Closed Bug 802240 Opened 12 years ago Closed 11 years ago

Figure out how to get Android library data from the APK to Breakpad for SPS profiling

Categories

(Core :: mozglue, defect)

ARM
Android
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: ted, Assigned: glandium)

References

Details

Attachments

(2 files, 5 obsolete files)

In Julian's patch to make SPS use Breakpad for unwinding we're relying on Breakpad being able to load libraries so it can read symbols out of them. This only works if you have the libraries on disk. For the loading-from-the-APK case we'll need to figure out how to make the library loader and Breakpad meet in the middle to get the symbol data. If we used my "dump symbols from ARM unwind tables" patch and we fixed this then we could profile release Android builds, which would be really cool.

I think the answer here is going to be to expose an API to get the load addresses of the .ARM.exidx and .ARM.extab sections, and to also make sure that we can call directly into the Breakpad methods that convert that data into debug info.
dl_iterate_phdr can give your the Phdrs, where you can find the EXIDX program headers yourself.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #0)
> I think the answer here is going to be to expose an API to get the load
> addresses of the .ARM.exidx and .ARM.extab sections

I'd prefer something a bit more general, which is that Mike's linker
gives us a pair of functions, to mmap and munmap files in the mapped
APK.  Then we can hand those to breakpad and it can use them in
preference to the standard mmap/munmap, to read whatever it wants.
This makes it agnostic to whether we go with EXIDX or CFI or some
combination of both.  It would be used to replace the mmap call in
LoadELF in
toolkit/crashreporter/google-breakpad/src/common/linux/dump_symbols.cc

The functionality can be very restricted compared to the standard
mmap:

* MAP_ANON, MAP_PRIVATE only
* mmap gets to choose where to put it
* zero offset in the file

Also, presumably it would need to identify the file by name rather
than by fd.
Oh, and no partial unmaps necessary.  munmap would only have to handle
unmapping precisely the range that just got mapped (I assume -- unless
breakpad is doing wierd stuff w/ mappings.)
Breakpad simply maps the entire file, pokes at it, and then unmaps it.
Julian, can you test if this works for you?

You need to define the following functions where you'll be using them:
extern "C" {
  MFBT_API size_t __dl_get_mappable_length(void *handle);
  MFBT_API void *__dl_mmap(void *handle, void *addr, size_t length, off_t offset);
  MFBT_API void __dl_munmap(void *handle, void *addr, size_t length);
}

handle is what you get from dlopen();
Assignee: nobody → mh+mozilla
Random comment, of no great importance, while reading through the patch:

+void
+CustomElf::MappableMUnmap(void *addr, size_t length) const
+{
+  return mappable->munmap(addr, length);
+}

Inconsistency in return type (void vs implied-not-void?)  Can this
return an error code instead of void, so it's possible to check it
didn't fail?
As found through irc debugging, this fails because of MappableDeflate::finalize freeing memory.
Contents of this patch:

* Mike's linker patch from comment 5

* Fixes to the above that Mike told me over irc

* My changes to the SPS and breakpad that hook up to the linker
  changes.  These center around the use of the type AltMapFns to
  abstractify out how breakpad maps objects so as to read from them.
  Basically, local_debug_info_symbolizer.cc creates one of these
  AltMapFns objects that hook up to faulty.lib.  It then hands it
  off to ReadSymbolData, which uses that to do the mapping.
No functional changes.  Is a minorly cleaned up version of
the previous rollup patch -- some debug printing removed,
and a couple of unrelated hunks deleted.
Attachment #733934 - Attachment is obsolete: true
Comment on attachment 734841 [details] [diff] [review]
Rollup WIP patch, somewhat cleaned up

Ted, what's your view on the breakpad side changes here? 
Specifically, the changes the following 3 files:

toolkit/crashreporter/google-breakpad/src/common/linux/dump_symbols.cc
toolkit/crashreporter/google-breakpad/src/common/linux/dump_symbols.h
tools/profiler/local_debug_info_symbolizer.cc

as summarised in comment 8.  Would this approach be considered acceptable?
Attachment #734841 - Flags: feedback?(ted)
(In reply to Mike Hommey [:glandium] from comment #5)
> Created attachment 729470 [details] [diff] [review]
> Expose an API to mmap the underlying file for a library loaded by faulty.lib
> Julian, can you test if this works for you?

This does indeed seem to work, when using the extra three fixes you
told me on irc.  These are marked using //MH-IRC in the rollup patch.
Comment on attachment 734841 [details] [diff] [review]
Rollup WIP patch, somewhat cleaned up

Review of attachment 734841 [details] [diff] [review]:
-----------------------------------------------------------------

I think this would be much simpler if you just exposed ReadSymbolDataInternal as a public API, and did the mapping management in local_debug_info_symbolizer.

In fact, you could already just use that API like the unit tests do, since it's not static. (You just have to define a prototype for it because it's not in the header.)
Attachment #734841 - Flags: feedback?(ted) → feedback-
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #12)

Good suggestion.  That makes the patch about 25% shorter and almost
completely non-intrusive for Breakpad.  Could you pls look over:

toolkit/crashreporter/google-breakpad/src/common/linux/dump_symbols.h
tools/profiler/local_debug_info_symbolizer.cc
Attachment #734841 - Attachment is obsolete: true
Attachment #735400 - Flags: feedback?(ted)
Comment on attachment 735400 [details] [diff] [review]
Rollup WIP patch, revised as per comment 12

Review of attachment 735400 [details] [diff] [review]:
-----------------------------------------------------------------

This looks a lot better, thanks. I'll land the dump_symbols.h change upstream for you.
Attachment #735400 - Flags: feedback?(ted) → feedback+
Julian, can you test this? (totally untested locally, besides the fact that it builds)
Attachment #729470 - Attachment is obsolete: true
(In reply to Mike Hommey [:glandium] from comment #15)
> Created attachment 736304 [details] [diff] [review]
> Expose an API to mmap the underlying file for a library loaded by faulty.lib
> 
> Julian, can you test this? (totally untested locally, besides the fact that
> it builds)

Note, you shouldn't need the __attribute__((weak)) anymore (in fact, you can probably include ElfLoader.h instead of making your own declarations), and the api should now take care of system libraries too (no need for your fallback)
Attachment #736304 - Attachment is obsolete: true
(In reply to Mike Hommey [:glandium] from comment #17)
> Created attachment 736433 [details] [diff] [review]
> Expose an API to mmap the underlying file for a library loaded by faulty.lib

Let's already get a review on this. It's kind of awful in that it sucks memory and the restrictions on GetMappableLength and friends are horrible, but it's only a quick hack to have something working. A proper patch will come in a followup.
Attachment #736433 - Flags: review?(nfroyd) → review+
(In reply to Mike Hommey [:glandium] from comment #15)
> Julian, can you test this? (totally untested locally, besides the fact that
> it builds)

This appears to work OK in the case where it is only used to process
in-APK files.

When passed a non-fully-qualified, not-in-APK file name (eg
"libc.so"), it appears to fail at the __dl_get_mappable_length call --
even though the preceding dlopen call looks like it succeeds:

E/Profiler( 4282): 2013-04-11 20:46:18:
local_debug_info_symbolizer.cc:75: INFO: ReadSymbolData_APK: Unable to
get size for ELF file 'libc.so'
(In reply to Mike Hommey [:glandium] from comment #15)
> Julian, can you test this? (totally untested locally, besides the fact that
> it builds)

If I install the Quit Now extension, and start the profiler, then it
segfaults at exit.  It segfault if I don't start the profiler:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 5781]
0xb00055f2 in ?? () from /home/sewardj/MOZ/JimDB/jimdb/lib/0288504044c06217/system/bin/linker
(gdb) where
#0  0xb00055f2 in ?? () from /home/sewardj/MOZ/JimDB/jimdb/lib/0288504044c06217/system/bin/linker
#1  0xb000593e in ?? () from /home/sewardj/MOZ/JimDB/jimdb/lib/0288504044c06217/system/bin/linker
#2  0x5b77158e in ElfLoader::DebuggerHelper::Remove (this=0x5b785258 <ElfLoader::Singleton+64>, map=0x1c3634c)
    at /home/sewardj/MOZ/MC-11-04-2013/mozglue/linker/ElfLoader.cpp:770
#3  0x5b77186a in ElfLoader::Forget (this=0x5b785218 <ElfLoader::Singleton>, handle=<error reading variable: Cannot access memory at address 0xfffffffe>)
    at /home/sewardj/MOZ/MC-11-04-2013/mozglue/linker/ElfLoader.cpp:416
#4  0x5b7728c2 in CustomElf::~CustomElf (this=0x1c36338, __in_chrg=<optimized out>) at /home/sewardj/MOZ/MC-11-04-2013/mozglue/linker/CustomElf.cpp:237
#5  0x5b772946 in CustomElf::~CustomElf (this=0x1c36338, __in_chrg=<optimized out>) at /home/sewardj/MOZ/MC-11-04-2013/mozglue/linker/CustomElf.cpp:238
#6  0x5b771ed0 in Release (this=<optimized out>) at /home/sewardj/MOZ/MC-11-04-2013/mozglue/linker/ElfLoader.h:230
#7  ReleaseDirectRef (this=<optimized out>) at /home/sewardj/MOZ/MC-11-04-2013/mozglue/linker/ElfLoader.h:153
#8  ElfLoader::~ElfLoader (this=0x5b785218 <ElfLoader::Singleton>, __in_chrg=<optimized out>) at /home/sewardj/MOZ/MC-11-04-2013/mozglue/linker/ElfLoader.cpp:444
#9  0x400b7f84 in __cxa_finalize () from /home/sewardj/MOZ/JimDB/jimdb/lib/0288504044c06217/system/lib/libc.so
#10 0x400b8320 in exit () from /home/sewardj/MOZ/JimDB/jimdb/lib/0288504044c06217/system/lib/libc.so
#11 0x400b8320 in exit () from /home/sewardj/MOZ/JimDB/jimdb/lib/0288504044c06217/system/lib/libc.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> If I install the Quit Now extension, and start the profiler, then it
> segfaults at exit.  It segfault if I don't start the profiler:
                        ^
Duh.  That should be "doesn't segfault".
Add DEFINES += -DMOZ_DEBUG_LINKER to mozglue/linker/Makefile.in and attach a logcat of GeckoLinker near the crash.
(In reply to Julian Seward from comment #19)
> E/Profiler( 4282): 2013-04-11 20:46:18:
> local_debug_info_symbolizer.cc:75: INFO: ReadSymbolData_APK: Unable to
> get size for ELF file 'libc.so'

> [23:31] <glandium> sewardj: edit mozglue/linker/ElfLoader.cpp, 
  find /system/lib and add a / (/system/lib/)

-    systemPath = "/system/lib";
+    systemPath = "/system/lib/";

This doesn't fix the problem.  If an incorrect systemPath was the
problem, would the initial dlopen("libc.so") call have succeeded?
Note, this is also with   systemPath = "/system/lib/";  as per
comment 23, and you can see the calls dlopen("libc.so") etc in
the log.
(In reply to Julian Seward from comment #23)
> (In reply to Julian Seward from comment #19)
> > E/Profiler( 4282): 2013-04-11 20:46:18:
> > local_debug_info_symbolizer.cc:75: INFO: ReadSymbolData_APK: Unable to
> > get size for ELF file 'libc.so'
> 
> > [23:31] <glandium> sewardj: edit mozglue/linker/ElfLoader.cpp, 
>   find /system/lib and add a / (/system/lib/)
> 
> -    systemPath = "/system/lib";
> +    systemPath = "/system/lib/";
> 
> This doesn't fix the problem.

It does for me:
void * foo = dlopen("libc.so", RTLD_GLOBAL | RTLD_LAZY);
__android_log_print(ANDROID_LOG_ERROR, "GeckoLinker", "__dl_get_mappable_length returns %d", __dl_get_mappable_length(foo));
dlclose(foo);

--> E/GeckoLinker( 5644): __dl_get_mappable_length returns 286500
(In reply to Julian Seward from comment #24)
> Created attachment 736529 [details]
> Complete logcat (.txt.bz2) as per comment 22
> 
> Note, this is also with   systemPath = "/system/lib/";  as per
> comment 23, and you can see the calls dlopen("libc.so") etc in
> the log.

Ah, but the error is different this time: "Failed to mmap ELF file 'libc.so'" instead of "Unable to
get size for ELF file 'libc.so'"

About the crash, the log shows this:
E/GeckoLinker( 7034): Caught segmentation fault @0x7c3de0e8
That's within the libxul range, after libxul is unloaded. Sounds like the good old "run code after libxul is unloaded", probably triggered by the profiler being on. This doesn't sound related to this patch at all.
Attachment #735400 - Attachment is obsolete: true
Blocks: 861141
https://hg.mozilla.org/mozilla-central/rev/6d4badbe3db2
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Depends on: 861796
Blocks: 863453
Depends on: 1082527
Blocks: 1082529
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: