Closed Bug 1773313 Opened 3 years ago Closed 24 days ago

In profiler shared library enumeration, obtain the ELF build ID from memory and not by reading the file again

Categories

(Core :: Gecko Profiler, task, P2)

All
Linux
task

Tracking

()

RESOLVED FIXED
143 Branch
Tracking Status
firefox143 --- fixed

People

(Reporter: mstange, Assigned: canova)

References

(Blocks 1 open bug)

Details

(Keywords: perf-alert, Whiteboard: [fxp])

Attachments

(4 files, 4 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

When we enumerate the libraries that are loaded in the libraries on Linux and Android, we call getId(filename) for each library in order to obtain the ELF build ID of the library. getId reads the file from disk:

https://searchfox.org/mozilla-central/rev/eb554e155a28ad36aca62281406757833b9c467a/tools/profiler/core/shared-libraries-linux.cc#73-86

// Get the breakpad Id for the binary file pointed by bin_name
static nsCString getId(const char* bin_name) {
  using namespace google_breakpad;

  PageAllocator allocator;
  auto_wasteful_vector<uint8_t, kDefaultBuildIdSize> identifier(&allocator);

  FileID file_id(bin_name);
  if (file_id.ElfFileIdentifier(identifier)) {
    return IDtoUUIDString(identifier);
  }

  return ""_ns;
}

This is problematic in two three cases:

  • (New:) When using AAB builds on Android without "legacy packaging", there is no uncompressed libxul.so on the disk to read the ID from. We have switched to legacy packing in bug 1865634 to keep build ID reading working. (Bug 1867280 tracks not using legacy packaging.)
  • If the file has changed between being loaded in our process and now, then we get the ID of the new file but we want to have the ID of the old file. This means we can get bad symbols (bug 773890).
  • In child processes with restrictive sandboxes such as the GMP process, the file read is disallowed (bug 1770905). This means that we cannot get the ID with the current implementation, and then we cannot symbolicate stacks.

Instead, we should get the ID from memory. We can get the base address where the library is loaded, so we should be able to parse the ELF commands in memory and find the build ID note.

Severity: -- → N/A
Priority: -- → P2
Attachment #9364806 - Attachment description: WIP: Bug 1773313 - Obtain the ElF build ID from the memory instead of reading the file for shared libs r?mstange → WIP: Bug 1773313 - Obtain the ELF build ID from the memory instead of reading the file for shared libs r?mstange
No longer blocks: 1865634
See Also: → 1865634

I have a patch that works on Linux here. It's mostly working on Android, but unfortunately it's crashing when it's trying to retrieve the ELF build ID from some certain shared objects. I've spent quite some time to understand the problem and I think I understand the issue now, so let me explain that.

On Android it's crashing when we call FileID::ElfFileIdentifierFromMappedFile directly, on this line here. And specifically it segfaults when it's trying to access the sh_offset member of section_names. When I use lldb, section_names doesn't look like it has anything, probably I can't access that memory.

When I looked at the breakpad's repository I found this commit that our breakpad dependency doesn't have. And I thought this would solve our problem, but unfortunately it didn't. But at least the commit message shows that some vendor libraries don't contain the section header. So I think this is very similar to what's happening to us.

When I reproduced it locally, it was the /lib/arm64/libjnidispatch.so shared object that was giving this error. Here's what readelf shows me for this object file:

➜  gecko-android readelf -WS libjnidispatch.so
There are 22 section headers, starting at offset 0x28b70:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .hash             HASH            0000000000000190 000190 000528 04   A  2   0  8
  [ 2] .dynsym           DYNSYM          00000000000006b8 0006b8 001278 18   A  3   3  8
  [ 3] .dynstr           STRTAB          0000000000001930 001930 0011db 00   A  0   0  1
  [ 4] .gnu.version      VERSYM          0000000000002b0c 002b0c 00018a 02   A  2   0  2
  [ 5] .gnu.version_r    VERNEED         0000000000002c98 002c98 000040 00   A  3   2  8
  [ 6] .rela.dyn         RELA            0000000000002cd8 002cd8 000918 18   A  2   0  8
  [ 7] .rela.plt         RELA            00000000000035f0 0035f0 0003f0 18  AI  2   8  8
  [ 8] .plt              PROGBITS        00000000000039e0 0039e0 0002c0 10  AX  0   0 16
  [ 9] .text             PROGBITS        0000000000004000 004000 01c080 00  AX  0   0 16384
  [10] .rodata           PROGBITS        0000000000020080 020080 002a10 00   A  0   0 16
  [11] .eh_frame_hdr     PROGBITS        0000000000022a90 022a90 00057c 00   A  0   0  4
  [12] .eh_frame         PROGBITS        0000000000023010 023010 002328 00   A  0   0  8
  [13] .data.rel.ro      PROGBITS        0000000000035ba0 025ba0 0000d8 00  WA  0   0  8
  [14] .dynamic          DYNAMIC         0000000000035c78 025c78 0001b0 10  WA  3   0  8
  [15] .got              PROGBITS        0000000000035e28 025e28 0001d8 08  WA  0   0  8
  [16] .data             PROGBITS        0000000000036000 026000 000244 00  WA  0   0 16
  [17] .bss              NOBITS          0000000000036248 026244 000838 00  WA  0   0  8
  [18] .comment          PROGBITS        0000000000000000 026244 000027 01  MS  0   0  1
  [19] .shstrtab         STRTAB          0000000000000000 02626b 0000b6 00      0   0  1
  [20] .symtab           SYMTAB          0000000000000000 026328 001500 18     21  30  8
  [21] .strtab           STRTAB          0000000000000000 027828 001347 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  D (mbind), p (processor specific)

As you can see, .shstrtab doesn't have A(alloc) flag. So I think it's not put into the memory when it's read. And if I'm not mistaken, this is the one that's trying to read while segfaulting. This is what ELF documentation says about .shstrtab:

.shstrtab
              This section holds section names.  This section is of type
              **SHT_STRTAB**.  No attribute types are used.

So it looks like the problem is that the required information is not loaded into the memory to be able to get this build ID.

Currently we are iterating the shared objects with dl_iterate_phdr. Coincidentally, when I looked at this object in this iteration callback, it didn't have any PT_PHDR(program header table) segment loaded. So I made the assumption that, if this program header is not loaded, then we can fall back to our initial implementation of reading the whole file from disk.

It worked locally on simulator and on my Android device, but when I pushed to try, I got another crash in this location this time. Which is again related to the section names. Interestingly, this object file has PT_PHDR loaded, but doesn't have section names.

Here's my try push: https://treeherder.mozilla.org/jobs?repo=try&revision=12e0540f1e544dd088683dd7f236fbc97385ed87&selectedTaskRun=S6dJlueqTtq-Am9uBmTX4g.0
This push also has some debugging information in the logcat logs if you are interested (if you search canova in it): https://firefoxci.taskcluster-artifacts.net/S6dJlueqTtq-Am9uBmTX4g/0/public/test_info/logcat-emulator-5554.log

The current failing library is /system/framework/x86_64/boot-okhttp.oat, which works fine on my simulator and Android device.

dl_iterate_phdr shows that it has PT_PHDR, but no PT_NOTE.

Name: "/system/framework/x86_64/boot-okhttp.oat" (6 segments)
     0: [    0x7225d040; memsz:    150] flags: 0x4; 
PT_PHDR
     1: [    0x7225d000; memsz:  87000] flags: 0x4; 
PT_LOAD
     2: [    0x722e4000; memsz:  61d97] flags: 0x5; 
PT_LOAD
     3: [    0x72346000; memsz:     ac] flags: 0x4; 
PT_LOAD
     4: [    0x72347000; memsz:     70] flags: 0x6; 
PT_LOAD
     5: [    0x72347000; memsz:     70] flags: 0x6; 
PT_DYNAMIC

I couldn't get this .oat file to my device to be able to investigate with readelf.

So I tried to find a way to get the information of "do we have section headers loaded" programmatically, but unfortunately we don't have this information in dl_iterate_phdr as it deals with program headers and not section headers. Looks like only place we can get that information is readelf, which requires us to read the file.

So, considering all this, I'm not sure how to move forward with it now, considering that (afaik) it's not possible to check if we have section headers loaded or not.

Hi Nazim! question for you...

(In reply to Nazım Can Altınova [:canova][:canaltinova on phabricator] from comment #2)

I have a patch that works on Linux here. It's mostly working on Android, but unfortunately it's crashing when it's trying to retrieve the ELF build ID from some certain shared objects.
[...]
So, considering all this, I'm not sure how to move forward with it now, considering that (afaik) it's not possible to check if we have section headers loaded or not.

If the patch is fine on Linux but we're stalled getting it to work on Android: maybe for now we could adjust it such that the changes here only take effect for linux builds, and we just leave Android using the old codepath for now, with a followup bug? (using #ifdefs to differentiate at build time, or somesuch)

That would be a big help for me at least when capturing profiles of performance issues that I hit in daily browsing -- I sometimes run into bug 773890 (often without knowing it right away), and per bug 773890 comment 8, it sounds like this bug's patch would help. It's on my mind at the moment because I happened to run up against this today; I caught a pretty bad parent-process hang in the profiler in my main browsing session, and I'm blocked from investigating the hang, because the profile ended up getting the wrong symbols.

(This is one aspect of the nightly-updating-footgun that arises from my development-and-testing workflow, where I often create disposable profiles to test things in a fresh environment (mkdir /tmp/foo; firefox -profile /tmp/foo) and that results in staged Nightly updates being installed, while my main browsing session is ~silently stuck on the previous Nightly version until I notice and interrupt my flow to restart it. We mostly handle this gracefully, but bug 773890 describes one way in which it causes trouble.)

Flags: needinfo?(mstange.moz)

Hi! Thanks for the detailed information and sorry for the frustration. It's currently pretty annoying to have this happen.

Good questions, indeed it's going to be easier for Linux without Android. My biggest concern was that, even though if it was working on Linux at that moment, I wasn't sure if it was because I was lucky with the libraries. And I wasn't sure if we would encounter another bad library that's not fully in memory later which would lead to crashes. But at least we can start by adding a whitelisted shared libraries, which includes libxul and libmozglue as a start. This should fix the biggest part of your problem.

But also Markus gave me some ideas on how to prevent this crash by pre-emptively checking the segfault. So probably we can also make it happen by updating the breakpad's code.

I'm currently working on something but I will look into this issue as long as I have some cycles, hopefully soon.

Flags: needinfo?(mstange.moz)
Whiteboard: [fxp]
Attachment #9364806 - Attachment is obsolete: true

This is the import of changes in the upstream here:
https://chromium.googlesource.com/breakpad/breakpad/+/62d927241962ad40f3bca9fa3841edf9d7a56b5f

It says:
Some vendor library doesn't contain the section header. It causes
segmentation fault in FindElfClassSection.

e_shoff:
This member holds the section header table's file offset in bytes.
If the file has no section header table this member holds zero.

Assignee: nobody → canaltinova
Status: NEW → ASSIGNED

Previously we were reading the shared library from disk and getting its
elf file identifier that way. But we already have libraries loaded in
memory, so we should try to read the library from memory first, and if
that fails, read it from disk.

In case the library is not fully loaded into memory (for example PT_NOTE
or section headers might not be loaded), we might fail to find the elf
build identifier. For cases like this, we fall back to reading the library
from disk.

ElfFileIdentifierFromMappedFile used to assume that every byte of a shared
library was in memory, because mostly we were reading it from disk. But we
changed it to read the in-memory libraries instead. Still this is mostly true
on desktop Linux, so the code never noticed its mistake, but it is not true on
Android:

  • Sometimes Android loader (and “strip --strip-unneeded” builds) only maps the
    PT_LOAD segments of an elf file. Section headers, string tables
    and some PT_NOTE segments might stay on disk.

  • When Breakpad went to look for the .note.gnu.build-id note it sometimes
    walked past the end of the mapped area and hit unmapped memory,
    causing a crash during dl_iterate_phdr.

This patch adds a new lib_size parameter—calculated as libEnd - libStart
from the ranges we record in dl_iterate_phdr—down to all helper functions.
Every read now checks that the requested bytes fit inside these bounds. If they
do not, we fall back to the slower but safe file-based path instead of crashing.

Actually, I found another case where Android could crash. I'm updating the code.

On Android 11+, the section header table may be mapped but without read
permissions, causing SIGSEGV(SEGV_ACCERR) when dereferencing section_names
fields. This patch adds:

  1. Memory bounds checking before accessing section headers.
  2. mprotect() to restore read permissions if possible.
  3. Handling of empty section name string tables (common on Android).
  4. Safe bounds checking in FindElfSectionByName for section names.

When in-memory access fails, the code falls back to reading ELF data
from disk. Some of the bounds checks are defensive programming to handle
edge cases in ELF files with malformed or missing section data.

Finally managed to solve the Android crash that was happening before.

The root cause of the crash was because necessary parts of the shared library wasn't loaded into memory. The patch that fixes the crash is D257262 and D257510.

Here are before and after try runs:
Before - After (pushed to try recently so it's still running at the time of posting, but here's an older one that's green)

Also some notes on the new behavior. It tries to get the ELF build identifier from memory. If it fails it falls back to reading it from file.

On Linux, it looks like it's consistently reading it from memory, so we are safe there.

But on Linux, it's failing to read from memory from some of the shared libraries that belong to the operating system, like libc. Here's a list of libraries that it fails to read it from memory:

On x86_64 emulator:

falling back to reading from disk for: /dev/ashmem/dalvik-jit-code-cache
failed to read it from disk too
falling back to reading from disk for: /system/bin/app_process64
falling back to reading from disk for: /system/lib64/libandroid_runtime.so
falling back to reading from disk for: /system/lib64/libc.so
falling back to reading from disk for: /system/lib64/libm.so
falling back to reading from disk for: /system/lib64/libGLESv1_CM.so
falling back to reading from disk for: /system/lib64/libGLESv2.so
falling back to reading from disk for: /system/lib64/libunwind.so
falling back to reading from disk for: [vdso]
failed to read it from disk too
falling back to reading from disk for: /system/framework/x86_64/boot.oat
falling back to reading from disk for: /system/framework/x86_64/boot-core-libart.oat
falling back to reading from disk for: /system/framework/x86_64/boot-conscrypt.oat
falling back to reading from disk for: /system/framework/x86_64/boot-okhttp.oat
falling back to reading from disk for: /system/framework/x86_64/boot-core-junit.oat
falling back to reading from disk for: /system/framework/x86_64/boot-bouncycastle.oat
falling back to reading from disk for: /system/framework/x86_64/boot-ext.oat
falling back to reading from disk for: /system/framework/x86_64/boot-framework.oat
falling back to reading from disk for: /system/framework/x86_64/boot-telephony-common.oat
falling back to reading from disk for: /system/framework/x86_64/boot-voip-common.oat
falling back to reading from disk for: /system/framework/x86_64/boot-ims-common.oat
falling back to reading from disk for: /system/framework/x86_64/boot-apache-xml.oat
falling back to reading from disk for: /system/framework/x86_64/boot-org.apache.http.legacy.boot.oat
falling back to reading from disk for: /system/lib64/libGLESv3.so
falling back to reading from disk for: /data/app/org.mozilla.fenix.debug-1/oat/x86_64/base.odex
falling back to reading from disk for: /data/app/org.mozilla.fenix.debug-1/lib/x86_64/libjnidispatch.so

And on a real aarch64 device, it's actually a lot better:

falling back to reading from disk for: [vdso]
failed to read it from disk too
falling back to reading from disk for: /system/lib64/lib_SoundAlive_SRC384_ver320.so
falling back to reading from disk for: /data/misc/apexdata/com.android.art/dalvik-cache/arm64/boot.oat
falling back to reading from disk for: /data/misc/apexdata/com.android.art/dalvik-cache/arm64/boot-framework-adservices.oat
failed to read it from disk too
falling back to reading from disk for: /data/app/~~qjIAU3UmLUE8w8c1MG1oSg==/org.mozilla.fenix.debug-3oZHzvW4IgTsI0118qte8g==/lib/arm64/libjnidispatch.so

So there is no shared library from our bundle is affected (except libjnidispatch? don't think it's too important but let me know). I think this is completely fine, because we fall back to reading them from disk and not crash.

So we fix all the use cases:

  1. (New:) When using AAB builds on Android without "legacy packaging", there is no uncompressed libxul.so on the disk to read the ID from. We have switched to legacy packing in bug 1865634 to keep build ID reading working. (Bug 1867280 tracks not using legacy packaging.)

We can read the .so files we package perfectly fine. So we can revert the "legacy packaging"

  1. If the file has changed between being loaded in our process and now, then we get the ID of the new file but we want to have the ID of the old file. This means we can get bad symbols (bug 773890).

Fine for Linux. For Android: Since the only libraries that we can't read it from memory are the system libraries, they don't really change between being loaded and reading.

  1. In child processes with restrictive sandboxes such as the GMP process, the file read is disallowed (bug 1770905). This means that we cannot get the ID with the current implementation, and then we cannot symbolicate stacks.

Fine for Linux. For Android: It's fine for libxul etc. but it might still be a problem for system libraries like libc on Android. But there is really no way around it since the data is not in memory at the time.

Given the trouble with the section names, I propose a different approach:

  1. Only handle the case where we have a PT_NOTE segment with a note header of type NT_GNU_BUILD_ID.
  2. For in-memory build ID obtaining, don't do the fallback where we look for a section with the name ".note.gnu.build-id".
  3. For in-memory build ID obtaining, don't do the fallback where we hash the first page of the ".text" section.
  4. Move the in-memory build ID obtaining into the dl_iterate_callback, because we already have access to the segments (ELF commands) there.
  5. Don't worry about sharing code with FileID for the in-memory stuff; a reimplementation inside dl_iterate_callback will be simple enough.
  6. Use the FileID code only when getting the build ID from a file.
Attachment #9500987 - Attachment is obsolete: true
Attachment #9500579 - Attachment is obsolete: true
Attachment #9500578 - Attachment is obsolete: true

Previously, we were iterating over libInfoList twice. First iterating
over the whole list and generating the libraries. Then, in the second
loop, we try to find a library that belongs to this executable and
give a proper library name to it.

This was inefficient because we generated the breakpadId and codeId for the
same library twice, which isn't great for performance. But more importantly,
it's needed for the following patch. I'm going to pass the ELF file identifier
from outside (from the dl_iterate_phdr callback), and without this patch, it
essentially overrides the ELF file identifier that we found inside that
callback.

This commit modifies the dl_iterate_callback to extract ELF file identifier
directly from PT_NOTE segments in memory instead of always reading from the
file system. This makes sure that if Firefox is updated in between and shared
libraries are changed, we still read the correct in-memory library.

If we fail to get the ELF file identifier from PT_NOTE, we fall back to reading
it from disk again.

I avoided using the FileID code as much as possible. But still used
ElfClassBuildIDNoteIdentifier inside because this was going to be nearly
identical.

Pushed by canaltinova@gmail.com: https://github.com/mozilla-firefox/firefox/commit/57abb8720c29 https://hg.mozilla.org/integration/autoland/rev/be0da5e5bc55 Fix segfault when there is no section header in ELF file r=mstange,profiler-reviewers https://github.com/mozilla-firefox/firefox/commit/3b8d3c8796fb https://hg.mozilla.org/integration/autoland/rev/5e1205891044 Merge two consecutive loops with SharedLibraryAtPath calls into one r=mstange,profiler-reviewers https://github.com/mozilla-firefox/firefox/commit/0e31e0897162 https://hg.mozilla.org/integration/autoland/rev/543a57747130 Extract ELF file identifier from memory during shared library iteration r=mstange,profiler-reviewers https://github.com/mozilla-firefox/firefox/commit/9f06f89cc867 https://hg.mozilla.org/integration/autoland/rev/e98a3f6ae5be Update the test to ensure that libxul has valid breakpadId and codeId r=mstange,profiler-reviewers

(In reply to Nazım Can Altınova [:canova][:canaltinova on phabricator] from comment #11)

So there is no shared library from our bundle is affected (except libjnidispatch? don't think it's too important but let me know).

Looks like libjnidispatch will have a build ID as of 5.15.0: https://github.com/java-native-access/jna/commit/7a5d9850a1ad8ab66836cea15fc67967bbe29006
So this will be fixed once we update which version of jna we use, https://github.com/mozilla/application-services/blob/928a23f1f1258d1967d03b8e394b1b58ac4da9e4/gradle/libs.versions.toml#L28

(In reply to Pulsebot from comment #16)

Pushed by canaltinova@gmail.com:
https://github.com/mozilla-firefox/firefox/commit/57abb8720c29
https://hg.mozilla.org/integration/autoland/rev/be0da5e5bc55
Fix segfault when there is no section header in ELF file
r=mstange,profiler-reviewers
https://github.com/mozilla-firefox/firefox/commit/3b8d3c8796fb
https://hg.mozilla.org/integration/autoland/rev/5e1205891044
Merge two consecutive loops with SharedLibraryAtPath calls into one
r=mstange,profiler-reviewers
https://github.com/mozilla-firefox/firefox/commit/0e31e0897162
https://hg.mozilla.org/integration/autoland/rev/543a57747130
Extract ELF file identifier from memory during shared library iteration
r=mstange,profiler-reviewers
https://github.com/mozilla-firefox/firefox/commit/9f06f89cc867
https://hg.mozilla.org/integration/autoland/rev/e98a3f6ae5be
Update the test to ensure that libxul has valid breakpadId and codeId
r=mstange,profiler-reviewers

Perfherder has detected a devtools performance change from push e98a3f6ae5bedda6aed1b1a614a3c7e7de49d218.

If you have any questions, please reach out to a performance sheriff. Alternatively, you can find help on Slack by joining #perf-help, and on Matrix you can find help by joining #perftest.

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
4% damp custom.webconsole.out-of-order linux1804-64-shippable-qr e10s fission stylo webrender 499.96 -> 480.66
3% damp custom.webconsole.out-of-order linux1804-64-shippable-qr e10s fission stylo webrender 495.92 -> 479.90

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a performance sheriff to do that for you.

You can run all of these tests on try with ./mach try perf --alert 46091

The following documentation link provides more information about this command.

Keywords: perf-alert
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: