Closed
Bug 1414506
Opened 7 years ago
Closed 6 years ago
Crash in Elf::Ehdr::validate
Categories
(Core :: mozglue, defect, P3)
Tracking
()
RESOLVED
FIXED
mozilla60
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox56 | --- | unaffected |
firefox57 | --- | unaffected |
firefox58 | --- | wontfix |
firefox59 | --- | wontfix |
firefox60 | --- | fixed |
People
(Reporter: calixte, Assigned: glandium)
References
(Blocks 1 open bug)
Details
(Keywords: crash, regression, topcrash, Whiteboard: [clouseau])
Crash Data
Attachments
(4 files)
This bug was filed from the Socorro interface and is report bp-1f68c277-d071-4621-a89c-3c0b20171030. ============================================================= There are 16 crashes in nightly 58 (FennecAndroid only) starting with buildid 20171030100130. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1412405. [1]Â https://hg.mozilla.org/mozilla-central/rev?node=339e44e41f22670bd665d935ef56f2262f9360db
Flags: needinfo?(nfroyd)
Comment 1•7 years ago
|
||
That's possible, I guess, but the crash stack isn't really dealing with the code touched in that patch, and we've had intermittent issues with this signature, it looks like. I'm inclined to say the patch is not responsible...unless we have a large increase in the number of occurrences.
Flags: needinfo?(nfroyd)
Priority: -- → P3
Comment 2•7 years ago
|
||
Gonna fix-optional this given Comment #1.
Comment 3•6 years ago
|
||
This is the #1 topcrash for the Android nightly of 20180111100641, happening to 8 different installs, which is a bit concerning. Is there anything we can or should do here?
Flags: needinfo?(mh+mozilla)
Assignee | ||
Comment 4•6 years ago
|
||
From scraping those crash reports: - They all crash at base-address of a library + 3 - They happen in a variety of libraries: android.hardware.media.omx@1.0.so, android.hardware.renderscript@1.0.so, libGLESv2_adreno.so, libaaudio.so, libandroid.so, libandroid_runtime.so, libart-compiler.so, libc++.so, libcamera2ndk.so, libcamera_client.so, libdrmframework.so, libexif.so, libgui.so, libgui_ext.so, libllvm-glnext.so, libmedia.so, libmediaplayerservice.so, libopenjdk.so, libpdfium.so, libqservice.so, libvixl-arm.so, libvixl-arm64.so, libwilhelm.so, libxml2.so. - They happen on a variety of Android versions. That would seem to indicate that some change in Fennec made this code exercised more or something. The crashing code is: if (memcmp(ELFMAG, &ehdr->e_ident, SELFMAG) || And at the assembly level it looks like: 441ae: b5d0 push {r4, r6, r7, lr} 441b0: af02 add r7, sp, #8 441b2: 78c1 ldrb r1, [r0, #3] 441b4: 7883 ldrb r3, [r0, #2] 441b6: 7842 ldrb r2, [r0, #1] 441b8: f890 c000 ldrb.w ip, [r0] 441bc: ea43 2101 orr.w r1, r3, r1, lsl #8 441c0: 4c14 ldr r4, [pc, #80] 441c2: ea4c 2202 orr.w r2, ip, r2, lsl #8 441c6: ea42 4101 orr.w r1, r2, r1, lsl #16 441ca: 42a1 cmp r1, r4 The crash being on 0x441b2, so ldrb r1, [r0, #3], thus the offset always being 3. I really don't know how one can go past this https://dxr.mozilla.org/mozilla-central/source/mozglue/linker/ElfLoader.cpp#144-172 and still crash reading that memory... Specifically, that would imply write() was happy to read that data, but later reading it was forbidden...
Flags: needinfo?(mh+mozilla)
Assignee | ||
Comment 5•6 years ago
|
||
Oh, digging in one minidump actually says that the modules list in the crash report is obsolete, and the libraries are not actually mapped anymore. So that would imply a race condition with another thread dlclose()ing the libraries. The problem is that while we can add a lock to avoid races with our code calling into dl* functions, there will always be a race with e.g. dalvik threads calling the system dl* functions directly... I wonder if bionic has a dl_iterate_phdr function these days and if it does, if we could use it instead to loop on the system-loaded libraries.
Comment 6•6 years ago
|
||
https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Move_fix-optionals
status-firefox59:
--- → ?
Comment 7•6 years ago
|
||
Currently this is the #3 overall top browser crash on nightly. One comment says "browsing Feedly with FF as my browser"
Keywords: topcrash
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → mh+mozilla
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 14•6 years ago
|
||
Comment on attachment 8943920 [details] Bug 1414506 - Move dl_phdr_info filling and callback invocation to a separate class. Oh great... turns out Android NDK doesn't have the dl_iterate_phdr symbol below version 21... That'd be Android 5.
Attachment #8943920 -
Flags: review?(nfroyd)
Assignee | ||
Comment 15•6 years ago
|
||
(but as usual with bionic, the system headers do expose it at any level)
Assignee | ||
Updated•6 years ago
|
Attachment #8943921 -
Flags: review?(nfroyd)
Assignee | ||
Comment 16•6 years ago
|
||
The good news, though, is that it /seems/ all the crashes happen with versions >= 23 (so, Android 6).
Comment 17•6 years ago
|
||
mozreview-review |
Comment on attachment 8943918 [details] Bug 1414506 - Drive-by: Add missing <cstring> include to XZStream.cpp. https://reviewboard.mozilla.org/r/214266/#review220014
Attachment #8943918 -
Flags: review?(nfroyd) → review+
Comment 18•6 years ago
|
||
mozreview-review |
Comment on attachment 8943919 [details] Bug 1414506 - Drive-by: Remove MOZ_CONCAT definition from mozglue/linker/Logging.h. https://reviewboard.mozilla.org/r/214268/#review220016
Attachment #8943919 -
Flags: review?(nfroyd) → review+
Assignee | ||
Comment 19•6 years ago
|
||
So, I was testing a new patch on try, and added a log to confirm whether I'm detecting dl_iterate_phdr correctly, but it looks like all our tests are on 4.3, so I only get to know the branch using the old code is called... I tested on my phone which has 6.0, and I do get a log saying it takes the other branch, so it looks like it works. Anyways, what puzzles me is that in the test logs for those builds, I see logs from my additions a *lot*. Is is expected to call dl_iterate_phdr that much? See for example: curl -sL https://public-artifacts.taskcluster.net/E8N2jH1sRSGV9r_klUd0Sw/0/public/logs/live_backing.log | zgrep dl_iterate_phdr | awk 'NR == 1 { print } END { print; print NR }' [task 2018-01-23T06:17:09.013Z] 06:17:09 INFO - 01-22 22:13:42.228 W/GeckoLinker( 798): NOT using system dl_iterate_phdr [task 2018-01-23T06:17:09.145Z] 06:17:09 INFO - 01-22 22:16:45.458 W/GeckoLinker( 798): NOT using system dl_iterate_phdr 272 That's 272 times in 3 minutes. That's more than once per second! I wonder if that's the reason for the sudden spike in crashes. If we call it that many, no wonder why we're hitting the race condition more.
Flags: needinfo?(mstange)
Comment 20•6 years ago
|
||
The background hang reporter might be responsible for the more frequent calls to SharedLibraryInfo. Nika, is BHR enabled on Android?
Flags: needinfo?(mstange) → needinfo?(nika)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 25•6 years ago
|
||
mozreview-review |
Comment on attachment 8943921 [details] Bug 1414506 - Use system dl_iterate_phdr for system loaded libraries when we can. https://reviewboard.mozilla.org/r/214272/#review220426 ::: mozglue/linker/ElfLoader.cpp:227 (Diff revision 3) > + return dl_iterate_phdr(callback, data); > + } > + > + /* For versions of Android that don't support dl_iterate_phdr (< 5.0), > + * we go through the debugger helper data, which is known to be racy, but > + * there's not much we can do about this :( . */ Note that we /could/ stop all the other threads. But that sounds horrible.
Comment 26•6 years ago
|
||
(In reply to Markus Stange [:mstange] from comment #20) > The background hang reporter might be responsible for the more frequent > calls to SharedLibraryInfo. Nika, is BHR enabled on Android? We should be avoiding doing anything related to shared libraries on android because of these undefs: https://searchfox.org/mozilla-central/rev/b7e3ec2468d42fa59d86c03ec7afeb209813f1d4/toolkit/components/backgroundhangmonitor/ThreadStackHelper.h#35-40 There's a chance we collect the module information unnecessarily on android though, as the monitor isn't disabled completely. I can add some more compile time checks to disable the code which calls into SharedLibraryInfo on Android as well :-).
Flags: needinfo?(nika)
Comment 27•6 years ago
|
||
mozreview-review |
Comment on attachment 8943920 [details] Bug 1414506 - Move dl_phdr_info filling and callback invocation to a separate class. https://reviewboard.mozilla.org/r/214270/#review221284 r=me; not going to insist on the comment below, but it'd be nice. ::: mozglue/linker/ElfLoader.cpp:128 (Diff revision 3) > +int > +DlIteratePhdrHelper::fill_and_call(dl_phdr_cb callback, const void* l_addr, > + const char* l_name, void *data) Maybe provide a brief comment about what this method is supposed to do? `fill_and_call` is not exactly evocative of what this method does.
Attachment #8943920 -
Flags: review?(nfroyd) → review+
Comment 28•6 years ago
|
||
mozreview-review-reply |
Comment on attachment 8943921 [details] Bug 1414506 - Use system dl_iterate_phdr for system loaded libraries when we can. https://reviewboard.mozilla.org/r/214272/#review220426 > Note that we /could/ stop all the other threads. But that sounds horrible. Totally agreed on "that sounds horrible"...let's not try to do that.
Comment 29•6 years ago
|
||
mozreview-review |
Comment on attachment 8943921 [details] Bug 1414506 - Use system dl_iterate_phdr for system loaded libraries when we can. https://reviewboard.mozilla.org/r/214272/#review221290 ::: mozglue/linker/ElfLoader.cpp:47 (Diff revision 3) > +extern "C" MOZ_EXPORT int > +dl_iterate_phdr(dl_phdr_cb callback, void *data) __attribute__((weak)); Random thought: I noticed, while trying to figure out where we setup our `__wrap_*` functions to be called (since we don't appear to use `-Wl,--wrap=...`?), that `mozglue/linker/dladdr.h` appears to be missing `extern "C"`. We should either delete that header (surely everybody has `dladdr`?) or fix the declaration. ::: mozglue/linker/ElfLoader.cpp:207 (Diff revision 3) > > int > __wrap_dl_iterate_phdr(dl_phdr_cb callback, void *data) > { > + DlIteratePhdrHelper helper; > + AutoLock lock(&ElfLoader::Singleton.handlesMutex); Do we want to lock this for the entirety of the function, or only for the `dl_iterate_phdr` case?
Attachment #8943921 -
Flags: review?(nfroyd) → review+
Assignee | ||
Comment 30•6 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #29) > Comment on attachment 8943921 [details] > Bug 1414506 - Use system dl_iterate_phdr for system loaded libraries when we > can. > > https://reviewboard.mozilla.org/r/214272/#review221290 > > ::: mozglue/linker/ElfLoader.cpp:47 > (Diff revision 3) > > +extern "C" MOZ_EXPORT int > > +dl_iterate_phdr(dl_phdr_cb callback, void *data) __attribute__((weak)); > > Random thought: I noticed, while trying to figure out where we setup our > `__wrap_*` functions to be called (since we don't appear to use > `-Wl,--wrap=...`?), that `mozglue/linker/dladdr.h` appears to be missing > `extern "C"`. We should either delete that header (surely everybody has > `dladdr`?) or fix the declaration. I'll file a separate bug for this. It was only used as a forced include for nspr, but that was removed. > ::: mozglue/linker/ElfLoader.cpp:207 > (Diff revision 3) > > > > int > > __wrap_dl_iterate_phdr(dl_phdr_cb callback, void *data) > > { > > + DlIteratePhdrHelper helper; > > + AutoLock lock(&ElfLoader::Singleton.handlesMutex); > > Do we want to lock this for the entirety of the function, or only for the > `dl_iterate_phdr` case? I think it's safer to get the lock for the old code too. At least it avoids racing with other threads calling /our/ dlclose.
Comment 31•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #30) > (In reply to Nathan Froyd [:froydnj] from comment #29) > > ::: mozglue/linker/ElfLoader.cpp:207 > > (Diff revision 3) > > > > > > int > > > __wrap_dl_iterate_phdr(dl_phdr_cb callback, void *data) > > > { > > > + DlIteratePhdrHelper helper; > > > + AutoLock lock(&ElfLoader::Singleton.handlesMutex); > > > > Do we want to lock this for the entirety of the function, or only for the > > `dl_iterate_phdr` case? > > I think it's safer to get the lock for the old code too. At least it avoids > racing with other threads calling /our/ dlclose. Probably worth a comment to that effect, then.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 36•6 years ago
|
||
Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/6d040b09b48d Drive-by: Add missing <cstring> include to XZStream.cpp. r=froydnj https://hg.mozilla.org/integration/autoland/rev/534f60b7cb2b Drive-by: Remove MOZ_CONCAT definition from mozglue/linker/Logging.h. r=froydnj https://hg.mozilla.org/integration/autoland/rev/ce1f9ca065cf Move dl_phdr_info filling and callback invocation to a separate class. r=froydnj https://hg.mozilla.org/integration/autoland/rev/3b862e2b9e0b Use system dl_iterate_phdr for system loaded libraries when we can. r=froydnj
Comment 37•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/6d040b09b48d https://hg.mozilla.org/mozilla-central/rev/534f60b7cb2b https://hg.mozilla.org/mozilla-central/rev/ce1f9ca065cf https://hg.mozilla.org/mozilla-central/rev/3b862e2b9e0b
Status: NEW → RESOLVED
Closed: 6 years ago
status-firefox60:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Comment 38•6 years ago
|
||
Not seeing any reports of this outside trunk, so calling this wontfix for 59.
You need to log in
before you can comment on or make changes to this bug.
Description
•