Closed Bug 1414506 Opened 2 years ago Closed 2 years ago

Crash in Elf::Ehdr::validate

Categories

(Core :: mozglue, defect, P3, critical)

58 Branch
Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox-esr52 --- unaffected
firefox56 --- unaffected
firefox57 --- unaffected
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- fixed

People

(Reporter: calixte, Assigned: glandium)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [clouseau])

Crash Data

Attachments

(4 files)

This bug was filed from the Socorro interface and is 
report bp-1f68c277-d071-4621-a89c-3c0b20171030.
=============================================================

There are 16 crashes in nightly 58 (FennecAndroid only) starting with buildid 20171030100130. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1412405.

[1] https://hg.mozilla.org/mozilla-central/rev?node=339e44e41f22670bd665d935ef56f2262f9360db
Flags: needinfo?(nfroyd)
That's possible, I guess, but the crash stack isn't really dealing with the code touched in that patch, and we've had intermittent issues with this signature, it looks like.  I'm inclined to say the patch is not responsible...unless we  have a large increase in the number of occurrences.
Flags: needinfo?(nfroyd)
Priority: -- → P3
Gonna fix-optional this given Comment #1.
This is the #1 topcrash for the Android nightly of 20180111100641, happening
to 8 different installs, which is a bit concerning.  Is there anything we can
or should do here?
Flags: needinfo?(mh+mozilla)
From scraping those crash reports:
- They all crash at base-address of a library + 3
- They happen in a variety of libraries: android.hardware.media.omx@1.0.so, android.hardware.renderscript@1.0.so, libGLESv2_adreno.so, libaaudio.so, libandroid.so, libandroid_runtime.so, libart-compiler.so, libc++.so, libcamera2ndk.so, libcamera_client.so, libdrmframework.so, libexif.so, libgui.so, libgui_ext.so, libllvm-glnext.so, libmedia.so, libmediaplayerservice.so, libopenjdk.so, libpdfium.so, libqservice.so, libvixl-arm.so, libvixl-arm64.so, libwilhelm.so, libxml2.so.
- They happen on a variety of Android versions.

That would seem to indicate that some change in Fennec made this code exercised more or something.

The crashing code is:
  if (memcmp(ELFMAG, &ehdr->e_ident, SELFMAG) ||

And at the assembly level it looks like:
   441ae:       b5d0            push    {r4, r6, r7, lr}
   441b0:       af02            add     r7, sp, #8
   441b2:       78c1            ldrb    r1, [r0, #3]
   441b4:       7883            ldrb    r3, [r0, #2]
   441b6:       7842            ldrb    r2, [r0, #1]
   441b8:       f890 c000       ldrb.w  ip, [r0]
   441bc:       ea43 2101       orr.w   r1, r3, r1, lsl #8
   441c0:       4c14            ldr     r4, [pc, #80]
   441c2:       ea4c 2202       orr.w   r2, ip, r2, lsl #8
   441c6:       ea42 4101       orr.w   r1, r2, r1, lsl #16
   441ca:       42a1            cmp     r1, r4

The crash being on 0x441b2, so ldrb    r1, [r0, #3], thus the offset always being 3.

I really don't know how one can go past this https://dxr.mozilla.org/mozilla-central/source/mozglue/linker/ElfLoader.cpp#144-172 and still crash reading that memory... Specifically, that would imply write() was happy to read that data, but later reading it was forbidden...
Flags: needinfo?(mh+mozilla)
Oh, digging in one minidump actually says that the modules list in the crash report is obsolete, and the libraries are not actually mapped anymore. So that would imply a race condition with another thread dlclose()ing the libraries. The problem is that while we can add a lock to avoid races with our code calling into dl* functions, there will always be a race with e.g. dalvik threads calling the system dl* functions directly... I wonder if bionic has a dl_iterate_phdr function these days and if it does, if we could use it instead to loop on the system-loaded libraries.
Currently this is the #3 overall top browser crash on nightly. One comment says "browsing Feedly with FF as my browser"
Keywords: topcrash
Assignee: nobody → mh+mozilla
Comment on attachment 8943920 [details]
Bug 1414506 - Move dl_phdr_info filling and callback invocation to a separate class.

Oh great... turns out Android NDK doesn't have the dl_iterate_phdr symbol below version 21... That'd be Android 5.
Attachment #8943920 - Flags: review?(nfroyd)
(but as usual with bionic, the system headers do expose it at any level)
Attachment #8943921 - Flags: review?(nfroyd)
The good news, though, is that it /seems/ all the crashes happen with versions >= 23 (so, Android 6).
Comment on attachment 8943918 [details]
Bug 1414506 - Drive-by: Add missing <cstring> include to XZStream.cpp.

https://reviewboard.mozilla.org/r/214266/#review220014
Attachment #8943918 - Flags: review?(nfroyd) → review+
Comment on attachment 8943919 [details]
Bug 1414506 - Drive-by: Remove MOZ_CONCAT definition from mozglue/linker/Logging.h.

https://reviewboard.mozilla.org/r/214268/#review220016
Attachment #8943919 - Flags: review?(nfroyd) → review+
So, I was testing a new patch on try, and added a log to confirm whether I'm detecting dl_iterate_phdr correctly, but it looks like all our tests are on 4.3, so I only get to know the branch using the old code is called... I tested on my phone which has 6.0, and I do get a log saying it takes the other branch, so it looks like it works.

Anyways, what puzzles me is that in the test logs for those builds, I see logs from my additions a *lot*. Is is expected to call dl_iterate_phdr that much? See for example:

curl -sL https://public-artifacts.taskcluster.net/E8N2jH1sRSGV9r_klUd0Sw/0/public/logs/live_backing.log | zgrep dl_iterate_phdr | awk 'NR == 1 { print } END { print; print NR }'
[task 2018-01-23T06:17:09.013Z] 06:17:09     INFO -  01-22 22:13:42.228 W/GeckoLinker(  798): NOT using system dl_iterate_phdr
[task 2018-01-23T06:17:09.145Z] 06:17:09     INFO -  01-22 22:16:45.458 W/GeckoLinker(  798): NOT using system dl_iterate_phdr
272


That's 272 times in 3 minutes. That's more than once per second! I wonder if that's the reason for the sudden spike in crashes. If we call it that many, no wonder why we're hitting the race condition more.
Flags: needinfo?(mstange)
The background hang reporter might be responsible for the more frequent calls to SharedLibraryInfo. Nika, is BHR enabled on Android?
Flags: needinfo?(mstange) → needinfo?(nika)
Comment on attachment 8943921 [details]
Bug 1414506 - Use system dl_iterate_phdr for system loaded libraries when we can.

https://reviewboard.mozilla.org/r/214272/#review220426

::: mozglue/linker/ElfLoader.cpp:227
(Diff revision 3)
> +    return dl_iterate_phdr(callback, data);
> +  }
> +
> +  /* For versions of Android that don't support dl_iterate_phdr (< 5.0),
> +   * we go through the debugger helper data, which is known to be racy, but
> +   * there's not much we can do about this :( . */

Note that we /could/ stop all the other threads. But that sounds horrible.
(In reply to Markus Stange [:mstange] from comment #20)
> The background hang reporter might be responsible for the more frequent
> calls to SharedLibraryInfo. Nika, is BHR enabled on Android?

We should be avoiding doing anything related to shared libraries on android because of these undefs: https://searchfox.org/mozilla-central/rev/b7e3ec2468d42fa59d86c03ec7afeb209813f1d4/toolkit/components/backgroundhangmonitor/ThreadStackHelper.h#35-40

There's a chance we collect the module information unnecessarily on android though, as the monitor isn't disabled completely. I can add some more compile time checks to disable the code which calls into SharedLibraryInfo on Android as well :-).
Flags: needinfo?(nika)
Comment on attachment 8943920 [details]
Bug 1414506 - Move dl_phdr_info filling and callback invocation to a separate class.

https://reviewboard.mozilla.org/r/214270/#review221284

r=me; not going to insist on the comment below, but it'd be nice.

::: mozglue/linker/ElfLoader.cpp:128
(Diff revision 3)
> +int
> +DlIteratePhdrHelper::fill_and_call(dl_phdr_cb callback, const void* l_addr,
> +                                   const char* l_name, void *data)

Maybe provide a brief comment about what this method is supposed to do?  `fill_and_call` is not exactly evocative of what this method does.
Attachment #8943920 - Flags: review?(nfroyd) → review+
Comment on attachment 8943921 [details]
Bug 1414506 - Use system dl_iterate_phdr for system loaded libraries when we can.

https://reviewboard.mozilla.org/r/214272/#review220426

> Note that we /could/ stop all the other threads. But that sounds horrible.

Totally agreed on "that sounds horrible"...let's not try to do that.
Comment on attachment 8943921 [details]
Bug 1414506 - Use system dl_iterate_phdr for system loaded libraries when we can.

https://reviewboard.mozilla.org/r/214272/#review221290

::: mozglue/linker/ElfLoader.cpp:47
(Diff revision 3)
> +extern "C" MOZ_EXPORT int
> +dl_iterate_phdr(dl_phdr_cb callback, void *data) __attribute__((weak));

Random thought: I noticed, while trying to figure out where we setup our `__wrap_*` functions to be called (since we don't appear to use `-Wl,--wrap=...`?), that `mozglue/linker/dladdr.h` appears to be missing `extern "C"`.  We should either delete that header (surely everybody has `dladdr`?) or fix the declaration.

::: mozglue/linker/ElfLoader.cpp:207
(Diff revision 3)
>  
>  int
>  __wrap_dl_iterate_phdr(dl_phdr_cb callback, void *data)
>  {
> +  DlIteratePhdrHelper helper;
> +  AutoLock lock(&ElfLoader::Singleton.handlesMutex);

Do we want to lock this for the entirety of the function, or only for the `dl_iterate_phdr` case?
Attachment #8943921 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #29)
> Comment on attachment 8943921 [details]
> Bug 1414506 - Use system dl_iterate_phdr for system loaded libraries when we
> can.
> 
> https://reviewboard.mozilla.org/r/214272/#review221290
> 
> ::: mozglue/linker/ElfLoader.cpp:47
> (Diff revision 3)
> > +extern "C" MOZ_EXPORT int
> > +dl_iterate_phdr(dl_phdr_cb callback, void *data) __attribute__((weak));
> 
> Random thought: I noticed, while trying to figure out where we setup our
> `__wrap_*` functions to be called (since we don't appear to use
> `-Wl,--wrap=...`?), that `mozglue/linker/dladdr.h` appears to be missing
> `extern "C"`.  We should either delete that header (surely everybody has
> `dladdr`?) or fix the declaration.

I'll file a separate bug for this. It was only used as a forced include for nspr, but that was removed.
 
> ::: mozglue/linker/ElfLoader.cpp:207
> (Diff revision 3)
> >  
> >  int
> >  __wrap_dl_iterate_phdr(dl_phdr_cb callback, void *data)
> >  {
> > +  DlIteratePhdrHelper helper;
> > +  AutoLock lock(&ElfLoader::Singleton.handlesMutex);
> 
> Do we want to lock this for the entirety of the function, or only for the
> `dl_iterate_phdr` case?

I think it's safer to get the lock for the old code too. At least it avoids racing with other threads calling /our/ dlclose.
(In reply to Mike Hommey [:glandium] from comment #30)
> (In reply to Nathan Froyd [:froydnj] from comment #29)
> > ::: mozglue/linker/ElfLoader.cpp:207
> > (Diff revision 3)
> > >  
> > >  int
> > >  __wrap_dl_iterate_phdr(dl_phdr_cb callback, void *data)
> > >  {
> > > +  DlIteratePhdrHelper helper;
> > > +  AutoLock lock(&ElfLoader::Singleton.handlesMutex);
> > 
> > Do we want to lock this for the entirety of the function, or only for the
> > `dl_iterate_phdr` case?
> 
> I think it's safer to get the lock for the old code too. At least it avoids
> racing with other threads calling /our/ dlclose.

Probably worth a comment to that effect, then.
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/6d040b09b48d
Drive-by: Add missing <cstring> include to XZStream.cpp. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/534f60b7cb2b
Drive-by: Remove MOZ_CONCAT definition from mozglue/linker/Logging.h. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/ce1f9ca065cf
Move dl_phdr_info filling and callback invocation to a separate class. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/3b862e2b9e0b
Use system dl_iterate_phdr for system loaded libraries when we can. r=froydnj
Depends on: 1433418
Not seeing any reports of this outside trunk, so calling this wontfix for 59.
You need to log in before you can comment on or make changes to this bug.