Nightly crash on startup Galaxy S4 ( SIV ) GT-I9505

RESOLVED FIXED in Firefox 24

Status

()

defect
--
critical
RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: kbrosnan, Assigned: glandium)

Tracking

({crash, regression, reproducible})

24 Branch
Firefox 24
ARM
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox21 unaffected, firefox22 unaffected, firefox23 unaffected, firefox24+ fixed, fennec24+)

Details

(Whiteboard: [native-crash])

Attachments

(4 attachments, 3 obsolete attachments)

We are crashing on startup. No crash id. This bit of the log looks interesting. CCing a Brian/Sriram as the device is in MV on my desk.

05-21 17:26:02.543 D/GeckoProfile(20291): Found profile dir: /data/data/org.mozilla.fennec/files/mozilla/2ziuifha.default
05-21 17:26:02.563 D/GeckoFavicons(20291): Creating Favicons instance
05-21 17:26:02.573 D/qdoverlay(  213): Unset pipe=VG0 dpy=0; Unset pipe=RGB1 dpy=0; 
05-21 17:26:02.593 D/skia    (20291): new locale en-Latn-US
05-21 17:26:02.613 D/AbsListView(20291): onVisibilityChanged() is called, visibility : 8
05-21 17:26:02.613 D/AbsListView(20291): unregisterIRListener() is called 
05-21 17:26:02.623 D/AbsListView(20291): Get MotionRecognitionManager
05-21 17:26:02.623 D/MotionRecognitionService(  734):  ssp status : true
05-21 17:26:02.633 D/libEGL  (20291): loaded /system/lib/egl/libEGL_adreno200.so
05-21 17:26:02.633 D/libEGL  (20291): loaded /system/lib/egl/libGLESv1_CM_adreno200.so
05-21 17:26:02.633 D/libEGL  (20291): loaded /system/lib/egl/libGLESv2_adreno200.so
05-21 17:26:02.643 I/Adreno200-EGL(20291): <qeglDrvAPI_eglInitialize:265>: EGL 1.4 QUALCOMM build:  (CL3544079)
05-21 17:26:02.643 I/Adreno200-EGL(20291): Build Date: 03/28/13 Thu
05-21 17:26:02.643 I/Adreno200-EGL(20291): Local Branch: adreno_20130328
05-21 17:26:02.643 I/Adreno200-EGL(20291): Remote Branch: 
05-21 17:26:02.643 I/Adreno200-EGL(20291): Local Patches: 
05-21 17:26:02.643 I/Adreno200-EGL(20291): Reconstruct Branch: 
05-21 17:26:02.643 D/GeckoViewsFactory(20291): Warning: unknown custom view: org.mozilla.gecko.BrowserToolbarLayout
05-21 17:26:02.643 D/GeckoViewsFactory(20291): Warning: unknown custom view: org.mozilla.gecko.BrowserToolbarLayout
05-21 17:26:02.663 D/GeckoViewsFactory(20291): Warning: unknown custom view: org.mozilla.gecko.TabCounter
05-21 17:26:02.663 D/GeckoViewsFactory(20291): Warning: unknown custom view: org.mozilla.gecko.TabCounter
Severity: normal → critical
Keywords: crash, regression
Whiteboard: [native-crash]
Version: Trunk → Firefox 24
Crash on my device (GT-I9505) as well. We can't startup-crash on flagship devices; so tracking 24.
I threw my local debug build on Aaron's device and that crashes too. Based on the logcat output (attached) it looks like it's crashing in loadSQLiteLibsNative in mozglue. I can't jimdb the SIGSEGV because the device isn't rooted.
Bisecting inbound builds shows this was caused by ef5b7b1039ac, bug 848764.
Blocks: 848764
Also for some reason the "run-as" command on Galaxy S4 seems to not find the packages specified, so I can't attach the gdbserver on this device without rooting it. I don't even know if rooting it will work, but I assume it will.

Using package org.mozilla.fennec_kats.
Launching org.mozilla.fennec_kats... Done
Attaching to pid 22482... as non-root... as root... 
"gdbserver" output:
 Cannot attach to lwp 22482: Operation not permitted (1)
 Exiting
"run-as" output:
 run-as: Package 'org.mozilla.fennec_kats' is unknown
"su -c" output:
 /system/bin/sh: su: not found
/Users/kats/android/jdb/moz-gdb/bin/../utils/gdbinit:133: Error in sourced command file:
failed to run gdbserver
(gdb) q
Can you build with DEFINES += -DMOZ_DEBUG_LINKER added to mozglue/linker/Makefile.in and attach a new logcat?
Attachment #753260 - Attachment mime type: text/x-log → text/plain
tracking-fennec: ? → 24+
So, thanks to a newer log from Aaron mapped to a build where i had the debugging info, I can interpret from that attached log:

> Caught segmentation fault @0x6e6d6409
So here, the linker is catching a segfault. The log itself comes from http://hg.mozilla.org/mozilla-central/file/8eebe35aae63/mozglue/linker/ElfLoader.cpp#l827 and the printed address is info->si_addr.
It happens to be inside a thread stack, so it doesn't make much sense that there would be an access error there, except if we tried to execute it, but it seems unlikely. The other weird thing is that we don't even get to the segfault handler printing something when mozglue is built with optimizations on.

> Redispatching to registered handler @0x401bec51
Since that address doesn't match a mapped library, it passes to another signal handler, which happens to be bionic's. And it calls the handler with the exact parameters it got.

> Fatal signal 11 (SIGSEGV), thread 5992 (Gecko)
This is bionic's signal handler displaying info about the crash, printed from https://github.com/android/platform_bionic/blob/master/linker/debugger.cpp#L138
It then connects to the debuggerd daemon to give it some info about the crash https://github.com/android/platform_bionic/blob/master/linker/debugger.cpp#L190
and then it returns and segfault for real, for debuggerd to catch and display. So theoretically, debuggerd is getting the same crash that triggered the signal handler in the first place...

> signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), fault addr 798444c8
... but the address debuggerd gives for the fault address (and it uses info->si_addr for that, cf. https://github.com/android/platform_system_core/blob/d7cab8bff1334ba48620a16d1b98f2ae623fee7d/debuggerd/tombstone.c#L180 ) is different... That address corresponds to the linker itself reading DT_DYNAMIC data from libnss3.so in http://hg.mozilla.org/mozilla-central/file/8eebe35aae63/mozglue/linker/CustomElf.cpp#l470
For posterity, glandium and I discussed this a little on IRC yesterday. I noticed that this output:

> Fatal signal 11 (SIGSEGV), thread 5992 (Gecko)

actually comes from https://github.com/android/platform_bionic/blob/master/linker/debugger.cpp#L143 (line 143 instead of line 138) which means that either (1) debuggerd_signal_handler is getting called with a null siginfo_t or (2) the haveSiginfo function is returning false. These imply that there is another signal handler chained somewhere, either between us and bionic in case (1), or on top of us in case (2).

However, we wrap the signal and sigaction functions in mozglue. So it seems pretty unlikely that (2) would ever happen, because any attempt to do that would our wrapper which moves the signal handler down the chain to below our handler. So it is likely to be case (1). The other odd thing is the stack address we are getting in si_addr, as glandium mentioned in the last comment. This can be explained if there is a signal handler on top of ours that is not using SA_SIGINFO, but because we wrap the functions I don't see how that could happen. It doesn't make much sense either way.
Using an updated build from Mike with a fix and that works for me (as conversed in IRC). I think it's fine with getting a bandaid fix up there for Nightly users before diving into the crux of the issue with on-demand decompression on the S4.
Well, the build with a mitigation was the one crashing on the compositor ; the build that didn't crash essentially had on-demand decompression disabled, but by means of decompressing everything ahead, not by actually backing out 848764. I think a reasonable way forward here is to setup a blacklist of devices where we disable on-demand decompression at runtime.
Blocks: 875824
So, thanks to some tracking I added in a custom build that Aaron ran, I was able to figure out what is going on:
libsc-a3xx.so is essentially doing this:
  sighandler_t old6 = signal(6, my_handler);
  sighandler_t old11 = signal(11, my_handler);
  do_stuff();
  signal(6, old6);
  signal(11, old11);

This effectively resets the flags set with sigaction and when we're back to our segfault handler and it's invoked, the siginfo and context arguments are not set by the kernel because of the lack of SA_SIGINFO flag. Also, since we're multi-threaded, when one thread is in do_stuff() and another should be hitting our sigsegv handler, it doesn't, and that crashes; that's why the first attempt at fixing things failed with crashes in the compositor. Also, I don't know what their segfault handler does, but if it happens that there is a legitimate crash happening when it's set on a separate thread, we may lose crash reports (although that's probably a corner case).

I have a working patch that makes fennec work on the S4 without the workaround from bug 875824, but it needs to be modified to work on x86 and validated to work on armv6 (it only supports arm and thumb and was only tested on thumb). That patch essentially replaces the first instructions in libc's bsd_signal, sysv_signal and sigaction with a branch instruction calling into our own wrappers, which then call the sigaction system call directly. This happens to be enough on android because the sigaction libc stub is mostly just calling the kernel directly. It is not the case on glibc systems, though, and i suspect android may at some point in the future add some things like a default restorer such that gdb will be happier with signal handlers.

So I think it would be better if we could call back into the original functions, but since we're modifying them, it means we'd need something like what we're doing in the DllInterceptor on Windows, which is more complicated and fragile. Another option would be to look for all GOT entries pointing to these functions in the libraries that have already been loaded and change them to point to our wrappers, but it might fail in some subtle ways, especially with prelinking. It might also add some unwanted startup overhead.
Posted patch WIP (obsolete) — Splinter Review
This is the current cleaned up work in progress (as in, with all the additional debugging removed)
Assignee: nobody → mh+mozilla
Posted patch WIP (obsolete) — Splinter Review
One more cleanup.
Attachment #754516 - Attachment is obsolete: true
I'm wondering if it's worth putting in this workaround at all. The SGS4 is a pretty beefy device and may not benefit that much from decompression-on-demand anyway. Unless there are other devices that are also affected we could just disable the decompression-on-demand for this device and be done with it.
I suspect all devices with an adreno 3xx are affected. Thanks Qualcomm.

That being said, even on a device like the S4, i think on-demand decompression can have a great benefit, combined with binary reordering.
Attachment #754518 - Attachment is obsolete: true
Comment on attachment 754907 [details] [diff] [review]
Hook libc's sigaction to avoid system libraries replacing our segfault handler temporarily and restoring it wrongly

Apparently, this breaks android x86.
Attachment #754907 - Flags: review?(nfroyd)
No longer blocks: 875824
Depends on: 875824
Attachment #754907 - Attachment is obsolete: true
Comment on attachment 755350 [details] [diff] [review]
Hook libc's sigaction to avoid system libraries replacing our segfault handler temporarily and restoring it wrongly

Review of attachment 755350 [details] [diff] [review]:
-----------------------------------------------------------------

I love working with libraries and signals, don't you?
Attachment #755350 - Flags: review?(nfroyd) → review+
https://hg.mozilla.org/mozilla-central/rev/1f228b1d9566
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 24
Depends on: 886736
No longer depends on: 886736
You need to log in before you can comment on or make changes to this bug.