Search activity crashes on VZW HTC One M8

RESOLVED FIXED in Firefox 40

Status

Firefox for Android Graveyard
Search Activity
--
critical
RESOLVED FIXED
3 years ago
4 months ago

People

(Reporter: snorp, Assigned: snorp)

Tracking

({crash, stackwanted})

34 Branch
Firefox 40
ARM
Android
crash, stackwanted

Details

Attachments

(2 attachments)

Some time last week the search activity started crashing on my Verizon HTC One M8. This happens after clicking the search button in the home screen widget.
Stacks plz.
Severity: normal → critical
tracking-fennec: --- → ?
Keywords: crash, stackwanted
OS: Mac OS X → Android
Hardware: x86 → ARM
Crash reporter gets a blank stack. I'll try to repro with a local build to see what's going on.
This is pretty strange. It looks like it only crashes if I enter the search activity after I've already started Fennec. I see the following log messages:

D/Telemetry( 5879): SendUIEvent: event = launch.1 method = widget timestamp = 698389554 extras = search
D/GeckoHardwareUtils( 5879): HardwareUtils already inited.
D/GeckoHardwareUtils( 5879): HardwareUtils already inited.
W/GeckoBatteryManager( 5879): Already started!
D/GeckoSessInfo( 5879): Recording start of session: 1429129318071
V/GeckoProfile( 5879): Fetching profile: '', 'null'
D/GeckoLayerClient( 5879): Window-size changed to (1080,1701)
I/InputMethodManagerService(11023): Enable input method client, pid=5879
V/GeckoProfile( 5879): Fetching profile: '', 'null'
D/GeckoHealthRec( 5879): Recording session end: P
V/GeckoHealthRec( 5879): Recorded session entry for env 39, current is 39
D/GeckoSessInfo( 5879): Recording session done: 1429129318071
D/GeckoBrowserProvider( 5879): Expiring history.
I/GeckoHealth( 5879): fennec :: HealthReportBroadcastService :: Registering HealthReportPruneService.
I/GeckoHealth( 5879): fennec :: BackgroundService :: Setting inexact repeating alarm for interval 86400000
I/InputMethodManagerService(11023): Disable input method client, pid=5879
D/GeckoBrowserProvider( 5879): Expiring thumbnails.
D/GeckoMemoryMonitor( 5879): onTrimMemory() notification received with level 20
D/GeckoMemoryMonitor( 5879): increasing memory pressure to 1
D/Telemetry( 5879): SendUIEvent: event = launch.1 method = widget timestamp = 698411473 extras = search
I/WebViewFactory( 5879): Loading com.google.android.webview version 40 (1832189-arm) (code 424501)
E/libsigchain( 5879): Warning: Unexpected sigaction action found 0x99ce54d5
I/LibraryLoader( 5879): Time to load native libraries: 3 ms (timestamps 9243-9246)
I/LibraryLoader( 5879): Expected native library version number "",actual native library version number ""
D/Process (11023): killProcessQuiet, pid=5879
E/InputEventReceiver(11335): Looper::removeFd(63) is failed, result(0), input channel 'ClientState{86c1b46 uid 10234 pid 5879} (c)'
I/ActivityManager(11023): Recipient 5879
I/ActivityManager(11023): Process org.mozilla.fennec (pid 5879) has died

It appears that Android is killing us?

The crash report is busted somehow. Example of one here: https://crash-stats.mozilla.com/report/index/a1a3e0fd-c03e-4c24-97fb-14b9d2150415
I can't seem to reproduce it with a local build. Interesting.
I think the important bit here might be:

I/LibraryLoader( 5879): Expected native library version number "",actual native library version number ""

That's coming from here: http://androidxref.com/5.0.0_r2/xref/external/chromium_org/base/android/java/src/org/chromium/base/library_loader/LibraryLoader.java#193

It looks like both versions are the empty string, but maybe not. If they aren't equal, it throws ProcessInitException. Then I think ActivityManager would kill us due to that.
Also I think the libsigchain message is interesting. It's probably complaining about either the crash reporter or the on-demand linker.
I guess since this used to work fine, a bisect might be useful.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #5)
> I think the important bit here might be:
> 
> I/LibraryLoader( 5879): Expected native library version number "",actual
> native library version number ""
> 
> That's coming from here:
> http://androidxref.com/5.0.0_r2/xref/external/chromium_org/base/android/java/
> src/org/chromium/base/library_loader/LibraryLoader.java#193
> 
> It looks like both versions are the empty string, but maybe not. If they
> aren't equal, it throws ProcessInitException. Then I think ActivityManager
> would kill us due to that.

That message shows up unconditionally, so to me, it sounds like all it says is that the test that follows it is false and that ProcessInitException is not thrown.

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #6)
> Also I think the libsigchain message is interesting. It's probably
> complaining about either the crash reporter or the on-demand linker.

I think it's a red herring, but better safe than sorry, you should try adding some logging in SEGVHandler::__wrap_sigaction. I tried to reproduce on my phone, and while I got a crash (https://crash-stats.mozilla.com/report/index/e25b4d3f-e6a3-45ee-8218-3f7662150415), it only happened once (can't reproduce anymore), doesn't have your symptoms (it does have a backtrace). It also didn't have a libsigchain message in logcat.
tracking-fennec: ? → 40+
Assignee: nobody → snorp
Created attachment 8593596 [details] [diff] [review]
Put the search activity in a separate process
Attachment #8593596 - Flags: review?(margaret.leibovic)
This patch seems to work around it for me. Something about loading a WebView in Fennec is busted on this phone. The WebView was recently upgraded via Play Store, so I think that's the likely cause, as it crashed in old Nightlies as well.

Comment 11

3 years ago
Comment on attachment 8593596 [details] [diff] [review]
Put the search activity in a separate process

Review of attachment 8593596 [details] [diff] [review]:
-----------------------------------------------------------------

This seems fine to me. I'm trying to think of a reason why having the search activity in another process could cause problems, but I can't think of any. But asking rnewman for a double sanity check just in case.
Attachment #8593596 - Flags: review?(rnewman)
Attachment #8593596 - Flags: review?(margaret.leibovic)
Attachment #8593596 - Flags: review+
Perhaps Bug 1151967 regressed something here?
It looks like the libraryloader info message is a red herring; it pops up everywhere folks use WebView (e.g., https://code.google.com/p/android-developer-preview/issues/detail?id=491). Lazy development practices.

Concerns:

* Using a separate process is more memory-hungry, and can result in really undesirable behaviors like killing the main Fennec process in order to show the search activity. There needs to be a darn good reason for doing so.

* In theory this requires profile access from two processes (to get the search engine list), which will bypass any concurrency protections we have in GeckoProfile. The searches ContentProvider *should* be safe (ContentResolver will do IPC for us). This might not be problematic in practice, but I don't like to tempt the concurrency gods.

* We don't know exactly why this broke (I could hazard some guesses), and we don't know why this fixes it. That makes me nervous. We also use a WebView inside FxA setup; if that's broken too, we should find and fix the root cause, because we can't fork a new process every time we need a WebView.


A little digging suggests that we're hitting this:

https://github.com/golang/go/issues/9507
I downgraded the 'Android System WebView' package that was recently updated via the play store, and the problem went away. Updated again and the crash returns -- so I don't think any change we made caused this. I'm going to explore the libsigchain stuff a bit.
Some more notes: A debug build from inbound does not crash, but an opt one does. When using the crashme addon in the debug build, the crash reporter is not triggered, even though the build config indicates that the reporter should be enabled.
Actually the debug build crashes too once I enable the crashreporter with MOZ_CRASHREPORTER. So something is going on with the signal handling there, I think.
That was the Go folks' conclusion (Comment 13).
Alright, I was able to get it to crash while attached. I get this:

Program received signal SIGILL, Illegal instruction.
#0  0xb6f67564 in sigaddset () from /Users/snorp/source/jimdb-arm/lib/HT4AFSF03231/system/lib/libc.so
#1  0xa406c4dc in ?? ()
#2  0xa406c4dc in ?? ()


According to /proc/pid/maps, those unknown symbols are in /data/app/com.google.android.webview-1/lib/arm/libwebviewchromium.so which at least makes sense. Why sigaddset() would trigger SIGILL, though, I have no nidea.
If I ask gdb to disassemble that address, I get something interesting:

disassemble 0xb6f67564
Dump of assembler code for function sigaddset:
=> 0xb6f67564 <+0>:			; <UNDEFINED> instruction: 0xf004e51f
   0xb6f67568 <+4>:	b.n	0xb6f67186 <pthread_setname_np+178>
   0xb6f6756a <+6>:	add	r1, pc, #340	; (adr r1, 0xb6f676c0 <sigprocmask+50>)
   0xb6f6756c <+8>:	bls.n	0xb6f6757c <sigaddset+24>
   0xb6f6756e <+10>:	bl	0xb6f642f8 <__errno>
   0xb6f67572 <+14>:	movs	r3, #22
   0xb6f67574 <+16>:	str	r3, [r0, #0]
   0xb6f67576 <+18>:	mov.w	r0, #4294967295	; 0xffffffff
   0xb6f6757a <+22>:	pop	{r3, pc}
   0xb6f6757c <+24>:	and.w	r2, r1, #31
   0xb6f67580 <+28>:	movs	r3, #1
   0xb6f67582 <+30>:	ldr	r1, [r0, #0]
   0xb6f67584 <+32>:	lsl.w	r12, r3, r2
   0xb6f67588 <+36>:	orr.w	r2, r1, r12
   0xb6f6758c <+40>:	str	r2, [r0, #0]
   0xb6f6758e <+42>:	movs	r0, #0
   0xb6f67590 <+44>:	pop	{r3, pc}
End of assembler dump.

This doesn't match the disassemble of the libc.so I pulled from the device, and we can see the first instruction there is apparently bogus. Corruption?
I see a lot of the same undefined instruction stuff in the objdump disassembly too. Mike do you know what that's all about?
Flags: needinfo?(mh+mozilla)
If I simply install the same signal handlers as breakpad I can reproduce locally. This is the code I'm using for that in nsAppRunner() in the same spot where we'd normally init the CrashReporter (which I can't do because I'm on mac):

const int kExceptionSignals[] = {
  SIGSEGV, SIGABRT, SIGFPE, SIGILL, SIGBUS
};
const int kNumHandledSignals =
    sizeof(kExceptionSignals) / sizeof(kExceptionSignals[0]);

struct sigaction old_handlers[kNumHandledSignals];

static void
MySignalHandler(int sig, siginfo_t* info, void* uc)
{
  printf_stderr("SNORP: got signal: %d at %p\n", sig, info ? info->si_addr : 0);
}

static void
SetMyExceptionHandler()
{
  // Fail if unable to store all the old handlers.
  for (int i = 0; i < kNumHandledSignals; ++i) {
    if (sigaction(kExceptionSignals[i], NULL, &old_handlers[i]) == -1)
      return;
  }

  struct sigaction sa;
  memset(&sa, 0, sizeof(sa));
  sigemptyset(&sa.sa_mask);

  // Mask all exception signals when we're handling one of them.
  for (int i = 0; i < kNumHandledSignals; ++i)
    sigaddset(&sa.sa_mask, kExceptionSignals[i]);

  sa.sa_sigaction = MySignalHandler;
  sa.sa_flags = SA_ONSTACK | SA_SIGINFO;

  for (int i = 0; i < kNumHandledSignals; ++i) {
    if (sigaction(kExceptionSignals[i], &sa, NULL) == -1) {
      // At this point it is impractical to back out changes, and so failure to
      // install a signal is intentionally ignored.
    }
  }
}

My exception handler does get the SIGILL just as breakpad does, but everything continues to work since I don't do anything but print some stuff. I wonder if something in chromium actually expects SIGILL?
Ugh, I'm an idiot. Crash reporter has nothing to do with it. It crashes without that, but Android restarts the search activity so quickly it looks like nothing went wrong. When I tried a couple times in gdb I evidently just got lucky, but now I can repro every time. So we're mostly at square one as to what could cause a SIGILL in sigaddset() of all things.
The SIGILL address does map to the first instruction of sigaddset, as disassembled by objdump below. But the disassembly from gdb and objdump do not agree. They are pretty similar (or the same) after the bls.n line, but the two functions don't even appear to be the same length. And of course the one where we crash starts with a bogus instruction.

00014564 <sigaddset>:
   14564:       b508            push    {r3, lr}
   14566:       3901            subs    r1, #1
   14568:       b108            cbz     r0, 1456e <sigaddset+0xa>
   1456a:       291f            cmp     r1, #31
   1456c:       d906            bls.n   1457c <sigaddset+0x18>
   1456e:       f7fc fec3       bl      112f8 <__errno>
   14572:       2316            movs    r3, #22
   14574:       6003            str     r3, [r0, #0]
   14576:       f04f 30ff       mov.w   r0, #4294967295 ; 0xffffffff
   1457a:       bd08            pop     {r3, pc}
   1457c:       f001 021f       and.w   r2, r1, #31
   14580:       2301            movs    r3, #1
   14582:       6801            ldr     r1, [r0, #0]
   14584:       fa03 fc02       lsl.w   ip, r3, r2
   14588:       ea41 020c       orr.w   r2, r1, ip
   1458c:       6002            str     r2, [r0, #0]
   1458e:       2000            movs    r0, #0
   14590:       bd08            pop     {r3, pc}
Created attachment 8594991 [details] [diff] [review]
Put our sigaction diversion in __sigaction if it exists
Attachment #8594991 - Flags: review?(mh+mozilla)
Mike looked at this for like 5s and immediately realized that 'sigaction' was too small for our diversion trampoline. The attached patch puts it in __sigaction instead, if that exists.
Flags: needinfo?(mh+mozilla)
Comment on attachment 8594991 [details] [diff] [review]
Put our sigaction diversion in __sigaction if it exists

Review of attachment 8594991 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/linker/ElfLoader.cpp
@@ +1115,5 @@
>     *   faulty.lib-loaded library set with sigaction.
>     */
>    void *libc = dlopen("libc.so", RTLD_GLOBAL | RTLD_LAZY);
>    if (libc) {
> +    // Some devices only have a small trampoline in 'sigaction', with the real

It's more of a lollipop thing than a "some devices" thing. The change of sigaction to call __sigaction comes from https://github.com/android/platform_bionic/commit/c7e9b2331771e5e87c34a8ee3dc6cc41d35b02fe

@@ +1116,5 @@
>     */
>    void *libc = dlopen("libc.so", RTLD_GLOBAL | RTLD_LAZY);
>    if (libc) {
> +    // Some devices only have a small trampoline in 'sigaction', with the real
> +    // work happening in '__sigaction'. Put our stuff there too, if it exists.

s/too//, since we don't divert sigaction in that case :)

That said, "Divert it instead of sigaction" would probably be more appropriate.

Please use /* */ instead of //.

Note, for the future in which system library symbols are resolved by our linker itself, we'll be able to get the symbol size and avoid patching when we don't have enough space.
Attachment #8594991 - Flags: review?(mh+mozilla) → review+
Attachment #8593596 - Flags: review?(rnewman) → review-
Re-nomming, 'cos it seems this'll bite us sooner than 40.
Status: NEW → ASSIGNED
tracking-fennec: 40+ → ?
https://hg.mozilla.org/mozilla-central/rev/5192a5da9260
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox40: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 40
[Tracking Requested - why for this release]: Lets get this on RelMgt radar as well.
status-firefox37: --- → affected
status-firefox38: --- → affected
status-firefox39: --- → affected
tracking-firefox38: --- → ?
tracking-firefox39: --- → ?
Thanks Kevin.
James, can we have an uplift request? Thanks
status-firefox37: affected → wontfix
tracking-firefox38: ? → +
tracking-firefox39: ? → +
Flags: needinfo?(snorp)
Comment on attachment 8594991 [details] [diff] [review]
Put our sigaction diversion in __sigaction if it exists

Approval Request Comment
[Feature/regressing bug #]: Android 5.0 changes
[User impact if declined]: Crashes
[Describe test coverage new/current, TreeHerder]: Nightly, local
[Risks and why]: low
[String/UUID change made/needed]: none
Flags: needinfo?(snorp)
Attachment #8594991 - Flags: approval-mozilla-beta?
Attachment #8594991 - Flags: approval-mozilla-aurora?
tracking-fennec: ? → 38+
Comment on attachment 8594991 [details] [diff] [review]
Put our sigaction diversion in __sigaction if it exists

Looks like a pretty safe fix. Beta+ Aurora+
Attachment #8594991 - Flags: approval-mozilla-beta?
Attachment #8594991 - Flags: approval-mozilla-beta+
Attachment #8594991 - Flags: approval-mozilla-aurora?
Attachment #8594991 - Flags: approval-mozilla-aurora+
Comment on attachment 8594991 [details] [diff] [review]
Put our sigaction diversion in __sigaction if it exists

[Triage Comment]
We merged m-b into m-b for 38
Should be in 38 beta 8
Attachment #8594991 - Flags: approval-mozilla-beta+ → approval-mozilla-release+
status-firefox38.0.5: --- → fixed

Updated

4 months ago
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.