Closed Bug 810471 Opened 8 years ago Closed 5 years ago

Crash on startup during various tests

Categories

(Firefox for Android :: General, defect)

x86
Android
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: gbrown, Assigned: gbrown)

References

(Depends on 1 open bug)

Details

Various intermittent test failure bugs all appear to have one thing in common: Fennec crashes on startup. I believe this includes reftests, mochitests, robocop, and talos.

Logcat often shows:

E/GeckoLibLoad( 1536): Load nss done

shortly before the crash. 

Stack traces are often not available, or show little more that libc and system library calls.

Consider: 

Bug 722166 - Android tests intermittently just don't start 
Bug 796914 - Intermittent "talosError: 'stack found after process termination (org.mozilla.fennec: terminated by testAgent.killProcess, plugin-container: terminated by testAgent.killProcess, crashreporter: terminated by testAgent.killProcess)'"
Bug 747787 - Intermittent Tegra crash in libdvm.so during tests
Bug 686245 - Intermittent Android "talosError: 'browser non-zero return code (1) [browser_output.txt]'"

and possibly also:

Bug 810405 - Browser crashes not reported as test failures
Bug 690176 - Intermittent Android "talosError: 'initialization timed out'"

These bugs have thousands of tbpl failure logs. Many of these bugs contain logs that are not startup crashes, but I think startup crashes are a significant cause of the failures in most of these bugs.

The absence of good stacks is a significant barrier to resolution.
(In reply to Geoff Brown [:gbrown] from comment #0)
> Logcat often shows:
> 
> E/GeckoLibLoad( 1536): Load nss done
> 
> shortly before the crash. 

E/GeckoConsole(15253): Could not read chrome manifest 'file:///data/data/org.mozilla.fennec/chrome.manifest'.

is just as common, and occurs after "Load nss done".
(In reply to Geoff Brown [:gbrown] from comment #1)
> E/GeckoConsole(15253): Could not read chrome manifest
> 'file:///data/data/org.mozilla.fennec/chrome.manifest'.
> 
> is just as common, and occurs after "Load nss done".

Neither are something wrong happening.
(In reply to Mike Hommey [:glandium] from comment #2)
> (In reply to Geoff Brown [:gbrown] from comment #1)
> > E/GeckoConsole(15253): Could not read chrome manifest
> > 'file:///data/data/org.mozilla.fennec/chrome.manifest'.
> > 
> > is just as common, and occurs after "Load nss done".
> 
> Neither are something wrong happening.

We should make them INFOs rather than ERRORs and/or suppress them if that's the case.
(In reply to Ed Morley [:edmorley UTC+0] from comment #3)
> (In reply to Mike Hommey [:glandium] from comment #2)
> > (In reply to Geoff Brown [:gbrown] from comment #1)
> > > E/GeckoConsole(15253): Could not read chrome manifest
> > > 'file:///data/data/org.mozilla.fennec/chrome.manifest'.
> > > 
> > > is just as common, and occurs after "Load nss done".
> > 
> > Neither are something wrong happening.
> 
> We should make them INFOs rather than ERRORs and/or suppress them if that's
> the case.

The "Could not read chrome manifest" one is from the component manager, which sends it to the error console. You'll see similar messages on desktop builds, in the error console. There is no severity distinction for the component manager messages.

For the others, there's a tendency to use ANDROID_LOG_ERROR for calls to __android_log_print in APKOpen.cpp which i'm not sure where they come from.

The latter could be filed as a separate bug. The former would need to add severity to the component manager logging. File another bug for that if you want.
(In reply to Mike Hommey [:glandium] from comment #2)
> (In reply to Geoff Brown [:gbrown] from comment #1)
> > E/GeckoConsole(15253): Could not read chrome manifest
> > 'file:///data/data/org.mozilla.fennec/chrome.manifest'.
> > 
> > is just as common, and occurs after "Load nss done".
> 
> Neither are something wrong happening.

I know they are normal -- I just wanted to note that these are often the most-recent log messages before the crash. Since I cannot reproduce the crash with any regularity, I have been adding more logging to try runs. Preliminary results suggest -- but only suggest -- that the crash is in loadGeckoLibsNative.
See Also: → 780831
(In reply to Geoff Brown [:gbrown] from comment #5)
> Preliminary results suggest -- but only suggest -- that the crash is in
> loadGeckoLibsNative.

Logging shows that many of the crashes occur while loadGeckoLibsNative is executing...but sometimes the crash occurs after loadGeckoLibsNative completes, and occasionally before it starts. I'm thinking:
 - maybe the crash is on a different thread
 - maybe there are multiple causes 
 - maybe the logging is not reliable
Blocks: 816584
Blocks: 817583
On a tip from glandium on an unrelated matter, I thought I would try re-creating the crash with additional logging enabled with:

DEFINES += -DMOZ_DEBUG_LINKER

in mozglue/linker/Makefile.in

Here are 2 such crashes with the extra logging:

https://tbpl.mozilla.org/php/getParsedLog.php?id=17566044&tree=Try&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=17561794&tree=Try&full=1

Here's a sample:

12-03 13:05:48.236 E/GeckoLinker( 7522): /data/app/org.mozilla.fennec-1.apk - GetFile libxul.so
12-03 13:05:48.236 E/GeckoLinker( 7522): /data/app/org.mozilla.fennec-1.apk - Scan directory entries in search for libxul.so
12-03 13:05:48.236 E/GeckoLinker( 7522): Decompression buffer of size 37564622 in ashmem "libxul.so", mapped @0x5274d000
12-03 13:05:48.236 E/GeckoLinker( 7522): CustomElf::Load("/data/app/org.mozilla.fennec-1.apk!/libxul.so", 3) = ...
12-03 13:05:48.236 E/GeckoLinker( 7522): /data/app/org.mozilla.fennec-1.apk!/libxul.so: Warning: program header type #1879048193 not handled
12-03 13:05:48.236 E/GeckoLinker( 7522): PT_LOAD @0x00000000 (filesz: 0x0006c850, memsz: 0x0006c850, offset: 0x00000000, flags: r-x)
12-03 13:05:48.236 E/GeckoLinker( 7522): PT_LOAD @0x002880f4 (filesz: 0x010a1f14, memsz: 0x010a1f14, offset: 0x0006d0f4, flags: r-x)
12-03 13:05:48.236 E/GeckoLinker( 7522): PT_LOAD @0x0132b008 (filesz: 0x001a2cb0, memsz: 0x0021315c, offset: 0x0110f008, flags: rw-)
12-03 13:05:48.236 E/GeckoLinker( 7522): PT_DYNAMIC @0x01474c40 (filesz: 0x00000150, memsz: 0x00000150, offset: 0x01258c40, flags: rw-)
12-03 13:05:48.236 E/GeckoLinker( 7522): /data/app/org.mozilla.fennec-1.apk!/libxul.so: Warning: program header type #1685382480 not handled
12-03 13:05:48.236 E/GeckoLinker( 7522): PT_GNU_STACK @0x00000000 (filesz: 0x00000000, memsz: 0x00000000, offset: 0x00000000, flags: rw-)
12-03 13:05:48.236 E/GeckoLinker( 7522): /data/app/org.mozilla.fennec-1.apk!/libxul.so: Loading segment @0x54b22000 r-x
12-03 13:05:48.256 E/GeckoLinker( 7522): cacheflush(0x5274d000, 0x527b9850)
12-03 13:05:48.256 E/GeckoLinker( 7522): /data/app/org.mozilla.fennec-1.apk!/libxul.so: Loading segment @0x54daa000 r-x
12-03 13:05:48.316 E/GeckoLinker( 7522): Caught segmentation fault @0x270000
12-03 13:05:48.316 E/GeckoLinker( 7522): Redispatching to registered handler @0xb0003515
12-03 13:05:48.416 I/DEBUG   (  937): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
12-03 13:05:48.416 I/DEBUG   (  937): Build fingerprint: 'nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys'
12-03 13:05:48.416 I/DEBUG   (  937): pid: 7522, tid: 7522  >>> org.mozilla.fennec <<<
12-03 13:05:48.416 I/DEBUG   (  937): signal 11 (SIGSEGV), fault addr 00270000
12-03 13:05:48.416 I/DEBUG   (  937):  r0 00310b00  r1 0026fffc  r2 000019e5  r3 00fb709a
12-03 13:05:48.416 I/DEBUG   (  937):  r4 9a164904  r5 f7f865dd  r6 2b36a00b  r7 73f43a94
12-03 13:05:48.416 I/DEBUG   (  937):  r8 566f90b6  r9 08e40b93  10 eda26da8  fp fb709aed
12-03 13:05:48.416 I/DEBUG   (  937):  ip 9a164904  sp bef04cc0  lr 00000018  pc afd0f388  cpsr 20000010
12-03 13:05:48.416 I/DEBUG   (  937):  d0  4481a00044564000  d1  0000009d000000b4
12-03 13:05:48.416 I/DEBUG   (  937):  d2  0000000000000000  d3  0000000000000000
12-03 13:05:48.416 I/DEBUG   (  937):  d4  0000000000000000  d5  3ff000003f800000
12-03 13:05:48.416 I/DEBUG   (  937):  d6  4456400040800000  d7  431d000040800000
12-03 13:05:48.416 I/DEBUG   (  937):  d8  0000040d00000359  d9  0000000000000004
12-03 13:05:48.416 I/DEBUG   (  937):  d10 0000000000000000  d11 0000000000000000
12-03 13:05:48.416 I/DEBUG   (  937):  d12 0000000000000000  d13 0000000000000000
12-03 13:05:48.416 I/DEBUG   (  937):  d14 0000000000000000  d15 0000000000000000
12-03 13:05:48.416 I/DEBUG   (  937):  scr 80000012
12-03 13:05:48.416 I/DEBUG   (  937): 
12-03 13:05:48.466 I/DEBUG   (  937):          #00  pc 0000f388  /system/lib/libc.so
12-03 13:05:48.466 I/DEBUG   (  937):          #01  pc 0008fabc  /system/lib/libskia.so
12-03 13:05:48.466 I/DEBUG   (  937):          #02  pc 000402b0  /system/lib/libskia.so
12-03 13:05:48.466 I/DEBUG   (  937):          #03  pc 000d8640  /system/lib/libskia.so
12-03 13:05:48.466 I/DEBUG   (  937):          #04  pc 000dd2ce  /system/lib/libskia.so
12-03 13:05:48.466 I/DEBUG   (  937):          #05  pc 000d785c  /system/lib/libskia.so
12-03 13:05:48.466 I/DEBUG   (  937):          #06  pc 000d7aac  /system/lib/libskia.so
12-03 13:05:48.466 I/DEBUG   (  937):          #07  pc 00040d44  /system/lib/libskia.so
12-03 13:05:48.476 I/DEBUG   (  937):          #08  pc 0003d7a8  /system/lib/libskia.so
12-03 13:05:48.476 I/DEBUG   (  937):          #09  pc 00049b2c  /system/lib/libskia.so
12-03 13:05:48.476 I/DEBUG   (  937):          #10  pc 000423d4  /system/lib/libskia.so
12-03 13:05:48.476 I/DEBUG   (  937):          #11  pc 00042448  /system/lib/libskia.so
12-03 13:05:48.476 I/DEBUG   (  937):          #12  pc 000852cc  /system/lib/libskia.so
12-03 13:05:48.476 I/DEBUG   (  937):          #13  pc 0004b568  /system/lib/libskia.so
12-03 13:05:48.476 I/DEBUG   (  937):          #14  pc 0006d65c  /system/lib/libskia.so
12-03 13:05:48.476 I/DEBUG   (  937): 
12-03 13:05:48.476 I/DEBUG   (  937): code around pc:
12-03 13:05:48.476 I/DEBUG   (  937): afd0f368 e1a0a82a e18aa80b e8a007f8 e1a0382b 
12-03 13:05:48.476 I/DEBUG   (  937): afd0f378 2affffe8 ea000030 e491c004 e1a0400c 
12-03 13:05:48.476 I/DEBUG   (  937): afd0f388 e8b10fe0 f5d1f040 e2522020 2491c004 
12-03 13:05:48.476 I/DEBUG   (  937): afd0f398 e1833c04 e1a04424 e1844c05 e1a05425 
12-03 13:05:48.476 I/DEBUG   (  937): afd0f3a8 e1855c06 e1a06426 e1866c07 e1a07427 
12-03 13:05:48.476 I/DEBUG   (  937): 
12-03 13:05:48.476 I/DEBUG   (  937): code around lr:
12-03 13:05:48.476 I/DEBUG   (  937): 
12-03 13:05:48.476 I/DEBUG   (  937): stack:
12-03 13:05:48.476 I/DEBUG   (  937):     bef04c80  00312568  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04c84  00312800  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04c88  00000001  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04c8c  00002530  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04c90  000000c5  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04c94  002f50a4  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04c98  000000eb  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04c9c  000002d1  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04ca0  af710f0c  /system/lib/libz.so
12-03 13:05:48.476 I/DEBUG   (  937):     bef04ca4  000001ff  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04ca8  00000008  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cac  afd4a9b4  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cb0  00000208  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cb4  002f2d90  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cb8  00000001  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cbc  af712158  /system/lib/libz.so
12-03 13:05:48.476 I/DEBUG   (  937): #00 bef04cc0  00002000  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cc4  00002000  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cc8  00314a48  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04ccc  002f2d90  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cd0  00000001  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cd4  0000005f  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cd8  00000000  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cdc  00310508  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04ce0  00249c28  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04ce4  ab18fac0  /system/lib/libskia.so
12-03 13:05:48.476 I/DEBUG   (  937): #01 bef04ce8  002f2d90  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cec  00002000  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cf0  00002000  
12-03 13:05:48.476 I/DEBUG   (  937):     bef04cf4  ab1402b4  /system/lib/libskia.so
12-03 13:05:48.916 E/GeckoLinker( 7522): cacheflush(0x527ba000, 0x5385c008)
12-03 13:05:49.106 I/DEBUG   (  937): debuggerd committing suicide to free the zombie!
12-03 13:05:49.106 I/DEBUG   ( 7541): debuggerd: Feb  2 2011 10:46:35
12-03 13:05:49.116 I/BootReceiver( 1020): Copying /data/tombstones/tombstone_09 to DropBox (SYSTEM_TOMBSTONE)
12-03 13:05:49.116 I/ActivityManager( 1020): Process org.mozilla.fennec (pid 7522) has died.
12-03 13:05:49.126 D/Zygote  (  939): Process 7522 terminated by signal (11)

Is this helpful? Where can we go from here?
Flags: needinfo?(mh+mozilla)
Blocks: 818107
Blocks: 818235
(In reply to Geoff Brown [:gbrown] from comment #5)
> Preliminary results suggest -- but only suggest -- that the crash is in
> loadGeckoLibsNative.

Comment 7 suggests otherwise.
Flags: needinfo?(mh+mozilla)
Blocks: 749785
There is also bug 757405, which has a SIGSEGV blamed on libskia in reftests. In that bug, the crash happens later -- typically 5 to 10 seconds after libs are loaded.
Blocks: 823535
And in bug 823535, we see the same symptoms on a pandaboard.
Blocks: 781107
Blocks: 825643
Depends on: 824760
Depends on: 803158
Blocks: 829371
These unexplained crash-on-startup failures are now quite rare. Reviews of recent logs reported in bugs 796914, 722166, 686245, 816584, and 781107 show that the most frequent problems are:

Bug 761987 - shutdown crashes in nsCertificate_Destroy
Bug 829419 - crashes in nsDeleteDir::PostTimer 
Bug 830557 - java.lang.NullPointerException in IInputMethodManager

There is a fairly high (temporary?) incidence of unexplained crashes on mozilla-beta: see https://bugzilla.mozilla.org/show_bug.cgi?id=686245#c1344.

The remaining unexplained crash on startup logs that I can find from the last week are:

https://bugzilla.mozilla.org/show_bug.cgi?id=796914#c623 (Jan 12)
https://bugzilla.mozilla.org/show_bug.cgi?id=781107#c233 (Jan 10)
https://bugzilla.mozilla.org/show_bug.cgi?id=816584#c49 (mozilla-aurora, Jan 7)
https://bugzilla.mozilla.org/show_bug.cgi?id=816584#c50 (mozilla-aurora, Jan 7)
https://bugzilla.mozilla.org/show_bug.cgi?id=816584#c51 (mozilla-aurora, Jan 7)


(There are additional examples on Jan 6 and earlier; eg https://bugzilla.mozilla.org/show_bug.cgi?id=686245#c1291).
Bug 833766 contains a crash report with symbols that would look similar to our unexplained crash reports, if it did not have symbols. In that case, the top frame is /data/data/org.mozilla.fennec_morbo/lib/libmozglue.so (Java_org_mozilla_gecko_GeckoAppShell_reportJavaCrash) ... worth investigating.
Blocks: 822296
Bug 837821 describes a possible cause for the remaining, unexplained startup crashes, and the test failures that initiated that bug look very similar to those we have been chasing in this bug.
Depends on: 837821
Depends on: 830557, 829419, 823452, 826385
Depends on: 838636
Assignee: nobody → gbrown
We still see some crashes on test startup, but they occur much less frequently than when this bug was opened. Reporting uncaught java exceptions was particularly helpful. Dependent bugs 803158 and 824760 are still good ideas that we should pursue when priorities allow.
Assignee: gbrown → nobody
We did good work here. There are far fewer startup crashes now and this bug has outlived its usefulness.
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.