Closed Bug 722166 Opened 12 years ago Closed 8 years ago

Fennec intermittently crashes on startup ("Automation Error: No crash directory (/mnt/sdcard/tests/profile/minidumps/) found on remote device" or "Automation Error: No crash directory (/mnt/sdcard/tests/reftest/profile/minidumps/) found on remote device")

Categories

(Testing :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [android_tier_1_mozharness])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #689856 +++

A horrible summary, for horrible logs, not unlike the "just stop" clone source. A typical example is

https://tbpl.mozilla.org/php/getParsedLog.php?id=8920720&tree=Firefox
Android Tegra 250 mozilla-central opt test jsreftest-2 on 2012-01-29 03:52:55 PST for push aee879a3190a

python reftest/remotereftest.py...
unable to execute ADB: ensure Android SDK is installed and adb is in your $PATH
restarting as root failed
reconnecting socket
args: ['../hostutils/bin/xpcshell', '-g', '/builds/tegra-163/test/build/hostutils/xre', '-v', '170', '-f', '/builds/tegra-163/test/build/tests/reftest/reftest/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpF9TDpg';const _SERVER_PORT = '30163'; const _SERVER_ADDR ='10.250.48.218';", '-f', '/builds/tegra-163/test/build/tests/reftest/server.js']
INFO | remotereftests.py | Server pid: 96872
pushing directory: /tmp/tmptATQTB to /mnt/sdcard/tests/reftest/profile
pushing directory: /tmp/tmptATQTB to /mnt/sdcard/tests/reftest/profile
REFTEST INFO | runreftest.py | Running tests: start.

FIRE PROC: '"MOZ_CRASHREPORTER=1,XPCOM_DEBUG_BREAK=stack,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1,MOZ_PROCESS_LOG=/tmp/tmplVo9sfpidlog,XPCOM_MEM_BLOAT_LOG=/tmp/tmptATQTB/runreftest_leaks.log" org.mozilla.fennec -no-remote -profile /mnt/sdcard/tests/reftest/profile/'
INFO | automation.py | Application pid: 1538
DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory
DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory
(repeats over and over, because DeviceManager is sure that browser which isn't actually running at all will soon start filling up the log)
DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory
DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory

INFO | automation.py | Application ran for: 1:01:18.565134
INFO | automation.py | Reading PID log: /tmp/tmplVo9sfpidlog
getting files in '/mnt/sdcard/tests/reftest/profile/minidumps/'
WARNING | automationutils.processLeakLog() | refcount logging is off, so leaks can't be detected!

REFTEST INFO | runreftest.py | Running tests: end.
DeviceManager: error pulling file '/mnt/sdcard/tests/reftest/reftest.log': No such file or directory
program finished with exit code 0
elapsedTime=3688.976364
TinderboxPrint: jsreftest-2<br/><em class="testfail">T-FAIL</em>

logcat doesn't appear to have much to say - there's a "java.net.SocketException: Connection reset by peer" but then, that's no real surprise. Might just be a flavor of bug 681861, with the lost connection happening right at startup instead of later on, I guess.
https://tbpl.mozilla.org/php/getParsedLog.php?id=9124289&tree=Mozilla-Inbound
Whiteboard: [orange] → [orange][android_tier_∞]
(-- start of Android native is broken but won't admit it --)

https://tbpl.mozilla.org/php/getParsedLog.php?id=10016879&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10017153&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10018449&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10018650&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10019399&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10020912&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10019236&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10019620&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10020269&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10022679&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10032865&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10032795&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10033111&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10033212&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10036500&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10035078&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10034938&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10038180&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10040049&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10040578&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10041917&tree=Mozilla-Inbound
This might have always been yours, but it certainly is after the March 12th comment 282 bustage.

https://tbpl.mozilla.org/php/getParsedLog.php?id=10065410&tree=Mozilla-Inbound
Product: Testing → Fennec Native
QA Contact: general → general
https://tbpl.mozilla.org/php/getParsedLog.php?id=10091515&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10091523&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10091430&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10092097&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10092844&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10094791&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10095652&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10097172&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10102009&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10101448&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10099167&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10100642&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10098895&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10099119&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10099170&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10098929&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10099308&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10100007&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10100033&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10100302&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10100144&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10101979&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10102573&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10103118&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10102794&tree=Mozilla-Inbound
You're not long at all away from no longer being tolerable as a tier 1, visible on tbpl platform :(
Severity: normal → blocker
https://tbpl.mozilla.org/php/getParsedLog.php?id=10156567&tree=Firefox (after I merged the backout that should have fixed the explosion of this, because the world is a cruel place, and tegras are a particularly cruel part of it)
Whiteboard: [orange][android_tier_∞] → [orange][android_tier_1_mozharness]
blocking-fennec1.0: --- → -
https://tbpl.mozilla.org/php/getParsedLog.php?id=15423565&tree=Mozilla-Inbound
tegra-271
Product: Firefox for Android → Testing
this is interesting as I get very similar log entries on a passing version:
01-17 11:17:34.496 I/ActivityManager( 1021): Displayed activity org.mozilla.fennec_jmaher/.App: 540 ms (total 540 ms)
01-17 11:17:34.526 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:17:34.596 I/GeckoAppShell(19856): env0: ANDROID_SOCKET_zygote=10
01-17 11:17:34.596 I/GeckoAppShell(19856): env1: ANDROID_BOOTLOGO=1
01-17 11:17:34.596 I/GeckoAppShell(19856): env2: EXTERNAL_STORAGE=/mnt/sdcard
01-17 11:17:34.596 I/GeckoAppShell(19856): env3: ANDROID_ASSETS=/system/app
01-17 11:17:34.596 I/GeckoAppShell(19856): env4: ASEC_MOUNTPOINT=/mnt/asec
01-17 11:17:34.596 I/GeckoAppShell(19856): env5: PATH=/sbin:/system/sbin:/system/bin:/system/xbin
01-17 11:17:34.596 I/GeckoAppShell(19856): env6: ANDROID_DATA=/data
01-17 11:17:34.596 I/GeckoAppShell(19856): env7: BOOTCLASSPATH=/system/framework/core.jar:/system/framework/ext.jar:/system/framework/framework.jar:/system/framework/android.policy.jar:/system/framework/services.jar:/system/framework/com.nvidia.graphics.jar
01-17 11:17:34.596 I/GeckoAppShell(19856): env8: ANDROID_PROPERTY_WORKSPACE=9,32768
01-17 11:17:34.596 I/GeckoAppShell(19856): env9: ANDROID_ROOT=/system
01-17 11:17:34.596 I/GeckoAppShell(19856): env10: LD_LIBRARY_PATH=/system/lib
01-17 11:17:34.596 I/GeckoAppShell(19856): env11: MOZ_CRASHREPORTER_NO_REPORT=1
01-17 11:17:34.596 I/GeckoAppShell(19856): env12: MOZ_CRASHREPORTER=1
01-17 11:17:34.596 I/GeckoAppShell(19856): env13: XPCOM_MEM_BLOAT_LOG=/tmp/tmpjHmyoY/runtests_leaks.log
01-17 11:17:34.596 I/GeckoAppShell(19856): env14: XPCOM_DEBUG_BREAK=stack
01-17 11:17:34.596 I/GeckoAppShell(19856): env15: MOZ_PROCESS_LOG=/tmp/tmpUGYRh5pidlog
01-17 11:17:34.596 I/GeckoAppShell(19856): env16: NO_EM_RESTART=1
01-17 11:17:34.596 I/GeckoAppShell(19856): env17: null
01-17 11:17:34.606 E/GeckoLibLoad(19856): Load sqlite start
01-17 11:17:34.626 E/GeckoLinker(19856): /data/app/org.mozilla.fennec_jmaher-1.apk!/libmozsqlite3.so: Warning: relocation to NULL @0x00054d18
01-17 11:17:34.626 E/GeckoLinker(19856): /data/app/org.mozilla.fennec_jmaher-1.apk!/libmozsqlite3.so: Warning: relocation to NULL @0x00054c14 for symbol "__cxa_begin_cleanup"
01-17 11:17:34.626 E/GeckoLinker(19856): /data/app/org.mozilla.fennec_jmaher-1.apk!/libmozsqlite3.so: Warning: relocation to NULL @0x00054c80 for symbol "__cxa_type_match"
01-17 11:17:34.626 E/GeckoLibLoad(19856): Load sqlite done
01-17 11:17:34.636 E/GeckoLibLoad(19856): Load nss start
01-17 11:17:34.666 E/GeckoLinker(19856): /data/app/org.mozilla.fennec_jmaher-1.apk!/libnspr4.so: Warning: relocation to NULL @0x00020fa4
01-17 11:17:34.666 E/GeckoLinker(19856): /data/app/org.mozilla.fennec_jmaher-1.apk!/libnspr4.so: Warning: relocation to NULL @0x00020d54 for symbol "__cxa_begin_cleanup"
01-17 11:17:34.666 E/GeckoLinker(19856): /data/app/org.mozilla.fennec_jmaher-1.apk!/libnspr4.so: Warning: relocation to NULL @0x00020e70 for symbol "__cxa_type_match"
01-17 11:17:34.676 E/GeckoLibLoad(19856): Load nss done
01-17 11:17:35.376 E/GeckoLinker(19856): /data/app/org.mozilla.fennec_jmaher-1.apk!/libmozalloc.so: Warning: relocation to NULL @0x00003da8
01-17 11:17:35.376 E/GeckoLinker(19856): /data/app/org.mozilla.fennec_jmaher-1.apk!/libmozalloc.so: Warning: relocation to NULL @0x00003d3c for symbol "__cxa_begin_cleanup"
01-17 11:17:35.376 E/GeckoLinker(19856): /data/app/org.mozilla.fennec_jmaher-1.apk!/libmozalloc.so: Warning: relocation to NULL @0x00003d64 for symbol "__cxa_type_match"
01-17 11:17:35.409 E/GeckoLibLoad(19856): Loaded libs in 728ms total, 450ms user, 240ms system, 0 faults
01-17 11:17:35.409 W/GeckoThread(19856): zerdatime 9510084 - runGecko
01-17 11:17:35.409 I/GeckoThread(19856): RunGecko - args = -no-remote -profile /mnt/sdcard/tests/profile/
01-17 11:17:35.409 I/GeckoAppShell(19856): post native init
01-17 11:17:35.409 I/GeckoAppShell(19856): setLayerClient called
01-17 11:17:35.419 E/profiler(19856): Registering start signal
01-17 11:17:35.439 W/GeckoApp(19856): unexpected exception, passing url directly to Gecko but we should explicitly catch this
01-17 11:17:35.439 W/GeckoApp(19856): java.net.UnknownHostException: Host is unresolved: mochi.test:8888
01-17 11:17:35.439 W/GeckoApp(19856): 	at java.net.Socket.connect(Socket.java:1038)
01-17 11:17:35.439 W/GeckoApp(19856): 	at org.apache.harmony.luni.internal.net.www.protocol.http.HttpConnection.<init>(HttpConnection.java:62)
01-17 11:17:35.439 W/GeckoApp(19856): 	at org.apache.harmony.luni.internal.net.www.protocol.http.HttpConnectionPool.get(HttpConnectionPool.java:88)
01-17 11:17:35.439 W/GeckoApp(19856): 	at org.apache.harmony.luni.internal.net.www.protocol.http.HttpURLConnectionImpl.getHTTPConnection(HttpURLConnectionImpl.java:927)
01-17 11:17:35.439 W/GeckoApp(19856): 	at org.apache.harmony.luni.internal.net.www.protocol.http.HttpURLConnectionImpl.connect(HttpURLConnectionImpl.java:909)
01-17 11:17:35.439 W/GeckoApp(19856): 	at org.mozilla.gecko.GeckoApp$PrefetchRunnable.run(GeckoApp.java:1816)
01-17 11:17:35.439 W/GeckoApp(19856): 	at android.os.Handler.handleCallback(Handler.java:587)
01-17 11:17:35.439 W/GeckoApp(19856): 	at android.os.Handler.dispatchMessage(Handler.java:92)
01-17 11:17:35.439 W/GeckoApp(19856): 	at android.os.Looper.loop(Looper.java:123)
01-17 11:17:35.439 W/GeckoApp(19856): 	at org.mozilla.gecko.util.GeckoBackgroundThread.run(GeckoBackgroundThread.java:31)
01-17 11:17:35.439 W/GeckoApp(19856): zerdatime 9510113 - pre checkLaunchState
01-17 11:17:35.546 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:17:35.566 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:17:35.616 I/SUTAgentAndroid( 1503): 192.168.1.73 : isdir /mnt/sdcard/tests/logs
01-17 11:17:35.636 I/SUTAgentAndroid( 1503): 192.168.1.73 : cd /mnt/sdcard/tests/logs
01-17 11:17:35.666 I/SUTAgentAndroid( 1503): 192.168.1.73 : ls
01-17 11:17:36.096 E/GeckoConsole(19856): Could not read chrome manifest 'file:///data/data/org.mozilla.fennec_jmaher/chrome.manifest'.
01-17 11:17:40.886 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:17:40.916 I/SUTAgentAndroid( 1503): 192.168.1.73 : isdir /mnt/sdcard/tests/logs
01-17 11:17:40.946 I/SUTAgentAndroid( 1503): 192.168.1.73 : cd /mnt/sdcard/tests/logs
01-17 11:17:40.966 I/SUTAgentAndroid( 1503): 192.168.1.73 : ls
01-17 11:17:46.196 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:17:46.236 I/SUTAgentAndroid( 1503): 192.168.1.73 : isdir /mnt/sdcard/tests/logs
01-17 11:17:46.256 I/SUTAgentAndroid( 1503): 192.168.1.73 : cd /mnt/sdcard/tests/logs
01-17 11:17:46.276 I/SUTAgentAndroid( 1503): 192.168.1.73 : ls
01-17 11:17:51.546 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:17:51.609 I/SUTAgentAndroid( 1503): 192.168.1.73 : isdir /mnt/sdcard/tests/logs
01-17 11:17:51.639 I/SUTAgentAndroid( 1503): 192.168.1.73 : cd /mnt/sdcard/tests/logs
01-17 11:17:51.659 I/SUTAgentAndroid( 1503): 192.168.1.73 : ls
01-17 11:17:56.709 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:17:56.759 I/SUTAgentAndroid( 1503): 192.168.1.73 : isdir /mnt/sdcard/tests/logs
01-17 11:17:56.779 I/SUTAgentAndroid( 1503): 192.168.1.73 : cd /mnt/sdcard/tests/logs
01-17 11:17:56.806 I/SUTAgentAndroid( 1503): 192.168.1.73 : ls
01-17 11:18:02.469 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:18:02.499 I/SUTAgentAndroid( 1503): 192.168.1.73 : isdir /mnt/sdcard/tests/logs
01-17 11:18:02.529 I/SUTAgentAndroid( 1503): 192.168.1.73 : cd /mnt/sdcard/tests/logs
01-17 11:18:02.546 I/SUTAgentAndroid( 1503): 192.168.1.73 : ls
01-17 11:18:07.626 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:18:07.656 I/SUTAgentAndroid( 1503): 192.168.1.73 : isdir /mnt/sdcard/tests/logs
01-17 11:18:07.676 I/SUTAgentAndroid( 1503): 192.168.1.73 : cd /mnt/sdcard/tests/logs
01-17 11:18:07.696 I/SUTAgentAndroid( 1503): 192.168.1.73 : ls
01-17 11:18:12.886 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:18:12.936 I/SUTAgentAndroid( 1503): 192.168.1.73 : isdir /mnt/sdcard/tests/logs
01-17 11:18:12.966 I/SUTAgentAndroid( 1503): 192.168.1.73 : cd /mnt/sdcard/tests/logs
01-17 11:18:12.986 I/SUTAgentAndroid( 1503): 192.168.1.73 : ls
01-17 11:18:18.119 I/SUTAgentAndroid( 1503): 192.168.1.73 : ps
01-17 11:18:18.169 I/SUTAgentAndroid( 1503): 192.168.1.73 : isdir /mnt/sdcard/tests/logs
01-17 11:18:18.189 I/SUTAgentAndroid( 1503): 192.168.1.73 : cd /mnt/sdcard/tests/logs
01-17 11:18:18.216 I/SUTAgentAndroid( 1503): 192.168.1.73 : ls
01-17 11:18:18.857 I/Gecko   (19856): Detected osrelease `2.6.32.9'
01-17 11:18:18.857 I/Gecko   (19856): JITs are not broken
01-17 11:18:18.926 E/GeckoConsole(19856): [JavaScript Warning: "Unknown property '-moz-align-self'.  Declaration dropped." {file: "resource://gre-resources/ua.css" line: 44}]
01-17 11:18:19.769 E/GeckoConsole(19856): zerdatime 1394299773 - browser chrome startup finished.
01-17 11:18:19.789 I/GeckoApp(19856): Return 
01-17 11:18:20.017 E/GeckoConsole(19856): [JavaScript Error: "DOMApplicationRegistry: Could not read from /mnt/sdcard/tests/webapps/webapps.json : [Exception... "Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIChannel.asyncOpen]"  nsresult: "0x80520012 (NS_ERROR_FILE_NOT_FOUND)"  location: "JS frame :: resource://gre/modules/NetUtil.jsm :: NetUtil_asyncOpen :: line 165"  data: no]" {file: "resource://gre/modules/Webapps.jsm" line: 202}]



What I see as different in the passing case is we do the " relocation to NULL @0x00054d18" work before we get the exception about unknown host mochi.test:8888.  Also, I never see:
01-17 11:17:35.409 I/GeckoThread(19856): RunGecko - args = -no-remote -profile /mnt/sdcard/tests/profile/


could it be that we are skipping over something internally, or doing things in a different order sometimes?
Could be. Another interesting point here is that we could check to verify that the httpd.js webserver is actually running before we kick off the tests. That would be pretty simple to do and we if we for some reason can't get the webserver to start here we can flag this as purple and try to retrigger this job on a different tegra/foopy combination.

One thing that is nice is that thanks to Philor's starring, we have good data over time of this issue, and as we have made improvements to the android infrastructure we can see that it is occurring less frequently:

http://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=722166&entireHistory=true&tree=mozilla-inbound

Joel can you think of anything that would cause us to do things in a different order sometimes? I can't, but I should go look at the code. Adding a simple "is the webserver alive" check would be pretty easy to do and I wonder if that would catch any of these problems.
I don't think this is related to the webserver not being there because I see the same error pattern in the logcat for a successful run.  The one exception would be when we run for <TIMEOUT> seconds.  In the dozen or so logs I looked at we died in <60 seconds, usually <30 seconds.
This log has a crashdump backtrace in the logcat section:

https://tbpl.mozilla.org/php/getParsedLog.php?id=16140188&tree=Mozilla-Inbound
(In reply to William Lachance (:wlach) from comment #835)
> This log has a crashdump backtrace in the logcat section:

Actually, almost all of the recent logs show a SIGSEGV and a stack...but the stacks look unhelpful. libmozglue, libskia, and libc figure prominently, but not consistently.

I/DEBUG   (  937): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
I/DEBUG   (  937): Build fingerprint: 'nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys'
I/DEBUG   (  937): pid: 1735, tid: 1735  >>> org.mozilla.fennec <<<
I/DEBUG   (  937): signal 11 (SIGSEGV), fault addr 002511cc
I/DEBUG   (  937):  r0 00000fc0  r1 00000007  r2 00000000  r3 afd42328
I/DEBUG   (  937):  r4 afd43748  r5 002511c8  r6 00000007  r7 f0000000
I/DEBUG   (  937):  r8 bec912b0  r9 bec912b0  10 00205918  fp 00000000
I/DEBUG   (  937):  ip fffff040  sp bec911c8  lr afd0c69f  pc afd0bf5c  cpsr a0000030
I/DEBUG   (  937):  d0  3f80000000000000  d1  3f80000000000000
I/DEBUG   (  937):  d2  3ff0000000000000  d3  0000000000000000
I/DEBUG   (  937):  d4  0000000000000000  d5  8000000000000000
I/DEBUG   (  937):  d6  0000000000000000  d7  0000000000000000
I/DEBUG   (  937):  d8  0000000000000000  d9  0000000000000000
I/DEBUG   (  937):  d10 0000000000000000  d11 0000000000000000
I/DEBUG   (  937):  d12 0000000000000000  d13 0000000000000000
I/DEBUG   (  937):  d14 0000000000000000  d15 0000000000000000
I/DEBUG   (  937):  scr 20000012
I/DEBUG   (  937): 
I/DEBUG   (  937):          #00  pc 0000bf5c  /system/lib/libc.so
I/DEBUG   (  937):          #01  pc 0000c69a  /system/lib/libc.so
I/DEBUG   (  937):          #02  pc 0000cd8c  /system/lib/libc.so
I/DEBUG   (  937):          #03  pc 0002cb20  /system/lib/libskia.so
I/DEBUG   (  937):          #04  pc 0005f208  /system/lib/libskia.so
I/DEBUG   (  937):          #05  pc 00059ce0  /system/lib/libskia.so
I/DEBUG   (  937):          #06  pc 0006cec0  /system/lib/libskia.so
I/DEBUG   (  937): 
I/DEBUG   (  937): code around pc:
I/DEBUG   (  937): afd0bf3c 2d00512c 291fd038 2700d101 084fe002 
I/DEBUG   (  937): afd0bf4c 0719f1c7 fa102200 9204f707 9301460e 
I/DEBUG   (  937): afd0bf5c f0216869 1a1c0303 bf284564 d2054664 
I/DEBUG   (  937): afd0bf6c 4631b91c 46a49b01 462ae01e c014f8d5 
I/DEBUG   (  937): afd0bf7c 1d190ffb f8559b04 f1bc5021 bf180100 
I/DEBUG   (  937): 
I/DEBUG   (  937): code around lr:
I/DEBUG   (  937): afd0c67c f115e00f bf880f41 35fff04f 350bd80c 
I/DEBUG   (  937): afd0c68c f0254927 18600507 b12b6843 f7ff4628 
I/DEBUG   (  937): afd0c69c 4606fc1d d1322800 19a24e21 428d6891 
I/DEBUG   (  937): afd0c6ac 1b4bd819 2b0f6956 1970d908 f0436093 
I/DEBUG   (  937): afd0c6bc 61500c01 c004f8c0 e01750c3 f0412300 
I/DEBUG   (  937): 
I/DEBUG   (  937): stack:
I/DEBUG   (  937):     bec91188  00230dc4  
I/DEBUG   (  937):     bec9118c  ab14b298  /system/lib/libskia.so
I/DEBUG   (  937):     bec91190  afd4372c  /system/lib/libc.so
I/DEBUG   (  937):     bec91194  29522444  
I/DEBUG   (  937):     bec91198  00000001  
I/DEBUG   (  937):     bec9119c  29522444  
I/DEBUG   (  937):     bec911a0  00276140  
I/DEBUG   (  937):     bec911a4  bec912b0  [stack]
I/DEBUG   (  937):     bec911a8  00205918  
I/DEBUG   (  937):     bec911ac  afd0be63  /system/lib/libc.so
I/DEBUG   (  937):     bec911b0  00000000  
I/DEBUG   (  937):     bec911b4  00000000  
I/DEBUG   (  937):     bec911b8  00000018  
I/DEBUG   (  937):     bec911bc  00000000  
I/DEBUG   (  937):     bec911c0  ab20f440  /system/lib/libskia.so
I/DEBUG   (  937):     bec911c4  00276170  
I/DEBUG   (  937): #00 bec911c8  00276170  
I/DEBUG   (  937):     bec911cc  afd42328  /system/lib/libc.so
I/DEBUG   (  937):     bec911d0  00000001  
I/DEBUG   (  937):     bec911d4  afd0cdaf  /system/lib/libc.so
I/DEBUG   (  937):     bec911d8  00000000  
I/DEBUG   (  937):     bec911dc  00000fbc  
I/DEBUG   (  937):     bec911e0  00000003  
I/DEBUG   (  937):     bec911e4  afd42328  /system/lib/libc.so
I/DEBUG   (  937):     bec911e8  00000fc0  
I/DEBUG   (  937):     bec911ec  00000003  
I/DEBUG   (  937):     bec911f0  00205918  
I/DEBUG   (  937):     bec911f4  afd0c69f  /system/lib/libc.so
I/DEBUG   (  937): #01 bec911f8  00000001  
I/DEBUG   (  937):     bec911fc  ab14e5cc  /system/lib/libskia.so
I/DEBUG   (  937):     bec91200  00230da0  
I/DEBUG   (  937):     bec91204  ab20fb64  /system/lib/libskia.so
I/DEBUG   (  937):     bec91208  00276140  
I/DEBUG   (  937):     bec9120c  00205918  
I/DEBUG   (  937):     bec91210  00205950  
I/DEBUG   (  937):     bec91214  00000fbc  
I/DEBUG   (  937):     bec91218  00000002  
I/DEBUG   (  937):     bec9121c  00276140  
I/DEBUG   (  937):     bec91220  00205918  
I/DEBUG   (  937):     bec91224  afd0cd8f  /system/lib/libc.so
we really need to make use of these backtraces.  We have 800+ of these which are documented, or 4/day.  Maybe we can get creative in how we debug this.  I know in the past jchen has done some magic os level hunting and found workarounds.
I tried to reproduce this failure on a panda board (in hopes of getting better traces) but could not.

However, I notice that very similar crash-on-test-startup failures are being logged against a variety of other bugs...and some of them have more information!

For example:

https://tbpl.mozilla.org/php/getParsedLog.php?id=16454462&tree=Try&full=1
Slightly different dump from comment 851:

I/DEBUG   (  937): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
I/DEBUG   (  937): Build fingerprint: 'nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys'
I/DEBUG   (  937): pid: 1540, tid: 1549  >>> org.mozilla.fennec <<<
I/DEBUG   (  937): signal 11 (SIGSEGV), fault addr 002c0914
I/DEBUG   (  937):  r0 002c08f8  r1 4e6d6f00  r2 00000028  r3 002c08f8
I/DEBUG   (  937):  r4 002c08f8  r5 00000000  r6 002c08f8  r7 00000000
I/DEBUG   (  937):  r8 4e6d6b80  r9 43094b18  10 43094b00  fp 00223970
I/DEBUG   (  937):  ip ffffffff  sp 4e6d6ad8  lr a8b1c219  pc a8b1c1c0  cpsr 60000030
I/DEBUG   (  937):  d0  0000007400000064  d1  4042000042e80000
I/DEBUG   (  937):  d2  40187e3d66b2ee64  d3  4059000000000000
I/DEBUG   (  937):  d4  412e848000000000  d5  3fe0000000000000
I/DEBUG   (  937):  d6  412e848000000000  d7  411e848c00000000
I/DEBUG   (  937):  d8  0000000000000000  d9  0000000000000000
I/DEBUG   (  937):  d10 0000000000000000  d11 0000000000000000
I/DEBUG   (  937):  d12 0000000000000000  d13 0000000000000000
I/DEBUG   (  937):  d14 0000000000000000  d15 0000000000000000
I/DEBUG   (  937):  scr 60000012
I/DEBUG   (  937): 
I/DEBUG   (  937):          #00  pc 0001c1c0  /system/lib/libsqlite.so
I/DEBUG   (  937):          #01  pc 0001c214  /system/lib/libsqlite.so
I/DEBUG   (  937):          #02  pc 00021a4a  /system/lib/libsqlite.so
I/DEBUG   (  937):          #03  pc 00021b12  /system/lib/libsqlite.so
I/DEBUG   (  937):          #04  pc 0003bc0c  /system/lib/libandroid_runtime.so
I/DEBUG   (  937):          #05  pc 00016e34  /system/lib/libdvm.so
I/DEBUG   (  937): 
I/DEBUG   (  937): code around pc:
I/DEBUG   (  937): a8b1c1a0 fc78f7e9 21058ba0 4380f420 f02377a1 
I/DEBUG   (  937): a8b1c1b0 f04c0cff 83a20201 bf00bd10 4604b570 
I/DEBUG   (  937): a8b1c1c0 f2428b82 ea024060 b3030300 5500f412 
I/DEBUG   (  937): a8b1c1d0 4620d007 f7ee6821 4620feff f818f000 
I/DEBUG   (  937): a8b1c1e0 f412e015 d0056f80 b11b6a23 47986960 
I/DEBUG   (  937): 
I/DEBUG   (  937): code around lr:
I/DEBUG   (  937): a8b1c1f8 6820d003 fc4af7e9 f012e005 d0020f40 
I/DEBUG   (  937): a8b1c208 f7ff4620 bd70ffbb 4604b510 ffd2f7ff 
I/DEBUG   (  937): a8b1c218 6a616920 f930f7e8 61632300 62236263 
I/DEBUG   (  937): a8b1c228 bf00bd10 1e03b5f7 2301bf18 29004604 
I/DEBUG   (  937): a8b1c238 2300bf0c 0301f003 6906b303 7fb02728 
I/DEBUG   (  937): 
I/DEBUG   (  937): stack:
I/DEBUG   (  937):     4e6d6a98  4859df30  /dev/ashmem/mspace/dalvik-heap/2 (deleted)
I/DEBUG   (  937):     4e6d6a9c  4859df30  /dev/ashmem/mspace/dalvik-heap/2 (deleted)
I/DEBUG   (  937):     4e6d6aa0  808a23f4  /system/lib/libdvm.so
I/DEBUG   (  937):     4e6d6aa4  4449f05b  /data/dalvik-cache/system@framework@framework.jar@classes.dex
I/DEBUG   (  937):     4e6d6aa8  80884688  /system/lib/libdvm.so
I/DEBUG   (  937):     4e6d6aac  80888a3e  /system/lib/libdvm.so
I/DEBUG   (  937):     4e6d6ab0  0011f450  
I/DEBUG   (  937):     4e6d6ab4  afd43928  /system/lib/libc.so
I/DEBUG   (  937):     4e6d6ab8  060d4000  
I/DEBUG   (  937):     4e6d6abc  afd106d8  /system/lib/libc.so
I/DEBUG   (  937):     4e6d6ac0  0011f450  
I/DEBUG   (  937):     4e6d6ac4  00000001  
I/DEBUG   (  937):     4e6d6ac8  a8b50948  /system/lib/libsqlite.so
I/DEBUG   (  937):     4e6d6acc  00000000  
I/DEBUG   (  937):     4e6d6ad0  df0027ad  
I/DEBUG   (  937):     4e6d6ad4  00000000  
I/DEBUG   (  937): #00 4e6d6ad8  002c08f8  
I/DEBUG   (  937):     4e6d6adc  00000000  
I/DEBUG   (  937):     4e6d6ae0  002c08f8  
I/DEBUG   (  937):     4e6d6ae4  a8b1c219  /system/lib/libsqlite.so
I/DEBUG   (  937): #01 4e6d6ae8  0027c358  
I/DEBUG   (  937):     4e6d6aec  a8b21a4f  /system/lib/libsqlite.so
See Also: → 805353
Depends on: 810471
blocking-fennec1.0: - → ---
I believe bug 813647 to be the same as this, just with the bug 808410 change from  the string "Warning: No crash directory..." to "Automation Error: No crash directory..." so this failure mode is now starrable. In addition, this failure is now fatal on robocop, whereas previously we would have missed it.
Summary: Android tests intermittently just don't start → Android tests intermittently just don't start ("Automation Error: No crash directory (/mnt/sdcard/tests/profile/minidumps/) found on remote device")
Summary: Android tests intermittently just don't start ("Automation Error: No crash directory (/mnt/sdcard/tests/profile/minidumps/) found on remote device") → Android tests intermittently don't start ("Automation Error: No crash directory (/mnt/sdcard/tests/profile/minidumps/) found on remote device" or "Automation Error: No crash directory (/mnt/sdcard/tests/reftest/profile/minidumps/) found on remote device")
Whiteboard: [orange][android_tier_1_mozharness] → [android_tier_1_mozharness]
I checked the 5 most recent logs; all of the logcats show the same pattern as bug 810471: tests crash on startup.
Comment 1315 is a mis-star, that's real bustage (even without looking at the log, it would be extremely unlikely for this intermittent to occur on every testsuite at once on a push). However the confusion is kind of understandable, given the misleading string used for this failure mode (filed bug 823452 for this), leading people to think it's an automation-only failure mode and not suspect in-tree checkins (or in the comment 1315 case, a likely-needed-clobber after merge due to our broken fennec build system).
Summary: Android tests intermittently don't start ("Automation Error: No crash directory (/mnt/sdcard/tests/profile/minidumps/) found on remote device" or "Automation Error: No crash directory (/mnt/sdcard/tests/reftest/profile/minidumps/) found on remote → Fennec intermittently crashes on startup ("Automation Error: No crash directory (/mnt/sdcard/tests/profile/minidumps/) found on remote device" or "Automation Error: No crash directory (/mnt/sdcard/tests/reftest/profile/minidumps/) found on remote
(In reply to Ed Morley (Away 20th Dec-2nd Jan) [UTC+0; email:edmorley@moco] from comment #1316)
> Comment 1315 is a mis-star
[...]
> a likely-needed-clobber after merge due to our broken fennec build system).

For the record -- we ended up tracking that fully-orange cycle in bug 824276, and it did indeed end up just needing a clobber.
Depends on: 824760
Depends on: 823452
The last few Panda logs - comments 1439, 1438, 1436 - look like the Panda is rebooting when the test starts. In these cases, there is no evidence that Fennec is crashing, or has even been started. (Same Automation Error, but different cause...bug 811444?)
Comment 1441 is a tegra reboot-during-test; see: "INFO: sending rebt command"
Comment 1442 is a panda reboot, like 1439, 1438, 1436.
Comments 1437, 1443 and 1444 are Java exceptions: now reported in bug 830557

I don't see any logs showing the classic case of an unexplained crash on startup for several days now...are those cases being reported somewhere else now, or... (fingers crossed).
Depends on: 830557
Thank you for looking through these :-D

(In reply to Geoff Brown [:gbrown] from comment #1445)
> Comment 1441 is a tegra reboot-during-test; see: "INFO: sending rebt command"

We need to change this to something that TBPL's parser will be able to catch.

> Comment 1442 is a panda reboot, like 1439, 1438, 1436.

For these we currently get:

{
INFO | automation.py | Application pid: 2123
Could not connect; sleeping for 5 seconds.
reconnecting socket
INFO | automation.py | Application ran for: 0:04:44.582392
INFO | automation.py | Reading PID log: /tmp/tmpTBmdRApidlog
Automation Error: No crash directory (/mnt/sdcard/tests/reftest/profile/minidumps/) found on remote device
}

I'd really like us to handle these better. If we have a disconnect, we shouldn't continue and then be surprised when there is no minidumps directory.

> Comments 1437, 1443 and 1444 are Java exceptions: now reported in bug 830557

So bug 823452 will help us star these ones correctly, right?
(In reply to Ed Morley [:edmorley UTC+0] from comment #1446)
> > Comments 1437, 1443 and 1444 are Java exceptions: now reported in bug 830557
> 
> So bug 823452 will help us star these ones correctly, right?

Correct (they contain ">>> REPORTING UNCAUGHT EXCEPTION FROM THREAD").
Comment 1448, 1450, 1453, 1454, 1459, 1463: bug 830557.
Bugs 1469 - 1473 are all APK installation failures: bug 838647.
Depends on: 838647
s/Bugs/Comments/ !
10 failures in a row on tegra-035 -- something is wrong with that tegra!
Depends on: tegra-035
Almost all of the recent failures here have been on tegra-256.
Depends on: tegra-256
(In reply to TinderboxPushlog Robot from comment #1508)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=20068768&tree=Mozilla-Inbound

There was an additional exception in this log, filed bug 844902.
Comments 1512 and 1513 are both installation errors, not detected until test run time. I don't understand the cause of the installation failure. I recall seeing similar errors in logs elsewhere...will try to collect more examples in bug 847835.

03-04 21:07:03.001 W/System.err( 1511): android.content.pm.PackageManager$NameNotFoundException: org.mozilla.fennec
03-04 21:07:03.011 W/System.err( 1511): 	at android.app.ContextImpl$ApplicationPackageManager.getPackageInfo(ContextImpl.java:1651)
03-04 21:07:03.011 W/System.err( 1511): 	at com.mozilla.SUTAgentAndroid.service.DoCommand.StartJavaPrg(DoCommand.java:3393)
03-04 21:07:03.011 W/System.err( 1511): 	at com.mozilla.SUTAgentAndroid.service.DoCommand.StartPrg2(DoCommand.java:3666)
03-04 21:07:03.011 W/System.err( 1511): 	at com.mozilla.SUTAgentAndroid.service.DoCommand.processCommand(DoCommand.java:689)
03-04 21:07:03.011 W/System.err( 1511): 	at com.mozilla.SUTAgentAndroid.service.CmdWorkerThread.run(CmdWorkerThread.java:141)
03-04 21:07:03.011 W/System.err( 1511): android.content.ActivityNotFoundException: No Activity found to handle Intent { act=android.intent.action.VIEW dat=http://mochi.test:8888/tests/?autorun=1&closeWhenDone=1&logFile=%2Fmnt%2Fsdcard%2Ftests%2Flogs%2Fmochitest.log&fileLevel=INFO&consoleLevel=INFO&totalChunks=8&thisChunk=3&hideResultsTable=1&testManifest=android.json&runOnly=true flg=0x10000000 pkg=org.mozilla.fennec (has extras) }
03-04 21:07:03.011 W/System.err( 1511): 	at android.app.Instrumentation.checkStartActivityResult(Instrumentation.java:1408)
03-04 21:07:03.011 W/System.err( 1511): 	at android.app.Instrumentation.execStartActivity(Instrumentation.java:1378)
03-04 21:07:03.011 W/System.err( 1511): 	at android.app.ContextImpl.startActivity(ContextImpl.java:622)
03-04 21:07:03.011 W/System.err( 1511): 	at android.content.ContextWrapper.startActivity(ContextWrapper.java:258)
03-04 21:07:03.011 W/System.err( 1511): 	at com.mozilla.SUTAgentAndroid.service.DoCommand.StartJavaPrg(DoCommand.java:3464)
03-04 21:07:03.011 W/System.err( 1511): 	at com.mozilla.SUTAgentAndroid.service.DoCommand.StartPrg2(DoCommand.java:3666)
03-04 21:07:03.011 W/System.err( 1511): 	at com.mozilla.SUTAgentAndroid.service.DoCommand.processCommand(DoCommand.java:689)
03-04 21:07:03.011 W/System.err( 1511): 	at com.mozilla.SUTAgentAndroid.service.CmdWorkerThread.run(CmdWorkerThread.java:141)
Depends on: 847835
Most of the recent failures here are on tegra-038 and tegra-315, already under investigation in bug 832052.

These show:

05-10 09:12:07.643 E/AndroidRuntime( 1790): FATAL EXCEPTION: main
05-10 09:12:07.643 E/AndroidRuntime( 1790): java.lang.RuntimeException: Unable to instantiate instrumentation ComponentInfo{org.mozilla.roboexample.test/org.mozilla.fennec.FennecInstrumentationTestRunner}: java.lang.ClassNotFoundException: org.mozilla.fennec.FennecInstrumentationTestRunner in loader dalvik.system.PathClassLoader[/system/framework/android.test.runner.jar:/data/app/org.mozilla.roboexample.test-1.apk:/data/app/org.mozilla.fennec-1.apk]
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at android.app.ActivityThread.handleBindApplication(ActivityThread.java:4202)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at android.app.ActivityThread.access$3000(ActivityThread.java:125)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at android.app.ActivityThread$H.handleMessage(ActivityThread.java:2071)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at android.os.Handler.dispatchMessage(Handler.java:99)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at android.os.Looper.loop(Looper.java:123)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at android.app.ActivityThread.main(ActivityThread.java:4627)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at java.lang.reflect.Method.invokeNative(Native Method)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at java.lang.reflect.Method.invoke(Method.java:521)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:868)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:626)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at dalvik.system.NativeStart.main(Native Method)
05-10 09:12:07.643 E/AndroidRuntime( 1790): Caused by: java.lang.ClassNotFoundException: org.mozilla.fennec.FennecInstrumentationTestRunner in loader dalvik.system.PathClassLoader[/system/framework/android.test.runner.jar:/data/app/org.mozilla.roboexample.test-1.apk:/data/app/org.mozilla.fennec-1.apk]
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at dalvik.system.PathClassLoader.findClass(PathClassLoader.java:243)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at java.lang.ClassLoader.loadClass(ClassLoader.java:573)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at java.lang.ClassLoader.loadClass(ClassLoader.java:532)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	at android.app.ActivityThread.handleBindApplication(ActivityThread.java:4199)
05-10 09:12:07.643 E/AndroidRuntime( 1790): 	... 10 more
05-10 09:12:07.643 W/ActivityManager( 1020): Error in app org.mozilla.fennec running instrumentation ComponentInfo{org.mozilla.roboexample.test/org.mozilla.fennec.FennecInstrumentationTestRunner}:
05-10 09:12:07.643 W/ActivityManager( 1020):   java.lang.ClassNotFoundException
05-10 09:12:07.643 W/ActivityManager( 1020):   java.lang.ClassNotFoundException: org.mozilla.fennec.FennecInstrumentationTestRunner in loader dalvik.system.PathClassLoader[/system/framework/android.test.runner.jar:/data/app/org.mozilla.roboexample.test-1.apk:/data/app/org.mozilla.fennec-1.apk]
Depends on: 832052
Depends on: 872116
Depends on: tegra-262
Depends on: 876110
Many of the recent failures are reported at the end of rc2's testJarReader, which was enabled recently.
Depends on: 889939
The current message is a bit unclear as there is nothing inherently wrong with there being no crash directory -- it's just that gecko is supposed to create one automatically at startup and it not being there is a sign that something may have gone wrong. Here's a simple patch which adds a few details to the error message.

A slight change like this won't mess up orange starring, will it?
Assignee: nobody → wlachance
Attachment #771074 - Flags: review?(gbrown)
Attachment #771074 - Flags: feedback?(emorley)
Attachment #771074 - Flags: review?(gbrown) → review+
Comment on attachment 771074 [details] [diff] [review]
Make error message clearer

TBPL will fall back to searching for the full failure line, which will continue to work as long as the bug summary is updated with the new string. The longer failure line might mean we need to split this bug into two (there are a couple of variations in the summary currently) due to the character limit. However, this bug has been a general dumping ground in the past, with many of the issues now being fixed - so might be good to start afresh anyway.
Attachment #771074 - Flags: feedback?(emorley) → feedback+
The most recent failures have SIGSEGV reported by libc very soon after launch:

07:49:11     INFO -  09-04 07:49:08.593 W/GeckoGLController( 2259): GLController::surfaceChanged, creating compositor; mCompositorCreated=false, mSurfaceValid=true
07:49:11     INFO -  09-04 07:49:08.601 D/GeckoLoader( 2259): Gecko environment env0: MOZ_CRASHREPORTER=1
07:49:11     INFO -  09-04 07:49:08.601 D/GeckoLoader( 2259): env1: MOZ_CRASHREPORTER_NO_REPORT=1
07:49:11     INFO -  09-04 07:49:08.601 D/GeckoLoader( 2259): env2: XPCOM_DEBUG_BREAK=stack
07:49:11     INFO -  09-04 07:49:08.601 D/GeckoLoader( 2259): env3: MOZ_HIDE_RESULTS_TABLE=1
07:49:11     INFO -  09-04 07:49:08.601 D/GeckoLoader( 2259): env4: XPCOM_MEM_BLOAT_LOG=/tmp/tmpMSEsAd/runtests_leaks.log
07:49:11     INFO -  09-04 07:49:08.601 D/GeckoLoader( 2259): env5: null
07:49:11     INFO -  09-04 07:49:08.601 F/libc    ( 2259): Fatal signal 11 (SIGSEGV) at 0x00000070 (code=1)
07:49:11     INFO -  09-04 07:49:08.601 E/GeckoLibLoad( 2259): Load sqlite start
Do we get a stack in logcat out of that?
Time to re-image tegra-228!
No longer depends on: tegra-262
Some of the recent crashes here are variations on the loadNSSLibs exception being investigated in bug 832052.
:glandium -- This bug has more intermittent crashes in/around GeckoLibLoad/GeckoLinker, although almost every one crashes in a slightly different way. Hoping you can make some sense of this...?

One example, from https://tbpl.mozilla.org/php/getParsedLog.php?id=32723157&full=1&branch=fx-team#error0:

1-08 14:40:28.591 E/GeckoLibLoad( 1835): Load nss start
01-08 14:40:28.591 E/GeckoLinker( 1835): /data/app/org.mozilla.fennec-1.apk!/assets/libnss3.so: Warning: unhandled flags #8 not handled
01-08 14:40:28.611 E/GeckoLibLoad( 1835): Load sqlite start
01-08 14:40:28.611 E/GeckoLinker( 1835): mprotect failed
01-08 14:40:28.611 E/GeckoLibLoad( 1835): Load nss done
01-08 14:40:28.611 E/GeckoLinker( 1835): /data/app/org.mozilla.fennec-1.apk!/assets/libxul.so: Warning: unhandled flags #8 not handled
01-08 14:40:28.611 E/GeckoLinker( 1835): /data/app/org.mozilla.fennec-1.apk!/assets/libmozalloc.so: Warning: unhandled flags #8 not handled
01-08 14:40:28.621 E/GeckoLinker( 1835): /data/app/org.mozilla.fennec-1.apk!/assets/libxul.so: Warning: relocation to NULL @0x01e5d144
01-08 14:40:28.631 E/GeckoLinker( 1835): /data/app/org.mozilla.fennec-1.apk!/assets/libxul.so: Warning: relocation to NULL @0x01e5e260 for symbol "__cxa_begin_cleanup"
01-08 14:40:28.631 E/GeckoLinker( 1835): /data/app/org.mozilla.fennec-1.apk!/assets/libxul.so: Warning: relocation to NULL @0x01e5e264 for symbol "__cxa_type_match"
01-08 14:40:28.713 E/GeckoLibLoad( 1835): Load sqlite done
01-08 14:40:28.713 I/DEBUG   (  937): debuggerd committing suicide to free the zombie!
01-08 14:40:28.713 I/DEBUG   ( 1860): debuggerd: Feb  2 2011 10:46:35
01-08 14:40:28.723 D/Zygote  (  939): Process 1835 terminated by signal (11)
Flags: needinfo?(mh+mozilla)
Depends on: 959254
See bug 959254.
Flags: needinfo?(mh+mozilla)
Is there a bug on file to make the tbpl reports less useless and not bring sheriffs here? This bug has been around for too long and used for at least five completely different issues.
(In reply to Mike Hommey [:glandium] from comment #1821)
> Is there a bug on file to make the tbpl reports less useless and not bring
> sheriffs here? This bug has been around for too long and used for at least
> five completely different issues.

Yeah it would be good to have more specific failure strings in the log, which would either use a prefix TBPL's parsers understands, or else we can add to the regexp used. Other than the patch in bug 959254, do you have any other suggestions of where we can make the log output more useful? :-)
Is it possible to get a minidump for crashes like this? We seem to often not get one for startup crashes, which seems reasonable, but I'm actually not sure of the failure mechanism in play.

On another note, would anyone mind if we routinely cloned bugs like this (long-lived bugs for generic failures with 1000+ TBPL Robot comments), and closed the original?
(In reply to Geoff Brown [:gbrown] from comment #1825)
> On another note, would anyone mind if we routinely cloned bugs like this
> (long-lived bugs for generic failures with 1000+ TBPL Robot comments), and
> closed the original?

Yeah that would be useful in this case; we can remove the "intermittent-failure" keyword from the old/closed bug to ensure new unrelated instances aren't lumped in here.
(In reply to Geoff Brown [:gbrown] from comment #1825)
> Is it possible to get a minidump for crashes like this? We seem to often not
> get one for startup crashes, which seems reasonable, but I'm actually not
> sure of the failure mechanism in play.

We could setup the test environments so that core dumps are produced in case of crashes, and get those dumps somehow. They wouldn't be minidumps, though, because breakpad is not up at the time of (some of) those crashes.
Depends on: 967032
(In reply to TBPL Robot from comment #1850)
> Tomcat
> https://tbpl.mozilla.org/php/getParsedLog.php?id=33870944&tree=Mozilla-
> Inbound
> Android 2.2 Tegra mozilla-inbound opt test robocop-2 on 2014-01-31 03:12:15
> revision: e84fc82b4d22
> slave: tegra-281
> 
> Automation Error: No crash directory (/mnt/sdcard/tests/profile/minidumps/)
> found on remote device
> Mochi-Remote ERROR | Automation Error: Missing end of test marker (process
> crashed?)
> Mochi-Remote ERROR   | Automation Error: Missing end of test marker (process
> crashed?)

This one, and several like it happen during robocop tests after 5 minutes:

63 INFO TEST-PASS | testSystemPages | The correct number of tabs are opened - 4 should equal 4
waitForText timeout on ^Settings$
INFO | automation.py | Application ran for: 0:05:01.326236
INFO | zombiecheck | Reading PID log: /tmp/tmpzGgUJ2pidlog
/data/anr/traces.txt not found
WARNING | leakcheck | refcount logging is off, so leaks can't be detected!
runtests.py | Running tests: end.
Mochi-Remote WARNING | cleaning up pidfile '/builds/tegra-281/test/../runtestsremote.pid' was unsuccessful from the test harness
Mochi-Remote ERROR   | Automation Error: Missing end of test marker (process crashed?)

That is bug 963838.
Depends on: 963838
For the last week (at least) failures reported here are dominated by bug 959254.
(In reply to TBPL Robot from comment #1965)
> KWierso
> https://tbpl.mozilla.org/php/getParsedLog.php?id=37093703&tree=Mozilla-
> Central

Here, and in Comment 1964, the problem is:

14:37:58     INFO - Installing Fennec for test-1
ERROR:  Failure [INSTALL_FAILED_INSUFFICIENT_STORAGE]

...another case of bug 988657 (bug 988262).
Depends on: 991020
Geoff, apart from bug 991020 and bug 959254, do you know of any other cases where we can add logging or improve the strings used so TBPL can pick them up (or even add to TBPL's regex) - so we can depreciate this bug? Cheers :-)
Flags: needinfo?(gbrown)
(And thank you for always digging into failures that get posted here... <3)
(In reply to Ed Morley [:edmorley UTC+0] from comment #1968)
> Geoff, apart from bug 991020 and bug 959254, do you know of any other cases
> where we can add logging or improve the strings used so TBPL can pick them
> up (or even add to TBPL's regex) - so we can depreciate this bug? Cheers :-)

No. I think nearly all of the recent failures here are bug 991020 or bug 959254.
Flags: needinfo?(gbrown)
Most of these still look like bug 959254:

05-28 11:53:31.875 E/GeckoJNI( 1826): missing NSS_Initialize
05-28 11:53:31.875 E/GeckoLibLoad( 1826): Throw
05-28 11:53:31.875 E/GeckoLibLoad( 1826): Load sqlite done
05-28 11:53:31.875 I/TestRunner( 1826): failed: testPasswordProvider(org.mozilla.gecko.tests.testPasswordProvider)
05-28 11:53:31.875 I/TestRunner( 1826): ----- begin exception -----
05-28 11:53:31.875 I/TestRunner( 1826): 
05-28 11:53:31.875 I/TestRunner( 1826): java.lang.Exception: Error loading sqlite libraries
05-28 11:53:31.875 I/TestRunner( 1826): 	at org.mozilla.gecko.mozglue.GeckoLoader.loadSQLiteLibsNative(Native Method)
05-28 11:53:31.875 I/TestRunner( 1826): 	at org.mozilla.gecko.mozglue.GeckoLoader.loadSQLiteLibs(GeckoLoader.java:241)
05-28 11:53:31.875 I/TestRunner( 1826): 	at org.mozilla.gecko.FennecNativeActions.<init>(FennecNativeActions.java:40)
05-28 11:53:31.875 I/TestRunner( 1826): 	at org.mozilla.gecko.tests.BaseTest.setUp(BaseTest.java:116)


05-28 03:45:43.313 E/GeckoLibLoad( 1851): Load sqlite start
05-28 03:45:43.323 E/GeckoLinker( 1851): mprotect @0x5319e000, 0xffe99000, 0x3 failed with errno 12
05-28 03:45:43.323 E/GeckoLinker( 1851): mprotect failed
05-28 03:45:43.323 E/GeckoLibLoad( 1851): Load nss done
Depends on: 1017578
Not working on this.

After some thought I'm not sure if my patch in comment 1684 genuinely makes things better or not. Ed suggested it was ok in comment 1697, but I myself am not sure if it won't just cause more confusion. The underlying issue is really that Android isn't stable enough to run things to completion in many cases and I don't know if a slightly more accurate error message really addresses that in a meaningful way. Given that I'm not really actively working on Android automation these days, I'm inclined to just leave things be.
Assignee: wlachance → nobody
Status: NEW → RESOLVED
blocking-b2g: 2.5? → ---
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: