Closed
Bug 960265
Opened 10 years ago
Closed 10 years ago
Android test hangs should be handled by the test harness, outputting last test seen & attempting to get a stack
Categories
(Testing :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gbrown, Assigned: gbrown)
References
(Blocks 1 open bug)
Details
(Keywords: sheriffing-P1, Whiteboard: [leave open])
Attachments
(2 files)
3.81 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
2.00 KB,
patch
|
jmaher
:
review-
|
Details | Diff | Splinter Review |
When an Android test is killed for no output in N seconds (bug 663657), the resulting log is pretty useless. We may be able to retrieve logcats, minidumps, etc., but no attempt is made currently. If the browser is hung, perhaps we can kill it and retrieve the stack trace. I am thinking that the test harness (runtestsremote.py or remoteautomation.py) could have a shorter timeout than the existing one (2300 seconds without output) and could try to report diagnostics before taking everything down.
Comment 1•10 years ago
|
||
Big +1 on this :-) Even if it's only a TEST-UNEXPECTED-FAIL | /foo/bar/baz.html | application timed out after 330 seconds with no output using the last test name seen (like we do elsewhere), that would be a great help for sheriffs (and prevent the dumping ground that is bug 663657.
Comment 2•10 years ago
|
||
Jim, any thoughts?
Comment 3•10 years ago
|
||
We really should just do the same thing we do on desktop tests: http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#1050 Spit a timeout, try to kill the browser if it's still running in a way that gets a stack, check for dumps, etc.
Updated•10 years ago
|
Summary: Better logs for hung Android tests → Android test hangs should be handled by the test harness, outputting last test seen & attempting to get a stack
Updated•10 years ago
|
Keywords: sheriffing-P1
Comment 4•10 years ago
|
||
I think it'd be good to look for minidumps and ANR reports (both from the app data directory). Do we know roughly the causes for no-output? I can think of perhaps network issues, OOM kills, and actual hangs.
Assignee | ||
Comment 5•10 years ago
|
||
(In reply to Jim Chen [:jchen :nchen] from comment #4) > I think it'd be good to look for minidumps and ANR reports (both from the > app data directory). With any luck we can do all the "normal" checks easily: minidumps, ANR reports, Java exceptions + dump the logcat. > Do we know roughly the causes for no-output? I can > think of perhaps network issues, OOM kills, and actual hangs. I think those are the main possibilities. I have come to the conclusion that most are actual hangs. If there are network issues, we should see a socket error, or at least a device manager failure (the harness pings the device at least once a minute to check if the browser is running). If OOM kills the browser but sutagent is still alive, the harness should see the browser is gone and end the test. If OOM kills the sutagent, device manager should report an error ("timeout in command xxx").
Assignee | ||
Comment 6•10 years ago
|
||
Try run demonstrating the issue by intentionally hanging a few tests: https://tbpl.mozilla.org/?tree=Try&rev=8e7652982b3f. Desktop (Linux x86 Opt) has a useful failure message and fails in under 6 minutes: 21:22:21 INFO - 78 INFO TEST-END | /tests/Harness_sanity/test_SpecialPowersPushPermissions.html | finished in 184ms 21:22:21 INFO - 79 INFO TEST-START | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html 21:27:51 WARNING - TEST-UNEXPECTED-FAIL | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html | application timed out after 330 seconds with no output 21:27:51 INFO - Xlib: extension "RANDR" missing on display ":0". 21:27:52 WARNING - TEST-UNEXPECTED-FAIL | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html | application terminated with exit code 6 21:27:52 INFO - INFO | runtests.py | Application ran for: 0:05:43.541083 21:27:52 INFO - INFO | zombiecheck | Reading PID log: /tmp/tmpDBhZocpidlog 21:27:52 INFO - mozcrash INFO | Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/gbrown@mozilla.com-8e7652982b3f/try-linux64/firefox-29.0a1.en-US.linux-x86_64.crashreporter-symbols.zip 21:28:18 WARNING - PROCESS-CRASH | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html | application crashed [@ libc-2.15.so + 0xe8403] 21:28:19 INFO - Crash dump filename: /tmp/tmpwROy0P/minidumps/7f49d10a-4f31-d795-637327b7-136b3cc5.dmp 21:28:19 INFO - Operating system: Linux 21:28:19 INFO - 0.0.0 Linux 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 21:28:19 INFO - CPU: amd64 21:28:19 INFO - family 6 model 45 stepping 7 21:28:19 INFO - 1 CPU 21:28:19 INFO - Crash reason: SIGABRT 21:28:19 INFO - Crash address: 0x3e800000932 21:28:19 INFO - Thread 0 (crashed) 21:28:19 INFO - 0 libc-2.15.so + 0xe8403 ... Android is terse and anonymous and takes 40 minutes to fail: 21:11:27 INFO - 78 INFO TEST-END | /tests/Harness_sanity/test_SpecialPowersPushPermissions.html | finished in 413ms 21:11:27 INFO - 79 INFO TEST-START | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html command timed out: 2400 seconds without output, attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=3296.769143 ========= Finished '/tools/buildbot/bin/python scripts/scripts/android_panda.py ...' failed (results: 2, elapsed: 54 mins, 57 secs) (at 2014-01-27 21:51:27.465442) =========
Assignee | ||
Comment 7•10 years ago
|
||
Detect "no output for N seconds", using the existing timeout parameter. In practice, this means N=330 seconds. On timeout, fail, identifying last test seen and reason for failure.
Attachment #8366951 -
Flags: review?(jmaher)
Assignee | ||
Comment 8•10 years ago
|
||
On no-output timeout, this patch kill -6's (that's SIGABRT) the browser, to trigger a crash report. This part feels more fragile, but it seems to work fine. Try run demonstrating "no harm done": https://bugzilla.mozilla.org/show_bug.cgi?id=960265 Try run demonstrating new hang handling (with some infra problems): https://tbpl.mozilla.org/?tree=Try&rev=2582c5afa280 Note https://tbpl.mozilla.org/php/getParsedLog.php?id=33693610&tree=Try&full=1#error0; now Android handling is very similar to desktop: 08:17:57 INFO - 78 INFO TEST-END | /tests/Harness_sanity/test_SpecialPowersPushPermissions.html | finished in 413ms 08:17:57 INFO - 79 INFO TEST-START | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html 08:23:42 WARNING - TEST-UNEXPECTED-FAIL | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html | application timed out after 330.0 seconds with no output 08:23:42 INFO - INFO | automation.py | Application ran for: 0:06:46.273393 08:23:42 INFO - INFO | zombiecheck | Reading PID log: /tmp/tmp52TwC5pidlog 08:23:42 INFO - /data/anr/traces.txt not found 08:23:42 INFO - mozcrash INFO | Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/gbrown@mozilla.com-2582c5afa280/try-android/fennec-29.0a1.en-US.android-arm.crashreporter-symbols.zip 08:24:00 WARNING - PROCESS-CRASH | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html | application crashed [@ libc.so + 0xcff0] 08:24:00 INFO - Crash dump filename: /tmp/tmptKpkyU/1b948ac6-70c4-1368-729ff642-4fce1466.dmp 08:24:00 INFO - Operating system: Android 08:24:00 INFO - 0.0.0 Linux 3.2.0+ #2 SMP PREEMPT Thu Nov 29 08:06:57 EST 2012 armv7l pandaboard/pandaboard/pandaboard:4.0.4/IMM76I/5:eng/test-keys 08:24:00 INFO - CPU: arm 08:24:00 INFO - 2 CPUs 08:24:00 INFO - Crash reason: SIGABRT 08:24:00 INFO - Crash address: 0x908 08:24:00 INFO - Thread 0 (crashed) ... - test name is identified - faster (330 second) failure - crash dump - logcat - ANR and Java exception checks (not evident here because of the nature of the hang)
Attachment #8366971 -
Flags: review?(jmaher)
Comment 9•10 years ago
|
||
\o/ :-D
Comment 10•10 years ago
|
||
Comment on attachment 8366951 [details] [diff] [review] (1) add no-output timeout to remoteautomation This gives failures of form: "application timed out after 330.0 seconds with no output" Slight nit, but could we display the seconds integer without the trailing zero, for consistency with other platforms? (in runtests.py we cast to an int / use %d) :-)
Comment 11•10 years ago
|
||
Comment on attachment 8366951 [details] [diff] [review] (1) add no-output timeout to remoteautomation Review of attachment 8366951 [details] [diff] [review]: ----------------------------------------------------------------- cool function. ::: build/mobile/remoteautomation.py @@ +247,5 @@ > @property > def getLastTestSeen(self): > return self.lastTestSeen > > + def wait(self, timeout = None, noOutputTimeout = None): please document this that noOutputTimeout is in seconds
Attachment #8366951 -
Flags: review?(jmaher) → review+
Comment 12•10 years ago
|
||
Comment on attachment 8366971 [details] [diff] [review] (2) on no-output timeout, kill -6 the browser Review of attachment 8366971 [details] [diff] [review]: ----------------------------------------------------------------- not sure I like this implementation as much. ::: build/mobile/remoteautomation.py @@ +285,5 @@ > + # kill -6 (SIGABRT) the process, to trigger a crash dump > + pid = self.dm.processExist(self.procName) > + if pid and pid > 0: > + try: > + self.dm.shellCheckOutput(['kill', '-6', str(pid)], root=True) do we really support "kill -6 <pid>" via the sutagent and on android ? Will this work on 2.2, 2.3, 4.0.4, 4.1, 4.2, 4.3 ? @@ +288,5 @@ > + try: > + self.dm.shellCheckOutput(['kill', '-6', str(pid)], root=True) > + except DMError: > + print "Error killing %s (pid %s)" % (self.procName, str(pid)) > + pass if abort we kill and then we dm.killProcess, shouldn't this go in devicemanager? do we want the double try here?
Attachment #8366971 -
Flags: review?(jmaher) → review-
Assignee | ||
Comment 13•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #12) > do we really support "kill -6 <pid>" via the sutagent and on android ? Will > this work on 2.2, 2.3, 4.0.4, 4.1, 4.2, 4.3 ? For the sutagent (or adb for that matter), it is just a command it execs. "kill -6 <pid>" works on 2.2: https://tbpl.mozilla.org/php/getParsedLog.php?id=33693538&tree=Try&full=1 and on 4.0.4: https://tbpl.mozilla.org/php/getParsedLog.php?id=33693610&tree=Try&full=1 and I verified it works on 4.2 locally. I think it is pretty standard across Unix/Linux/Android. I'll look into re-organizing/folding into killProcess.
Assignee | ||
Comment 14•10 years ago
|
||
Clean try run for (1): https://tbpl.mozilla.org/?tree=Try&rev=493b38c7fe0b
Assignee | ||
Comment 15•10 years ago
|
||
(1) with comment and %d added: https://hg.mozilla.org/integration/mozilla-inbound/rev/529211698369 Sheriffs - Be aware that this patch tranforms many cases of Android "2400 seconds without output" into "testXXX | application timed out after 330 seconds with no output". I expect we will need to open some new Android bugs today.
Whiteboard: [leave open]
Assignee | ||
Comment 17•10 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #13) > I'll look into re-organizing/folding into killProcess. I am pretty happy with the way this is working now, so I am reluctant to make further changes. Let's leave this as-is and open a new bug if we find cause.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•