Android test hangs should be handled by the test harness, outputting last test seen & attempting to get a stack

RESOLVED FIXED

Status

defect
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: gbrown, Assigned: gbrown)

Tracking

(Blocks 1 bug, {sheriffing-P1})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [leave open])

Attachments

(2 attachments)

When an Android test is killed for no output in N seconds (bug 663657), the resulting log is pretty useless. We may be able to retrieve logcats, minidumps, etc., but no attempt is made currently. If the browser is hung, perhaps we can kill it and retrieve the stack trace.

I am thinking that the test harness (runtestsremote.py or remoteautomation.py) could have a shorter timeout than the existing one (2300 seconds without output) and could try to report diagnostics before taking everything down.
Big +1 on this :-)

Even if it's only a 
TEST-UNEXPECTED-FAIL | /foo/bar/baz.html | application timed out after 330 seconds with no output
using the last test name seen (like we do elsewhere), that would be a great help for sheriffs (and prevent the dumping ground that is bug 663657.
Jim, any thoughts?
We really should just do the same thing we do on desktop tests:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#1050

Spit a timeout, try to kill the browser if it's still running in a way that gets a stack, check for dumps, etc.
Blocks: 663657
Blocks: 778688
Summary: Better logs for hung Android tests → Android test hangs should be handled by the test harness, outputting last test seen & attempting to get a stack
Keywords: sheriffing-P1
I think it'd be good to look for minidumps and ANR reports (both from the app data directory). Do we know roughly the causes for no-output? I can think of perhaps network issues, OOM kills, and actual hangs.
(In reply to Jim Chen [:jchen :nchen] from comment #4)
> I think it'd be good to look for minidumps and ANR reports (both from the
> app data directory). 

With any luck we can do all the "normal" checks easily: minidumps, ANR reports, Java exceptions + dump the logcat.

> Do we know roughly the causes for no-output? I can
> think of perhaps network issues, OOM kills, and actual hangs.

I think those are the main possibilities. I have come to the conclusion that most are actual hangs. If there are network issues, we should see a socket error, or at least a device manager failure (the harness pings the device at least once a minute to check if the browser is running). If OOM kills the browser but sutagent is still alive, the harness should see the browser is gone and end the test. If OOM kills the sutagent, device manager should report an error ("timeout in command xxx").
Try run demonstrating the issue by intentionally hanging a few tests: https://tbpl.mozilla.org/?tree=Try&rev=8e7652982b3f. 

Desktop (Linux x86 Opt) has a useful failure message and fails in under 6 minutes:

21:22:21     INFO -  78 INFO TEST-END | /tests/Harness_sanity/test_SpecialPowersPushPermissions.html | finished in 184ms
21:22:21     INFO -  79 INFO TEST-START | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html
21:27:51  WARNING -  TEST-UNEXPECTED-FAIL | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html | application timed out after 330 seconds with no output
21:27:51     INFO -  Xlib:  extension "RANDR" missing on display ":0".
21:27:52  WARNING -  TEST-UNEXPECTED-FAIL | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html | application terminated with exit code 6
21:27:52     INFO -  INFO | runtests.py | Application ran for: 0:05:43.541083
21:27:52     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmpDBhZocpidlog
21:27:52     INFO -  mozcrash INFO | Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/gbrown@mozilla.com-8e7652982b3f/try-linux64/firefox-29.0a1.en-US.linux-x86_64.crashreporter-symbols.zip
21:28:18  WARNING -  PROCESS-CRASH | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html | application crashed [@ libc-2.15.so + 0xe8403]
21:28:19     INFO -  Crash dump filename: /tmp/tmpwROy0P/minidumps/7f49d10a-4f31-d795-637327b7-136b3cc5.dmp
21:28:19     INFO -  Operating system: Linux
21:28:19     INFO -                    0.0.0 Linux 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64
21:28:19     INFO -  CPU: amd64
21:28:19     INFO -       family 6 model 45 stepping 7
21:28:19     INFO -       1 CPU
21:28:19     INFO -  Crash reason:  SIGABRT
21:28:19     INFO -  Crash address: 0x3e800000932
21:28:19     INFO -  Thread 0 (crashed)
21:28:19     INFO -   0  libc-2.15.so + 0xe8403
...

Android is terse and anonymous and takes 40 minutes to fail:

21:11:27     INFO -  78 INFO TEST-END | /tests/Harness_sanity/test_SpecialPowersPushPermissions.html | finished in 413ms
21:11:27     INFO -  79 INFO TEST-START | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html

command timed out: 2400 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=3296.769143
========= Finished '/tools/buildbot/bin/python scripts/scripts/android_panda.py ...' failed (results: 2, elapsed: 54 mins, 57 secs) (at 2014-01-27 21:51:27.465442) =========
Detect "no output for N seconds", using the existing timeout parameter. In practice, this means N=330 seconds. On timeout, fail, identifying last test seen and reason for failure.
Attachment #8366951 - Flags: review?(jmaher)
On no-output timeout, this patch kill -6's (that's SIGABRT) the browser, to trigger a crash report. This part feels more fragile, but it seems to work fine.

Try run demonstrating "no harm done": https://bugzilla.mozilla.org/show_bug.cgi?id=960265

Try run demonstrating new hang handling (with some infra problems): https://tbpl.mozilla.org/?tree=Try&rev=2582c5afa280

Note https://tbpl.mozilla.org/php/getParsedLog.php?id=33693610&tree=Try&full=1#error0; now Android handling is very similar to desktop:

08:17:57     INFO -  78 INFO TEST-END | /tests/Harness_sanity/test_SpecialPowersPushPermissions.html | finished in 413ms
08:17:57     INFO -  79 INFO TEST-START | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html
08:23:42  WARNING -  TEST-UNEXPECTED-FAIL | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html | application timed out after 330.0 seconds with no output
08:23:42     INFO -  INFO | automation.py | Application ran for: 0:06:46.273393
08:23:42     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmp52TwC5pidlog
08:23:42     INFO -  /data/anr/traces.txt not found
08:23:42     INFO -  mozcrash INFO | Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/gbrown@mozilla.com-2582c5afa280/try-android/fennec-29.0a1.en-US.android-arm.crashreporter-symbols.zip
08:24:00  WARNING -  PROCESS-CRASH | /tests/Harness_sanity/test_SpecialPowersPushPrefEnv.html | application crashed [@ libc.so + 0xcff0]
08:24:00     INFO -  Crash dump filename: /tmp/tmptKpkyU/1b948ac6-70c4-1368-729ff642-4fce1466.dmp
08:24:00     INFO -  Operating system: Android
08:24:00     INFO -                    0.0.0 Linux 3.2.0+ #2 SMP PREEMPT Thu Nov 29 08:06:57 EST 2012 armv7l pandaboard/pandaboard/pandaboard:4.0.4/IMM76I/5:eng/test-keys
08:24:00     INFO -  CPU: arm
08:24:00     INFO -       2 CPUs
08:24:00     INFO -  Crash reason:  SIGABRT
08:24:00     INFO -  Crash address: 0x908
08:24:00     INFO -  Thread 0 (crashed)
...

 - test name is identified
 - faster (330 second) failure
 - crash dump
 - logcat
 - ANR and Java exception checks (not evident here because of the nature of the hang)
Attachment #8366971 - Flags: review?(jmaher)
\o/ :-D
Comment on attachment 8366951 [details] [diff] [review]
(1) add no-output timeout to remoteautomation

This gives failures of form:
"application timed out after 330.0 seconds with no output"

Slight nit, but could we display the seconds integer without the trailing zero, for consistency with other platforms? (in runtests.py we cast to an int / use %d) :-)
Comment on attachment 8366951 [details] [diff] [review]
(1) add no-output timeout to remoteautomation

Review of attachment 8366951 [details] [diff] [review]:
-----------------------------------------------------------------

cool function.

::: build/mobile/remoteautomation.py
@@ +247,5 @@
>          @property
>          def getLastTestSeen(self):
>              return self.lastTestSeen
>  
> +        def wait(self, timeout = None, noOutputTimeout = None):

please document this that noOutputTimeout is in seconds
Attachment #8366951 - Flags: review?(jmaher) → review+
Comment on attachment 8366971 [details] [diff] [review]
(2) on no-output timeout, kill -6 the browser

Review of attachment 8366971 [details] [diff] [review]:
-----------------------------------------------------------------

not sure I like this implementation as much.

::: build/mobile/remoteautomation.py
@@ +285,5 @@
> +                # kill -6 (SIGABRT) the process, to trigger a crash dump
> +                pid = self.dm.processExist(self.procName)
> +                if pid and pid > 0:
> +                    try:
> +                        self.dm.shellCheckOutput(['kill', '-6', str(pid)], root=True)

do we really support "kill -6 <pid>" via the sutagent and on android ?  Will this work on 2.2, 2.3, 4.0.4, 4.1, 4.2, 4.3 ?

@@ +288,5 @@
> +                    try:
> +                        self.dm.shellCheckOutput(['kill', '-6', str(pid)], root=True)
> +                    except DMError:
> +                        print "Error killing %s (pid %s)" % (self.procName, str(pid))
> +                        pass

if abort we kill and then we dm.killProcess, shouldn't this go in devicemanager?  do we want the double try here?
Attachment #8366971 - Flags: review?(jmaher) → review-
(In reply to Joel Maher (:jmaher) from comment #12)
> do we really support "kill -6 <pid>" via the sutagent and on android ?  Will
> this work on 2.2, 2.3, 4.0.4, 4.1, 4.2, 4.3 ?

For the sutagent (or adb for that matter), it is just a command it execs.

"kill -6 <pid>" works on 2.2: https://tbpl.mozilla.org/php/getParsedLog.php?id=33693538&tree=Try&full=1

and on 4.0.4: https://tbpl.mozilla.org/php/getParsedLog.php?id=33693610&tree=Try&full=1

and I verified it works on 4.2 locally. I think it is pretty standard across Unix/Linux/Android.

I'll look into re-organizing/folding into killProcess.
(1) with comment and %d added: https://hg.mozilla.org/integration/mozilla-inbound/rev/529211698369

Sheriffs - Be aware that this patch tranforms many cases of Android "2400 seconds without output" into "testXXX | application timed out after 330 seconds with no output". I expect we will need to open some new Android bugs today.
Whiteboard: [leave open]
(In reply to Geoff Brown [:gbrown] from comment #13)
> I'll look into re-organizing/folding into killProcess.

I am pretty happy with the way this is working now, so I am reluctant to make further changes. Let's leave this as-is and open a new bug if we find cause.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
See Also: → 1027802
You need to log in before you can comment on or make changes to this bug.