Some android-hw tasks still hang after failing to install apk (appear to hang in verify_device)
Categories
(Testing :: General, defect, P1)
Tracking
(Not tracked)
People
(Reporter: gbrown, Assigned: gbrown)
References
Details
Recent android-hw intermittent hangs producing task timeouts (bug 1587207) other than raptor tasks (bug 1600193) still appear to hang in about the same place as observed in bug 1605790, but after 1605790's landing:
Some appear to hang during pip freeze, just before the verify-device step:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286397444&repo=autoland&lineNumber=1113
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285807461&repo=autoland&lineNumber=1109
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286180439&repo=autoland&lineNumber=1109
[task 2020-01-24T18:32:43.918Z] 18:32:43 INFO - mozsystemmonitor == 0.4
[task 2020-01-24T18:32:43.918Z] 18:32:43 INFO - mozterm == 1.0.0
[task 2020-01-24T18:32:43.918Z] 18:32:43 INFO - moztest == 1.0.0
[task 2020-01-24T18:32:43.918Z] 18:32:43 INFO - mozversion == 2.2.0
[taskcluster:error] Aborting task...
[taskcluster:warn 2020-01-24T19:01:12.387Z] no such process
[taskcluster 2020-01-24T19:01:12.387Z] === Task Finished ===
and some appear to hang during the verify-device step:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285701156&repo=mozilla-central&lineNumber=1120
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285726800&repo=autoland&lineNumber=1116
[task 2020-01-21T07:31:40.477Z] 07:31:40 INFO - Starting resource monitoring.
[task 2020-01-21T07:31:40.477Z] 07:31:40 INFO - [mozharness: 2020-01-21 07:31:40.476397Z] Finished create-virtualenv step (success)
[task 2020-01-21T07:31:40.477Z] 07:31:40 INFO - [mozharness: 2020-01-21 07:31:40.476833Z] Running verify-device step.
[taskcluster:error] Aborting task...
[taskcluster:warn 2020-01-21T08:30:35.375Z] no such process
[taskcluster 2020-01-21T08:30:35.375Z] === Task Finished ===
![]() |
Assignee | |
Comment 1•5 years ago
|
||
However, in all of these cases, the affected tasks have both complete android-performance.log artifacts and logcat-<device>.log artifacts, so the test log appearance is misleading, and we know that execution is at least proceeding to:
Updated•5 years ago
|
Updated•5 years ago
|
![]() |
Assignee | |
Comment 2•5 years ago
|
||
I hit on this debugging technique: shell_output("log blah") puts "blah" in logcat.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2395eaaf30c59f389bec71d33b036f6561e54097
reproduced the hang with additional logging -- it looks like the end of verify_device() was executed. Trying to track it down further...
![]() |
Assignee | |
Comment 3•5 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b134d665a446aae221a2944dffa4f640128934c8
reproduced the hang:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286762076&repo=try&lineNumber=1113
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286762085&repo=try&lineNumber=1114
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286758163&repo=try&lineNumber=1119
In each case, test logs suggest the hang is in pip freeze.
In each case, extra logcat logging shows that execution actually reaches install_apk(), which catches an exception and logs a fatal message at:
https://hg.mozilla.org/try/rev/9d33132173ff784b39b0f4d93c628c51d9c04f27#l1.16
Comment 4•5 years ago
|
||
I see
except (mozdevice.ADBError, mozdevice.ADBProcessError, mozdevice.ADBTimeoutError) as e:
+ self.shell_output('log installapk-except')
Can we not lump ADBErrors and ADBTimeoutError together? They are fundamentally different issues. I'd much rather have separate except clauses and either log timeouts and re-raise them or not catch them at all.
Note that ADBProcessError is a child of ADBError. We don't need both in the except clause.
class ADBProcessError(ADBError):
![]() |
Assignee | |
Comment 5•5 years ago
|
||
Agreed, ADBProcessError is not needed.
I don't mind having separate clauses for ADBError and ADBTimeoutError, but why not fail with self.fatal('...', EXIT_STATUS_DICT[TBPL_RETRY]) for both? The concept was, if we cannot install an apk, abandon the task, and rely on taskcluster to spin up a new task to try again.
Comment 6•5 years ago
|
||
fatal is fine. It is probably better in that we have a clear exit and can retry. I just want the script to not attempt to do anything else with the device after ADBTimeoutError.
![]() |
Assignee | |
Comment 7•5 years ago
|
||
Okay, got it.
![]() |
Assignee | |
Comment 8•5 years ago
|
||
By the way, now we can reliably reproduce the hang on try by simply forcing install_app to throw. That behavior continues even with the changes to our except clauses and simplified error messages:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bcd99080aeccd4d499e9523860f9828766dc484e
![]() |
Assignee | |
Comment 9•5 years ago
|
||
If the android_hardware_unittest.py has exited, I think there is little chance that test-linux.sh will hang:
That leaves me wondering if script.py is not exiting its main loop:
https://github.com/bclary/mozilla-bitbar-docker/blob/master/taskcluster/script.py#L219
if line_len == 0 and bytes_written == bytes_read and rc is not None:
break
That's hard to tell with the incomplete logging (which I also don't understand!), but we might be able to sort this out by testing changes to script.py against try pushes like in comment 8.
Comment 10•5 years ago
|
||
Great detective work!
Andrew: Would you mind working with Geoff to figure out if script.py is the culprit?
Comment 11•5 years ago
|
||
Yeah, will do.
Geoff, does https://github.com/bclary/mozilla-bitbar-docker/pull/35 look like it should work?
![]() |
Assignee | |
Comment 12•5 years ago
|
||
Sorry, I don't know! I think you previously convinced me that pull/35 was not needed, and I don't actually see anything wrong with the existing code. I worry that one of the conditions is not being met....but which one, and why??
Want to try pull/35 as a quick possible fix? I think that should be fine.
Another idea: since script.py has an ADBDevice instance already, it should be easy to reliably log some additional diagnostics to logcat, like device.shell_output("log proc.poll returned None!") -- if that seems useful.
![]() |
Assignee | |
Updated•5 years ago
|
![]() |
Assignee | |
Comment 13•5 years ago
|
||
Side note:
I tried adding retries in install_apk: if install_app() fails for any reason, try to install again, x2, before giving up -- it didn't seem to help. We hit ADBTimeoutError on the final attempt with about the same frequency (and then we hit the hang, of course).
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8909e097f698b6339c8cd4061fc4753847fb14f
Comment 14•5 years ago
|
||
OK, I'll start with the PR and then we can add more debug output if needed.
Comment 15•5 years ago
|
||
Testing update:
test command used: ./mach try fuzzy --full --query "android-hw 'debug-geckoview-mochitest-webgl1 !fis" --no-push
PR35
- https://github.com/bclary/mozilla-bitbar-docker/pull/35
- no change in output and still failing
PR41
Comment 16•5 years ago
|
||
We've found a fix and are testing it in https://github.com/bclary/mozilla-bitbar-docker/pull/42/.
Comment 17•5 years ago
|
||
The fix is now live (the docker image with the fix is now the default).
![]() |
Assignee | |
Comment 18•5 years ago
|
||
The fix is good! I no longer see this type of hang. (There are other android task timeouts, but those have raptor activity in the logcats.)
Thanks!
Description
•