Closed Bug 1611936 Opened 5 years ago Closed 5 years ago

Some android-hw tasks still hang after failing to install apk (appear to hang in verify_device)

Categories

(Testing :: General, defect, P1)

Version 3
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

Recent android-hw intermittent hangs producing task timeouts (bug 1587207) other than raptor tasks (bug 1600193) still appear to hang in about the same place as observed in bug 1605790, but after 1605790's landing:

Some appear to hang during pip freeze, just before the verify-device step:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286397444&repo=autoland&lineNumber=1113
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285807461&repo=autoland&lineNumber=1109
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=286180439&repo=autoland&lineNumber=1109

[task 2020-01-24T18:32:43.918Z] 18:32:43     INFO -   mozsystemmonitor == 0.4
[task 2020-01-24T18:32:43.918Z] 18:32:43     INFO -   mozterm == 1.0.0
[task 2020-01-24T18:32:43.918Z] 18:32:43     INFO -   moztest == 1.0.0
[task 2020-01-24T18:32:43.918Z] 18:32:43     INFO -   mozversion == 2.2.0
[taskcluster:error] Aborting task...
[taskcluster:warn 2020-01-24T19:01:12.387Z] no such process
[taskcluster 2020-01-24T19:01:12.387Z] === Task Finished ===

and some appear to hang during the verify-device step:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285701156&repo=mozilla-central&lineNumber=1120
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285726800&repo=autoland&lineNumber=1116

[task 2020-01-21T07:31:40.477Z] 07:31:40     INFO - Starting resource monitoring.
[task 2020-01-21T07:31:40.477Z] 07:31:40     INFO - [mozharness: 2020-01-21 07:31:40.476397Z] Finished create-virtualenv step (success)
[task 2020-01-21T07:31:40.477Z] 07:31:40     INFO - [mozharness: 2020-01-21 07:31:40.476833Z] Running verify-device step.
[taskcluster:error] Aborting task...
[taskcluster:warn 2020-01-21T08:30:35.375Z] no such process
[taskcluster 2020-01-21T08:30:35.375Z] === Task Finished ===

However, in all of these cases, the affected tasks have both complete android-performance.log artifacts and logcat-<device>.log artifacts, so the test log appearance is misleading, and we know that execution is at least proceeding to:

https://searchfox.org/mozilla-central/rev/aa684c0c76136be80af4c6e429bce81dea55c429/testing/mozharness/mozharness/mozilla/testing/android.py#614

Flags: needinfo?(bob)
Flags: needinfo?(bob)

I hit on this debugging technique: shell_output("log blah") puts "blah" in logcat.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2395eaaf30c59f389bec71d33b036f6561e54097

reproduced the hang with additional logging -- it looks like the end of verify_device() was executed. Trying to track it down further...

I see

         except (mozdevice.ADBError, mozdevice.ADBProcessError, mozdevice.ADBTimeoutError) as e:
+            self.shell_output('log installapk-except')

Can we not lump ADBErrors and ADBTimeoutError together? They are fundamentally different issues. I'd much rather have separate except clauses and either log timeouts and re-raise them or not catch them at all.

Note that ADBProcessError is a child of ADBError. We don't need both in the except clause.

class ADBProcessError(ADBError):

Agreed, ADBProcessError is not needed.

I don't mind having separate clauses for ADBError and ADBTimeoutError, but why not fail with self.fatal('...', EXIT_STATUS_DICT[TBPL_RETRY]) for both? The concept was, if we cannot install an apk, abandon the task, and rely on taskcluster to spin up a new task to try again.

fatal is fine. It is probably better in that we have a clear exit and can retry. I just want the script to not attempt to do anything else with the device after ADBTimeoutError.

Okay, got it.

By the way, now we can reliably reproduce the hang on try by simply forcing install_app to throw. That behavior continues even with the changes to our except clauses and simplified error messages:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bcd99080aeccd4d499e9523860f9828766dc484e

If the android_hardware_unittest.py has exited, I think there is little chance that test-linux.sh will hang:

https://searchfox.org/mozilla-central/rev/ad6234148892bc29bf08d736604ac71925138040/taskcluster/scripts/tester/test-linux.sh#227

That leaves me wondering if script.py is not exiting its main loop:

https://github.com/bclary/mozilla-bitbar-docker/blob/master/taskcluster/script.py#L219

        if line_len == 0 and bytes_written == bytes_read and rc is not None:
            break

That's hard to tell with the incomplete logging (which I also don't understand!), but we might be able to sort this out by testing changes to script.py against try pushes like in comment 8.

Great detective work!

Andrew: Would you mind working with Geoff to figure out if script.py is the culprit?

Yeah, will do.

Geoff, does https://github.com/bclary/mozilla-bitbar-docker/pull/35 look like it should work?

Sorry, I don't know! I think you previously convinced me that pull/35 was not needed, and I don't actually see anything wrong with the existing code. I worry that one of the conditions is not being met....but which one, and why??

Want to try pull/35 as a quick possible fix? I think that should be fine.

Another idea: since script.py has an ADBDevice instance already, it should be easy to reliably log some additional diagnostics to logcat, like device.shell_output("log proc.poll returned None!") -- if that seems useful.

Summary: Some android-hw tasks still appear to hang during/near verify_device → Some android-hw tasks still hang after failing to install apk (appear to hang in verify_device)

Side note:
I tried adding retries in install_apk: if install_app() fails for any reason, try to install again, x2, before giving up -- it didn't seem to help. We hit ADBTimeoutError on the final attempt with about the same frequency (and then we hit the hang, of course).
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8909e097f698b6339c8cd4061fc4753847fb14f

OK, I'll start with the PR and then we can add more debug output if needed.

The fix is now live (the docker image with the fix is now the default).

The fix is good! I no longer see this type of hang. (There are other android task timeouts, but those have raptor activity in the logcats.)

Thanks!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.