android-hw tasks intermittently hang
Categories
(Testing :: General, defect, P3)
Tracking
(Not tracked)
People
(Reporter: gbrown, Assigned: gbrown)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
There are quite a few reports of android-hw mochitest-media exceeding its task max-run-time (90 minutes) in bug 1523098.
These tasks normally run in less than 15 minutes.
The failures appear to be hangs, after the process list is reported but before any tests are run.
Examples:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=270157725&repo=mozilla-central&lineNumber=1874
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=270186349&repo=autoland&lineNumber=1873
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=270203425&repo=autoland&lineNumber=1834
![]() |
Assignee | |
Comment 1•5 years ago
|
||
aerickson, bc: Do you understand what's happening here?
Comment 2•5 years ago
|
||
Not really. I think the first thing I would try is to see if I could log if self.info(self.shell_output('ps')) returned in verify_device by inserting a log statement immediately afterwards. If that hangs, perhaps split the line into ps_output = self.shell_output('ps') and self.info(ps_output). If that doesn't help, I suppose I would sprinkle more diagnostic logs to see where the hang is occurring.
Comment 3•5 years ago
|
||
My observations/thoughts:
- ps takes ~2 seconds to run on 'normal' runs
- this error seems to have been happening for awhile on android-hw.
- probably not due to recent hostutils change.
- for linked errors above
- different devices (pixel2-08,10,17)
- does seem to be hanging towards the end of the ps output
- "[taskcluster:warn 2019-10-08T04:22:17.000Z] no such process"
- what's sending this? g-w?
- it's already finished? what happened to it?
- adb connection issues?
- already using latest adb (29.0.4)
- not clustered around set of devices on the same hub
- adb connection issues?
- painful that we wait 90 minutes to fail
- can we set timeouts on these commands?
![]() |
Assignee | |
Updated•5 years ago
|
![]() |
Assignee | |
Comment 4•5 years ago
|
||
Successful runs look like this:
[task 2019-10-10T17:35:52.580Z] 17:35:37 INFO - system 5174 743 2190528 37176 0 0 S com.android.keychain
[task 2019-10-10T17:35:52.580Z] 17:35:37 INFO - shell 5491 839 9204 1844 __skb_recv_datagram 7f45ff9c88 S logcat
[task 2019-10-10T17:35:52.580Z] 17:35:37 INFO - shell 5551 839 10452 2084 0 750394bc28 R ps
[task 2019-10-10T17:35:52.580Z] 17:35:37 INFO - u0_a40 5552 743 2195996 44876 0 0 D android.process.acore
[task 2019-10-10T17:35:52.580Z] 17:35:37 INFO - Running post-action listener: _resource_record_post_action
[task 2019-10-10T17:35:52.580Z] 17:35:37 INFO - [mozharness: 2019-10-10 17:35:37.785042Z] Finished verify-device step (success)
[task 2019-10-10T17:35:52.580Z] 17:35:37 INFO - [mozharness: 2019-10-10 17:35:37.785115Z] Running install step.
[task 2019-10-10T17:35:52.580Z] 17:35:37 INFO - Running pre-action listener: _resource_record_pre_action
[task 2019-10-10T17:35:52.580Z] 17:35:37 INFO - Running main action method: install
![]() |
Assignee | |
Comment 5•5 years ago
|
||
![]() |
Assignee | |
Updated•5 years ago
|
Comment 7•5 years ago
|
||
bugherder |
![]() |
Assignee | |
Comment 8•5 years ago
|
||
There are similar hangs seen in android-hw jsreftest/jittest in bug 1571322. Here's the most recent:
[task 2019-10-18T11:39:53.402Z] 11:39:44 INFO - shell 10050 840 9204 1848 __skb_recv_datagram 774ded0c88 S logcat
[task 2019-10-18T11:39:53.402Z] 11:39:44 INFO - shell 10109 840 10452 2084 0 796b1eac28 R ps
[task 2019-10-18T11:39:53.402Z] 11:39:44 INFO - verify_device complete
[task 2019-10-18T11:39:53.402Z] 11:39:44 INFO - Running post-action listener: _resource_record_post_action
[task 2019-10-18T11:39:53.403Z] 11:39:44 INFO - [mozharness: 2019-10-18 11:39:44.077432Z] Finished verify-device step (success)
[task 2019-10-18T11:39:53.403Z] 11:39:44 INFO - [mozharness: 2019-10-18 11:39:44.077487Z] Running install step.
[task 2019-10-18T11:39:53.403Z] 11:39:44 INFO - Running pre-action listener: _resource_record_pre_action
[task 2019-10-18T11:39:53.403Z] 11:39:44 INFO - Running main action method: install
[taskcluster:error] Aborting task...
[taskcluster:warn 2019-10-18T12:38:45.937Z] no such process
[taskcluster 2019-10-18T12:38:45.938Z] === Task Finished ===
Comment 9•5 years ago
|
||
This from the logcat looks relevant. We've seen this signature issue before. Should we just uninstall before trying to install and get around these issues once and for all?
10-18 11:39:53.297 1198 1242 W PackageManager: Package org.mozilla.geckoview.test signatures do not match the previously installed version; ignoring!
10-18 11:39:53.336 881 2249 E : Couldn't opendir /data/app/vmdl1316328988.tmp: No such file or directory
10-18 11:39:53.336 881 2249 E installd: Failed to delete /data/app/vmdl1316328988.tmp: No such file or directory
![]() |
Assignee | |
Comment 10•5 years ago
|
||
Sounds good!
![]() |
Assignee | |
Comment 11•5 years ago
|
||
Noticed this in a recent try run:
https://treeherder.mozilla.org/logviewer.html#?job_id=272055262&repo=try
[task 2019-10-20T03:28:24.649Z] 03:28:24 INFO - u0_a71 4758 737 2241788 59476 0 0 S com.google.android.gm
[task 2019-10-20T03:28:24.649Z] 03:28:24 INFO - u0_a41 4839 737 2200184 47868 0 0 S com.google.android.apps.gcs
[task 2019-10-20T03:28:24.649Z] 03:28:24 INFO - u0_a108 4860 737 2549932 88296 0 0 S com.google.android.youtube
[taskcluster:error] Aborting task...
[taskcluster:warn 2019-10-20T04:56:58.955Z] no such process
[taskcluster 2019-10-20T04:56:58.955Z] === Task Finished ===
[taskcluster 2019-10-20T04:56:58.955Z] Task Duration: 1h30m0.000571885s
Logcat: https://taskcluster-artifacts.net/cN8pZXgDRBWgrmul3y0k_A/0/public/test_info/logcat-FA84B1A00179.log
...was active for the duration of the hang. The only geckoview mention I see is:
10-20 03:25:55.126 1178 1222 I BackupManagerService: Package org.mozilla.geckoview.test not installed; dropping from full backup
Is this a concern:
10-20 03:26:32.293 849 1116 E adbd : remote usb: read terminated (message): No such device
10-20 03:26:32.296 849 1115 I adbd : closing functionfs transport
10-20 03:26:32.297 849 856 I adbd : initializing functionfs
10-20 03:26:33.282 2523 2709 I WorkerManager: dispose()
10-20 03:26:33.294 2523 2709 W ThreadPoolDumper: Queue length for executor EventBus is now 11. Perhaps some tasks are too long, or the pool is too small.
10-20 03:26:33.308 849 856 I adbd : functionfs successfully initialized
10-20 03:26:33.309 849 856 I adbd : registering usb transport
![]() |
Assignee | |
Comment 12•5 years ago
|
||
In https://treeherder.mozilla.org/#/jobs?repo=try&revision=ae986444378fd2296d1239292abda30ba2e5feae I accidentally introduced a hang (or python exception?) during the install step -- logs appeared just like what we have seen here. I think there's a decent chance that bug 1589691 will resolve these hangs.
![]() |
Assignee | |
Comment 13•5 years ago
|
||
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272906716&repo=autoland&lineNumber=1817
[task 2019-10-25T06:35:31.496Z] 06:35:31 INFO - Failed to install /builds/task_1571984945/workspace/build/geckoview-androidTest.apk on pixel2-09: ADBTimeoutError args: adb wait-for-device install /builds/task_1571984945/workspace/build/geckoview-androidTest.apk, exitcode: None, stdout:
[taskcluster:error] Aborting task...
[taskcluster:warn 2019-10-25T07:29:11.090Z] no such process
[taskcluster 2019-10-25T07:29:11.090Z] === Task Finished ===
[taskcluster 2019-10-25T07:29:11.090Z] Task Duration: 1h0m0.000544034s
self.info('Failed to install %s on %s: %s %s' %
(apk, self.device_name,
type(e).__name__, e))
self.fatal('INFRA-ERROR: %s Failed to install %s' %
(type(e).__name__, os.path.basename(apk)),
EXIT_STATUS_DICT[TBPL_RETRY])
What happened to "INFRA-ERROR"? Is self.fatal() hanging?
![]() |
Assignee | |
Comment 14•5 years ago
|
||
try push which calls fatal(0) during the install step:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb47ae5c135be9e924cd9626d03b2a24224d4505
Note that android-em tasks exit with 0 and have full logging; android-hw tasks often hang and sometimes have incomplete logging.
![]() |
Assignee | |
Comment 15•5 years ago
|
||
Missing logging more explicit and more reliable in
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0644da131caa525b14e925b8ee6af3f5cc4e560a
even with explicit log handler flushes in mozharness fatal().
![]() |
Assignee | |
Comment 16•5 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #12)
I think there's a decent chance that bug 1589691 will resolve these hangs.
No, it didn't. It may have reduced the frequency, a little? Failures continue in bug 1589796. Recent ones:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=275064351&repo=autoland&lineNumber=1884
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=275033023&repo=mozilla-inbound&lineNumber=1883
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274971978&repo=autoland&lineNumber=1883
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274676230&repo=autoland&lineNumber=1816
![]() |
Assignee | |
Comment 17•5 years ago
|
||
Many (maybe all) of these happen (or appear to happen?) in or near the "install" mozharness step -- early in the task. That's consistent with my finding that this sort of hang is easily reproducible by calling fatal() during the install step; when I call fatal() during the "run_tests" step, tasks generally complete normally (with the error code requested in the fatal() call).
I suggest appear to install because the logging is inconsistent. I keep wondering if some logging is being truncated in these failures.
![]() |
Assignee | |
Comment 18•5 years ago
|
||
script.py is specific to android-hw and pipes logging:
https://github.com/bclary/mozilla-bitbar-docker/blob/master/taskcluster/script.py#L201
I would like to try adding bufsize=0 to the Popen, like
Also, let's check the script.py write() return values, like
I also find myself wondering if perhaps the readline()/write()/poll() loop has actually completed and something else in script.py is hung (enable_charging?). Can we print() immediately after the loop?
aerickson - I don't know how to test and deploy these potential changes to script.py. Can you take that on?
![]() |
Assignee | |
Updated•5 years ago
|
![]() |
Assignee | |
Comment 20•5 years ago
|
||
Doing better here, but bug 1600193 persists.
![]() |
Assignee | |
Comment 21•5 years ago
|
||
Finally, all dependencies resolved.
Description
•