Closed Bug 1587207 Opened 1 year ago Closed 11 months ago

android-hw tasks intermittently hang

Categories

(Testing :: General, defect, P3)

Version 3
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: gbrown)

References

(Blocks 2 open bugs)

Details

Attachments

(1 file)

There are quite a few reports of android-hw mochitest-media exceeding its task max-run-time (90 minutes) in bug 1523098.

These tasks normally run in less than 15 minutes.

The failures appear to be hangs, after the process list is reported but before any tests are run.

Examples:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=270157725&repo=mozilla-central&lineNumber=1874
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=270186349&repo=autoland&lineNumber=1873
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=270203425&repo=autoland&lineNumber=1834

aerickson, bc: Do you understand what's happening here?

Flags: needinfo?(bob)
Flags: needinfo?(aerickson)

Not really. I think the first thing I would try is to see if I could log if self.info(self.shell_output('ps')) returned in verify_device by inserting a log statement immediately afterwards. If that hangs, perhaps split the line into ps_output = self.shell_output('ps') and self.info(ps_output). If that doesn't help, I suppose I would sprinkle more diagnostic logs to see where the hang is occurring.

Flags: needinfo?(bob)

My observations/thoughts:

  • ps takes ~2 seconds to run on 'normal' runs
  • this error seems to have been happening for awhile on android-hw.
    • probably not due to recent hostutils change.
  • for linked errors above
    • different devices (pixel2-08,10,17)
    • does seem to be hanging towards the end of the ps output
    • "[taskcluster:warn 2019-10-08T04:22:17.000Z] no such process"
      • what's sending this? g-w?
      • it's already finished? what happened to it?
        • adb connection issues?
          • already using latest adb (29.0.4)
          • not clustered around set of devices on the same hub
      • painful that we wait 90 minutes to fail
        • can we set timeouts on these commands?
Flags: needinfo?(aerickson)
Priority: -- → P3

Successful runs look like this:

[task 2019-10-10T17:35:52.580Z] 17:35:37     INFO - system        5174   743 2190528  37176 0                   0 S com.android.keychain
[task 2019-10-10T17:35:52.580Z] 17:35:37     INFO - shell         5491   839    9204   1844 __skb_recv_datagram 7f45ff9c88 S logcat
[task 2019-10-10T17:35:52.580Z] 17:35:37     INFO - shell         5551   839   10452   2084 0          750394bc28 R ps
[task 2019-10-10T17:35:52.580Z] 17:35:37     INFO - u0_a40        5552   743 2195996  44876 0                   0 D android.process.acore
[task 2019-10-10T17:35:52.580Z] 17:35:37     INFO - Running post-action listener: _resource_record_post_action
[task 2019-10-10T17:35:52.580Z] 17:35:37     INFO - [mozharness: 2019-10-10 17:35:37.785042Z] Finished verify-device step (success)
[task 2019-10-10T17:35:52.580Z] 17:35:37     INFO - [mozharness: 2019-10-10 17:35:37.785115Z] Running install step.
[task 2019-10-10T17:35:52.580Z] 17:35:37     INFO - Running pre-action listener: _resource_record_pre_action
[task 2019-10-10T17:35:52.580Z] 17:35:37     INFO - Running main action method: install
Assignee: nobody → gbrown
Keywords: leave-open
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a30f6255ffd4
Add logging to investigate intermittent hang in android-hw tests; r=bc

There are similar hangs seen in android-hw jsreftest/jittest in bug 1571322. Here's the most recent:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=271869247&repo=mozilla-central&lineNumber=1812

[task 2019-10-18T11:39:53.402Z] 11:39:44     INFO - shell        10050   840    9204   1848 __skb_recv_datagram 774ded0c88 S logcat
[task 2019-10-18T11:39:53.402Z] 11:39:44     INFO - shell        10109   840   10452   2084 0          796b1eac28 R ps
[task 2019-10-18T11:39:53.402Z] 11:39:44     INFO - verify_device complete
[task 2019-10-18T11:39:53.402Z] 11:39:44     INFO - Running post-action listener: _resource_record_post_action
[task 2019-10-18T11:39:53.403Z] 11:39:44     INFO - [mozharness: 2019-10-18 11:39:44.077432Z] Finished verify-device step (success)
[task 2019-10-18T11:39:53.403Z] 11:39:44     INFO - [mozharness: 2019-10-18 11:39:44.077487Z] Running install step.
[task 2019-10-18T11:39:53.403Z] 11:39:44     INFO - Running pre-action listener: _resource_record_pre_action
[task 2019-10-18T11:39:53.403Z] 11:39:44     INFO - Running main action method: install
[taskcluster:error] Aborting task...
[taskcluster:warn 2019-10-18T12:38:45.937Z] no such process
[taskcluster 2019-10-18T12:38:45.938Z] === Task Finished ===
Blocks: 1571322
Summary: android-hw mochitest-media intermittently hangs → android-hw tasks intermittently hang

This from the logcat looks relevant. We've seen this signature issue before. Should we just uninstall before trying to install and get around these issues once and for all?

10-18 11:39:53.297  1198  1242 W PackageManager: Package org.mozilla.geckoview.test signatures do not match the previously installed version; ignoring!
10-18 11:39:53.336   881  2249 E         : Couldn't opendir /data/app/vmdl1316328988.tmp: No such file or directory
10-18 11:39:53.336   881  2249 E installd: Failed to delete /data/app/vmdl1316328988.tmp: No such file or directory

Sounds good!

Depends on: 1589691
Blocks: 1589796

Noticed this in a recent try run:

https://treeherder.mozilla.org/logviewer.html#?job_id=272055262&repo=try

[task 2019-10-20T03:28:24.649Z] 03:28:24     INFO - u0_a71        4758   737 2241788  59476 0                   0 S com.google.android.gm
[task 2019-10-20T03:28:24.649Z] 03:28:24     INFO - u0_a41        4839   737 2200184  47868 0                   0 S com.google.android.apps.gcs
[task 2019-10-20T03:28:24.649Z] 03:28:24     INFO - u0_a108       4860   737 2549932  88296 0                   0 S com.google.android.youtube
[taskcluster:error] Aborting task...
[taskcluster:warn 2019-10-20T04:56:58.955Z] no such process
[taskcluster 2019-10-20T04:56:58.955Z] === Task Finished ===
[taskcluster 2019-10-20T04:56:58.955Z] Task Duration: 1h30m0.000571885s

Logcat: https://taskcluster-artifacts.net/cN8pZXgDRBWgrmul3y0k_A/0/public/test_info/logcat-FA84B1A00179.log
...was active for the duration of the hang. The only geckoview mention I see is:

10-20 03:25:55.126  1178  1222 I BackupManagerService: Package org.mozilla.geckoview.test not installed; dropping from full backup

Is this a concern:

10-20 03:26:32.293   849  1116 E adbd    : remote usb: read terminated (message): No such device
10-20 03:26:32.296   849  1115 I adbd    : closing functionfs transport
10-20 03:26:32.297   849   856 I adbd    : initializing functionfs
10-20 03:26:33.282  2523  2709 I WorkerManager: dispose()
10-20 03:26:33.294  2523  2709 W ThreadPoolDumper: Queue length for executor EventBus is now 11. Perhaps some tasks are too long, or the pool is too small.
10-20 03:26:33.308   849   856 I adbd    : functionfs successfully initialized
10-20 03:26:33.309   849   856 I adbd    : registering usb transport

In https://treeherder.mozilla.org/#/jobs?repo=try&revision=ae986444378fd2296d1239292abda30ba2e5feae I accidentally introduced a hang (or python exception?) during the install step -- logs appeared just like what we have seen here. I think there's a decent chance that bug 1589691 will resolve these hangs.

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272906716&repo=autoland&lineNumber=1817

[task 2019-10-25T06:35:31.496Z] 06:35:31     INFO - Failed to install /builds/task_1571984945/workspace/build/geckoview-androidTest.apk on pixel2-09: ADBTimeoutError args: adb wait-for-device install /builds/task_1571984945/workspace/build/geckoview-androidTest.apk, exitcode: None, stdout: 
[taskcluster:error] Aborting task...
[taskcluster:warn 2019-10-25T07:29:11.090Z] no such process
[taskcluster 2019-10-25T07:29:11.090Z] === Task Finished ===
[taskcluster 2019-10-25T07:29:11.090Z] Task Duration: 1h0m0.000544034s

vs.
https://searchfox.org/mozilla-central/rev/2a355e56c490768be676689de18629485c9d699b/testing/mozharness/mozharness/mozilla/testing/android.py#343

            self.info('Failed to install %s on %s: %s %s' %
                      (apk, self.device_name,
                       type(e).__name__, e))
            self.fatal('INFRA-ERROR: %s Failed to install %s' %
                       (type(e).__name__, os.path.basename(apk)),
                       EXIT_STATUS_DICT[TBPL_RETRY])

What happened to "INFRA-ERROR"? Is self.fatal() hanging?

Blocks: 1500266

try push which calls fatal(0) during the install step:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb47ae5c135be9e924cd9626d03b2a24224d4505

Note that android-em tasks exit with 0 and have full logging; android-hw tasks often hang and sometimes have incomplete logging.

Missing logging more explicit and more reliable in

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0644da131caa525b14e925b8ee6af3f5cc4e560a

even with explicit log handler flushes in mozharness fatal().

Many (maybe all) of these happen (or appear to happen?) in or near the "install" mozharness step -- early in the task. That's consistent with my finding that this sort of hang is easily reproducible by calling fatal() during the install step; when I call fatal() during the "run_tests" step, tasks generally complete normally (with the error code requested in the fatal() call).

I suggest appear to install because the logging is inconsistent. I keep wondering if some logging is being truncated in these failures.

script.py is specific to android-hw and pipes logging:

https://github.com/bclary/mozilla-bitbar-docker/blob/master/taskcluster/script.py#L201

I would like to try adding bufsize=0 to the Popen, like

https://searchfox.org/mozilla-central/rev/b2b0077c2e6a516a76bf8077d6f0237c58f5959a/taskcluster/scripts/run-task#141

Also, let's check the script.py write() return values, like

https://searchfox.org/mozilla-central/rev/b2b0077c2e6a516a76bf8077d6f0237c58f5959a/taskcluster/scripts/run-task#107-109

I also find myself wondering if perhaps the readline()/write()/poll() loop has actually completed and something else in script.py is hung (enable_charging?). Can we print() immediately after the loop?

aerickson - I don't know how to test and deploy these potential changes to script.py. Can you take that on?

Flags: needinfo?(aerickson)
Depends on: 1595129

Absolutely. Created Bug 1595129 for tracking.

Flags: needinfo?(aerickson)
Depends on: 1600193
Keywords: leave-open
Depends on: 1605790
Depends on: 1611936

Doing better here, but bug 1600193 persists.

Finally, all dependencies resolved.

Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.