Open Bug 1500266 Opened 6 years ago Updated 19 days ago

Intermittent raise ADBTimeoutError("%s" % adb_process)

Categories

(Testing :: General, defect, P3)

defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra][comment 20])

Attachments

(3 files)

Filed by: bclary [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=206171878&repo=mozilla-central

https://queue.taskcluster.net/v1/task/M2TxQOtuQc2lk6juvZDLFQ/runs/0/artifacts/public/logs/live_backing.log

This bug will be used to classify ADBTimeoutErrors on Android emulators and hardware. Either ActiveData or Redash can be used to obtain breakdowns of failures by device.
Depends on: 1512352
Assignee: nobody → gbrown
Priority: -- → P1
I don't know what you will be able to do about adb timeouts especially for hardware.
True. I was going to look at the android-em issues, but I've already got bug 1512352 for that -- I'll work over there.
Assignee: gbrown → nobody
Priority: P1 → P3
The most common android-em cases are resolved now - bug 1512352.

Many of the recent android-hw cases are in mochitest-media-1, following the problematic dom/media/test directory; I wonder if these will be resolved by bug 1517170.
See Also: → 1517124

There are 23 total failures in the last 7 days all on android-hw-p2-8-0-arm7-api-16 debug and opt.

Recent failure log:

16:19:39 INFO - 354 INFO TEST-START | dom/media/webaudio/test/test_waveShaperPassThrough.html
16:19:39 INFO - 355 INFO TEST-OK | dom/media/webaudio/test/test_waveShaperPassThrough.html | took 396ms
16:24:49 INFO - Traceback (most recent call last):
16:24:49 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2791, in doTests
16:24:49 INFO - e10s=options.e10s
16:24:49 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtestsremote.py", line 341, in runApp
16:24:49 INFO - ret, _ = self.automation.runApp(*args, **kwargs)
16:24:49 INFO - File "/builds/worker/workspace/build/tests/mochitest/remoteautomation.py", line 69, in runApp
16:24:49 INFO - status = self.waitForFinish(timeout, maxTime)
16:24:49 INFO - File "/builds/worker/workspace/build/tests/mochitest/remoteautomation.py", line 122, in waitForFinish
16:24:49 INFO - status = self.wait(timeout=maxTime, noOutputTimeout=timeout)
16:24:49 INFO - File "/builds/worker/workspace/build/tests/mochitest/remoteautomation.py", line 409, in wait
16:24:49 INFO - hasOutput = self.read_stdout()
16:24:49 INFO - File "/builds/worker/workspace/build/tests/mochitest/remoteautomation.py", line 324, in read_stdout
16:24:49 INFO - if not self.device.is_file(self.remoteLog):
16:24:49 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1962, in is_file
16:24:49 INFO - return self._test_path('f', path, timeout, root)
16:24:49 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1900, in _test_path
16:24:49 INFO - timeout=timeout, root=root)
16:24:49 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1426, in shell_bool
16:24:49 CRITICAL - raise ADBTimeoutError("%s" % adb_process)
16:24:49 CRITICAL - ADBTimeoutError: args: adb wait-for-device shell test -f /sdcard/tests/logs/mochitest.log; echo adb_returncode=$?, exitcode: None, stdout:
16:24:49 INFO - 356 ERROR Automation Error: Received unexpected exception while running application
16:24:49 INFO - runtests.py | Device disconnected. Aborting test.
16:24:49 INFO - Stopping web server
16:24:49 INFO - Stopping web socket server
16:24:49 INFO - Stopping ssltunnel
16:24:49 INFO - websocket/process bridge listening on port 8191
16:24:50 INFO - Stopping websocket/process bridge
16:24:50 INFO - 357 ERROR Automation Error: Exception caught while running tests
16:24:50 INFO - Traceback (most recent call last):
16:24:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtestsremote.py", line 371, in run_test_harness
16:24:50 INFO - retVal = mochitest.runTests(options)
16:24:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2590, in runTests
16:24:50 INFO - res = self.runMochitests(options, tests_in_manifest)
16:24:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2384, in runMochitests
16:24:50 INFO - result = self.doTests(options, testsToRun)
16:24:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2791, in doTests
16:24:50 INFO - e10s=options.e10s
16:24:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtestsremote.py", line 341, in runApp
16:24:50 INFO - ret, _ = self.automation.runApp(*args, **kwargs)
16:24:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/remoteautomation.py", line 69, in runApp
16:24:50 INFO - status = self.waitForFinish(timeout, maxTime)
16:24:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/remoteautomation.py", line 122, in waitForFinish
16:24:50 INFO - status = self.wait(timeout=maxTime, noOutputTimeout=timeout)
16:24:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/remoteautomation.py", line 409, in wait
16:24:50 INFO - hasOutput = self.read_stdout()
16:24:50 INFO - File "/builds/worker/workspace/build/tests/mochitest/remoteautomation.py", line 324, in read_stdout
16:24:50 INFO - if not self.device.is_file(self.remoteLog):
16:24:50 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1962, in is_file
16:24:50 INFO - return self._test_path('f', path, timeout, root)
16:24:50 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1900, in _test_path
16:24:50 INFO - timeout=timeout, root=root)
16:24:50 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1426, in shell_bool
16:24:50 CRITICAL - raise ADBTimeoutError("%s" % adb_process)
16:24:50 CRITICAL - ADBTimeoutError: args: adb wait-for-device shell test -f /sdcard/tests/logs/mochitest.log; echo adb_returncode=$?, exitcode: None, stdout:
16:24:50 INFO - Device disconnected. Will not run mochitest.cleanup().
16:24:50 INFO - Buffered messages finished
16:24:50 INFO - SUITE-END | took 506s
16:24:50 ERROR - Return code: 1
16:24:50 INFO - TinderboxPrint: mochitest-media<br/>66/0/0
16:24:50 INFO - ##### mochitest-media log ends
16:24:50 CRITICAL - # TBPL RETRY #
16:24:50 WARNING - setting return code to 4
16:24:50 CRITICAL - The mochitest-media suite: mochitest-media ran with return status: RETRY

Geff do you have time to take a look at this?

Flags: needinfo?(gbrown)

There isn't anything Geoff or anyone can do. This error is when the physical device or emulator stops responding for whatever reason for the time out period which is 5 minutes. It is inherent with Android unfortunately.

Flags: needinfo?(gbrown)

Thank you Bob.

Joel, can we label this as stockwell infra based on the above comment?

Flags: needinfo?(jmaher)

:apavel, yes want to do the change of the whiteboard tag?

Flags: needinfo?(jmaher)

Yes Joel, thank you.

Whiteboard: [stockwell needswork:owner] → [stockwell infra]

There are 31 total failures in the last 7 days on Android platform but most of them seem to be miss-classifications: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-04-19&endday=2019-04-26&tree=trunk&bug=1500266

The most recent 3 failures from today and yesterday are due to pixel2-22 which we have flagged to be reflashed in Bug 1549313.

In the last 7 days there have been 28 occurrences on android-hw-p2-8-0-arm7-api-16, build types debug and pgo.

This bug failed 35 times in the last 7 days. Occurs on android-em-7-0-x86_64 and android-hw-p2-8-0-arm7-api-16 on opt and pgo build types.

Log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=251919414&repo=mozilla-inbound&lineNumber=15981

All but one of these classifications was correct. pixel2-14 accounted for 21 instances. I'll speak to bitbar later today.

Flags: needinfo?(bob)

I've determined pixel2-14 is having problems and have raised the issue with bitbar.

Flags: needinfo?(bob)

pixel2-14 was responsible for 10 of these. I moved it back into quarantine yesterday. We are working on a plan to see if upgrading it to Android 8.1 will help.

pixel2-14 was responsible for all of these and now has been quarantined again. I'm not sure why it wasn't in quarantine already.

Device counts for failures 2019-07-25

  5 motog5-13
  4 pixel2-02
  4 pixel2-01
  3 pixel2-37
  3 pixel2-04
  3 pixel2-03
  2 pixel2-54
  2 pixel2-40
  2 pixel2-38
  2 pixel2-36
  2 pixel2-19
  1 pixel2-60
  1 pixel2-58
  1 pixel2-57
  1 pixel2-56
  1 pixel2-55
  1 pixel2-53
  1 pixel2-52
  1 pixel2-51
  1 pixel2-50
  1 pixel2-47
  1 pixel2-46
  1 pixel2-45
  1 pixel2-44
  1 pixel2-43
  1 pixel2-42
  1 pixel2-41
  1 pixel2-39
  1 pixel2-33
  1 pixel2-28
  1 pixel2-27
  1 pixel2-26
  1 motog5-40
  1 motog5-34
  1 motog5-10
  1 motog5-09
  1 motog5-07
  1 motog5-06
  1 motog5-01

device counts for last 7 days

  8 pixel2-27
  7 pixel2-28
  6 pixel2-26
  6 motog5-10
  6 motog5-07
  6 motog5-06
  5 pixel2-37
  5 pixel2-36
  5 pixel2-02
  5 motog5-13
  4 pixel2-54
  4 pixel2-41
  4 pixel2-04
  4 pixel2-03
  4 pixel2-01
  4 motog5-09
  3 pixel2-58
  3 pixel2-46
  3 pixel2-45
  3 pixel2-40
  3 pixel2-38
  3 motog5-01
  2 pixel2-47
  2 pixel2-39
  2 pixel2-19
  2 motog5-40
  1 pixel2-60
  1 pixel2-57
  1 pixel2-56
  1 pixel2-55
  1 pixel2-53
  1 pixel2-52
  1 pixel2-51
  1 pixel2-50
  1 pixel2-49
  1 pixel2-48
  1 pixel2-44
  1 pixel2-43
  1 pixel2-42
  1 pixel2-33
  1 pixel2-30
  1 pixel2-29
  1 pixel2-24
  1 pixel2-22
  1 pixel2-12
  1 pixel2-09
  1 motog5-34

I'll raise the issue with bitbar. We are in the process of evaluating if upgrading to the latest Android 9 will help with these issues though this might just be the wearing out of the devices.

pixel2-14 has been reflashed. I've removed it from quarantine and will watch it closely.

Since removing pixel2-14 from quarantine, 2/3 jobs were failures for adb issues. Re-quarantined for now.

Bitbar has moved 2-14 to a different USB hub and swapped USB cables. I've removed the device from quarantine. All 4 of it's jobs since being removed from quarantine have been successful.

The time outs are pretty evenly distributed among the devices.

37 devices experienced ADBTimeoutErrors with the following having the most

  4 pixel2-30
  4 pixel2-26
  3 pixel2-25
  3 pixel2-23
  2 pixel2-57
  2 pixel2-27
  2 pixel2-24

the other devices had one instance each.

There are 54 total failures in the last 7 days on

android-hw-p2-8-0-arm7-api-16-qr debug and opt
android-hw-p2-8-0-arm7-api-16 pgo, opt and debug
android-hw-p2-8-0-android-aarch64 pgo.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=264024651&repo=mozilla-central&lineNumber=828

[task 2019-08-29T10:40:10.903Z] 10:35:10 INFO - New mozdevice with adb=adb, device=HT8641A00557
[task 2019-08-29T10:40:10.903Z] 10:35:10 INFO - adb -s HT8641A00557 logcat -v threadtime Trace:S StrictMode:S ExchangeService:S
[task 2019-08-29T10:40:10.903Z] 10:40:10 INFO - Failed to install /builds/task_1567074849/workspace/build/geckoview_example.apk on pixel2-32: ADBTimeoutError args: adb wait-for-device install /builds/task_1567074849/workspace/build/geckoview_example.apk, exitcode: None, stdout:
[task 2019-08-29T10:40:10.903Z] 10:40:10 FATAL - INFRA-ERROR: ADBTimeoutError Failed to install geckoview_example.apk
[task 2019-08-29T10:40:10.903Z] 10:40:10 FATAL - Running post_fatal callback...
[task 2019-08-29T10:40:10.903Z] 10:40:10 FATAL - Exiting 4
[task 2019-08-29T10:40:10.903Z] 10:40:10 INFO - Killing logcat pid 545.

Attached file 2019-08-29 devices —

The issues are evenly distributed and no particular device stands out. This is from the IFV for this bug but there are still a few other bugs with ADBTimeoutError that are being classified which reduces the overall count. No action at this time is necessary.

aerickson: Looks like 32 devices with pixel2-01, pixel2-02, pixel2-03, pixel2-04 leading the list. They are among our oldest devices. Perhaps bitbar can reimage them.

  7 pixel2-04
  7 pixel2-01
  5 pixel2-03
  5 pixel2-02
  4 pixel2-47
  3 pixel2-52
  3 pixel2-45
  3 pixel2-43
  3 pixel2-38
  3 pixel2-36
  3 pixel2-23
  2 pixel2-51
  2 pixel2-49
  2 pixel2-41
  2 pixel2-37
  2 pixel2-33
  2 pixel2-27
  2 pixel2-26
  2 motog5-13
  1 pixel2-57
  1 pixel2-55
  1 pixel2-53
  1 pixel2-50
  1 pixel2-48
  1 pixel2-42
  1 pixel2-40
  1 pixel2-30
  1 pixel2-25
  1 pixel2-24
  1 pixel2-21
  1 pixel2-14
  1 motog5-34
Flags: needinfo?(aerickson)

In the last week there have been 99 occurrences. The occurrence rate is decreasing at the moment.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=266990171&repo=autoland&lineNumber=909

[task 2019-09-17T07:39:08.416Z] 07:34:03 INFO - Using adb 1.0.41
[task 2019-09-17T07:39:08.416Z] 07:34:04 INFO - adbd not restarted as root
[task 2019-09-17T07:39:08.416Z] 07:34:04 INFO - su -c setenforce 0 exitcode 0, stdout: None
[task 2019-09-17T07:39:08.416Z] 07:34:05 INFO - su -c supported
[task 2019-09-17T07:39:08.416Z] 07:34:05 INFO - /system/bin/ls -1A supported
[task 2019-09-17T07:39:08.416Z] 07:34:05 INFO - Native cp support: True
[task 2019-09-17T07:39:08.416Z] 07:34:05 INFO - Native chmod -R support: True
[task 2019-09-17T07:39:08.416Z] 07:34:05 INFO - Native chown -R support: True
[task 2019-09-17T07:39:08.416Z] 07:34:05 INFO - Setting SELinux Permissive
[task 2019-09-17T07:39:08.416Z] 07:34:06 INFO - Native pidof support: True
[task 2019-09-17T07:39:08.416Z] 07:34:06 INFO - New mozdevice with adb=adb, device=HT86B1A01456
[task 2019-09-17T07:39:08.416Z] 07:34:07 INFO - adb -s HT86B1A01456 logcat -v threadtime Trace:S StrictMode:S ExchangeService:S
[task 2019-09-17T07:39:08.416Z] 07:39:07 INFO - Failed to install /builds/task_1568705501/workspace/build/geckoview_example.apk on pixel2-39: ADBTimeoutError args: adb wait-for-device install /builds/task_1568705501/workspace/build/geckoview_example.apk, exitcode: None, stdout:
[task 2019-09-17T07:39:08.416Z] 07:39:07 FATAL - INFRA-ERROR: ADBTimeoutError Failed to install geckoview_example.apk
[task 2019-09-17T07:39:08.416Z] 07:39:07 FATAL - Running post_fatal callback...
[task 2019-09-17T07:39:08.416Z] 07:39:07 FATAL - Exiting 4
[task 2019-09-17T07:39:08.416Z] 07:39:07 INFO - Killing logcat pid 729.

I've asked bitbar to reimage pixel2-01 to pixel2-04.

Flags: needinfo?(aerickson)

There are 39 total failures in the last 7 days on android-hw-p2-8-0-arm7-api-16 pgo, opt and debug, android-hw-p2-8-0-android-aarch64 pgo and opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=268460439&repo=autoland&lineNumber=22109

[task 2019-09-26T00:04:59.549Z] 00:03:38 INFO - REFTEST TEST-START | http://10.7.205.209:8854/jsreftest/tests/jsreftest.html?test=test262/language/statementList/eval-block-with-statment-array-literal-with-item.js
[task 2019-09-26T00:04:59.549Z] 00:03:38 INFO - REFTEST TEST-LOAD | http://10.7.205.209:8854/jsreftest/tests/jsreftest.html?test=test262/language/statementList/eval-block-with-statment-array-literal-with-item.js | 4951 / 5967 (82%)
[task 2019-09-26T00:04:59.549Z] 00:03:49 INFO - remoteautomation.py | exception reading log: [Errno 2] No such file or directory: '/tmp/tmpTCgosn'
[task 2019-09-26T00:16:01.385Z] 00:04:59 INFO - Failed to get top activity, retrying, once...
[task 2019-09-26T00:16:01.385Z] 00:10:59 INFO - Automation Error: Exception caught while running tests
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - Traceback (most recent call last):
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/remotereftest.py", line 429, in run_test_harness
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - retVal = reftest.runTests(options.tests, options)
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/runreftest.py", line 588, in runTests
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - return self.runSerialTests(manifests, options, cmdargs)
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/runreftest.py", line 923, in runSerialTests
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - return run()
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/runreftest.py", line 914, in run
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - **kwargs)
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/remotereftest.py", line 383, in runApp
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - e10s=options.e10s)
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/remoteautomation.py", line 68, in runApp
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - status = self.waitForFinish(timeout, maxTime)
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/remoteautomation.py", line 118, in waitForFinish
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - status = self.wait(timeout=maxTime, noOutputTimeout=timeout)
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/remoteautomation.py", line 370, in wait
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - self.read_stdout()
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/remoteautomation.py", line 257, in read_stdout
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - if not self.device.is_file(self.remoteLog):
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 2134, in is_file
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - return self._test_path('f', path, timeout=timeout, root=root)
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 2072, in _test_path
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - timeout=timeout, root=root)
[task 2019-09-26T00:16:01.386Z] 00:10:59 INFO - File "/builds/task_1569454873/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1573, in shell_bool
[task 2019-09-26T00:16:01.386Z] 00:10:59 CRITICAL - raise ADBTimeoutError("%s" % adb_process)
[task 2019-09-26T00:16:01.386Z] 00:10:59 CRITICAL - ADBTimeoutError: args: adb wait-for-device shell test -f /sdcard/tests/reftest/reftest.log, exitcode: None, stdout:
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - Traceback (most recent call last):
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/remotereftest.py", line 446, in <module>
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - sys.exit(run_test_harness(parser, options))
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/remotereftest.py", line 438, in run_test_harness
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - reftest.printDeviceInfo(printLogcat=(retVal != 0))
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - File "/builds/task_1569454873/workspace/build/tests/reftest/remotereftest.py", line 329, in printDeviceInfo
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - logcat = self.device.get_logcat(filter_out_regexps=fennecLogcatFilters)
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - File "/builds/task_1569454873/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1719, in get_logcat
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - lines = self.command_output(cmds, timeout=timeout).splitlines()
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - File "/builds/task_1569454873/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1215, in command_output
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - timeout=timeout)
[task 2019-09-26T00:16:01.386Z] 00:16:01 INFO - File "/builds/task_1569454873/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 318, in command_output
[task 2019-09-26T00:16:01.386Z] 00:16:01 CRITICAL - raise ADBTimeoutError("%s" % adb_process)
[task 2019-09-26T00:16:12.049Z] 00:16:01 CRITICAL - mozdevice.adb.ADBTimeoutError: args: adb wait-for-device logcat -v time -d dalvikvm:I ConnectivityService:S WifiMonitor:S WifiStateTracker:S wpa_supplicant:S NetworkStateTracker:S EmulatedCamera_Camera:S EmulatedCamera_Device:S EmulatedCamera_FakeCamera:S EmulatedCamera_FakeDevice:S EmulatedCamera_CallbackNotifier:S GnssLocationProvider:S Hyphenator:S BatteryStats:S, exitcode: None, stdout:
[task 2019-09-26T00:16:12.049Z] 00:16:01 ERROR - Return code: 1
[task 2019-09-26T00:16:12.049Z] 00:16:01 ERROR - No tests run or test summary not found
[task 2019-09-26T00:16:12.049Z] 00:16:01 INFO - TinderboxPrint: jsreftest<br/><em class="testfail">T-FAIL</em>
[task 2019-09-26T00:16:12.049Z] 00:16:01 INFO - ##### jsreftest log ends
[task 2019-09-26T00:16:12.049Z] 00:16:01 CRITICAL - # TBPL RETRY #
[task 2019-09-26T00:16:12.049Z] 00:16:01 WARNING - setting return code to 4
[task 2019-09-26T00:16:12.049Z] 00:16:01 CRITICAL - The jsreftest suite: jsreftest ran with return status: RETRY

Andrew are there updates here?

Flags: needinfo?(aerickson)

I don't think we'll ever eliminate these errors. We're working to make them get retried on this error.

I've pinged bitbar regarding pixel2-01 - 04 getting reimaged. I haven't heard anything yet.

I noticed that for the last week, pixel2-04 is the only host with issues... so not sure if they have underlying issues or we're just seeing the regular failure rate.

Flags: needinfo?(aerickson)

I understand, thank you for explaining.

Whiteboard: [stockwell infra] → [stockwell infra][comment 20]

Mostly spread out among 29 devices with pixel2-11 the worst offender.

      4 pixel2-11
      3 pixel2-55
      3 pixel2-53
      3 pixel2-51
      3 pixel2-41
      3 pixel2-04
      3 pixel2-02
      3 motog5-13
      2 pixel2-44
      2 pixel2-43
      2 pixel2-30
      2 pixel2-29
      2 pixel2-09
      1 pixel2-59
      1 pixel2-56
      1 pixel2-54
      1 pixel2-50
      1 pixel2-46
      1 pixel2-42
      1 pixel2-31
      1 pixel2-28
      1 pixel2-26
      1 pixel2-24
      1 pixel2-23
      1 pixel2-12
      1 pixel2-10
      1 pixel2-03
      1 motog5-07
      1 motog5-06
  5 pixel2-50
  4 pixel2-30
  4 pixel2-25
  3 pixel2-49
  3 pixel2-23
  2 pixel2-57
  2 pixel2-53
  2 pixel2-48
  2 pixel2-47
  2 pixel2-43
  2 pixel2-40
  2 pixel2-33
  2 pixel2-29
  2 pixel2-13
  2 pixel2-09
  1 pixel2-58
  1 pixel2-56
  1 pixel2-55
  1 pixel2-52
  1 pixel2-51
  1 pixel2-41
  1 pixel2-38
  1 pixel2-37
  1 pixel2-36
  1 pixel2-31
  1 pixel2-24
  1 pixel2-20
  1 pixel2-16
  1 pixel2-12
  1 pixel2-06
  1 pixel2-02
  1 motog5-14
  1 motog5-07
  1 motog5-06

pixel2-50 may need investigation as well as pixel2-25 and pixel2-30.

Flags: needinfo?(aerickson)

Update on pixel2-01 to 04. The hub has been replaced.

Regarding the new top 3 offenders (pixel2-50, pixel2-30, pixel2-25), they're all on different cluster managers/hubs. Bitbar seems to think that re-imaging doesn't help, so just replace them?

Flags: needinfo?(aerickson) → needinfo?(bob)

I'll leave the NI to check the timeouts for today and see if this is a continuing pattern before making the call to replace them.

  3 motog5-03
  2 pixel2-11
  2 pixel2-10
  2 pixel2-04
  2 pixel2-02
  2 motog5-13
  1 pixel2-57
  1 pixel2-55
  1 pixel2-53
  1 pixel2-37
  1 pixel2-33
  1 pixel2-31
  1 pixel2-30
  1 pixel2-29
  1 pixel2-28
  1 pixel2-27
  1 pixel2-26
  1 pixel2-25
  1 pixel2-23
  1 pixel2-16
  1 pixel2-15
  1 pixel2-14
  1 pixel2-12
  1 pixel2-09
  1 pixel2-08
  1 pixel2-07
  1 pixel2-05
  1 pixel2-03
  1 motog5-12
  1 motog5-11
  1 motog5-09
  1 motog5-05

All but one of these were actual ADBTimeoutErrors. No real pattern here.

Flags: needinfo?(bob)

Bug 1573269 should help with this (failures will cause retries for most ADB connection issues). I think there are a few spots returning non-4 on issues that we may have to cleanup.

50 failures were associated with this bug in the last 7 days: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-10-23&endday=2019-10-30&tree=all&bug=1500266

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=273668976&repo=autoland&lineNumber=2696

This failed on: android-hw-g5-7-0-arm7-api-16,android-hw-p2-8-0-android-aarch64 pgo build type.

Attached file 2019-10-30 bad devices —

The biggest offenders are:

  6 motog5-07
  5 pixel2-28
  5 motog5-14
  5 motog5-03
  4 pixel2-27
  4 motog5-06

aerickson: Where do we stand with getting retries working?
motog5-07, motog5-06, motog5-14 are on and01.
motog5-03 is on and4.
pixel2-27 and pixel2-28 are on and02.

perhaps we need a hub/cables replaced on and01.

Flags: needinfo?(aerickson)

The work required to make g-w rerun jobs that exit with exit code 4 is done (Bug 1573269). I think that the issue we're seeing with the jobs is that they're not exiting 4, so we need to go through and change those parts of the code to send exit code 4.

[task 2019-10-30T13:51:09.966Z] 13:40:58     INFO -  mozdevice.adb.ADBTimeoutError: args: adb wait-for-device shell sync, exitcode: None, stdout:
[task 2019-10-30T13:51:09.966Z] 13:40:59    ERROR - Return code: 1
[task 2019-10-29T10:24:17.728Z] 10:14:07     INFO -  mozdevice.adb.ADBTimeoutError: args: adb wait-for-device shell su -c "echo 512 > /sys/kernel/debug/msm-bus-dbg/shell-client/slv", exitcode: None, stdout:
[task 2019-10-29T10:24:17.728Z] 10:14:07    ERROR - Return code: 1

Bitbar just replaced hubs and cables on 3 device managers (and04, and05, and and06). I'll see when and01's hub and cables were last replaced.

Flags: needinfo?(aerickson)

(In reply to Andrew Erickson [:aerickson] from comment #146)

For android-em, the mozharness log parser translates "ADBTimeoutError" and other log strings into an exit code of 4:

https://searchfox.org/mozilla-central/rev/9df2ccc256fe54c314b34e52920471a0d04b283e/testing/mozharness/mozharness/mozilla/testing/unittest.py#169-175

That mechanism doesn't seem to be working for raptor tests; I can have a closer look.

There are also some on-going failures in mochitest-media, where I would expect the mozharness log parser to be working:

https://taskcluster-artifacts.net/StZkla0IQtO5QRrPYdrarQ/0/public/logs/live_backing.log

[task 2019-10-25T19:13:27.906Z] 19:13:27     INFO - Failed to install /builds/task_1572030418/workspace/build/geckoview-androidTest.apk on pixel2-21: ADBTimeoutError args: adb wait-for-device install /builds/task_1572030418/workspace/build/geckoview-androidTest.apk, exitcode: None, stdout: 
[task 2019-10-25T19:13:27.906Z] 19:13:27    FATAL - INFRA-ERROR: ADBTimeoutError Failed to install geckoview-androidTest.apk
[taskcluster:error] Aborting task...
[taskcluster:warn 2019-10-25T20:37:03.681Z] no such process
[taskcluster 2019-10-25T20:37:03.681Z] === Task Finished ===

Notice that there is a hang after INFRA-ERROR. That is consistent with what I am seeing in bug 1587207: android-hw tasks hang after calling mozharness fatal().

Depends on: 1587207
Depends on: 1592681

Bitbar will be replacing the hub on and01 today. Sakari doesn't think that the cables are going bad.

Depends on: 1594982
Depends on: 1595245

2 of the 3 most recent failures exit mozharness correctly with code 4 but subsequently hang, bug 1587207.

The remaining failure is an ADBTimeoutError in script.py:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=276622899&repo=mozilla-central&lineNumber=102

I'm not sure how to handle that, or even if it is worth the effort.

Andrew: Can you think of a way for us to retry the ADBTimeoutError failures in script.py?

Flags: needinfo?(aerickson)
See Also: → 1597364

Yeah, shouldn't be hard. I've created Bug 1597364.

Flags: needinfo?(aerickson)

Update: There are 26 total failures in the last 7 days on android-hw-p2-8-0-arm7-api-16 and android-hw-g5-7-0-arm7-api-16 pgo and debug.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=278398835&repo=autoland&lineNumber=1804

[task 2019-11-27T13:57:32.077Z] 13:52:31 INFO - verify_device complete
[task 2019-11-27T13:57:32.077Z] 13:52:31 INFO - Running post-action listener: _resource_record_post_action
[task 2019-11-27T13:57:32.077Z] 13:52:31 INFO - [mozharness: 2019-11-27 13:52:31.507450Z] Finished verify-device step (success)
[task 2019-11-27T13:57:32.077Z] 13:52:31 INFO - [mozharness: 2019-11-27 13:52:31.507490Z] Running install step.
[task 2019-11-27T13:57:32.077Z] 13:52:31 INFO - Running pre-action listener: _resource_record_pre_action
[task 2019-11-27T13:57:32.077Z] 13:52:31 INFO - Running main action method: install
[task 2019-11-27T13:57:32.077Z] 13:57:32 INFO - Failed to install /builds/task_1574862101/workspace/build/geckoview-androidTest.apk on pixel2-06: ADBTimeoutError args: adb wait-for-device install /builds/task_1574862101/workspace/build/geckoview-androidTest.apk, exitcode: None, stdout:
[task 2019-11-27T13:57:32.077Z] 13:57:32 FATAL - INFRA-ERROR: ADBTimeoutError Failed to install geckoview-androidTest.apk
[task 2019-11-27T13:57:32.077Z] 13:57:32 FATAL - Running post_fatal callback...
[task 2019-11-27T13:57:32.077Z] 13:57:32 FATAL - Exiting 4
[task 2019-11-27T13:57:32.077Z] 13:57:32 INFO - Running post-action listener: _resource_record_post_action
[task 2019-11-27T13:57:32.077Z] 13:57:32 INFO - [mozharness: 2019-11-27 13:57:32.021293Z] Finished install step (failed)
[task 2019-11-27T13:57:32.077Z] 13:57:32 INFO - Running post-run listener: _resource_record_post_run
[taskcluster:error] Aborting task...
[taskcluster:warn 2019-11-27T14:51:28.545Z] no such process
[taskcluster 2019-11-27T14:51:28.546Z] === Task Finished ===
[taskcluster 2019-11-27T14:51:28.546Z] Task Duration: 1h0m0.000317628s
[taskcluster 2019-11-27T14:51:28.786Z] Uploading artifact public/logs/localconfig.json from file workspace/logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2020-11-26T11:05:26.357Z
[taskcluster 2019-11-27T14:51:29.022Z] Uploading artifact public/test_info/android-performance.log from file workspace/build/blobber_upload_dir/android-performance.log with content encoding "gzip", mime type "text/plain" and expiry 2020-11-26T11:05:26.357Z
[taskcluster 2019-11-27T14:51:29.221Z] Uploading artifact public/test_info/logcat-FA84C1A00065.log from file workspace/build/blobber_upload_dir/logcat-FA84C1A00065.log with content encoding "gzip", mime type "text/plain" and expiry 2020-11-26T11:05:26.357Z
[taskcluster 2019-11-27T14:51:29.597Z] Uploading artifact public/test_info/resource-usage.json from file workspace/build/blobber_upload_dir/resource-usage.json with content encoding "gzip", mime type "application/json" and expiry 2020-11-26T11:05:26.357Z
[taskcluster 2019-11-27T14:51:29.897Z] Uploading redirect artifact public/logs/live.log to URL https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VFShMb3fR2mFFK5jy1p3Ng/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-11-26T11:05:26.357Z
[taskcluster:error] Task aborted - max run time exceeded

Andrew: It looks like some of the motog5s are having problems:

  6 motog5-06
  4 motog5-07
  4 motog5-14
  1 pixel2-02
  1 pixel2-53

:bc, sorry just seeing your comment now.

Some of the g5 hosts you mentioned were at a lower success rate earlier today, but seem to have improved. Current hosts under 85% success rate:

gecko-t-bitbar-gw-perf-g5.motog5-22  {sr: [========  ]  84.2%, suc: 16, cmp: 19, exc:  0, rng:  1, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-perf-g5.motog5-28  {sr: [=======   ]  77.8%, suc: 14, cmp: 18, exc:  1, rng:  1, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-unit-p2.pixel2-05  {sr: [========  ]  84.2%, suc: 16, cmp: 19, exc:  0, rng:  1, alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-unit-p2.pixel2-16  {sr: [========  ]  84.2%, suc: 16, cmp: 19, exc:  0, rng:  1, alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-unit-p2.pixel2-20  {sr: [========  ]  84.2%, suc: 16, cmp: 19, exc:  0, rng:  1, alerts: ['Low health (less than 0.85)!']}

(In reply to Intermittent Failures Robot from comment #167)

61 failures in 4867 pushes (0.013 failures/push) were associated with this bug in the last 7 days.

Looks like the usual suspects...

 14 motog5-07
 14 motog5-06
 13 motog5-14
  4 motog5-03
  2 pixel2-24
  2 pixel2-21
  2 pixel2-16
  2 pixel2-12
  1 pixel2-52
  1 pixel2-43
  1 pixel2-29
  1 pixel2-05
  1 pixel2-04
  1 motog5-32
  1 motog5-26
  1 motog5-12

If we restrict the range to just Saturday to today we have
7 motog5-14
6 motog5-07
6 motog5-06
1 pixel2-47

aerickson: These top 3 motog5s appear to still be an issue.

Flags: needinfo?(aerickson)

I've quarantined the hosts. Will see what bitbar has to say about them.

gecko-t-bitbar-gw-perf-g5.motog5-06  {sr: [=======   ]  73.7%, suc: 14, cmp: 19, exc:  0, rng:  1, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-perf-g5.motog5-07  {sr: [=======   ]  75.0%, suc: 15, cmp: 20, exc:  0, rng:  0, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!']}
gecko-t-bitbar-gw-perf-g5.motog5-14  {sr: [=====     ]  60.0%, suc: 12, cmp: 20, exc:  0, rng:  0, notes: ['No jobs in queue.'], alerts: ['Low health (less than 0.85)!']}
Flags: needinfo?(aerickson)

Sakari is investigating. Logs look ok. The hosts are all on the and01 cluster manager. We replaced the hub on that manager in October.

There wasn't anything interesting in the logs. They've swapped cables and rebooted the cluster manager.

Removing from quarantine and watching them.

Mostly pixels this time...
4 pixel2-10
3 pixel2-11
3 pixel2-09
3 motog5-10
3 motog5-03
2 pixel2-28
2 pixel2-23
2 pixel2-04
2 pixel2-03
2 motog5-09
1 pixel2-27
1 pixel2-26
1 pixel2-25
1 pixel2-24
1 pixel2-16
1 pixel2-15
1 pixel2-13
1 pixel2-08
1 pixel2-07
1 pixel2-06
1 pixel2-05
1 pixel2-02
1 motog5-13

Depends on: 1605239

I asked Bitbar to reboot the cluster manager and investigate for pixel2-09, pixel2-10, pixel2-11 yesterday.

Today Sakari posted "actually we swapped usb hub, cable and updated the host earlier today already - devices were going offline every 5-6 hours
i checked a few hours ago and runs were showing mostly green(except pixel2-11)".

There are 23 total failures in the last 7 days on android-hw-p2-8-0-arm7-api-16-qr opt and debug, android-hw-p2-8-0-arm7-api-16 debug and pgo

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=283340661&repo=autoland&lineNumber=3045

[task 2020-01-03T13:24:34.744Z] 13:17:13 INFO - 327 INFO TEST-START | dom/media/tests/mochitest/test_peerConnection_trackDisabling_clones.html
[task 2020-01-03T13:24:34.744Z] 13:17:13 INFO - 328 INFO TEST-OK | dom/media/tests/mochitest/test_peerConnection_trackDisabling_clones.html | took 5408ms
[task 2020-01-03T13:24:34.744Z] 13:17:24 INFO - remoteautomation.py | exception reading log: [Errno 2] No such file or directory: '/tmp/tmpzdy3Gc'
[task 2020-01-03T13:24:34.744Z] 13:18:34 INFO - Failed to get top activity, retrying, once...
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - Traceback (most recent call last):
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - File "/builds/task_1578056703/workspace/build/tests/mochitest/runtests.py", line 2852, in doTests
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - e10s=options.e10s
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - File "/builds/task_1578056703/workspace/build/tests/mochitest/runtestsremote.py", line 339, in runApp
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - ret, _ = self.automation.runApp(*args, **kwargs)
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - File "/builds/task_1578056703/workspace/build/tests/mochitest/remoteautomation.py", line 68, in runApp
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - status = self.waitForFinish(timeout, maxTime)
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - File "/builds/task_1578056703/workspace/build/tests/mochitest/remoteautomation.py", line 118, in waitForFinish
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - status = self.wait(timeout=maxTime, noOutputTimeout=timeout)
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - File "/builds/task_1578056703/workspace/build/tests/mochitest/remoteautomation.py", line 369, in wait
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - self.read_stdout()
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - File "/builds/task_1578056703/workspace/build/tests/mochitest/remoteautomation.py", line 252, in read_stdout
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - newLogContent = self.device.get_file(self.remoteLog, offset=self.stdoutlen)
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - File "/builds/task_1578056703/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 2440, in get_file
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - self._sync(timeout=timeout)
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - File "/builds/task_1578056703/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 980, in _sync
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - self.shell_output("sync", timeout=timeout)
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - File "/builds/task_1578056703/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1621, in shell_output
[task 2020-01-03T13:24:34.744Z] 13:24:34 CRITICAL - raise ADBTimeoutError("%s" % adb_process)
[task 2020-01-03T13:24:34.744Z] 13:24:34 CRITICAL - ADBTimeoutError: args: adb wait-for-device shell sync, exitcode: None, stdout:
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - 329 ERROR Automation Error: Received unexpected exception while running application
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - runtests.py | Device disconnected. Aborting test.
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - Stopping web server
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - Stopping web socket server
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - Stopping ssltunnel
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - websocket/process bridge listening on port 8191
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - Stopping websocket/process bridge
[task 2020-01-03T13:24:34.744Z] 13:24:34 INFO - 330 ERROR Automation Error: Exception caught while running tests

I reviewed the most recent 10 failures reported here. In all cases, the mozharness script returned 4, but script.py subsequently raised another ADBTimeoutError, during enable_charging().

[task 2020-01-04T12:38:51.684Z] + exit 4
[task 2020-01-04T12:38:51.684Z] script.py: command finished (bytes read: 179087, bytes written: 179087)
[task 2020-01-04T12:38:51.687Z] Traceback (most recent call last):
[task 2020-01-04T12:38:51.687Z]   File "/builds/taskcluster/script.py", line 250, in <module>
[task 2020-01-04T12:38:51.687Z]     sys.exit(main())
[task 2020-01-04T12:38:51.687Z]   File "/builds/taskcluster/script.py", line 227, in main
[task 2020-01-04T12:38:51.687Z]     enable_charging(device, device_type)
[task 2020-01-04T12:38:51.687Z]   File "/builds/taskcluster/script.py", line 58, in enable_charging
[task 2020-01-04T12:38:51.687Z]     print("script.py: enabling charging for device '%s' ('%s')..." % (device_type, device.get_info('id')['id']))
[task 2020-01-04T12:38:51.687Z]   File "/usr/local/lib/python3.6/dist-packages/mozdevice/adb.py", line 2883, in get_info
[task 2020-01-04T12:38:51.687Z]     info['id'] = self.command_output(['get-serialno'], timeout=timeout)
[task 2020-01-04T12:38:51.687Z]   File "/usr/local/lib/python3.6/dist-packages/mozdevice/adb.py", line 1215, in command_output
[task 2020-01-04T12:38:51.687Z]     timeout=timeout)
[task 2020-01-04T12:38:51.687Z]   File "/usr/local/lib/python3.6/dist-packages/mozdevice/adb.py", line 318, in command_output
[task 2020-01-04T12:38:51.687Z]     raise ADBTimeoutError("%s" % adb_process)
[task 2020-01-04T12:38:51.687Z] mozdevice.adb.ADBTimeoutError: args: adb wait-for-device get-serialno, exitcode: None, stdout: 

aerickson - something you could take care of?

Flags: needinfo?(aerickson)

Definitely. I've landed a fix (https://github.com/bclary/mozilla-bitbar-docker/pull/38) that is being tested (https://github.com/bclary/mozilla-bitbar-devicepool/pull/87). It should land soon.

Flags: needinfo?(aerickson)

The failures in the last week are all around failure to install the APK and a timeout. It seems mozharness is exiting 4, but hanging on exit and then taskcluster terminates the task due to timeout at 1.5 hours.

Is "Running post-run listener: _resource_record_post_run" running more ADB commands that fail or something?

[task 2020-01-21T20:31:09.042Z] 20:26:08     INFO - Running main action method: install
[task 2020-01-21T20:31:09.042Z] 20:31:08     INFO - Failed to install /builds/task_1579638288/workspace/build/geckoview-androidTest.apk on pixel2-03: ADBTimeoutError args: adb wait-for-device install /builds/task_1579638288/workspace/build/geckoview-androidTest.apk, exitcode: None, stdout: 
[task 2020-01-21T20:31:09.042Z] 20:31:08    FATAL - INFRA-ERROR: ADBTimeoutError Failed to install geckoview-androidTest.apk
[task 2020-01-21T20:31:09.042Z] 20:31:08    FATAL - Running post_fatal callback...
[task 2020-01-21T20:31:09.042Z] 20:31:08    FATAL - Exiting 4
[task 2020-01-21T20:31:09.042Z] 20:31:08     INFO - Running post-action listener: _resource_record_post_action
[task 2020-01-21T20:31:09.042Z] 20:31:08     INFO - [mozharness: 2020-01-21 20:31:08.991588Z] Finished install step (failed)
[task 2020-01-21T20:31:09.042Z] 20:31:08     INFO - Running post-run listener: _resource_record_post_run
[task 2020-01-21T20:31:09.042Z] 20:31:09     INFO - Total resource usage - Wall time: 305s; CPU: 5.0%; Read bytes: 0; Write bytes: 191791104; Read time: 0; Write time: 197364
[task 2020-01-21T20:31:09.042Z] 20:31:09     INFO - TinderboxPrint: CPU usage<br/>5.4%
[task 2020-01-21T20:31:09.042Z] 20:31:09     INFO - TinderboxPrint: I/O read bytes / time<br/>0 / 0
[taskcluster:error] Aborting task...
[taskcluster:warn 2020-01-21T21:54:51.070Z] no such process
[taskcluster 2020-01-21T21:54:51.070Z] === Task Finished ===
[taskcluster 2020-01-21T21:54:51.070Z] Task Duration: 1h30m0.00058578s
[taskcluster 2020-01-21T21:54:51.269Z] Uploading artifact public/logs/localconfig.json from file workspace/logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2021-01-20T19:50:24.625Z
[taskcluster 2020-01-21T21:54:51.516Z] Uploading artifact public/test_info/android-performance.log from file workspace/build/blobber_upload_dir/android-performance.log with content encoding "gzip", mime type "text/plain" and expiry 2021-01-20T19:50:24.625Z
[taskcluster 2020-01-21T21:54:51.679Z] Uploading artifact public/test_info/logcat-FA83V1A02389.log from file workspace/build/blobber_upload_dir/logcat-FA83V1A02389.log with content encoding "gzip", mime type "text/plain" and expiry 2021-01-20T19:50:24.625Z
[taskcluster 2020-01-21T21:54:51.954Z] Uploading artifact public/test_info/resource-usage.json from file workspace/build/blobber_upload_dir/resource-usage.json with content encoding "gzip", mime type "application/json" and expiry 2021-01-20T19:50:24.625Z
[taskcluster 2020-01-21T21:54:52.163Z] Uploading redirect artifact public/logs/live.log to URL https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Of92DctoSxOFJ7DHOix93g/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2021-01-20T19:50:24.625Z
[taskcluster:error] Task aborted - max run time exceeded

Flags: needinfo?(bob)
Flags: needinfo?(bob)

The issue I brought up in https://bugzilla.mozilla.org/show_bug.cgi?id=1500266#c192 has been tracked down by gbrown in https://bugzilla.mozilla.org/show_bug.cgi?id=1611936.

I've deployed a fix for that bug today.

(In reply to Andrew Erickson [:aerickson] from comment #196)

I've deployed a fix for that bug today.

FYI there are still test failures logged for the last days.

It seems like there is still a timeout (it seems slightly different), but there's also:

[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL - Uncaught exception: Traceback (most recent call last):
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -   File "/builds/task_1582332516/workspace/mozharness/mozharness/base/script.py", line 2132, in run
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -     self.run_action(action)
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -   File "/builds/task_1582332516/workspace/mozharness/mozharness/base/script.py", line 2071, in run_action
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -   File "/builds/task_1582332516/workspace/mozharness/mozharness/base/script.py", line 2026, in _possibly_run_method
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -     return getattr(self, method_name)()
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -   File "/builds/task_1582332516/workspace/mozharness/mozharness/mozilla/testing/raptor.py", line 650, in install
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -     self.device.uninstall_app(self.binary_path)
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -   File "/builds/task_1582332516/workspace/mozharness/mozharness/mozilla/testing/android.py", line 67, in device
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -     verbose=True)
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -   File "/builds/task_1582332516/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 625, in __init__
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -     self._initialize_boot_state(timeout=timeout)
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -   File "/builds/task_1582332516/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 855, in _initialize_boot_state
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -     self._wait_for_boot_completed(timeout=timeout)
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -   File "/builds/task_1582332516/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 886, in _wait_for_boot_completed
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -     sys_boot_completed = self.shell_output('getprop sys.boot_completed', timeout=timeout)
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -   File "/builds/task_1582332516/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1621, in shell_output
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL -     raise ADBTimeoutError("%s" % adb_process)
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL - ADBTimeoutError: args: adb wait-for-device shell getprop sys.boot_completed; echo adb_returncode=$?, exitcode: None, stdout: 
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL - Running post_fatal callback...
[task 2020-02-22T00:55:22.751Z] 00:55:22    FATAL - Exiting -1

Another bug that should be an ADBTimeout exception?

from https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=289983091&repo=mozilla-central&lineNumber=411

[task 2020-02-22T00:53:29.621Z] Traceback (most recent call last):
[task 2020-02-22T00:53:29.621Z]   File "condprof/../condprof/android.py", line 79, in prepare
[task 2020-02-22T00:53:29.621Z]     self.device = ADBDevice(verbose=self.verbose, logger_name="adb")
[task 2020-02-22T00:53:29.621Z]   File "/builds/task_1582331887/fetches/condprofile/lib/python3.6/site-packages/mozdevice/adb.py", line 625, in __init__
[task 2020-02-22T00:53:29.621Z]     self._initialize_boot_state(timeout=timeout)
[task 2020-02-22T00:53:29.621Z]   File "/builds/task_1582331887/fetches/condprofile/lib/python3.6/site-packages/mozdevice/adb.py", line 855, in _initialize_boot_state
[task 2020-02-22T00:53:29.621Z]     self._wait_for_boot_completed(timeout=timeout)
[task 2020-02-22T00:53:29.621Z]   File "/builds/task_1582331887/fetches/condprofile/lib/python3.6/site-packages/mozdevice/adb.py", line 886, in _wait_for_boot_completed
[task 2020-02-22T00:53:29.621Z]     sys_boot_completed = self.shell_output('getprop sys.boot_completed', timeout=timeout)
[task 2020-02-22T00:53:29.621Z]   File "/builds/task_1582331887/fetches/condprofile/lib/python3.6/site-packages/mozdevice/adb.py", line 1619, in shell_output
[task 2020-02-22T00:53:29.621Z]     timeout=timeout, root=root)
[task 2020-02-22T00:53:29.621Z]   File "/builds/task_1582331887/fetches/condprofile/lib/python3.6/site-packages/mozdevice/adb.py", line 1484, in shell
[task 2020-02-22T00:53:29.621Z]     if (self._device_serial.startswith('emulator') or
[task 2020-02-22T00:53:29.621Z] AttributeError: 'NoneType' object has no attribute 'startswith'

it looks like the _device_serial is None here. Interestingly there is only a single place where it get set, which is in the constructor of the ADBCommand class by a call to _get_device_serial(). The only situation I can see when None is returned is when there is no device present:

https://searchfox.org/mozilla-central/rev/96f1457323cc598a36f5701f8e67aedaf97acfcf/testing/mozbase/mozdevice/mozdevice/adb.py#904

But I wonder why that happens because it's below shell_output in the stack, which I would expect to only be called for an available device.

Recent android-em failures here are for webrender-android-emulator tasks. The task exits with code 4, as I expect for retries. I see task payloads including:

  "onExitStatus": {
    "purgeCaches": [
      72
    ],
    "retry": [
      72
    ]
  }

Why 72?

Component: Testing → General
Product: Firefox for Android → Testing

Looks like there's been a spike in this bug in the last 7 days, it failed 49 times.
Occurs on :
android-hw-p2-8-0-android-aarch64-shippable
android-hw-g5-7-0-arm7-api-16-shippable
android-hw-g5-7-0-arm7-api-16-shippable
android-hw-g5-7-0-arm7-api-16, android-em-7-0-x86

On Opt and debug build types.

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=303158965&repo=autoland&lineNumber=942

I've taken a look at some of the failures. Notes:

  • Exit 4's (retries) are being shown. These aren't technically failures, but infrastructure issues (that will likely never be solved). Can we filter out or display the exit code in the report?
  • Mozdevice still has uncaught ADBTimeouts that should be exit 4's:

device.rm

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=303553388&repo=mozilla-central&lineNumber=1643

[task 2020-05-24T22:23:11.900Z] 22:13:11    ERROR -  Traceback (most recent call last):
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/tests/raptor/raptor/raptor.py", line 201, in <module>
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      main()
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/tests/raptor/raptor/raptor.py", line 148, in main
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      success = raptor.run_tests(raptor_test_list, raptor_test_names)
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/tests/raptor/raptor/webextension/android.py", line 209, in run_tests
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      return super(WebExtensionAndroid, self).run_tests(tests, test_names)
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/tests/raptor/raptor/perftest.py", line 395, in run_tests
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      self.clean_up()
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/tests/raptor/raptor/webextension/android.py", line 420, in clean_up
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      self.device.rm(self.remote_test_root, force=True, recursive=True, root=True)
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 2573, in rm
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      self._sync(timeout=timeout)
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1003, in _sync
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      self.shell_output("sync", timeout=timeout)
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1676, in shell_output
[task 2020-05-24T22:23:11.900Z] 22:13:11 CRITICAL -      raise ADBTimeoutError("%s" % adb_process)
[task 2020-05-24T22:23:11.900Z] 22:13:11 CRITICAL -  mozdevice.adb.ADBTimeoutError: args: adb wait-for-device shell sync, exitcode: None, stdout:
[task 2020-05-24T22:23:11.900Z] 22:13:11    ERROR - Return code: 1
[task 2020-05-24T22:23:11.900Z] 22:13:11  WARNING - setting return code to 1
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO - Killing logcat pid 622.
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO - Copying Raptor results to upload dir:
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO - /builds/task_1590357368/workspace/build/blobber_upload_dir/perfherder-data.json
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO - Copying raptor results from /builds/task_1590357368/workspace/build/raptor.json to /builds/task_1590357368/workspace/build/blobber_upload_dir/perfherder-data.json
[task 2020-05-24T22:23:11.900Z] 22:13:11 CRITICAL - Error copying results /builds/task_1590357368/workspace/build/raptor.json to upload dir /builds/task_1590357368/workspace/build/blobber_upload_dir/perfherder-data.json

device.uninstall_app

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=303158965&repo=autoland&lineNumber=942

[task 2020-05-24T22:23:11.900Z] 22:13:11    ERROR -  Traceback (most recent call last):
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/tests/raptor/raptor/raptor.py", line 201, in <module>
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      main()
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/tests/raptor/raptor/raptor.py", line 148, in main
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      success = raptor.run_tests(raptor_test_list, raptor_test_names)
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/tests/raptor/raptor/webextension/android.py", line 209, in run_tests
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      return super(WebExtensionAndroid, self).run_tests(tests, test_names)
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/tests/raptor/raptor/perftest.py", line 395, in run_tests
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      self.clean_up()
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/tests/raptor/raptor/webextension/android.py", line 420, in clean_up
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      self.device.rm(self.remote_test_root, force=True, recursive=True, root=True)
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 2573, in rm
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      self._sync(timeout=timeout)
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1003, in _sync
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -      self.shell_output("sync", timeout=timeout)
[task 2020-05-24T22:23:11.900Z] 22:13:11     INFO -    File "/builds/task_1590357368/workspace/build/venv/lib/python2.7/site-packages/mozdevice/adb.py", line 1676, in shell_output
[task 2020-05-24T22:23:11.900Z] 22:13:11 CRITICAL -      raise ADBTimeoutError("%s" % adb_process)
[task 2020-05-24T22:23:11.900Z] 22:13:11 CRITICAL -  mozdevice.adb.ADBTimeoutError: args: adb wait-for-device shell sync, exitcode: None, stdout:
[task 2020-05-24T22:23:11.900Z] 22:13:11    ERROR - Return code: 1

(In reply to Andrew Erickson [:aerickson] from comment #218)

I've taken a look at some of the failures. Notes:

  • Exit 4's (retries) are being shown. These aren't technically failures, but infrastructure issues (that will likely never be solved). Can we filter out or display the exit code in the report?
  • Mozdevice still has uncaught ADBTimeouts that should be exit 4's:

mozdevice/adb can't control the exit code. That is something the test framework is responsible for handling. I've filed bug 1642620 for these two cases in Raptor.

Bug 1689171 showed that the harness is improperly handling the ADBTimeoutError in Python 3. It doesn't address what's going on in this bug, but I'm fixing that part of the error handling in bug 1689213.

Depends on: 1739422

Hi Andreea, no I don't think it could cause that - the patch (https://phabricator.services.mozilla.com/D140455) just adds a few pixels extra fuzziness to a couple of tests, so it shouldn't have any impact (except by some possibly arbitrary timing change it makes to the CI run perhaps affecting timeouts?).

Flags: needinfo?(gwatson)

(In reply to Glenn Watson [:gw] from comment #312)

Hi Andreea, no I don't think it could cause that - the patch (https://phabricator.services.mozilla.com/D140455) just adds a few pixels extra fuzziness to a couple of tests, so it shouldn't have any impact (except by some possibly arbitrary timing change it makes to the CI run perhaps affecting timeouts?).

Thanks for looking Glenn!

Looks like there are no failures since the 9th of March.

Recent failures here are mostly bug 1615599.

Severity: normal → S3
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 1 year ago11 months ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

(In reply to Treeherder Bug Filer from comment #374)

New failure instance: https://treeherder.mozilla.org/logviewer?job_id=422867800&repo=mozilla-esr102

This is unrelated to this bug and some infra issue.

Status: REOPENED → RESOLVED
Closed: 11 months ago9 months ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: