Open Bug 1411358 Opened 3 years ago Updated 2 days ago

Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.

Categories

(Testing :: General, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: gbrown, Unassigned)

References

(Depends on 5 open bugs)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra] [see summary at comment 278])

+++ This bug was initially created as a clone of Bug #1204281 +++

There is considerable history in bug 1204281; I have cloned and closed that bug because it was simply too big -- too many comments, too much ancient history.

I expect these failures to continue. Task timeouts are the last resort whenever a task is hung or otherwise runs for too long. The immediate goal is to address timeouts which recur in the same task frequently.
Duplicate of this bug: 1204281
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
No longer depends on: 1407687
Bug 1339568 is the single biggest cause of these failures.
Depends on: 1412953
Waiting for updates on bug 1339568, which depends on bug 1357466, which had no progress in the last months.
Still waiting for updates on bug 1339568.
Flags: needinfo?(gbrown)
Bug 1339568 remains the biggest contributer of failures reported in this bug.
Flags: needinfo?(gbrown)
Update: there are 65 failures for the last 7 days
Some tests have been disabled in bug 1339568, reducing mochitest-media-e10s-2 shutdown hangs considerably.

I expect to still see some intermittent failures reported here from other causes.
Depends on: 1422863
In the last week there have been 31 total failures.
The failures occur on debug and opt build types and on the following platforms:
- android-4-3-armv7-api16: 14
- android-api-16-gradle: 5
- Android 4.2 x86: 2
- Linux x64: 2
- linux64-stylo-disabled: 2
- android-4-0-armv7-api16: 1
- Linux: 1
- linux32-stylo-disabled: 1
- linux64-ccov: 1
- linux64-qr: 1
- windows-mingw32-32: 1

Here is a recent log file: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=150857059&lineNumber=9563
And a relevant snippet from it:
[task 2017-12-08T21:22:31.523Z] 21:22:31     INFO -  make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/xpcom/build'
9560
[task 2017-12-08T21:22:31.524Z] 21:22:31     INFO -  make[4]: Entering directory '/builds/worker/workspace/build/src/obj-firefox/xpcom/build'
9561
[task 2017-12-08T21:22:31.557Z] 21:22:31     INFO -  make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/xpcom/build'
9562

9563
[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
9564
[taskcluster 2017-12-08 21:22:56.375Z] === Task Finished ===
9565
[taskcluster 2017-12-08 21:22:56.383Z] Unsuccessful task run with exit code: -1 completed in 7596.094 seconds


:gbrown, any updates on this?
Flags: needinfo?(gbrown)
The latest increase in frequency is due to failing Android builds. I cannot tell if that is an on-going problem, or a temporary glitch but will continue to monitor.
Flags: needinfo?(gbrown)
Most recent failures have been Android tests, for a variety of reasons. Some show intermittent download failures/retries -- hopefully a temporary condition. I don't see enough consistency to warrant further investigation right now.
Depends on: 1430668
Last week there were a variety of tasks failing here. The biggest group is linux32/debug jsreftests -- bug 1430668.
Depends on: 1433560
Some recent android opt mochitest failures here - bug 1433560 should help.

Also some test-verify failures - bug 1431125.
Depends on: 1431125
Priority: -- → P3
Depends on: 1435079
Recent failures seem to be mostly bug 1321605 -- probably nothing more I can do there. There may also be some fallout from android mochitest run-by-manifest -- keeping an eye on that.
There are 83 failures in the past 7 days.
Platforms: most of them on  android-4-3-armv7-api16 debug/ opt and linux64-jsdcov opt; and a few occurrences on Linux/Linux x64 debug, linux32-stylo-disabled debug, linux64-jsdcov opt, linux64-qr debug and windows10-64-ccov debug.
Recent failure log example:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=165586428&lineNumber=5170877
Depends on: 1442823
(In reply to OrangeFactor Robot from comment #37)
> Platform breakdown:
> * linux64-jsdcov: 91
> * android-4-3-armv7-api16: 11

The linux64-jsdcov problem is apparent here -- working on that in bug 1442823.
Keywords: leave-open
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c291e7dfe010
Increase taskcluster max-run-time for Android mochitest-chrome and Android/opt mochitest; r=me,a=test-only
There are 105 failures in the past week.
Platforms: 
- Android 4.2 x86 opt
- android-4-0-armv7-api16 opt and debug
- android-4-3-armv7-api16 opt and debug
- android-5-0-aarch64 opt
- Linux opt and debug (many occurrences)
- Linux x64 opt, debug and asan (many occurrences)
- linux32-stylo-disabled debug
- linux64-ccov opt
- linux64-jsdcov opt (many occurrences)
- linux64-noopt debug
- linux64-qr opt and debug
- linux64-stylo-disabled opt and debug
- windows10-64-ccov debug

Recent log failure: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=168753923&lineNumber=1537

Relevant part of the log: 
[task 2018-03-17T23:01:53.204Z] 23:01:53     INFO -  Successfully installed browsermob-proxy-0.6.0 firefox-puppeteer-52.1.0 marionette-driver-2.5.0 marionette-harness-4.3.0 wptserve-1.4.0
[task 2018-03-17T23:01:53.253Z] 23:01:53     INFO - Return code: 0
[task 2018-03-17T23:01:53.255Z] 23:01:53     INFO - Installing None into virtualenv /builds/worker/workspace/build/venv
[task 2018-03-17T23:01:53.256Z] 23:01:53     INFO - error resolving pypi.pvt.build.mozilla.org (ignoring): 

[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[task 2018-03-17T23:01:53.260Z] 23:01:53     INFO - retry: Calling run_command with args: [['/builds/worker/workspace/build/venv/bin/pip', 'install', '--timeout', '120', '-r', '/builds/worker/workspace/build/tests/config/marionette_requirements.txt', '--no-index', '--find-links', 'http://pypi.pub.build.mozilla.org/pub', '--trusted-host', 'pypi.pub.build.mozilla.org']], kwargs: {'error_level': 'warning', 'error_list': [{'substr': 'not found or a compiler error:', 'level': 'warning'}, {'regex': <_sre.SRE_Pattern object at 0x7f7cc33d4490>, 'level': 'error'}, {'regex': <_sre.SRE_Pattern object at 0x7f7cc3490300>, 'level': 'warning'}, {'regex': <_sre.SRE_Pattern object at 0x158aa80>, 'level': 'debug'}, {'substr': 'command not found', 'level': 'error'}, {'regex': <_sre.SRE_Pattern object at 0x7f7cc488fbb8>, 'level': 'warning'}, {'substr': 'Traceback (most recent call last)', 'level': 'error'}, {'substr': 'SyntaxError: ', 'level': 'error'}, {'substr': 'TypeError: ', 'level': 'error'}, {'substr': 'NameError: ', 'level': 'error'}, {'substr': 'ZeroDivisionError: ', 'level': 'error'}, {'regex': <_sre.SRE_Pattern object at 0x7f7cc3493188>, 'level': 'critical'}, {'regex': <_sre.SRE_Pattern object at 0x7f7cc48a7b28>, 'level': 'critical'}], 'cwd': '/builds/worker/workspace/build/tests/config', 'env': {'TASKCLUSTER_PORT_80_TCP_ADDR': '172.17.0.2', 'TASKCLUSTER_INSTANCE_TYPE': 'm3.large', 'TASKCLUSTER_WORKER_TYPE': 'gecko-t-linux-large', 'MOZ_AUTOMATION': '1', 'MOZ_SOURCE_CHANGESET': 'efce78e62b6dac195e7cb4898684da54155d1661', 'MOCHITEST_FLAVOR': 'plain', 'LOGNAME': 'worker', 'USER': 'worker', 'HOME': '/builds/worker', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/builds/worker/bin', 'DISPLAY': ':0', 'LANG': 'en_US.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'MOZ_NODE_PATH': '/usr/local/bin/node', 'RUN_ID': '0', 'HG_STORE_PATH': '/builds/worker/checkouts/hg-store', 'MOZILLA_BUILD_URL': 'https://queue.taskcluster.net/v1/task/cjvNZxdBSKKfmKHtz6MnGQ/artifacts/public/build/target.tar.bz2', 'TASKCLUSTER_PORT': 'tcp://172.17.0.2:80', 'MOZHARNESS_SCRIPT': 'desktop_unittest.py', 'TASKCLUSTER_NAME': '/laughing_khorana/taskcluster', 'GECKO_HEAD_REPOSITORY': 'https://hg.mozilla.org/mozilla-central', 'SCCACHE_DISABLE': '1', 'TASKCLUSTER_PORT_80_TCP_PORT': '80', 'MOZ_SOURCE_REPO': 'https://hg.mozilla.org/mozilla-central', 'GECKO_HEAD_REV': 'efce78e62b6dac195e7cb4898684da54155d1661', 'MOZHARNESS_URL': 'https://queue.taskcluster.net/v1/task/cjvNZxdBSKKfmKHtz6MnGQ/artifacts/public/build/mozharness.zip', 'GECKO_BASE_REPOSITORY': 'https://hg.mozilla.org/mozilla-unified', 'LC_ALL': 'en_US.UTF-8', 'TASKCLUSTER_PORT_80_TCP_PROTO': 'tcp', 'TASK_ID': 'OgTQKS-PS-udVgG1SWL7lA', 'TASKCLUSTER_PORT_80_TCP': 'tcp://172.17.0.2:80', 'OLDPWD': '/builds/worker/workspace', 'HOSTNAME': 'taskcluster-worker', 'SHLVL': '1', 'PWD': '/builds/worker/workspace', 'NEED_PULSEAUDIO': 'true', 'ENABLE_E10S': 'false', 'NEED_WINDOW_MANAGER': 'true', 'MOZHARNESS_CONFIG': 'unittests/linux_unittest.py remove_executables.py', 'TASKCLUSTER_WORKER_GROUP': 'us-west-1', 'TASKCLUSTER_PUBLIC_IP': '54.183.87.60'}}, attempt #1
[task 2018-03-17T23:01:53.261Z] 23:01:53     INFO - Running command: ['/builds/worker/workspace/build/venv/bin/pip', 'install', '--timeout', '120', '-r', '/builds/worker/workspace/build/tests/config/marionette_requirements.txt', '--no-index', '--find-links', 'http://pypi.pub.build.mozilla.org/pub', '--trusted-host', 'pypi.pub.build.mozilla.org'] in /builds/worker/workspace/build/tests/config
[task 2018-03-17T23:01:53.261Z] 23:01:53     INFO - Copy/paste: /builds/worker/workspace/build/venv/bin/pip install --timeout 120 -r /builds/worker/workspace/build/tests/config/marionette_requirements.txt --no-index --find-links http://pypi.pub.build.mozilla.org/pub --trusted-host pypi.pub.build.mozilla.org
[task 2018-03-17T23:01:53.261Z] 23:01:53     INFO - Using env: {'DISPLAY': ':0',
Some failures are from 'try -p all -u all' runs which have triggered jsdcov/dev-tools failures -- carry over from bug 1442823.
See Also: → 1445838
Many of the recent logs on linux and android show pypi failures, similar to bug 1445580.
Depends on: 1445580
Depends on: 1448049
Over the last 7 days there are 81 failures on this bug. These happen on Android 4.2 x86, android-4-0-armv7-api16, android-4-3-armv7-api16, android-5-0-aarch64, Linux, Linux x64, linux64-qr, 

Here is a recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=170655917&lineNumber=12538

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
Flags: needinfo?(gbrown)
See comment 0 and bug dependencies. Investigation and mitigation is on-going.
Flags: needinfo?(gbrown)
Recent failures on Android and Linux are frequent and irregular: Not associated with particular tests or test tasks. Some are related to network failures or delays during 'pip install':

[task 2018-04-03T18:13:55.354Z] 18:13:55     INFO -  Collecting jsonschema==2.5.1
[task 2018-04-03T18:15:55.610Z] 18:15:55     INFO -    Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7fba9e73e3d0>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub
[task 2018-04-03T18:17:56.643Z] 18:17:56     INFO -    Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7fba9e73e550>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub
[task 2018-04-03T18:19:58.106Z] 18:19:58     INFO -    Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7fba9e73e6d0>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub
[task 2018-04-03T18:22:00.205Z] 18:22:00     INFO -    Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7fba9e73e850>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub
[task 2018-04-03T18:24:04.445Z] 18:24:04     INFO -    Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x7fba9e73e9d0>, 'Connection to pypi.pub.build.mozilla.org timed out. (connect timeout=120.0)')': /pub
[task 2018-04-03T18:26:04.821Z] 18:26:04     INFO -    Could not find a version that satisfies the requirement jsonschema==2.5.1 (from versions: )
[task 2018-04-03T18:26:04.822Z] 18:26:04     INFO -  No matching distribution found for jsonschema==2.5.1
Depends on: 1452956
Depends on: 1451432
Updates:
There have been 141 failures in the last 7 days.

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.

The most affected build types are debug and opt, but there are also over 10 failures on asan & pgo.

Failures per platforms:
Linux: 32	
Linux x64: 32
android-4-3-armv7-api16: 31
linux64-qr: 19
linux64-ccov: 18
android-4-0-armv7-api16: 5
android-5-0-aarch64: 2
linux32-nightly: 1
Android 4.2 x86: 1
Bug 1451432 remains my biggest current concern here.
Over the last 7 days there are 99 failures present on this bug. These happen on Android 4.2 x86, android-4-3-armv7-api16, android-5-0-aarch64, Linux, Linux x64, linux64-ccov, linux64-qr, Windows 7.

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=176126037&lineNumber=25270


[task 2018-04-28T15:11:00.833Z] 15:11:00     INFO - TEST-START | /referrer-policy/unsafe-url/meta-referrer/same-origin/http-https/iframe-tag/generic.swap-origin-redirect.http.html
[task 2018-04-28T15:11:00.844Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 28 (0x7fa601cc9800) [pid = 15232] [serial = 17] [outer = (nil)] [url = http://web-platform.test:8000/referrer-policy/generic/subresource/document.py?redirection=no-redirect&cache_destroyer=1524928253055]
[task 2018-04-28T15:11:00.849Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 27 (0x7fa601508000) [pid = 15232] [serial = 14] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.853Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 26 (0x7fa602492800) [pid = 15232] [serial = 3] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.855Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 25 (0x7fa609723000) [pid = 15232] [serial = 2] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.858Z] 15:11:00     INFO - PID 15175 | --DOCSHELL 0x7fa601552800 == 8 [pid = 15232] [id = {62097795-cefa-4626-92b1-e2b01134ebcf}]
[task 2018-04-28T15:11:00.862Z] 15:11:00     INFO - PID 15175 | --DOCSHELL 0x7fa5ff39c800 == 7 [pid = 15232] [id = {907aed2f-82ba-484c-a1a7-209937e276ec}]
[task 2018-04-28T15:11:00.864Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 24 (0x7fa5ff365000) [pid = 15232] [serial = 12] [outer = (nil)] [url = http://web-platform.test:8000/referrer-policy/generic/subresource/document.py]

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[task 2018-04-28T15:11:00.874Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 23 (0x7fa609720c00) [pid = 15232] [serial = 7] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.875Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 22 (0x7fa609721000) [pid = 15232] [serial = 6] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.876Z] 15:11:00     INFO - PID 15175 | --DOMWINDOW == 21 (0x7fa6098bf000) [pid = 15232] [serial = 9] [outer = (nil)] [url = about:blank]
[task 2018-04-28T15:11:00.878Z] 15:11:00     INFO - PID 15175 | --DOCSHELL 0x7fa601555000 == 6 [pid = 15232] [id = {f6b879e5-be92-4156-adc9-5489f3b17629}]
[taskcluster 2018-04-28 15:11:01.679Z] === Task Finished ===
[taskcluster 2018-04-28 15:11:01.685Z] Unsuccessful task run with exit code: -1 completed in 7740.784 seconds
Flags: needinfo?(gbrown)
Update:
There have been 62 failures in the last 7 days, 230 total failures in the last 21 days.

Depends on:	1243080, 1246165, 1321605, 1404190, 1431125, 1448049, 1451432
Blocks:	1306635

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.

Failure per platform and build type:

- Android 4.2 x86 / opt: 2
- android-4-0-armv7-api16 / debug & opt :3
- android-4-3-armv7-api16 / debug & opt: 17
- android-5-0-aarch64 / debug & op: 1
- Linux x64 / pgo, debug & asan: 8
- linux64-ccov / opt: 8
- linux64-qr / opt & debug: 9
- linux64-asan-reporter / opt: 1
- Linux / debug: 13

Recent log with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=177354933&lineNumber=1772
Depends on: 1346126
No longer depends on: 1246165
Flags: needinfo?(gbrown)
Depends on: 1461393
Depends on: 1457694
I expect these failures to continue and this bug to remain open indefinitely. Task timeouts are the last resort whenever a task is hung or otherwise runs for too long. My immediate goal is to address timeouts which recur in the same task frequently and identify other related issues in dependent bugs.

I expect infrequent failures on Android especially due to bug 1321605.

There are some infrequent failures on Android that can be addressed with more chunks - bug 1461393.

Recent failures on Android and Linux are frequent and irregular: Not associated with particular tests or test tasks. I'm hoping those might be addressed in bug 1457694.

(Recent network failures and delays during 'pip install' appear to have been resolved.)
Keywords: leave-open
Whiteboard: [stockwell infra] → [stockwell infra] [see summary at comment 92]
Over the last 7 days there are 124 failures present on this bug. These happen on android-4-0-armv7-api16, android-4-2-x86, android-4-3-armv7-api16, linux32, linux64, linux64-ccov, linux64-nightly, linux64-noopt, linux64-qr, windows7-32.
Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?job_id=180438377&repo=mozilla-inbound&lineNumber=1806

[task 2018-05-27T02:38:21.245Z] 02:38:21     INFO -  TEST-OK | testSessionOOMRestore | took 335841ms
[task 2018-05-27T02:38:21.246Z] 02:38:21     INFO -  TEST-START | Shutdown
[task 2018-05-27T02:38:21.247Z] 02:38:21     INFO -  Passed: 97
[task 2018-05-27T02:38:21.249Z] 02:38:21     INFO -  Failed: 0
[task 2018-05-27T02:38:21.250Z] 02:38:21     INFO -  Todo: 0
[task 2018-05-27T02:38:21.250Z] 02:38:21     INFO -  SimpleTest FINISHED
[task 2018-05-27T02:38:42.819Z] 02:38:42     INFO -  INFO | automation.py | Application ran for: 0:06:10.045962
[task 2018-05-27T02:38:42.821Z] 02:38:42     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmp_o252qpidlog
[task 2018-05-27T02:38:43.458Z] 02:38:43     INFO -  /data/tombstones does not exist; tombstone check skipped
[task 2018-05-27T02:38:50.701Z] 02:38:50     INFO -  INFO | automation.py | Application pid: 5599
[task 2018-05-27T02:39:03.345Z] 02:39:03     INFO -  SimpleTest START
[task 2018-05-27T02:39:03.347Z] 02:39:03     INFO -  TEST-START | testSessionPrivateBrowsing

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2018-05-27 02:39:11.369Z] === Task Finished ===
[taskcluster 2018-05-27 02:39:11.372Z] Unsuccessful task run with exit code: -1 completed in 3601.941 seconds
Flags: needinfo?(gbrown)
linux32-debug/jittest-4 and android-debug/xpcshell-11 are frequent and might be actionable. Otherwise, see comment 92.
Flags: needinfo?(gbrown)
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5989424834d2
Increase Android/debug xpcshell max-run-time; r=me,a=test-only
Keywords: leave-open
As an update, there are 76 failures in the last 7 days as it follows:
- 36 on linux64 (asan-reporter, ccov and qr)
- 29 on android. Mostly on 4-3-armv7-api16 but there are also 3 on 4-0-armv7-api16 and 2 on 4-2-x86
- 11 on linux32
Depends on: 1469288
Depends on: 1469720
In the last 7 days, there are 111 failures on this bug.
They occur mostly on linux 32, 64, 64-ccov, 64-qr, all builds type.
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=183726361&repo=mozilla-inbound&lineNumber=24638
Depends on: 1470177
In the last 7 days there are 127 failures:

- 51 failures on linux64 / asan, debug, pgo & lto
- 47 failures on android-em-4-3-armv7-api16 / opt & debug
- 11 failures on linux 32 / debug, pgo & opt
- 7 failures on linux64-ccov / debug
- 6 failures on osx-cross / opt & debug
- 2 failures on android-4-0-armv7-api16 / debug
- 2 failures on linux64-qr / debug
- 1 failure on linux64-noopt / debug 

As per comment 92, these will keep occuring. 

:gbrown is there no way to decrease their rate on the most common platforms they occur: linux64 and android?
Flags: needinfo?(gbrown)
(In reply to Andreea Pavel [:apavel] from comment #134)
> In the last 7 days there are 127 failures:
> 
> - 51 failures on linux64 / asan, debug, pgo & lto
> - 47 failures on android-em-4-3-armv7-api16 / opt & debug
> - 11 failures on linux 32 / debug, pgo & opt
> - 7 failures on linux64-ccov / debug
> - 6 failures on osx-cross / opt & debug
> - 2 failures on android-4-0-armv7-api16 / debug
> - 2 failures on linux64-qr / debug
> - 1 failure on linux64-noopt / debug 
> 
> As per comment 92, these will keep occuring. 
> 
> :gbrown is there no way to decrease their rate on the most common platforms
> they occur: linux64 and android?

These are certainly too frequent, but I don't know what to do about it. There are multiple causes, even on one platform. Bug 1470177 eliminated a common cause of android timeouts, but the weekly stats hardly changed.

Bug 1457694 was an attempt to address a common scenario I (may) see on linux and android, but that hasn't made any progress and I'm not sure how to make progress on it. :(

I will keep monitoring failures and acting on those cases that I can.
Flags: needinfo?(gbrown)
:gbrown thanks for looking into this!
Depends on: 1115253
(In reply to Geoff Brown [:gbrown] from comment #137)
> Bug 1457694 was an attempt to address a common scenario I (may) see on linux
> and android, but that hasn't made any progress and I'm not sure how to make
> progress on it. :(

Bug 1457694 has been confirmed as a real issue and is progressing, I think!
Update:
 There has been a total of 119 failures in the last 7 days( and 339 failures in the last 21 days).
- 2 failures on android-4-0-armv7-api16 debug
- 1 failure on android-4-2-x86 opt
- 1 failure on android-em-4-2-x86 opt
- 19 failures on android-em-4-3-armv7-api16 debug/opt
- 17 failures on linux32 debug/opt
- 51 failures on linux64 asan/debug/lto/opt/pgo
- 10 failures on linux64-ccov debug
- 10 failures on linux64-qr debug/opt
- 1 failure on linux64-noopt debug
- 1 failure on osx-10-10 opt
- 3 failures on osx-cross opt
- 1 failure on windows10-64 asan
- 2 failures on windows7-32 debug
 There are different type of tests running, but all of them have this error in common:
[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2018-07-18 12:00:30.241Z] Unsuccessful task run with exit code: -1 completed in 3924.674 seconds
(In reply to OrangeFactor Robot from comment #163)
> 135 failures in 138 pushes (0.978 failures/push) were associated with this
> bug yesterday.    
> 
> Platform breakdown:
> * android-em-4-3-armv7-api16: 105

robocop is perma-failing - https://bugzilla.mozilla.org/show_bug.cgi?id=1451513#c19.
(In reply to Geoff Brown [:gbrown] from comment #164)
> robocop is perma-failing -
> https://bugzilla.mozilla.org/show_bug.cgi?id=1451513#c19.

That was corrected with a backout and subsequent re-landing - great!
(In reply to Geoff Brown [:gbrown] from comment #150)
> Bug 1457694 has been confirmed as a real issue and is progressing, I think!

That bug hasn't been marked as resolved, but some changes have landed in related bugs and I am seeing far fewer of those issues. 

Failure rates in this bug have recently been significantly lower than we've seen in months.
Depends on: 1480348
Thanks :aryx! Yes, it looks like bug 1480494 is the biggest contributor to current failures here.
Depends on: 1480494
See Also: 1480494
Duplicate of this bug: 1484064
Update:
There have been 98 failures in the last week.
Failures per platform and build type:

linux64 / debug, asan, opt, lto: 39
osx-10-10 / opt & debug: 12
android-em-4-3-armv7-api16 / debug & opt: 11
linux64-ccov / debug: 11
linux32 / debug & opt: 7
osx-cross / opt & asan: 7
windows2012-32 / debug & pgo: 2
windows7-32 / opt & pgo: 3
android-em-4-2-x86 / debug: 2
linux64-noopt / debug: 1
linux64-qr / debug: 1
android-4-0-armv7-api16 / opt: 1
android-5-0-aarch64 / opt: 1

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.
Depends on: 1487877
Depends on: 1487886
There have been 118 failures in the last week.

The most affected platform / build: linux64-ccov / debug.

Recent relevat log file: https://treeherder.mozilla.org/logviewer.html#?job_id=197122988&repo=mozilla-central&lineNumber=180429

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.

Blocks:	1306635
Depends on: 1115253, 1243080, 1321605, 1404190, 1431125, 1457694, 1480348, 1487877, 1487886
Depends on: 1489351
Depends on: 1489645
linux64/ccov devtool and android/debug mochitest failure are being addressed with more test chunks.

linux64 mochitest-media failures are reportedly 1480348.

There are also some build task timeouts, which seem quite random -- I don't know what to do about those.
Update:
There have been 37 failures in the last week.

Failures per platform and build type:
- linux64 / debug, opt, asan, pgo, lto: 16
- linux32 / debug & opt: 10
- android-em-4-3-armv7-api16 / debug & opt: 5
- linux64-qr / debug: 2
- linux64-ccov / debug: 1
- osx-10-10-dmd / opt: 1
- osx-cross / opt: 1
- osx-cross-noopt / debug: 1

Recent relevant log file:
https://treeherder.mozilla.org/logviewer.html#?job_id=199174505&repo=mozilla-inbound&lineNumber=40467

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.
(In reply to Geoff Brown [:gbrown] from comment #199)
> linux64/ccov devtool and android/debug mochitest failure are being addressed
> with more test chunks.
> 
> linux64 mochitest-media failures are reportedly 1480348.
> 
> There are also some build task timeouts, which seem quite random -- I don't
> know what to do about those.

Many of those (apparent) build task timeouts are in rusttests; I haven't investigated, but note bug 1489277 might be tangentially related.
Duplicate of this bug: 1493001
Depends on: 1485638
There was a recent spike in linux64-ccov/asan build timeouts after 60 minutes; these are "build-linux64-asan-fuzzing-ccov/opt (bocf)" builds, on mozilla-central only. The most recent builds have run close to 50 minutes, so I won't take action until/unless this issue returns.
No longer depends on: 1115253
Some failures last week happened during robustcheckout - fallout from bug 1490703 possibly - should be resolved now.
Depends on: 1500177
Depends on: 1505625
Depends on: 1505174
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=212394686&repo=autoland&lineNumber=45090

[task 2018-11-17T15:05:43.073Z] 15:05:43     INFO - TEST-START | /referrer-policy/same-origin/attr-referrer/cross-origin/http-https/script-tag/cross-origin.keep-origin-redirect.http.html
[task 2018-11-17T15:05:43.296Z] 15:05:43     INFO - PID 22748 | ++DOCSHELL 0x7fdd2decb800 == 2 [pid = 22854] [id = {f599e7eb-3bd6-4194-91f2-ee9bbeb55f76}]
[task 2018-11-17T15:05:43.298Z] 15:05:43     INFO - PID 22748 | ++DOMWINDOW == 5 (0x7fdd2ca0e800) [pid = 22854] [serial = 5] [outer = (nil)]
[task 2018-11-17T15:05:43.397Z] 15:05:43     INFO - PID 22748 | ++DOMWINDOW == 6 (0x7fdd2dd62400) [pid = 22854] [serial = 6] [outer = 0x7fdd2ca0e800]
[task 2018-11-17T15:05:43.679Z] 15:05:43     INFO - PID 22748 | ++DOMWINDOW == 7 (0x7fdd2dd63400) [pid = 22854] [serial = 7] [outer = 0x7fdd2ca0e800]
[task 2018-11-17T15:05:44.143Z] 15:05:44     INFO - PID 22748 | Couldn't convert chrome URL: chrome://branding/locale/brand.properties

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[taskcluster 2018-11-17 15:05:45.094Z] === Task Finished ===
[taskcluster 2018-11-17 15:05:45.096Z] Unsuccessful task run with exit code: -1 completed in 7692.491 seconds

Details in Comment 92.
Tests are in good shape -- almost no test failures here recently.

However, there are a bunch of build failures here. I don't see clear correlations to specific build types, but...needs more investigation.
Recent Android 7.0 failures are on machine-16 -- a ticket has been opened on packet.net to have this machine specific performance problem addressed.
Depends on: 1511084
This bug failed 45 times in the last 7 days. These occur on linux32, linux64 on opt, asan and pgo. For the following platforms there's only 1 failures for each: linux64-ccov, linux64-qr, android-em-4-3-armv7-api16.

Recent log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=217017974&repo=mozilla-inbound&lineNumber=87534

15:52:26     INFO - ============================= test session starts ==============================
[task 2018-12-14T15:52:26.563Z] 15:52:26     INFO - platform linux2 -- Python 2.7.9, pytest-3.6.2, py-1.5.4, pluggy-0.6.0 -- /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/obj-firefox-8yIyzR8r-2.7/bin/python
[task 2018-12-14T15:52:26.563Z] 15:52:26     INFO - rootdir: /builds/worker/workspace/build/src, inifile: /builds/worker/workspace/build/src/config/mozunit/mozunit/pytest.ini
[task 2018-12-14T15:52:26.563Z] 15:52:26     INFO - collecting ... collected 1 item
[task 2018-12-14T15:52:26.563Z] 15:52:26     INFO - ../config/tests/unit-mozunit.py::TestMozUnit::test_mocked_open PASSED
[task 2018-12-14T15:52:26.564Z] 15:52:26     INFO - =========================== 1 passed in 0.01 seconds ===========================
[task 2018-12-14T15:52:26.564Z] 15:52:26     INFO - /builds/worker/workspace/build/src/config/tests/test_mozbuild_reading.py
[task 2018-12-14T15:52:26.564Z] 15:52:26     INFO - ============================= test session starts ==============================
[task 2018-12-14T15:52:26.564Z] 15:52:26     INFO - platform linux2 -- Python 2.7.9, pytest-3.6.2, py-1.5.4, pluggy-0.6.0 -- /builds/worker/workspace/build/src/obj-firefox/_virtualenvs/obj-firefox-8yIyzR8r-2.7/bin/python
[task 2018-12-14T15:52:26.565Z] 15:52:26     INFO - rootdir: /builds/worker/workspace/build/src, inifile: /builds/worker/workspace/build/src/config/mozunit/mozunit/pytest.ini
[task 2018-12-14T15:52:26.565Z] 15:52:26     INFO - collecting ... collected 3 items
[task 2018-12-14T15:52:26.565Z] 15:52:26     INFO - ../config/tests/test_mozbuild_reading.py::TestMozbuildReading::test_filesystem_traversal_no_config PASSED
[task 2018-12-14T15:52:26.565Z] 15:52:26     INFO - ../config/tests/test_mozbuild_reading.py::TestMozbuildReading::test_filesystem_traversal_reading <- ../../../../../usr/lib/python2.7/unittest/case.py SKIPPED
[task 2018-12-14T15:52:26.566Z] 15:52:26     INFO - ../config/tests/test_mozbuild_reading.py::TestMozbuildReading::test_orphan_file_patterns PASSED
[task 2018-12-14T15:52:26.566Z] 15:52:26     INFO - ===================== 2 passed, 1 skipped in 79.39 seconds =====================

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[taskcluster 2018-12-14 15:53:03.117Z] === Task Finished ===
[taskcluster 2018-12-14 15:53:03.118Z] Unsuccessful task run with exit code: -1 completed in 7252.08 seconds
Depends on: 1514388
There are 29 total failures in the last 7 days and 162 total failures in the last 30 days (some of them might be misclassifications), majority on linux. 
Recent failure log:  https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=219384257&repo=autoland&lineNumber=194352
 
[task 2018-12-31T13:12:56.848Z] 13:12:56     INFO - TEST-START | dom/base/test/test_bug715041.xul

[task 2018-12-31T13:13:08.337Z] 13:13:08     INFO - GECKO(5728) | JS CALLED OBSERVE FUNCTION!!!
[task 2018-12-31T13:13:08.338Z] 13:13:08     INFO - GECKO(5728) | msg 1 Count: 1
[task 2018-12-31T13:13:09.378Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 28 (0xdb22d000) [pid = 5728] [serial = 132] [outer = (nil)] [url = about:blank]
[task 2018-12-31T13:13:09.382Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 27 (0xdb238800) [pid = 5728] [serial = 129] [outer = (nil)] [url = chrome://mochikit/content/tests/SimpleTest/iframe-between-tests.html]
[task 2018-12-31T13:13:09.385Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 26 (0xdb22b800) [pid = 5728] [serial = 124] [outer = (nil)] [url = chrome://mochitests/content/chrome/dom/base/test/test_blocking_image.html]
[task 2018-12-31T13:13:09.387Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 25 (0xdb231000) [pid = 5728] [serial = 127] [outer = (nil)] [url = chrome://mochikit/content/tests/SimpleTest/iframe-between-tests.html]
[task 2018-12-31T13:13:09.389Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 24 (0xdb223000) [pid = 5728] [serial = 123] [outer = (nil)] [url = chrome://mochikit/content/tests/SimpleTest/iframe-between-tests.html]
[task 2018-12-31T13:13:09.391Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 23 (0xdc944c00) [pid = 5728] [serial = 110] [outer = (nil)] [url = chrome://mochitests/content/chrome/dom/base/test/test_blockParsing.html]
[task 2018-12-31T13:13:09.396Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 22 (0xdb237c00) [pid = 5728] [serial = 128] [outer = (nil)] [url = chrome://mochitests/content/chrome/dom/base/test/test_bug1008126.html]
[task 2018-12-31T13:13:09.398Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 21 (0xdb225000) [pid = 5728] [serial = 130] [outer = (nil)] [url = chrome://mochitests/content/chrome/dom/base/test/test_bug1016960.html]
[task 2018-12-31T13:13:09.400Z] 13:13:09     INFO - GECKO(5728) | --DOMWINDOW == 20 (0xdb2b9000) [pid = 5728] [serial = 133] [outer = (nil)] [url = chrome://mochikit/content/tests/SimpleTest/iframe-between-tests.html]
[task 2018-12-31T13:13:10.312Z] 13:13:10     INFO - GECKO(5728) | msg 3 Count: 1
[task 2018-12-31T13:13:13.318Z] 13:13:13     INFO - GECKO(5728) | msg 4 Count: 1
[task 2018-12-31T13:13:13.319Z] 13:13:13     INFO - GECKO(5728) | msg 2 Count: 1
[task 2018-12-31T13:13:13.322Z] 13:13:13     INFO - GECKO(5728) | TESTING CASE AddShiftLocalCleanUp
[task 2018-12-31T13:13:13.325Z] 13:13:13     INFO - GECKO(5728) | ==============
[task 2018-12-31T13:13:13.327Z] 13:13:13     INFO - GECKO(5728) | JS REMOVE IDLE OBSERVER () time to be removed: 1
[task 2018-12-31T13:13:13.332Z] 13:13:13     INFO - GECKO(5728) | JS removeIdleObserver() observer time: 0
[task 2018-12-31T13:13:13.334Z] 13:13:13     INFO - GECKO(5728) | JS removeIdleObserver() observer time: 1
[task 2018-12-31T13:13:13.339Z] 13:13:13     INFO - GECKO(5728) | MOCK IDLE SERVICE REMOVING idle observer with time 1
[task 2018-12-31T13:13:13.341Z] 13:13:13     INFO - GECKO(5728) | MOCK IDLE SERVICE numIdleObserversRemoved: 1 numIdleObserversAdded: 0
[task 2018-12-31T13:13:13.342Z] 13:13:13     INFO - GECKO(5728) | JS FAKE IDLE SERVICE end of remove idle observer
[task 2018-12-31T13:13:13.350Z] 13:13:13     INFO - GECKO(5728) | JS NUM OBSERVERS: 1
[task 2018-12-31T13:13:13.351Z] 13:13:13     INFO - GECKO(5728) | AddShiftLocalCleanUp() done clean up
[task 2018-12-31T13:13:13.353Z] 13:13:13     INFO - GECKO(5728) | TESTING CASE AddNewLocalWhileAllIdle
[task 2018-12-31T13:13:13.354Z] 13:13:13     INFO - GECKO(5728) | ==============
[task 2018-12-31T13:13:13.356Z] 13:13:13     INFO - GECKO(5728) | JS FAKE IDLE SERVICE add idle observer before
[task 2018-12-31T13:13:13.358Z] 13:13:13     INFO - GECKO(5728) | JS NUM OBSERVERS: 1
[task 2018-12-31T13:13:13.360Z] 13:13:13     INFO - GECKO(5728) | window is: [object Window]
[task 2018-12-31T13:13:13.361Z] 13:13:13     INFO - GECKO(5728) | MOCK IDLE SERVICE ADDING idle observer with time: 1
[task 2018-12-31T13:13:13.363Z] 13:13:13     INFO - GECKO(5728) | MOCK IDLE SERVICE: num idle observers added: 1
[task 2018-12-31T13:13:13.364Z] 13:13:13     INFO - GECKO(5728) | JS FAKE IDLE SERVICE end of add idle observer
[task 2018-12-31T13:13:13.366Z] 13:13:13     INFO - GECKO(5728) | JS NUM OBSERVERS: 2
[task 2018-12-31T13:13:13.368Z] 13:13:13     INFO - GECKO(5728) | JS FAKE IDLE SERVICE
[task 2018-12-31T13:13:13.369Z] 13:13:13     INFO - GECKO(5728) | JS NUM OBSERVERS: 2
[task 2018-12-31T13:13:13.373Z] 13:13:13     INFO - GECKO(5728) | JS CALLED OBSERVE FUNCTION!!!
[task 2018-12-31T13:13:13.375Z] 13:13:13     INFO - GECKO(5728) | msg 1 Count: 1
[task 2018-12-31T13:13:14.322Z] 13:13:14     INFO - GECKO(5728) | msg 2 Count: 1
[task 2018-12-31T13:13:15.321Z] 13:13:15     INFO - GECKO(5728) | msg 2 Count: 2
[task 2018-12-31T13:13:16.821Z] 13:13:16     INFO - GECKO(5728) | msg 5 Count: 1
[task 2018-12-31T13:13:16.825Z] 13:13:16     INFO - GECKO(5728) | function performNextTest()
[task 2018-12-31T13:13:16.828Z] 13:13:16     INFO - GECKO(5728) | currTestCaseNum: 4
[task 2018-12-31T13:13:16.831Z] 13:13:16     INFO - GECKO(5728) | cleanUp: false
[task 2018-12-31T13:13:16.834Z] 13:13:16     INFO - GECKO(5728) | passed: true
[task 2018-12-31T13:13:16.838Z] 13:13:16     INFO - GECKO(5728) | numIdleObserversRemoved: 0
[task 2018-12-31T13:13:16.840Z] 13:13:16     INFO - GECKO(5728) | numIdleObservesAdded: 1
[task 2018-12-31T13:13:16.844Z] 13:13:16     INFO - GECKO(5728) | TESTING CASE AddNewLocalWhileAllIdleCleanUp
[task 2018-12-31T13:13:16.846Z] 13:13:16     INFO - GECKO(5728) | ==============
[task 2018-12-31T13:13:16.850Z] 13:13:16     INFO - GECKO(5728) | JS REMOVE IDLE OBSERVER () time to be removed: 1
[task 2018-12-31T13:13:16.853Z] 13:13:16     INFO - GECKO(5728) | JS removeIdleObserver() observer time: 0
[task 2018-12-31T13:13:16.854Z] 13:13:16     INFO - GECKO(5728) | JS removeIdleObserver() observer time: 1
[task 2018-12-31T13:13:16.856Z] 13:13:16     INFO - GECKO(5728) | MOCK IDLE SERVICE REMOVING idle observer with time 1
[task 2018-12-31T13:13:16.857Z] 13:13:16     INFO - GECKO(5728) | MOCK IDLE SERVICE numIdleObserversRemoved: 1 numIdleObserversAdded: 0
[task 2018-12-31T13:13:16.858Z] 13:13:16     INFO - GECKO(5728) | JS FAKE IDLE SERVICE end of remove idle observer
[task 2018-12-31T13:13:16.859Z] 13:13:16     INFO - GECKO(5728) | JS NUM OBSERVERS: 1
[task 2018-12-31T13:13:16.861Z] 13:13:16     INFO - GECKO(5728) | TESTING CASE ShiftLocalTimerBack()
[task 2018-12-31T13:13:16.863Z] 13:13:16     INFO - GECKO(5728) | ==============
[task 2018-12-31T13:13:16.866Z] 13:13:16     INFO - GECKO(5728) | JS FAKE IDLE SERVICE add idle observer before
[task 2018-12-31T13:13:16.867Z] 13:13:16     INFO - GECKO(5728) | JS NUM OBSERVERS: 1
[task 2018-12-31T13:13:16.869Z] 13:13:16     INFO - GECKO(5728) | window is: [object Window]
[task 2018-12-31T13:13:16.871Z] 13:13:16     INFO - GECKO(5728) | MOCK IDLE SERVICE ADDING idle observer with time: 1
[task 2018-12-31T13:13:16.873Z] 13:13:16     INFO - GECKO(5728) | MOCK IDLE SERVICE: num idle observers added: 1
[task 2018-12-31T13:13:16.875Z] 13:13:16     INFO - GECKO(5728) | JS FAKE IDLE SERVICE end of add idle observer
[task 2018-12-31T13:13:16.877Z] 13:13:16     INFO - GECKO(5728) | JS NUM OBSERVERS: 2
[task 2018-12-31T13:13:16.879Z] 13:13:16     INFO - GECKO(5728) | JS FAKE IDLE SERVICE
[task 2018-12-31T13:13:16.881Z] 13:13:16     INFO - GECKO(5728) | JS NUM OBSERVERS: 2
[task 2018-12-31T13:13:16.883Z] 13:13:16     INFO - GECKO(5728) | JS CALLED OBSERVE FUNCTION!!!
[task 2018-12-31T13:13:16.885Z] 13:13:16     INFO - GECKO(5728) | msg 2 Count: 1
[task 2018-12-31T13:13:18.751Z] 13:13:18     INFO - GECKO(5728) | msg 4 Count: 1

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2018-12-31 13:13:22.095Z] === Task Finished ===
[taskcluster 2018-12-31 13:13:22.096Z] Unsuccessful task run with exit code: -1 completed in 3603.519 seconds
(In reply to Intermittent Failures Robot from comment #232)
> 29 failures in 1008 pushes (0.029 failures/push) were associated with this
> bug in the last 7 days.
> 
> Platform breakdown:
> * linux64: 11
> * linux32: 8
> * linux64-qr: 10

I reviewed these linux timeouts but couldn't find a common cause. Existing max-run-times seem appropriate.

aws slow-down? Seems better recently -- monitoring.
See Also: → 1420394

Most recent failures are on linux/debug, but vary by test suite. Average task durations are << max-run-time. I wonder if these are related to logspam, like bug 1437991, bug 1515833, and bug 1515827.

(In reply to Geoff Brown [:gbrown] from comment #236)

Most recent failures are on linux/debug, but vary by test suite. Average task durations are << max-run-time. I wonder if these are related to logspam, like bug 1437991, bug 1515833, and bug 1515827.

Failure frequency did seem to decrease with resolution of bug 1437991, but linux/debug continues to dominate ongoing failures.

(In reply to Geoff Brown [:gbrown] from comment #236)

Most recent failures are on linux/debug, but vary by test suite. Average task durations are << max-run-time. I wonder if these are related to logspam, like bug 1437991, bug 1515833, and bug 1515827.

Those logspam bugs are all resolved now - great! - and did seem to help the random linux/debug failures here.

Recently we are seeing more linux/debug jsreftest failures specifically -- investigating...

(In reply to Geoff Brown [:gbrown] from comment #241)

Recently we are seeing more linux/debug jsreftest failures specifically -- investigating...

And linux/debug reftest also.

linux64/debug jsreftest might benefit from more chunks.

Other platforms are more puzzling: linux64/debug reftests run in about 30 minutes generally...then suddenly a chunk runs in 60+ minutes. Affected chunks appear "random"...could that be because tests are being shuffled?

Some failures on Feb 4/5 appear to be related to certificate expiry (temporary infrastructure condition caused mass failures, hangs).

Depends on: 1528405

Over the last 7 days this bug has 30 failures. These happen on windows7-32, osx-cross-noopt, osx-cross, linux64-qr, linux64, linux32

Here is the latest log failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=228854900&repo=mozilla-inbound&lineNumber=30960

[task 2019-02-16T23:10:38.226Z] 23:10:38 INFO - REFTEST TEST-START | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-004.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html
[task 2019-02-16T23:10:38.231Z] 23:10:38 INFO - REFTEST TEST-LOAD | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-004.html | 1167 / 1195 (97%)
[task 2019-02-16T23:10:38.256Z] 23:10:38 INFO - ++DOMWINDOW == 334 (0x7f0529d71c00) [pid = 8577] [serial = 2751] [outer = 0x7f052d531800]
[task 2019-02-16T23:10:38.337Z] 23:10:38 INFO - [Child 8577, Main Thread] WARNING: HTMLEditRules::BeforeEdit() failed to handle something: 'NS_SUCCEEDED(rv)', file /builds/worker/workspace/build/src/editor/libeditor/HTMLEditor.cpp, line 3576
[task 2019-02-16T23:10:38.344Z] 23:10:38 INFO - [Child 8577, Main Thread] WARNING: '!aSelection.RangeCount()', file /builds/worker/workspace/build/src/editor/libeditor/EditorBase.cpp, line 3569
[task 2019-02-16T23:10:38.351Z] 23:10:38 INFO - [Child 8577, Main Thread] WARNING: '!selectionStartPoint.IsSet()', file /builds/worker/workspace/build/src/editor/libeditor/HTMLEditRules.cpp, line 9806
[task 2019-02-16T23:10:38.355Z] 23:10:38 INFO - [Child 8577, Main Thread] WARNING: Failed to normalize Selection: 'NS_SUCCEEDED(rv)', file /builds/worker/workspace/build/src/editor/libeditor/HTMLEditRules.cpp, line 453
[task 2019-02-16T23:10:41.103Z] 23:10:41 INFO - REFTEST TEST-PASS | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-004.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html | image comparison, max difference: 0, number of differing pixels: 0
[task 2019-02-16T23:10:41.106Z] 23:10:41 INFO - REFTEST TEST-END | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-004.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html
[task 2019-02-16T23:10:41.136Z] 23:10:41 INFO - ++DOMWINDOW == 335 (0x7f0528b65000) [pid = 8577] [serial = 2752] [outer = 0x7f052d531800]
[task 2019-02-16T23:10:41.196Z] 23:10:41 INFO - REFTEST TEST-START | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-005.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html
[task 2019-02-16T23:10:41.203Z] 23:10:41 INFO - REFTEST TEST-LOAD | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-005.html | 1168 / 1195 (97%)
[task 2019-02-16T23:10:41.243Z] 23:10:41 INFO - ++DOMWINDOW == 336 (0x7f0528d1ac00) [pid = 8577] [serial = 2753] [outer = 0x7f052d531800]
[task 2019-02-16T23:10:41.628Z] 23:10:41 INFO - REFTEST TEST-PASS | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-005.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html | image comparison, max difference: 0, number of differing pixels: 0
[task 2019-02-16T23:10:41.634Z] 23:10:41 INFO - REFTEST TEST-END | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-005.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-001-ref.html
[task 2019-02-16T23:10:41.659Z] 23:10:41 INFO - ++DOMWINDOW == 337 (0x7f0529084c00) [pid = 8577] [serial = 2754] [outer = 0x7f052d531800]
[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[task 2019-02-16T23:10:41.723Z] 23:10:41 INFO - REFTEST TEST-START | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-006.html == file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-006-ref.html
[task 2019-02-16T23:10:41.746Z] 23:10:41 INFO - REFTEST TEST-LOAD | file:///builds/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/selectors/focus-within-006.html | 1169 / 1195 (97%)
[taskcluster 2019-02-16 23:10:42.612Z] === Task Finished ===
[taskcluster 2019-02-16 23:10:42.613Z] Unsuccessful task run with exit code: -1 completed in 4378.344 seconds

Flags: needinfo?(gbrown)
Flags: needinfo?(gbrown)
See Also: → 1529777
Depends on: 1529781

Over the last 7 days there are 39 failures on this bug. These happen on osx-cross, linux64-qr, linux64, linux32, android-em-4-3-armv7-api16, android-5-0-x86_64.

Here is the latest failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=231041560&repo=autoland&lineNumber=35054

Flags: needinfo?(gbrown)
Flags: needinfo?(gbrown)
See Also: → 1532236
Depends on: 1533565

Thanks Arthur. I noticed and hope my patch in bug 1533565 will resolve this.

Flags: needinfo?(gbrown)

There were 4 timeouts on android builds here:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=nadroid%2Cnightly&revision=c89f024c023fa816d700b151e8e0cbb9a1907cb8

(On retry, they were all fine.)

It looks like hg robustcheckout was taking a very long time.

[task 2019-03-07T10:38:21.362Z] 10:38:21 INFO - Running main action method: multi_l10n
...
[task 2019-03-07T12:05:06.758Z] 12:05:06 INFO - 12:05:06 INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--config', 'extensions.robustcheckout=/builds/worker/workspace/build/src/testing/mozharness/external_tools/robustcheckout.py', 'robustcheckout', u'https://hg.mozilla.org/l10n-central/mr', u'mr', '--sharebase', u'/builds/hg-shared', '--branch', u'default']
[task 2019-03-07T12:05:06.758Z] 12:05:06 INFO - 12:05:06 INFO - Copy/paste: hg --config ui.merge=internal:merge --config extensions.robustcheckout=/builds/worker/workspace/build/src/testing/mozharness/external_tools/robustcheckout.py robustcheckout https://hg.mozilla.org/l10n-central/mr mr --sharebase /builds/hg-shared --branch default
[task 2019-03-07T12:05:06.809Z] 12:05:06 INFO - 12:05:06 INFO - (using Mercurial 4.8.1)
[task 2019-03-07T12:05:06.810Z] 12:05:06 INFO - 12:05:06 INFO - ensuring https://hg.mozilla.org/l10n-central/mr@default is available at mr
[task 2019-03-07T12:05:07.025Z] 12:05:07 INFO - 12:05:07 INFO - (sharing from new pooled repository cf74f930fde306218d72c790d69ad9ba7c8b8b51)
[task 2019-03-07T12:05:07.216Z] 12:05:07 INFO - 12:05:07 INFO - requesting all changes

[taskcluster:error] Task timeout after 7200 seconds. Force killing container.

vs a successful retry:

[task 2019-03-07T12:43:25.400Z] 12:43:25 INFO - Running main action method: multi_l10n
...
[task 2019-03-07T12:51:28.535Z] 12:51:28 INFO - 12:51:28 INFO - [mozharness: 2019-03-07 12:51:28.535244Z] Finished pull-locale-source step (success)
...
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - [mozharness: 2019-03-07 12:56:32.376373Z] Finished multi-l10n step (success)
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - [mozharness: 2019-03-07 12:56:32.376448Z] Skipping package-source step.
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - Running post-run listener: _parse_build_tests_ccov
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - Running post-run listener: _shutdown_sccache
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - Running post-run listener: _summarize
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - # TBPL SUCCESS #
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - [mozharness: 2019-03-07 12:56:32.376769Z] FxDesktopBuild summary:
[task 2019-03-07T12:56:32.376Z] 12:56:32 INFO - # TBPL SUCCESS #
[taskcluster 2019-03-07 12:56:32.926Z] === Task Finished ===
[taskcluster 2019-03-07 12:58:24.583Z] Successful task run with exit code: 0 completed in 3083.283 seconds

I won't follow-up unless there are more problems like this.

Depends on: 1534822

There are still some linux64/debug devtools timeouts after bug 1534822. Chunks are very unbalanced. Filed bug 1536253 as a first step to improvement.

There are some infrequent linux64-qr/debug reftest 3600 s timeouts, but chunks normally run in 20 to 40 minutes so I am reluctant to act; watching...

Depends on: 1539520

source-test-clang-tidy is perma-fail, causing a spike here. There is some existing activity on that test, like bug 1539779 -- waiting to see if that helps. (Otherwise, it might be worth noting that there was a sharp spike in run time around here: https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=tidy&tochange=98452610cfcc81cba0d4478797fe1e83a51172e8&fromchange=cf6cfe33476622c202fcbc34ef12f8ca1c039b8a. )

Geoff, I've filed bug 1540325 for the clang tidy timeouts.

Depends on: 1540280

Over the last 7 days there are 31 failures present on this bug. These happen on: android-em-7-0-x86_64, linux32-shippable, linux64, linux64-qr, osx-cross, osx-cross-ccov

Here is the most recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=239147679&repo=autoland&lineNumber=78131

Flags: needinfo?(gbrown)
Flags: needinfo?(gbrown)
Depends on: 1543558

Over the last 7 days there are 38 failures present on this bug. These happen on android-4-0-armv7-api16, android-5-0-aarch64, android-em-7-0-x86, android-em-7-0-x86_64, linux32-shippable, linux64, linux64-shippable, osx-cross, osx-cross-noopt, osx-shippable, windows-mingw32, windows2012-32-shippable.

Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=240827727&repo=mozilla-central&lineNumber=5178

[task 2019-04-17T00:24:53.121Z] 00:24:53 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-283.xht
[task 2019-04-17T00:24:55.017Z] 00:24:55 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-283.xht | took 1905ms
[task 2019-04-17T00:24:55.019Z] 00:24:55 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-284.xht
[task 2019-04-17T00:24:57.803Z] 00:24:57 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-284.xht | took 2782ms
[task 2019-04-17T00:24:57.805Z] 00:24:57 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-285.xht
[task 2019-04-17T00:24:59.685Z] 00:24:59 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-285.xht | took 1879ms
[task 2019-04-17T00:24:59.687Z] 00:24:59 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-286.xht
[task 2019-04-17T00:25:01.657Z] 00:25:01 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-286.xht | took 1971ms
[task 2019-04-17T00:25:01.657Z] 00:25:01 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-287.xht
[task 2019-04-17T00:25:03.591Z] 00:25:03 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-287.xht | took 1935ms
[task 2019-04-17T00:25:03.593Z] 00:25:03 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-288.xht
[task 2019-04-17T00:25:05.387Z] 00:25:05 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-288.xht | took 1796ms
[task 2019-04-17T00:25:05.388Z] 00:25:05 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-289.xht
[task 2019-04-17T00:25:07.057Z] 00:25:07 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-289.xht | took 1667ms
[task 2019-04-17T00:25:07.058Z] 00:25:07 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-290.xht
[task 2019-04-17T00:25:08.942Z] 00:25:08 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-290.xht | took 1880ms
[task 2019-04-17T00:25:08.944Z] 00:25:08 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-291.xht
[task 2019-04-17T00:25:10.795Z] 00:25:10 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-291.xht | took 1857ms
[task 2019-04-17T00:25:10.796Z] 00:25:10 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-292.xht
[task 2019-04-17T00:25:12.671Z] 00:25:12 INFO - TEST-PASS | /css/CSS2/selectors/first-letter-punctuation-292.xht | took 1873ms
[task 2019-04-17T00:25:12.671Z] 00:25:12 INFO - TEST-START | /css/CSS2/selectors/first-letter-punctuation-293.xht
[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2019-04-17 00:25:15.346Z] === Task Finished ===
[taskcluster 2019-04-17 00:25:15.347Z] Unsuccessful task run with exit code: -1 completed in 5402.945 seconds

Flags: needinfo?(gbrown)

Looking at a couple of the linux builds it seems sometimes it takes 1/2 to 1 hour to clone and update the repo. I do see '(warning: large working directory being used without fsmonitor enabled; enable fsmonitor to improve performance; see "hg help -e fsmonitor")' in the log as well.

Perhaps we a) either adjust the max run time to account for the slow hg performance or b) improve the hg performance perhaps with fsmonitor or both or some other approach. Too bad we have to clone and update > 3G of data each time as well.

Perhaps we can just try to bump the max run time to see if we prevent or reduce these errors for the problematic jobs, not just builds, then for the ones which complete when given enough time, try to improve their run times. For the ones that won't complete regardless of the run time we can reduce it again and investigate that as a distinct issue.

Flags: needinfo?(gbrown)
Depends on: 1545308

(In reply to Bob Clary [:bc:] from comment #272)

Thanks :bc.

Looking at a couple of the linux builds it seems sometimes it takes 1/2 to 1 hour to clone and update the repo. I do see '(warning: large working directory being used without fsmonitor enabled; enable fsmonitor to improve performance; see "hg help -e fsmonitor")' in the log as well.

I also see the fsmonitor warning in other logs, where the cloning completes quickly -- maybe it is normal / expected?

There are 35 total failures in the last 7 days on multiple platforms: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-04-19&endday=2019-04-26&tree=trunk&bug=1411358

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=242811869&repo=mozilla-inbound&lineNumber=17137

[task 2019-04-26T09:12:42.250Z] 09:12:42 INFO - REFTEST TEST-START | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=test262/language/expressions/equals/S11.9.1_A3.3.js
[task 2019-04-26T09:12:42.251Z] 09:12:42 INFO - REFTEST TEST-LOAD | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=test262/language/expressions/equals/S11.9.1_A3.3.js | 4550 / 6609 (68%)
[task 2019-04-26T09:12:42.255Z] 09:12:42 INFO - wait for org.mozilla.geckoview.test complete; top activity=org.mozilla.geckoview.test
[task 2019-04-26T09:12:42.466Z] 09:12:42 INFO - org.mozilla.geckoview.test unexpectedly found running. Killing...
[task 2019-04-26T09:12:42.467Z] 09:12:42 INFO - REFTEST TEST-INFO | started process screentopng
[task 2019-04-26T09:12:44.529Z] 09:12:44 INFO - REFTEST TEST-INFO | screentopng: exit 0
[task 2019-04-26T09:13:00.704Z] 09:13:00 WARNING - TEST-UNEXPECTED-FAIL | http://10.0.2.2:8854/jsreftest/tests/jsreftest.html?test=test262/language/expressions/equals/S11.9.1_A3.3.js | application ran for longer than allowed maximum time
[task 2019-04-26T09:13:00.705Z] 09:13:00 INFO - remoteautomation.py | Application ran for: 1:50:49.448078
[task 2019-04-26T09:13:02.391Z] 09:13:02 INFO - REFTEST INFO | Copy/paste: /builds/worker/workspace/build/linux64-minidump_stackwalk /tmp/tmpPTQpi_/4c2ef0dc-18bd-c459-f2b2-5b20c3643e3e.dmp /builds/worker/workspace/build/symbols
[taskcluster:error] Task timeout after 7200 seconds. Force killing container.
[taskcluster 2019-04-26 09:13:20.899Z] === Task Finished ===
[taskcluster 2019-04-26 09:13:20.899Z] Unsuccessful task run with exit code: -1 completed in 7369.418 seconds

Whiteboard: [stockwell infra] [see summary at comment 92] → [stockwell infra] [see summary at comment 92][stockwell needswork:owner]
Whiteboard: [stockwell infra] [see summary at comment 92][stockwell needswork:owner] → [stockwell infra] [see summary at comment 92]

Most failures are android-em-7-0*. I have checked that these are not hung (recent entries in log at timeout) and that the same tasks normally runs in much less time on other pushes. Although android-performance.log artifacts are not available in this bug, I strongly suspect bug 1545308 and hope to see that bug re-open soon.

I expect these failures to continue and this bug to remain open indefinitely. Task timeouts are the last resort whenever a task is hung or otherwise runs for too long. My goal is to address timeouts which recur in the same task frequently and identify other related issues in dependent bugs.

I try to review failures here at least once each week. There is no point in pasting recent failure logs or failure counts in this bug. If a spike in frequency is noted, or you have insight into a failure, or I haven't commented on an on-going issue for more than a week, need-info is appreciated.

Whiteboard: [stockwell infra] [see summary at comment 92] → [stockwell infra] [see summary at comment 278]

(In reply to Intermittent Failures Robot from comment #279)

19 failures in 933 pushes (0.02 failures/push) were associated with this bug yesterday.

Repository breakdown:

  • try: 11

Mass mis-classification.

Most recent failures are on android-em-7-0-x86. These are mostly wpt-reftests which normally complete in 15 to 30 minutes but sometimes exceed 90 minutes. Evidence is not conclusive, but I hope these will be resolved by bug 1545308.

Depends on: 1552580
Depends on: 1552334