Closed Bug 1420394 Opened 7 years ago Closed 5 years ago

Intermittent reftest [taskcluster:error] Task aborted - max run time exceeded

Categories

(Testing :: Reftest, defect)

Version 3
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1589796

People

(Reporter: aiakab, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Here is the bug from autoland https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=6cd9206792f9f0cdb4828fba6d6f9dd0745d550b&selectedJob=147327027

It is e10s test-windows7-32/debug-reftest-gpu-e10s-1 tc-R-e10s(Rg1). It has failed with Aborting task - max run time exceeded!       and      Result: IDLENESS_LIMIT_EXCEEDED.  I have re triggered the job twice and  the job finished sucessfuly 

Log link: https://treeherder.mozilla.org/logviewer.html#?job_id=147327027&repo=autoland

Part of a log:
06:12:59     INFO -  [Child 2844, Main Thread] WARNING: Subsequent listeners are not retargetable: file z:/build/build/src/netwerk/protocol/http/HttpChannelChild.cpp, line 3416
10486
06:13:01     INFO -  REFTEST TEST-LOAD | http://localhost:49308/1511501287663/83/encoder.html?img=size-33x33.png&mime=image/vnd.microsoft.icon | 796 / 1936 (41%)
10487
06:13:01     INFO -  --DOMWINDOW == 12 (09147000) [pid = 2844] [serial = 2161] [outer = 00000000] [url = http://localhost:49308/1511501287649/80/encoder.html?img=size-17x17.png&mime=image/vnd.microsoft.icon]
10488
06:13:01     INFO -  --DOMWINDOW == 11 (09149000) [pid = 2844] [serial = 2160] [outer = 00000000] [url = http://localhost:49308/1511501287649/80/size-17x17.png]
10489
06:13:01     INFO -  --DOMWINDOW == 10 (0EB6F000) [pid = 2844] [serial = 2162] [outer = 00000000] [url = data:text/html;charset=UTF-8,%3C%21%2D%2DCLEAR%2D%2D%3E]
10490
06:13:01     INFO -  --DOMWINDOW == 9 (0EB6B800) [pid = 2844] [serial = 2158] [outer = 00000000] [url = http://localhost:49308/1511501287646/79/encoder.html?img=size-16x16.png&mime=image/vnd.microsoft.icon]
10491
06:13:01     INFO -  --DOMWINDOW == 8 (0914D800) [pid = 2844] [serial = 2159] [outer = 00000000] [url = data:text/html;charset=UTF-8,%3C%21%2D%2DCLEAR%2D%2D%3E]
10492
06:13:02     INFO -  ++DOMWINDOW == 9 (09149000) [pid = 2844] [serial = 2170] [outer = 061843A0]
10493
[taskcluster 2017-11-24T06:13:03.201Z] Aborting task - max run time exceeded!
10494
[taskcluster 2017-11-24T06:13:03.419Z]   Exit Code: 0
10495
[taskcluster 2017-11-24T06:13:03.419Z]   User Time: 0s
10496
[taskcluster 2017-11-24T06:13:03.422Z] Kernel Time: 31.2002ms
10497
[taskcluster 2017-11-24T06:13:03.422Z]   Wall Time: 59m55.211s
10498
[taskcluster 2017-11-24T06:13:03.422Z] Peak Memory: 2248704
10499
[taskcluster 2017-11-24T06:13:03.422Z]      Result: IDLENESS_LIMIT_EXCEEDED
10500
[taskcluster 2017-11-24T06:13:03.424Z] === Task Finished ===
10501
[taskcluster 2017-11-24T06:13:03.424Z] Task Duration: 59m55.264s
Summary: Intermittent reftest Aborting task - max run time exceeded! → Intermittent reftest [taskcluster:error] Task aborted - max run time exceeded
This bug failed 38 times in the last 7 days, occurs on windows7-32 opt, pgo and debug.

Log:
https://treeherder.mozilla.org/logviewer.html#?job_id=206763080&repo=autoland&lineNumber=17179

1:05:05     INFO - REFTEST TEST-PASS | file:///Z:/task_1540029163/build/tests/reftest/tests/layout/reftests/text-overflow/vertical-decorations-4.html != file:///Z:/task_1540029163/build/tests/reftest/tests/layout/reftests/text-overflow/vertical-decorations-3-4-notref.html | image comparison, max difference: 255, number of differing pixels: 125
11:05:05     INFO - REFTEST TEST-END | file:///Z:/task_1540029163/build/tests/reftest/tests/layout/reftests/text-overflow/vertical-decorations-4.html != file:///Z:/task_1540029163/build/tests/reftest/tests/layout/reftests/text-overflow/vertical-decorations-3-4-notref.html
[taskcluster:error] Aborting task...
11:05:06     INFO - REFTEST INFO | Slowest test took 6198ms (file:///Z:/task_1540029163/build/tests/reftest/tests/layout/reftests/text-overflow/selection.html)
11:05:06     INFO - REFTEST INFO | Total canvas count = 3
[taskcluster 2018-10-20T11:05:06.461Z] SUCCESS: The process with PID 5916 (child process of PID 5812) has been terminated.
[taskcluster 2018-10-20T11:05:06.461Z] SUCCESS: The process with PID 5424 (child process of PID 5812) has been terminated.
[taskcluster 2018-10-20T11:05:06.461Z] SUCCESS: The process with PID 5192 (child process of PID 5812) has been terminated.
[taskcluster 2018-10-20T11:05:06.461Z] SUCCESS: The process with PID 4392 (child process of PID 5812) has been terminated.
[taskcluster 2018-10-20T11:05:06.461Z] SUCCESS: The process with PID 5812 (child process of PID 5752) has been terminated.
[taskcluster 2018-10-20T11:05:06.461Z] SUCCESS: The process with PID 5752 (child process of PID 2932) has been terminated.
[taskcluster 2018-10-20T11:05:06.461Z] SUCCESS: The process with PID 2508 (child process of PID 1188) has been terminated.
[taskcluster 2018-10-20T11:05:06.461Z] SUCCESS: The process with PID 2932 (child process of PID 1188) has been terminated.
[taskcluster 2018-10-20T11:05:06.461Z] SUCCESS: The process with PID 1188 (child process of PID 2104) has been terminated.
[taskcluster 2018-10-20T11:05:06.461Z] SUCCESS: The process with PID 2104 (child process of PID 1164) has been terminated.
[taskcluster 2018-10-20T11:05:06.461Z] 
[taskcluster 2018-10-20T11:05:06.468Z] === Task Finished ===
[taskcluster 2018-10-20T11:05:06.469Z] Task Duration: 59m32.742s
[taskcluster 2018-10-20T11:05:07.424Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/octet-stream" and expiry 2019-10-20T08:45:01.371Z
[taskcluster 2018-10-20T11:05:08.388Z] Uploading artifact public/logs/log_critical.log from file logs\log_critical.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-20T08:45:01.371Z
[taskcluster 2018-10-20T11:05:08.832Z] Uploading artifact public/logs/log_error.log from file logs\log_error.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-20T08:45:01.371Z
[taskcluster 2018-10-20T11:05:09.477Z] Uploading artifact public/logs/log_fatal.log from file logs\log_fatal.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-20T08:45:01.371Z
[taskcluster 2018-10-20T11:05:09.896Z] Uploading artifact public/logs/log_info.log from file logs\log_info.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-20T08:45:01.371Z
[taskcluster 2018-10-20T11:05:10.938Z] Uploading artifact public/logs/log_raw.log from file logs\log_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-20T08:45:01.371Z
[taskcluster 2018-10-20T11:05:11.634Z] Uploading artifact public/logs/log_warning.log from file logs\log_warning.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-20T08:45:01.371Z
[taskcluster 2018-10-20T11:05:12.128Z] Uploading artifact public/test_info/reftest-gpu_errorsummary.log from file build\blobber_upload_dir\reftest-gpu_errorsummary.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-20T08:45:01.371Z
[taskcluster 2018-10-20T11:05:12.558Z] Uploading artifact public/test_info/reftest-gpu_raw.log from file build\blobber_upload_dir\reftest-gpu_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-20T08:45:01.371Z
[taskcluster 2018-10-20T11:05:13.662Z] Uploading artifact public/test_info/system-info.log from file build\blobber_upload_dir\system-info.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-20T08:45:01.371Z
[taskcluster 2018-10-20T11:05:14.260Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/bt3YEyWcSvCl1IPb9lpCpg/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-10-20T08:45:01.371Z
[taskcluster:error] Task aborted - max run time exceeded

ahal: Can you please take a look at the this bug?
Flags: needinfo?(ahal)
Whiteboard: [stockwell needswork]
This might just be a case of a task near the borderline of the timeout threshold. I'll keep the needinfo to dig deeper.
In the last 7 days there have been 83 occurrences mostly on Windows 10 64, some on Os X, build types Asan and Debug.
Whiteboard: [stockwell unknown] → [stockwell unknown][stockwell needswork]
This bug has failed 32 times in the last 7 days. Occurs on osx-10-10, linux64-qr, windows10-64-ccov and windows7-32 on debug opt and pgo build types.

Log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=216733713&repo=mozilla-inbound&lineNumber=21901

 INFO - REFTEST TEST-START | file:///Users/cltbld/tasks/task_1544669417/build/tests/jsreftest/tests/jsreftest.html?test=non262/lexical-environment/block-scoped-functions-annex-b-with.js
20:41:57     INFO - REFTEST TEST-LOAD | file:///Users/cltbld/tasks/task_1544669417/build/tests/jsreftest/tests/jsreftest.html?test=non262/lexical-environment/block-scoped-functions-annex-b-with.js | 1700 / 17967 (9%)
20:41:57     INFO - ++DOMWINDOW == 97 (0x122930400) [pid = 2036] [serial = 3628] [outer = 0x112ec0800]
20:41:57     INFO - TEST-INFO |  FAILED!  TypeError: Assertion failed: got "true", expected "false"
20:41:57     INFO - JavaScript error: file:///Users/cltbld/tasks/task_1544669417/build/tests/jsreftest/tests/shell.js, line 127: TypeError: Assertion failed: got "true", expected "false"
20:41:57     INFO - [Child 2036, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file /builds/worker/workspace/build/src/dom/base/nsContentUtils.cpp, line 3764
20:41:57     INFO - REFTEST TEST-PASS | file:///Users/cltbld/tasks/task_1544669417/build/tests/jsreftest/tests/jsreftest.html?test=non262/lexical-environment/block-scoped-functions-annex-b-with.js | (LOAD ONLY)
20:41:57     INFO - REFTEST TEST-END | file:///Users/cltbld/tasks/task_1544669417/build/tests/jsreftest/tests/jsreftest.html?test=non262/lexical-environment/block-scoped-functions-annex-b-with.js
20:41:58     INFO - ++DOMWINDOW == 98 (0x122934800) [pid = 2036] [serial = 3629] [outer = 0x112ec0800]
20:41:58     INFO - [Child 2036, Main Thread] WARNING: 'NS_FAILED(rv)', file /builds/worker/workspace/build/src/dom/base/nsContentUtils.cpp, line 8218
20:41:58     INFO - REFTEST TEST-START | file:///Users/cltbld/tasks/task_1544669417/build/tests/jsreftest/tests/jsreftest.html?test=non262/lexical-environment/block-scoped-functions-annex-b.js
(In reply to Dorel Luca [:dluca] from comment #48)
> This bug has failed 32 times in the last 7 days. Occurs on osx-10-10,
> linux64-qr, windows10-64-ccov and windows7-32 on debug opt and pgo build
> types.

I think that latest issue got exposed by bug 1512989. To be clear, bug 1512989 didn't cause it, but the jsreftests were perma-green (i.e even when there was a failure). So since the fix landed in bug 1512989 I'd expect to see all sorts of intermittent jsreftest stuff popping up. Unfortunately there won't be any easy way of tracking down when the intermittents started happening :(.

On another note, the failure in there is more specific than this bug. "Task aborted - max runtime exceeded" is just a generic message that can happen for a multitude of reasons. I think we should avoid starring things against this bug unless there is no other more specific log to star against. Maybe we can improve the logging of the harnesses so no intermittents get starred against this.
Flags: needinfo?(ahal)
See Also: → 1411358

In the last 7 days, there are 32 failures.

They occur on
16 linux64 (asan, debug)
5 linux32 (debug)
5 macosx64-qr (debug, opt)
3 linux64-qr (debug)
2 windows7-32 (debug)
1 osx-cross (debug)

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=221131401&repo=mozilla-central&lineNumber=115821

Whiteboard: [stockwell unknown][stockwell needswork] → [stockwell needswork]
See Also: → 1529777

Hi Andrew, can you assign someone to take a look at this bug?

Flags: needinfo?(ahal)
Whiteboard: [stockwell unknown] → [stockwell needswork:owner]

There's probably not a whole we can do here. This error means that for whatever reason (slow network, background processes eating CPU, etc), things ran a bit slower than usual. If there were a specific platform / suite / chunk that were consistently exceeding the maxruntime then we could easily fix it by increasing chunks or bumping up the maxruntime.

But because the distribution of platforms is very evenly spread out there aren't any clear actionable signals to work on here. Fwiw, I'd tend to treat each platform this error happens on as a separate issue. So grouping all reftest tasks under a single bug is misleading in terms of how bad of an intermittent this is.

Flags: needinfo?(ahal)
Depends on: 1579905

I can monitor this bug like I do for bug 1411358. Expect failures to continue, but there may be occasional adjustments we can make to keep the failures infrequent.

Assignee: nobody → gbrown

Thank you both.

Depends on: 1581847
Whiteboard: [stockwell unknown] → [stockwell infra]

:aryx - I'd like to consolidate several similar bugs (this one, bug 1523098, bug 1457641, maybe some others) into one generic "max run time exceeded" bug. I think it will be easier for me to monitor these failures that way. Would that be okay?

Flags: needinfo?(aryx.bugmail)

Thanks for monitoring those issues in general.

Flags: needinfo?(aryx.bugmail)
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.