Closed Bug 1455953 Opened 7 years ago Closed 6 years ago

Intermittent Aborting task - max run time exceeded! after talos hang after download of target.common.tests.zip took very long

Categories

(Testing :: Talos, defect)

Version 3
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: CosminS, Unassigned)

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

11:52:31 INFO - Processing c:\users\task_1524395807\build\tests\mozbase\mozprocess 11:52:43 INFO - Processing c:\users\task_1524395807\build\tests\mozbase\mozprofile 11:52:57 INFO - Processing c:\users\task_1524395807\build\tests\mozbase\mozrunner 11:53:06 INFO - Processing c:\users\task_1524395807\build\tests\mozbase\mozscreenshot 11:53:12 INFO - Processing c:\users\task_1524395807\build\tests\mozbase\moztest 11:53:20 INFO - Processing c:\users\task_1524395807\build\tests\mozbase\mozversion 11:53:27 INFO - Installing collected packages: mozterm, manifestparser, mozcrash, mozdebug, mozdevice, mozfile, mozhttpd, mozinfo, mozInstall, mozleak, mozlog, moznetwork, mozprocess, mozprofile, mozrunner, mozscreenshot, moztest, mozversion 11:53:27 INFO - Running setup.py install for mozterm: started 11:53:40 INFO - Running setup.py install for mozterm: finished with status 'done' 11:53:41 INFO - Running setup.py install for manifestparser: started 11:53:51 INFO - Running setup.py install for manifestparser: finished with status 'done' 11:53:52 INFO - Running setup.py install for mozcrash: started 11:54:01 INFO - Running setup.py install for mozcrash: finished with status 'done' 11:54:02 INFO - Running setup.py install for mozdebug: started 11:54:11 INFO - Running setup.py install for mozdebug: finished with status 'done' 11:54:11 INFO - Running setup.py install for mozdevice: started 11:54:26 INFO - Running setup.py install for mozdevice: finished with status 'done' 11:54:26 INFO - Running setup.py install for mozfile: started 11:54:37 INFO - Running setup.py install for mozfile: finished with status 'done' 11:54:37 INFO - Running setup.py install for mozhttpd: started 11:54:48 INFO - Running setup.py install for mozhttpd: finished with status 'done' 11:54:49 INFO - Running setup.py install for mozinfo: started 11:55:00 INFO - Running setup.py install for mozinfo: finished with status 'done' 11:55:00 INFO - Running setup.py install for mozInstall: started [taskcluster 2018-04-22T11:55:00.802Z] Exit Code: 0 [taskcluster 2018-04-22T11:55:00.802Z] User Time: 0s [taskcluster 2018-04-22T11:55:00.802Z] Kernel Time: 0s [taskcluster 2018-04-22T11:55:00.802Z] Wall Time: 24m57.0562737s [taskcluster 2018-04-22T11:55:00.802Z] Peak Memory: 6041600 [taskcluster 2018-04-22T11:55:00.802Z] Result: IDLENESS_LIMIT_EXCEEDED [taskcluster 2018-04-22T11:55:00.803Z] === Task Finished === [taskcluster 2018-04-22T11:55:00.803Z] Task Duration: 24m57.2678403s
Flags: needinfo?(rwood)
Summary: Intermittent talos (O) IDLENESS_LIMIT_EXCEEDED → Intermittent Aborting task - max run time exceeded! after talos hang after download of target.common.tests.zip took very long
I believe most of these issues are windows mooonshot machines- this is infra related as we are timing out or not getting the specific resources we are looking for. :markco, do you have backend logs about networking issues with the moonshots? :aryx, how can we get a list of the problems, specifically I want to look at all the failures and see if specific machine names show up.
Flags: needinfo?(rwood) → needinfo?(mcornmesser)
Is this isolated to T-W1064-244? That machine appears to hit a state where GenericWorker has hit an issue with privileges. I have quarantined the machine, so it will stop picking up tasks. Apr 22 08:27:49 T-W1064-MS-244.mdc1.mozilla.com generic-worker: time="2018-04-22T15:27:48Z" level=error msg="Error terminating process 1368: Access is denied." #015 Apr 22 13:23:35 T-W1064-MS-244.mdc1.mozilla.com generic-worker: time="2018-04-22T20:23:34Z" level=error msg="Error terminating process 1344: Access is denied." #015 Apr 22 19:20:22 T-W1064-MS-244.mdc1.mozilla.com generic-worker: time="2018-04-23T02:20:21Z" level=error msg="Error terminating process 1360: Access is denied." #015 Apr 23 02:16:03 T-W1064-MS-244.mdc1.mozilla.com generic-worker: time="2018-04-23T09:16:01Z" level=error msg="Error terminating process 1340: Access is denied." #015 Apr 23 03:54:38 T-W1064-MS-244.mdc1.mozilla.com generic-worker: time="2018-04-23T10:54:37Z" level=error msg="Error terminating process 1352: Access is denied." #015 I will dive in and see if I can find a root cause.
Flags: needinfo?(mcornmesser)
Attached image ms-244.jpg
I was doing some testing and unquarantined the node. The test gets to here and just hangs. I have not found anything interesting in the logs yet.
If I click on the window the test resumes, closes, opens a new Windows and hangs again.
I wonder if this is running the same generic worker as the other machines? I suspect it is the same and the error is different, but that behavior is what I saw on the 10.6* generic worker for reftests when we were trying to test the upgraded agent as a solution to the disk space and log files.
Just verified the version is 8.3.0.
maybe we reinstall the machine? It would be nice to figure out if there is a system resource hang? Maybe a multiple processes running (we didn't clean up from an earlier job or something?)
I have kicked off an install and setup a papertrail alert for that will email relops if another machine hits this state.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: