1437407 - Task started, then no logs until deadline-exceeded

Reporter

Description

•

7 years ago

Two failures appeared on m-c, with no error log parsed. Example: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=a8e153c55eeee93a11e87d325fb20c644421036f&filter-platform=Linux%20x64%20JSDCov%20opt&selectedJob=161504033 The raw log takes to https://queue.taskcluster.net/v1/task/K-ST-Z1kS-2h6HbIyxNOzA/runs/0/artifacts/public/logs/live_backing.log -> getArtifact\nerrorCode: ResourceNotFound\nstatusCode: 404 Checking taskcluster: https://tools.taskcluster.net/groups/SVkufmy3SyOukV-EnNzVqA/tasks/K-ST-Z1kS-2h6HbIyxNOzA/runs/0/logs/public%2Flogs%2Flive.log ->Failed to fetch. An error occurred attempting to load the provided log. Please check the URL and ensure it is reachable.

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

7 years ago

The task failed because it hung until its deadline (one day after it was created). I'm not sure how that happened, and it has scrolled off the end of our log retention, so I can't look again. In theory, the max-run-time (usually an hour or two) should have caused it to terminate earlier than that. Or, if the worker crashed, it should have been resolved with claim-expired and retried. If you see this again, please re-open.

Component: General → Queue

Summary: Error: Failed to fetch → Task started, then no logs until deadline-exceeded

Natalia Csoregi [:nataliaCs]

Reporter

Comment 3

•

7 years ago

Reoccured on central and the rerun was successful https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=f01d1def46fb53a6523768c6e9188e66b89e664e&filter-searchStr=lt&selectedJob=162643510

Natalia Csoregi [:nataliaCs]

Reporter

Comment 4

•

7 years ago

(In reply to Natalia Csoregi [:nataliaCs] from comment #3) > Reoccured on central and the rerun was successful > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > central&revision=f01d1def46fb53a6523768c6e9188e66b89e664e&filter- > searchStr=lt&selectedJob=162643510 wrong link ^ https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=57abd8e4ab25bc21d9c75dbf43546e7272fddf0e&selectedJob=162565479 https://tools.taskcluster.net/groups/UvzDAAviTKOBOUmKvY2_Wg/tasks/E22tHJmdRe-4CQXuUT6a0w/runs/0/logs/public%2Flogs%2Flive.log

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

7 years ago

Looking at the instance that ran the task, I can see logging about re-claiming the task: Feb 16 19:58:37 docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i-011754588353bc4ba docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux-xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe-4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0,"takenUntil":"2018-02-16T20:03:15.694Z"} Feb 16 19:58:37 docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i-011754588353bc4ba docker-worker: {"type":"reclaimed task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux-xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe-4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0,"takenUntil":"2018-02-16T20:18:37.521Z"} Looking back to worker startup, this was the first task it took. I don't see anything suspicious in the logs -- it seems like it's just happily executing that task, reclaiming, logging its aliveness checks, and so on for that whole time. Which is weird because: "maxRunTime": 7200, Wander, any ideas what might be going on here?

Flags: needinfo?(wcosta)

Comment hidden (Intermittent Failures Robot)

Wander Lairson Costa

Assignee

Comment 7

•

7 years ago

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #5) > Looking at the instance that ran the task, I can see logging about > re-claiming the task: > > Feb 16 19:58:37 > docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i- > 011754588353bc4ba docker-worker: {"type":"reclaiming > task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i- > 011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux- > xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe- > 4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0, > "takenUntil":"2018-02-16T20:03:15.694Z"} > Feb 16 19:58:37 > docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i- > 011754588353bc4ba docker-worker: {"type":"reclaimed > task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i- > 011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux- > xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe- > 4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0, > "takenUntil":"2018-02-16T20:18:37.521Z"} > > Looking back to worker startup, this was the first task it took. I don't > see anything suspicious in the logs -- it seems like it's just happily > executing that task, reclaiming, logging its aliveness checks, and so on for > that whole time. Which is weird because: > > "maxRunTime": 7200, > > Wander, any ideas what might be going on here? Assigning this bug to me to investigate it

Assignee: nobody → wcosta

Flags: needinfo?(wcosta)

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Component: Queue → Docker-Worker

Comment hidden (Intermittent Failures Robot)

Wander Lairson Costa

Assignee

Updated

•

7 years ago

Priority: -- → P2

Comment hidden (Intermittent Failures Robot)

Arthur Iakab [arthur_iakab]

Comment 10

•

7 years ago

This bug failed 61 times in the last 7 days, on Windows 10 affecting opt,debug,pgo and asan build types. Failing on bug: opt-web-platform-tests-e10s, debug-reftest-e10, opt-web-platform-tests-e10,, opt-reftest-e10, opt-talos-svgr-e10. Link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=164880559 Part of a link: Unable to read VR Path Registry from C:\Users\GenericWorker\AppData\Local\openvr\openvrpaths.vrpath 16:31:14 INFO - [Child 3712, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346 16:31:14 INFO - [Child 3712, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346 16:31:14 INFO - [Parent 6976, Gecko_IOThread] WARNING: file z:/build/build/src/ipc/chromium/src/base/process_util_win.cc, line 188 16:37:24 INFO - REFTEST ERROR | file:///C:/Users/task_1519834749/build/tests/reftest/tests/layout/reftests/svg/smil/transform/use-1.svg | application timed out after 370 seconds with no output 16:37:24 INFO - REFTEST ERROR | Force-terminating active process(es). 16:37:24 INFO - REFTEST TEST-INFO | started process screenshot 16:37:24 INFO - REFTEST TEST-INFO | screenshot: exit 0 16:37:24 INFO - TEST-INFO | crashinject: exit 0 wcosta: do you have any updates on this bug?

Flags: needinfo?(wcosta)

Whiteboard: [stockwell needswork]

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell needswork] → [stockwell infra]

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

7 years ago

I suspect this is related to when we ran reftests/talos on windows 10 moonshot hardware with taskcluster worker.

Natalia Csoregi [:nataliaCs]

Reporter

Comment 12

•

7 years ago

This issue occured on Android 4.3 Api16+ debug as well : https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=e20052aa341d6da4505eb84be6e061bb7f37de51&selectedJob=165132839 Log is not parsed, and taskcluster https://tools.taskcluster.net/groups/MR69WC2BTiar8Y2ZtCBjvA/tasks/T9iIMP_fQRGmh4rRNJrA4Q/runs/0/logs/public%2Flogs%2Flive.log mentions the same "failed to fetch" error.

Comment hidden (Intermittent Failures Robot)

Wander Lairson Costa

Assignee

Comment 14

•

7 years ago

Moving to P1 due to the frequency of occurrence.

Blocks: tc-stability

Status: NEW → ASSIGNED

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 15

•

7 years ago

NI :pmoore as it seems this affects generic-worker as well.

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 16

•

7 years ago

(In reply to Arthur Iakab [arthur_iakab] from comment #10) > This bug failed 61 times in the last 7 days, on Windows 10 affecting > opt,debug,pgo and asan build types. > Failing on bug: opt-web-platform-tests-e10s, debug-reftest-e10, > opt-web-platform-tests-e10,, opt-reftest-e10, opt-talos-svgr-e10. > Link to a recent log: > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > inbound&job_id=164880559 > Part of a link: > Unable to read VR Path Registry from > C:\Users\GenericWorker\AppData\Local\openvr\openvrpaths.vrpath > 16:31:14 INFO - [Child 3712, Chrome_ChildThread] WARNING: pipe error: > 109: file > z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line > 346 > 16:31:14 INFO - [Child 3712, Chrome_ChildThread] WARNING: pipe error: > 109: file > z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line > 346 > 16:31:14 INFO - [Parent 6976, Gecko_IOThread] WARNING: file > z:/build/build/src/ipc/chromium/src/base/process_util_win.cc, line 188 > 16:37:24 INFO - REFTEST ERROR | > file:///C:/Users/task_1519834749/build/tests/reftest/tests/layout/reftests/ > svg/smil/transform/use-1.svg | application timed out after 370 seconds with > no output > 16:37:24 INFO - REFTEST ERROR | Force-terminating active process(es). > 16:37:24 INFO - REFTEST TEST-INFO | started process screenshot > 16:37:24 INFO - REFTEST TEST-INFO | screenshot: exit 0 > 16:37:24 INFO - TEST-INFO | crashinject: exit 0 > > wcosta: do you have any updates on this bug? The worker has behaved correctly in the cited example (https://tools.taskcluster.net/groups/DH7tXg5qRii8JQBRiZbsqQ/tasks/dc7NQnaPS-urwMB9uE82aA/runs/2/logs/public%2Flogs%2Flive_backing.log). A task is executed, which did not complete within the duration specified in maxRunTime payload property, and therefore the worker resolved the task as failed. Therefore in the case of this Windows task, it appears to be a problem with the task, rather than the worker, which would also align with Joel's comment 11.

Flags: needinfo?(pmoore)

Comment hidden (Intermittent Failures Robot)

Natalia Csoregi [:nataliaCs]

Reporter

Comment 19

•

7 years ago

There were 30 failures in the last 7 days, on the following platforms: - android-4-3-armv7-api16 OPT - 1 - android-4-3-armv7-api16 DEBUG - 1 - linux x64 OPT - 4 - linux x64 PGO - 1 - linux x64-noopt DEBUG - 1 - linux x64-stylo-disabled OPT - 1 - linux x64-stylo-disabled DEBUG - 1 - osx-cross OPT - 2 - windows 2012 x64 OPT - 5 - windows 10-64 DEBUG - 1 - windows10-64-ccov DEBUG - 10 - windows10-64-qr DEBUG - 1 - windows2012-32 DEBUG - 1 Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=168776987 As nothing is shown in the parsed log, looking at the taskcluster link https://tools.taskcluster.net/groups/e5xStjolQvmBHJ-k3xomhA/tasks/ULaD9YhITrikpfuz5v6mPw/runs/0/logs/public%2Flogs%2Fchain_of_trust.log the error occurs: Failed to fetch An error occurred attempting to load the provided log. Please check the URL and ensure it is reachable. https://taskcluster-artifacts.net/ULaD9YhITrikpfuz5v6mPw/0/public/logs/chain_of_trust.log Snippet example from the above log ^: task: { "maxRunTime": 3600, "upstreamArtifacts": [ { "formats": [ "sha2signcode" ], "paths": [ "public/build/setup.exe" ], "taskId": "FFM4oDVXRQOxFNfr-Tekeg", "taskType": "build" }, { "formats": [ "sha2signcode", "widevine" ], "paths": [ "public/build/target.zip" ], "taskId": "FFM4oDVXRQOxFNfr-Tekeg", "taskType": "build" } ] } signing ULaD9YhITrikpfuz5v6mPw workerType differs! graph: "gecko-t-win10-64" task: "depsigning" 2018-03-18T02:03:42 CRITICAL - Can't find task signing ULaD9YhITrikpfuz5v6mPw in signing:parent e5xStjolQvmBHJ-k3xomhA task-graph.json! 2018-03-18T02:03:42 CRITICAL - Chain of Trust verification error! Traceback (most recent call last): File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1819, in verify_chain_of_trust task_count = await verify_task_types(chain) File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1601, in verify_task_types await valid_task_types[task_type](chain, obj) File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1383, in verify_parent_task verify_link_in_task_graph(chain, link, target_link) File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 974, in verify_link_in_task_graph task_link.name, task_link.task_id, decision_link.name, decision_link.task_id File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 302, in raise_on_errors raise CoTError("\n".join(errors)) scriptworker.exceptions.CoTError: "Can't find task signing ULaD9YhITrikpfuz5v6mPw in signing:parent e5xStjolQvmBHJ-k3xomhA task-graph.json!"

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 20

•

7 years ago

Hm, that is a signing error, which is not on docker-worker -- but this bug is about docker-worker. I'm not sure what's going on, or why I got the needinfo..

Flags: needinfo?(dustin)

Natalia Csoregi [:nataliaCs]

Reporter

Comment 21

•

7 years ago

All of these issues mentioned since starting have no logs parsed, and taskcluster only shows the "Failed to fetch error" for each and every one. The raw logs show this when they're opened: { "code": "ResourceNotFound", "message": "Artifact not found\n----\nmethod: getArtifact\nerrorCode: ResourceNotFound\nstatusCode: 404\ntime: 2018-03-21T06:44:04.511Z", "requestInfo": { "method": "getArtifact", "params": { "0": "public/logs/live_backing.log", "taskId": "K-ST-Z1kS-2h6HbIyxNOzA", "runId": "0", "name": "public/logs/live_backing.log" }, "payload": {}, "time": "2018-03-21T06:44:04.511Z" } } But checking the link given in taskcluster after this message "Please check the URL and ensure it is reachable." e.g. in comment 19 https://taskcluster-artifacts.net/ULaD9YhITrikpfuz5v6mPw/0/public/logs/chain_of_trust.log shows the signing error. Perhaps there were some mis-classifications that lead to this, I will look into it. The :ni was there because I checked for the triage manager for docker-worker If there is any more info I can provide regarding this, please let me know.

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

6 years ago

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → WORKSFORME

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Docker-Worker → Workers