Closed
Bug 1437407
Opened 7 years ago
Closed 6 years ago
Task started, then no logs until deadline-exceeded
Categories
(Taskcluster :: Workers, defect, P2)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: nataliaCs, Assigned: wcosta)
References
Details
(Whiteboard: [stockwell infra])
Two failures appeared on m-c, with no error log parsed.
Example: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=a8e153c55eeee93a11e87d325fb20c644421036f&filter-platform=Linux%20x64%20JSDCov%20opt&selectedJob=161504033
The raw log takes to https://queue.taskcluster.net/v1/task/K-ST-Z1kS-2h6HbIyxNOzA/runs/0/artifacts/public/logs/live_backing.log
-> getArtifact\nerrorCode: ResourceNotFound\nstatusCode: 404
Checking taskcluster: https://tools.taskcluster.net/groups/SVkufmy3SyOukV-EnNzVqA/tasks/K-ST-Z1kS-2h6HbIyxNOzA/runs/0/logs/public%2Flogs%2Flive.log
->Failed to fetch.
An error occurred attempting to load the provided log.
Please check the URL and ensure it is reachable.
Comment 2•7 years ago
|
||
The task failed because it hung until its deadline (one day after it was created). I'm not sure how that happened, and it has scrolled off the end of our log retention, so I can't look again.
In theory, the max-run-time (usually an hour or two) should have caused it to terminate earlier than that. Or, if the worker crashed, it should have been resolved with claim-expired and retried. If you see this again, please re-open.
Component: General → Queue
Summary: Error: Failed to fetch → Task started, then no logs until deadline-exceeded
Reporter | ||
Comment 3•7 years ago
|
||
Reoccured on central and the rerun was successful https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=f01d1def46fb53a6523768c6e9188e66b89e664e&filter-searchStr=lt&selectedJob=162643510
Reporter | ||
Comment 4•7 years ago
|
||
(In reply to Natalia Csoregi [:nataliaCs] from comment #3)
> Reoccured on central and the rerun was successful
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> central&revision=f01d1def46fb53a6523768c6e9188e66b89e664e&filter-
> searchStr=lt&selectedJob=162643510
wrong link ^
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=57abd8e4ab25bc21d9c75dbf43546e7272fddf0e&selectedJob=162565479
https://tools.taskcluster.net/groups/UvzDAAviTKOBOUmKvY2_Wg/tasks/E22tHJmdRe-4CQXuUT6a0w/runs/0/logs/public%2Flogs%2Flive.log
Comment 5•7 years ago
|
||
Looking at the instance that ran the task, I can see logging about re-claiming the task:
Feb 16 19:58:37 docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i-011754588353bc4ba docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux-xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe-4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0,"takenUntil":"2018-02-16T20:03:15.694Z"}
Feb 16 19:58:37 docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i-011754588353bc4ba docker-worker: {"type":"reclaimed task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux-xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe-4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0,"takenUntil":"2018-02-16T20:18:37.521Z"}
Looking back to worker startup, this was the first task it took. I don't see anything suspicious in the logs -- it seems like it's just happily executing that task, reclaiming, logging its aliveness checks, and so on for that whole time. Which is weird because:
"maxRunTime": 7200,
Wander, any ideas what might be going on here?
Flags: needinfo?(wcosta)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 7•7 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #5)
> Looking at the instance that ran the task, I can see logging about
> re-claiming the task:
>
> Feb 16 19:58:37
> docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i-
> 011754588353bc4ba docker-worker: {"type":"reclaiming
> task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-
> 011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux-
> xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe-
> 4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0,
> "takenUntil":"2018-02-16T20:03:15.694Z"}
> Feb 16 19:58:37
> docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i-
> 011754588353bc4ba docker-worker: {"type":"reclaimed
> task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-
> 011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux-
> xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe-
> 4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0,
> "takenUntil":"2018-02-16T20:18:37.521Z"}
>
> Looking back to worker startup, this was the first task it took. I don't
> see anything suspicious in the logs -- it seems like it's just happily
> executing that task, reclaiming, logging its aliveness checks, and so on for
> that whole time. Which is weird because:
>
> "maxRunTime": 7200,
>
> Wander, any ideas what might be going on here?
Assigning this bug to me to investigate it
Assignee: nobody → wcosta
Flags: needinfo?(wcosta)
Updated•7 years ago
|
Component: Queue → Docker-Worker
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•7 years ago
|
Priority: -- → P2
Comment hidden (Intermittent Failures Robot) |
Comment 10•7 years ago
|
||
This bug failed 61 times in the last 7 days, on Windows 10 affecting opt,debug,pgo and asan build types.
Failing on bug: opt-web-platform-tests-e10s, debug-reftest-e10, opt-web-platform-tests-e10,, opt-reftest-e10, opt-talos-svgr-e10.
Link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=164880559
Part of a link:
Unable to read VR Path Registry from C:\Users\GenericWorker\AppData\Local\openvr\openvrpaths.vrpath
16:31:14 INFO - [Child 3712, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
16:31:14 INFO - [Child 3712, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
16:31:14 INFO - [Parent 6976, Gecko_IOThread] WARNING: file z:/build/build/src/ipc/chromium/src/base/process_util_win.cc, line 188
16:37:24 INFO - REFTEST ERROR | file:///C:/Users/task_1519834749/build/tests/reftest/tests/layout/reftests/svg/smil/transform/use-1.svg | application timed out after 370 seconds with no output
16:37:24 INFO - REFTEST ERROR | Force-terminating active process(es).
16:37:24 INFO - REFTEST TEST-INFO | started process screenshot
16:37:24 INFO - REFTEST TEST-INFO | screenshot: exit 0
16:37:24 INFO - TEST-INFO | crashinject: exit 0
wcosta: do you have any updates on this bug?
Flags: needinfo?(wcosta)
Whiteboard: [stockwell needswork]
Updated•7 years ago
|
Whiteboard: [stockwell needswork] → [stockwell infra]
Comment 11•7 years ago
|
||
I suspect this is related to when we ran reftests/talos on windows 10 moonshot hardware with taskcluster worker.
Reporter | ||
Comment 12•7 years ago
|
||
This issue occured on Android 4.3 Api16+ debug as well : https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=e20052aa341d6da4505eb84be6e061bb7f37de51&selectedJob=165132839
Log is not parsed, and taskcluster https://tools.taskcluster.net/groups/MR69WC2BTiar8Y2ZtCBjvA/tasks/T9iIMP_fQRGmh4rRNJrA4Q/runs/0/logs/public%2Flogs%2Flive.log mentions the same "failed to fetch" error.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 14•7 years ago
|
||
Moving to P1 due to the frequency of occurrence.
Assignee | ||
Comment 15•7 years ago
|
||
NI :pmoore as it seems this affects generic-worker as well.
Flags: needinfo?(pmoore)
Comment 16•7 years ago
|
||
(In reply to Arthur Iakab [arthur_iakab] from comment #10)
> This bug failed 61 times in the last 7 days, on Windows 10 affecting
> opt,debug,pgo and asan build types.
> Failing on bug: opt-web-platform-tests-e10s, debug-reftest-e10,
> opt-web-platform-tests-e10,, opt-reftest-e10, opt-talos-svgr-e10.
> Link to a recent log:
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=164880559
> Part of a link:
> Unable to read VR Path Registry from
> C:\Users\GenericWorker\AppData\Local\openvr\openvrpaths.vrpath
> 16:31:14 INFO - [Child 3712, Chrome_ChildThread] WARNING: pipe error:
> 109: file
> z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line
> 346
> 16:31:14 INFO - [Child 3712, Chrome_ChildThread] WARNING: pipe error:
> 109: file
> z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line
> 346
> 16:31:14 INFO - [Parent 6976, Gecko_IOThread] WARNING: file
> z:/build/build/src/ipc/chromium/src/base/process_util_win.cc, line 188
> 16:37:24 INFO - REFTEST ERROR |
> file:///C:/Users/task_1519834749/build/tests/reftest/tests/layout/reftests/
> svg/smil/transform/use-1.svg | application timed out after 370 seconds with
> no output
> 16:37:24 INFO - REFTEST ERROR | Force-terminating active process(es).
> 16:37:24 INFO - REFTEST TEST-INFO | started process screenshot
> 16:37:24 INFO - REFTEST TEST-INFO | screenshot: exit 0
> 16:37:24 INFO - TEST-INFO | crashinject: exit 0
>
> wcosta: do you have any updates on this bug?
The worker has behaved correctly in the cited example (https://tools.taskcluster.net/groups/DH7tXg5qRii8JQBRiZbsqQ/tasks/dc7NQnaPS-urwMB9uE82aA/runs/2/logs/public%2Flogs%2Flive_backing.log). A task is executed, which did not complete within the duration specified in maxRunTime payload property, and therefore the worker resolved the task as failed.
Therefore in the case of this Windows task, it appears to be a problem with the task, rather than the worker, which would also align with Joel's comment 11.
Flags: needinfo?(pmoore)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 19•7 years ago
|
||
There were 30 failures in the last 7 days, on the following platforms:
- android-4-3-armv7-api16 OPT - 1
- android-4-3-armv7-api16 DEBUG - 1
- linux x64 OPT - 4
- linux x64 PGO - 1
- linux x64-noopt DEBUG - 1
- linux x64-stylo-disabled OPT - 1
- linux x64-stylo-disabled DEBUG - 1
- osx-cross OPT - 2
- windows 2012 x64 OPT - 5
- windows 10-64 DEBUG - 1
- windows10-64-ccov DEBUG - 10
- windows10-64-qr DEBUG - 1
- windows2012-32 DEBUG - 1
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=168776987
As nothing is shown in the parsed log, looking at the taskcluster link https://tools.taskcluster.net/groups/e5xStjolQvmBHJ-k3xomhA/tasks/ULaD9YhITrikpfuz5v6mPw/runs/0/logs/public%2Flogs%2Fchain_of_trust.log the error occurs:
Failed to fetch
An error occurred attempting to load the provided log.
Please check the URL and ensure it is reachable.
https://taskcluster-artifacts.net/ULaD9YhITrikpfuz5v6mPw/0/public/logs/chain_of_trust.log
Snippet example from the above log ^:
task: {
"maxRunTime": 3600,
"upstreamArtifacts": [
{
"formats": [
"sha2signcode"
],
"paths": [
"public/build/setup.exe"
],
"taskId": "FFM4oDVXRQOxFNfr-Tekeg",
"taskType": "build"
},
{
"formats": [
"sha2signcode",
"widevine"
],
"paths": [
"public/build/target.zip"
],
"taskId": "FFM4oDVXRQOxFNfr-Tekeg",
"taskType": "build"
}
]
}
signing ULaD9YhITrikpfuz5v6mPw workerType differs!
graph: "gecko-t-win10-64"
task: "depsigning"
2018-03-18T02:03:42 CRITICAL - Can't find task signing ULaD9YhITrikpfuz5v6mPw in signing:parent e5xStjolQvmBHJ-k3xomhA task-graph.json!
2018-03-18T02:03:42 CRITICAL - Chain of Trust verification error!
Traceback (most recent call last):
File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1819, in verify_chain_of_trust
task_count = await verify_task_types(chain)
File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1601, in verify_task_types
await valid_task_types[task_type](chain, obj)
File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1383, in verify_parent_task
verify_link_in_task_graph(chain, link, target_link)
File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 974, in verify_link_in_task_graph
task_link.name, task_link.task_id, decision_link.name, decision_link.task_id
File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 302, in raise_on_errors
raise CoTError("\n".join(errors))
scriptworker.exceptions.CoTError: "Can't find task signing ULaD9YhITrikpfuz5v6mPw in signing:parent e5xStjolQvmBHJ-k3xomhA task-graph.json!"
Flags: needinfo?(dustin)
Comment 20•7 years ago
|
||
Hm, that is a signing error, which is not on docker-worker -- but this bug is about docker-worker. I'm not sure what's going on, or why I got the needinfo..
Flags: needinfo?(dustin)
Reporter | ||
Comment 21•7 years ago
|
||
All of these issues mentioned since starting have no logs parsed, and taskcluster only shows the "Failed to fetch error" for each and every one.
The raw logs show this when they're opened:
{
"code": "ResourceNotFound",
"message": "Artifact not found\n----\nmethod: getArtifact\nerrorCode: ResourceNotFound\nstatusCode: 404\ntime: 2018-03-21T06:44:04.511Z",
"requestInfo": {
"method": "getArtifact",
"params": {
"0": "public/logs/live_backing.log",
"taskId": "K-ST-Z1kS-2h6HbIyxNOzA",
"runId": "0",
"name": "public/logs/live_backing.log"
},
"payload": {},
"time": "2018-03-21T06:44:04.511Z"
}
}
But checking the link given in taskcluster after this message
"Please check the URL and ensure it is reachable."
e.g. in comment 19 https://taskcluster-artifacts.net/ULaD9YhITrikpfuz5v6mPw/0/public/logs/chain_of_trust.log shows the signing error. Perhaps there were some mis-classifications that lead to this, I will look into it.
The :ni was there because I checked for the triage manager for docker-worker
If there is any more info I can provide regarding this, please let me know.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•6 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Updated•6 years ago
|
Component: Docker-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•