Closed Bug 1437407 Opened 6 years ago Closed 6 years ago

Task started, then no logs until deadline-exceeded

Categories

(Taskcluster :: Workers, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: nataliaCs, Assigned: wcosta)

References

Details

(Whiteboard: [stockwell infra])

Two failures appeared on m-c, with no error log parsed.
Example: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=a8e153c55eeee93a11e87d325fb20c644421036f&filter-platform=Linux%20x64%20JSDCov%20opt&selectedJob=161504033 

The raw log takes to https://queue.taskcluster.net/v1/task/K-ST-Z1kS-2h6HbIyxNOzA/runs/0/artifacts/public/logs/live_backing.log 
-> getArtifact\nerrorCode:  ResourceNotFound\nstatusCode: 404

Checking taskcluster: https://tools.taskcluster.net/groups/SVkufmy3SyOukV-EnNzVqA/tasks/K-ST-Z1kS-2h6HbIyxNOzA/runs/0/logs/public%2Flogs%2Flive.log 

->Failed to fetch.
An error occurred attempting to load the provided log.
Please check the URL and ensure it is reachable.
The task failed because it hung until its deadline (one day after it was created).  I'm not sure how that happened, and it has scrolled off the end of our log retention, so I can't look again.

In theory, the max-run-time (usually an hour or two) should have caused it to terminate earlier than that.  Or, if the worker crashed, it should have been resolved with claim-expired and retried.  If you see this again, please re-open.
Component: General → Queue
Summary: Error: Failed to fetch → Task started, then no logs until deadline-exceeded
(In reply to Natalia Csoregi [:nataliaCs] from comment #3)
> Reoccured on central and the rerun was successful
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> central&revision=f01d1def46fb53a6523768c6e9188e66b89e664e&filter-
> searchStr=lt&selectedJob=162643510

wrong link ^ 
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=57abd8e4ab25bc21d9c75dbf43546e7272fddf0e&selectedJob=162565479
https://tools.taskcluster.net/groups/UvzDAAviTKOBOUmKvY2_Wg/tasks/E22tHJmdRe-4CQXuUT6a0w/runs/0/logs/public%2Flogs%2Flive.log
Looking at the instance that ran the task, I can see logging about re-claiming the task:

Feb 16 19:58:37 docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i-011754588353bc4ba docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux-xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe-4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0,"takenUntil":"2018-02-16T20:03:15.694Z"}
Feb 16 19:58:37 docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i-011754588353bc4ba docker-worker: {"type":"reclaimed task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux-xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe-4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0,"takenUntil":"2018-02-16T20:18:37.521Z"} 

Looking back to worker startup, this was the first task it took.  I don't see anything suspicious in the logs -- it seems like it's just happily executing that task, reclaiming, logging its aliveness checks, and so on for that whole time.  Which is weird because:

  "maxRunTime": 7200,

Wander, any ideas what might be going on here?
Flags: needinfo?(wcosta)
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #5)
> Looking at the instance that ran the task, I can see logging about
> re-claiming the task:
> 
> Feb 16 19:58:37
> docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i-
> 011754588353bc4ba docker-worker: {"type":"reclaiming
> task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-
> 011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux-
> xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe-
> 4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0,
> "takenUntil":"2018-02-16T20:03:15.694Z"}
> Feb 16 19:58:37
> docker-worker.aws-provisioner.us-east-1e.ami-10272f6a.c3-xlarge.i-
> 011754588353bc4ba docker-worker: {"type":"reclaimed
> task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-
> 011754588353bc4ba","workerGroup":"us-east-1","workerType":"gecko-t-linux-
> xlarge","workerNodeType":"c3.xlarge","primaryTaskId":"E22tHJmdRe-
> 4CQXuUT6a0w","primaryRunId":0,"taskId":"E22tHJmdRe-4CQXuUT6a0w","runId":0,
> "takenUntil":"2018-02-16T20:18:37.521Z"} 
> 
> Looking back to worker startup, this was the first task it took.  I don't
> see anything suspicious in the logs -- it seems like it's just happily
> executing that task, reclaiming, logging its aliveness checks, and so on for
> that whole time.  Which is weird because:
> 
>   "maxRunTime": 7200,
> 
> Wander, any ideas what might be going on here?

Assigning this bug to me to investigate it
Assignee: nobody → wcosta
Flags: needinfo?(wcosta)
Component: Queue → Docker-Worker
Priority: -- → P2
This bug failed 61 times in the last 7 days, on Windows 10 affecting opt,debug,pgo and asan build types.
Failing on bug: opt-web-platform-tests-e10s, debug-reftest-e10, opt-web-platform-tests-e10,, opt-reftest-e10, opt-talos-svgr-e10.
Link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=164880559
Part of a link: 
 Unable to read VR Path Registry from C:\Users\GenericWorker\AppData\Local\openvr\openvrpaths.vrpath
16:31:14     INFO -  [Child 3712, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
16:31:14     INFO -  [Child 3712, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
16:31:14     INFO -  [Parent 6976, Gecko_IOThread] WARNING: file z:/build/build/src/ipc/chromium/src/base/process_util_win.cc, line 188
16:37:24     INFO -  REFTEST ERROR | file:///C:/Users/task_1519834749/build/tests/reftest/tests/layout/reftests/svg/smil/transform/use-1.svg | application timed out after 370 seconds with no output
16:37:24     INFO -  REFTEST ERROR | Force-terminating active process(es).
16:37:24     INFO -  REFTEST TEST-INFO | started process screenshot
16:37:24     INFO -  REFTEST TEST-INFO | screenshot: exit 0
16:37:24     INFO -  TEST-INFO | crashinject: exit 0

wcosta: do you have any updates on this bug?
Flags: needinfo?(wcosta)
Whiteboard: [stockwell needswork]
Whiteboard: [stockwell needswork] → [stockwell infra]
I suspect this is related to when we ran reftests/talos on windows 10 moonshot hardware with taskcluster worker.
Moving to P1 due to the frequency of occurrence.
Blocks: tc-stability
Status: NEW → ASSIGNED
Flags: needinfo?(wcosta)
NI :pmoore as it seems this affects generic-worker as well.
Flags: needinfo?(pmoore)
(In reply to Arthur Iakab [arthur_iakab] from comment #10)
> This bug failed 61 times in the last 7 days, on Windows 10 affecting
> opt,debug,pgo and asan build types.
> Failing on bug: opt-web-platform-tests-e10s, debug-reftest-e10,
> opt-web-platform-tests-e10,, opt-reftest-e10, opt-talos-svgr-e10.
> Link to a recent log:
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=164880559
> Part of a link: 
>  Unable to read VR Path Registry from
> C:\Users\GenericWorker\AppData\Local\openvr\openvrpaths.vrpath
> 16:31:14     INFO -  [Child 3712, Chrome_ChildThread] WARNING: pipe error:
> 109: file
> z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line
> 346
> 16:31:14     INFO -  [Child 3712, Chrome_ChildThread] WARNING: pipe error:
> 109: file
> z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line
> 346
> 16:31:14     INFO -  [Parent 6976, Gecko_IOThread] WARNING: file
> z:/build/build/src/ipc/chromium/src/base/process_util_win.cc, line 188
> 16:37:24     INFO -  REFTEST ERROR |
> file:///C:/Users/task_1519834749/build/tests/reftest/tests/layout/reftests/
> svg/smil/transform/use-1.svg | application timed out after 370 seconds with
> no output
> 16:37:24     INFO -  REFTEST ERROR | Force-terminating active process(es).
> 16:37:24     INFO -  REFTEST TEST-INFO | started process screenshot
> 16:37:24     INFO -  REFTEST TEST-INFO | screenshot: exit 0
> 16:37:24     INFO -  TEST-INFO | crashinject: exit 0
> 
> wcosta: do you have any updates on this bug?

The worker has behaved correctly in the cited example (https://tools.taskcluster.net/groups/DH7tXg5qRii8JQBRiZbsqQ/tasks/dc7NQnaPS-urwMB9uE82aA/runs/2/logs/public%2Flogs%2Flive_backing.log). A task is executed, which did not complete within the duration specified in maxRunTime payload property, and therefore the worker resolved the task as failed.

Therefore in the case of this Windows task, it appears to be a problem with the task, rather than the worker, which would also align with Joel's comment 11.
Flags: needinfo?(pmoore)
There were 30 failures in the last 7 days, on the following platforms:
- android-4-3-armv7-api16 OPT - 1
- android-4-3-armv7-api16 DEBUG - 1
- linux x64 OPT - 4
- linux x64 PGO - 1
- linux x64-noopt DEBUG - 1
- linux x64-stylo-disabled OPT - 1
- linux x64-stylo-disabled DEBUG - 1
- osx-cross OPT - 2
- windows 2012 x64 OPT - 5
- windows 10-64 DEBUG - 1
- windows10-64-ccov DEBUG - 10
- windows10-64-qr DEBUG - 1
- windows2012-32 DEBUG - 1
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=168776987 
As nothing is shown in the parsed log, looking at the taskcluster link https://tools.taskcluster.net/groups/e5xStjolQvmBHJ-k3xomhA/tasks/ULaD9YhITrikpfuz5v6mPw/runs/0/logs/public%2Flogs%2Fchain_of_trust.log the error occurs:
Failed to fetch
An error occurred attempting to load the provided log.
Please check the URL and ensure it is reachable.
https://taskcluster-artifacts.net/ULaD9YhITrikpfuz5v6mPw/0/public/logs/chain_of_trust.log

Snippet example from the above log ^:
task: {
  "maxRunTime": 3600,
  "upstreamArtifacts": [
    {
      "formats": [
        "sha2signcode"
      ],
      "paths": [
        "public/build/setup.exe"
      ],
      "taskId": "FFM4oDVXRQOxFNfr-Tekeg",
      "taskType": "build"
    },
    {
      "formats": [
        "sha2signcode",
        "widevine"
      ],
      "paths": [
        "public/build/target.zip"
      ],
      "taskId": "FFM4oDVXRQOxFNfr-Tekeg",
      "taskType": "build"
    }
  ]
}
signing ULaD9YhITrikpfuz5v6mPw workerType differs!
 graph: "gecko-t-win10-64"
 task: "depsigning"
2018-03-18T02:03:42 CRITICAL - Can't find task signing ULaD9YhITrikpfuz5v6mPw in signing:parent e5xStjolQvmBHJ-k3xomhA task-graph.json!
2018-03-18T02:03:42 CRITICAL - Chain of Trust verification error!
Traceback (most recent call last):
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1819, in verify_chain_of_trust
    task_count = await verify_task_types(chain)
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1601, in verify_task_types
    await valid_task_types[task_type](chain, obj)
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1383, in verify_parent_task
    verify_link_in_task_graph(chain, link, target_link)
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 974, in verify_link_in_task_graph
    task_link.name, task_link.task_id, decision_link.name, decision_link.task_id
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 302, in raise_on_errors
    raise CoTError("\n".join(errors))
scriptworker.exceptions.CoTError: "Can't find task signing ULaD9YhITrikpfuz5v6mPw in signing:parent e5xStjolQvmBHJ-k3xomhA task-graph.json!"
Flags: needinfo?(dustin)
Hm, that is a signing error, which is not on docker-worker -- but this bug is about docker-worker.  I'm not sure what's going on, or why I got the needinfo..
Flags: needinfo?(dustin)
All of these issues mentioned since starting have no logs parsed, and taskcluster only shows the "Failed to fetch error" for each and every one.
The raw logs show this when they're opened:
{
  "code": "ResourceNotFound",
  "message": "Artifact not found\n----\nmethod:     getArtifact\nerrorCode:  ResourceNotFound\nstatusCode: 404\ntime:       2018-03-21T06:44:04.511Z",
  "requestInfo": {
    "method": "getArtifact",
    "params": {
      "0": "public/logs/live_backing.log",
      "taskId": "K-ST-Z1kS-2h6HbIyxNOzA",
      "runId": "0",
      "name": "public/logs/live_backing.log"
    },
    "payload": {},
    "time": "2018-03-21T06:44:04.511Z"
  }
}
But checking the link given in taskcluster after this message
"Please check the URL and ensure it is reachable."
e.g. in comment 19 https://taskcluster-artifacts.net/ULaD9YhITrikpfuz5v6mPw/0/public/logs/chain_of_trust.log shows the signing error. Perhaps there were some mis-classifications that lead to this, I will look into it.
The :ni was there because I checked for the triage manager for docker-worker

If there is any more info I can provide regarding this, please let me know.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.