[tracker] Give more information for a deadline-exceeded error

RESOLVED WORKSFORME

Status

Taskcluster
General
RESOLVED WORKSFORME
3 years ago
2 years ago

People

(Reporter: armenzg, Unassigned)

Tracking

Details

"Reason Resolved: deadline-exceeded"

Presumably the spot node it ran on died. When this happens the log isn't stored.
It should be automatically retried after `takenUntil` has expired, but when this happened deadline had exceeded.

Or you managed to create a command that doesn't completed in an hour and it expired by deadline before maxRunTime was exceeded.
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #1)
> "Reason Resolved: deadline-exceeded"
> 
> Presumably the spot node it ran on died. When this happens the log isn't
> stored.
> It should be automatically retried after `takenUntil` has expired, but when
> this happened deadline had exceeded.
> 
> Or you managed to create a command that doesn't completed in an hour and it
> expired by deadline before maxRunTime was exceeded.

Is there a way to lead the user to this specific information? This was very useful.

For the second part, how could I know that it was the latter reason? Why should there not be a log showing the output plus the killing because we passed the maxRunTime?
If you exceed maxRunTime there should be a log that reflects this...

Note, I managed to run the task here:
https://tools.taskcluster.net/task-inspector/#UMa16mqoSHqp1NJvgQmEyw/0
(The command doesn't work)


Anyways, talking to gandt we suspect something died along way... Perhaps because deadline somehow exceeded without docker-worker doing something... We're looking at logs...
Component: TaskCluster → General
Product: Testing → Taskcluster
The task is gone now, and I don't think there's any specific action to take on this bug.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME
Summary: Task submitted does not show log → Give more information for a deadline-exceeded error
"Provide a treeherder-parseable message and a log that treeherder knows to link to as a log, or RETRY the job, instead of expecting that sheriffs will grovel around until they find that it's just a totally useless failure message which should have been retried" seems like a specific action to take.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Philor, can you point to another task where this occurred?  deadline-exceeded does cause a re-run of the task, so this is still WORKSFORME as far as I know.
Ugh, greg reminds me I was confusing deadline-exceeded (which is permanent) with claim-expired (which results in a re-run).  At any rate, this was an unexpected situation that occurred over a year ago, and all of the logs are gone now.  In fact, the bug is likely to have been fixed in the last year, if it hasn't recurred since then.

Philor, I figure something happened to cause you to re-open this, so please provide some more information.
Flags: needinfo?(philringnalda)

Comment 8

2 years ago
I think some of these tasks resolved as "deadline-exceeded" have not been showing up on treeherder as much since the bug was originally opened with the fix made in 1148965. (tasks that have a run created just to report an exception will not be reported to treeherder)

Comment 9

2 years ago
I think there could be situations where a run is resolved by the queue with deadline-exceeded as the reason while the task is currently being executed by the worker.  

I would like to see a more recent task that this happened to use as reference to figure out the timing of the events and how we can better prevent this within the workers so that a log is uploaded (or at least an attempt is made to do so).

Also, mozilla-taskcluster could use an update to try to include the log when available.  Right now from what I can tell when the run is resolved as an exception, it does not insert a log reference into the job data submitted to treeherder.  This should help in cases of an exception being reported because of claim-expired or worker-shutdown as well.
Thanks!  So, that instance started at

>  Apr 05 17:54:55 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1650" x-info="http://www.rsyslog.com"] start 

and claimed the task at

> Apr 05 18:01:06 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: {"type":"claim task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-001a931d9a0ee6b31","workerGroup":"us-east-1a","workerType":"desktop-test-xlarge","workerNodeType":"m3.xlarge","taskId":"BWlHKQLSSvmM7xswdRq70Q","runId":0} 

It reclaimed every periodically for a VERY long time, starting with

>  Apr 05 18:16:29 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-001a931d9a0ee6b31","workerGroup":"us-east-1a","workerType":"desktop-test-xlarge","workerNodeType":"m3.xlarge","primaryTaskId":"BWlHKQLSSvmM7xswdRq70Q","primaryRunId":0,"taskId":"BWlHKQLSSvmM7xswdRq70Q","runId":0,"takenUntil":"2016-04-06T01:21:06.016Z"} 

until it finally got an error reclaiming due to deadline exceeded, just under 23 hours later.  Note that this is far longer than the maxRunTime!

> Apr 06 17:20:27 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-001a931d9a0ee6b31","workerGroup":"us-east-1a","workerType":"desktop-test-xlarge","workerNodeType":"m3.xlarge","primaryTaskId":"BWlHKQLSSvmM7xswdRq70Q","primaryRunId":0,"taskId":"BWlHKQLSSvmM7xswdRq70Q","runId":0,"takenUntil":"2016-04-07T00:25:05.619Z"}
> Apr 06 17:20:27 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: {"type":"error reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-001a931d9a0ee6b31","workerGroup":"us-east-1a","workerType":"desktop-test-xlarge","workerNodeType":"m3.xlarge","error":"Could not reclaim task. Error: Task BWlHKQLSSvmM7xswdRq70Q Can't be reclaimed past it's deadline of 2016-04-07T00:06:59.365Z.\n----\nerrorCode:  RequestConflict\nstatusCode: 409\nrequestInfo:\n  method:   reclaimTask\n  params:   {\"taskId\":\"BWlHKQLSSvmM7xswdRq70Q\",\"runId\":\"0\"}\n  payload:  {}\n  time:     2016-04-07T00:20:27.428Z\ndetails:\n{\n  \"taskId\": \"BWlHKQLSSvmM7xswdRq70Q\",\n  \"deadline\": \"2016-04-07T00:06:59.365Z\"\n}\n    at /home/ubuntu/docker_worker/node_modules/taskcluster-client/lib/client.js:322:21\n    at tryCallOne (/home/ubuntu/docker_worker/node_modules/promise/lib/core.js:37:12)\n    at /home/ubuntu/docker_worker/node_modules/promise/lib/core.js:123:15\n    at flush (/home/ubuntu/docker_worker/node_modules/promise/node_modules/asap/raw.js:50:29)\n    at process.
> Apr 06 17:20:27 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: _tickDomainCallback (node.js:381:11)","primaryTaskId":"BWlHKQLSSvmM7xswdRq70Q","primaryRunId":0,"taskId":"BWlHKQLSSvmM7xswdRq70Q","runId":0,"takenUntil":"2016-04-07T00:25:05.619Z"}
> Apr 06 17:20:27 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: {"type":"cancel task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-001a931d9a0ee6b31","workerGroup":"us-east-1a","workerType":"desktop-test-xlarge","workerNodeType":"m3.xlarge","taskId":"BWlHKQLSSvmM7xswdRq70Q","runId":0,"message":"Could not reclaim task. Error: Task BWlHKQLSSvmM7xswdRq70Q Can't be reclaimed past it's deadline of 2016-04-07T00:06:59.365Z.\n----\nerrorCode:  RequestConflict\nstatusCode: 409\nrequestInfo:\n  method:   reclaimTask\n  params:   {\"taskId\":\"BWlHKQLSSvmM7xswdRq70Q\",\"runId\":\"0\"}\n  payload:  {}\n  time:     2016-04-07T00:20:27.428Z\ndetails:\n{\n  \"taskId\": \"BWlHKQLSSvmM7xswdRq70Q\",\n  \"deadline\": \"2016-04-07T00:06:59.365Z\"\n}\n    at /home/ubuntu/docker_worker/node_modules/taskcluster-client/lib/client.js:322:21\n    at tryCallOne (/home/ubuntu/docker_worker/node_modules/promise/lib/core.js:37:12)\n    at /home/ubuntu/docker_worker/node_modules/promise/lib/core.js:123:15\n    at flush (/home/ubuntu/docker_worker/node_modules/promise/node_modules
> Apr 06 17:20:27 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: /asap/raw.js:50:29)\n    at process._tickDomainCallback (node.js:381:11)"} 

and the instance appears to have been running ever since (39 hours and counting now).  So hopefully that's some good data to go on.
Oh, been a while since I saw these, I'd forgotten deadline-exceeded was almost a day later.

Long before that everyone has lost any interest in seeing the job run, so ignore everything I've said here and elsewhere about RETRY, nobody will ever see the result if you do. The part that would make this class of failure sheriff-manageable would be to make deadline-exceeded much shorter and to provide a log and a parseable error message, but without the much shorter they mostly just won't ever be seen.
Thank you! this is very helpful so I'll begin checking out these instances.
I have opened up the following bugs to address separate issues with this:

bug 1262959
bug 1262960

Updated

2 years ago
Depends on: 1262960, 1262959
Summary: Give more information for a deadline-exceeded error → [tracker] Give more information for a deadline-exceeded error

Comment 16

2 years ago
5 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 5

Platform breakdown:
* linux64: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1107970&startday=2016-04-18&endday=2016-04-24&tree=all
I don't think there's any better information we can give here -- from the bugs Greg added, this only occurs when there are unexpected bugs, and by their nature they're unexpected and require investigation and fixing.
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.