Bugzilla

Comment 4

•

8 years ago

The task is gone now, and I don't think there's any specific action to take on this bug.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME

Summary: Task submitted does not show log → Give more information for a deadline-exceeded error

Comment 5

•

8 years ago

"Provide a treeherder-parseable message and a log that treeherder knows to link to as a log, or RETRY the job, instead of expecting that sheriffs will grovel around until they find that it's just a totally useless failure message which should have been retried" seems like a specific action to take.

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Comment 6

•

8 years ago

Philor, can you point to another task where this occurred?  deadline-exceeded does cause a re-run of the task, so this is still WORKSFORME as far as I know.

Comment 7

•

8 years ago

Ugh, greg reminds me I was confusing deadline-exceeded (which is permanent) with claim-expired (which results in a re-run).  At any rate, this was an unexpected situation that occurred over a year ago, and all of the logs are gone now.  In fact, the bug is likely to have been fixed in the last year, if it hasn't recurred since then.

Philor, I figure something happened to cause you to re-open this, so please provide some more information.

Flags: needinfo?(philringnalda)

Comment 8

•

8 years ago

I think some of these tasks resolved as "deadline-exceeded" have not been showing up on treeherder as much since the bug was originally opened with the fix made in 1148965. (tasks that have a run created just to report an exception will not be reported to treeherder)

Comment 9

•

8 years ago

I think there could be situations where a run is resolved by the queue with deadline-exceeded as the reason while the task is currently being executed by the worker.  

I would like to see a more recent task that this happened to use as reference to figure out the timing of the events and how we can better prevent this within the workers so that a log is uploaded (or at least an attempt is made to do so).

Also, mozilla-taskcluster could use an update to try to include the log when available.  Right now from what I can tell when the run is resolved as an exception, it does not insert a log reference into the job data submitted to treeherder.  This should help in cases of an exception being reported because of claim-expired or worker-shutdown as well.

Comment 10

•

8 years ago

Given their lack of logs, it's hard to even point to them.

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=336759fb7df024ec7fe456df8337316e8361c4d5&selectedJob=25234015&filter-searchStr=cbc56d6ed036f0e3591efe72f5333a1aa2ff9f5c

Flags: needinfo?(philringnalda)

Comment 11

•

8 years ago

Thanks!  So, that instance started at

>  Apr 05 17:54:55 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1650" x-info="http://www.rsyslog.com"] start 

and claimed the task at

> Apr 05 18:01:06 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: {"type":"claim task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-001a931d9a0ee6b31","workerGroup":"us-east-1a","workerType":"desktop-test-xlarge","workerNodeType":"m3.xlarge","taskId":"BWlHKQLSSvmM7xswdRq70Q","runId":0} 

It reclaimed every periodically for a VERY long time, starting with

>  Apr 05 18:16:29 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-001a931d9a0ee6b31","workerGroup":"us-east-1a","workerType":"desktop-test-xlarge","workerNodeType":"m3.xlarge","primaryTaskId":"BWlHKQLSSvmM7xswdRq70Q","primaryRunId":0,"taskId":"BWlHKQLSSvmM7xswdRq70Q","runId":0,"takenUntil":"2016-04-06T01:21:06.016Z"} 

until it finally got an error reclaiming due to deadline exceeded, just under 23 hours later.  Note that this is far longer than the maxRunTime!

> Apr 06 17:20:27 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: {"type":"reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-001a931d9a0ee6b31","workerGroup":"us-east-1a","workerType":"desktop-test-xlarge","workerNodeType":"m3.xlarge","primaryTaskId":"BWlHKQLSSvmM7xswdRq70Q","primaryRunId":0,"taskId":"BWlHKQLSSvmM7xswdRq70Q","runId":0,"takenUntil":"2016-04-07T00:25:05.619Z"}
> Apr 06 17:20:27 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: {"type":"error reclaiming task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-001a931d9a0ee6b31","workerGroup":"us-east-1a","workerType":"desktop-test-xlarge","workerNodeType":"m3.xlarge","error":"Could not reclaim task. Error: Task BWlHKQLSSvmM7xswdRq70Q Can't be reclaimed past it's deadline of 2016-04-07T00:06:59.365Z.\n----\nerrorCode:  RequestConflict\nstatusCode: 409\nrequestInfo:\n  method:   reclaimTask\n  params:   {\"taskId\":\"BWlHKQLSSvmM7xswdRq70Q\",\"runId\":\"0\"}\n  payload:  {}\n  time:     2016-04-07T00:20:27.428Z\ndetails:\n{\n  \"taskId\": \"BWlHKQLSSvmM7xswdRq70Q\",\n  \"deadline\": \"2016-04-07T00:06:59.365Z\"\n}\n    at /home/ubuntu/docker_worker/node_modules/taskcluster-client/lib/client.js:322:21\n    at tryCallOne (/home/ubuntu/docker_worker/node_modules/promise/lib/core.js:37:12)\n    at /home/ubuntu/docker_worker/node_modules/promise/lib/core.js:123:15\n    at flush (/home/ubuntu/docker_worker/node_modules/promise/node_modules/asap/raw.js:50:29)\n    at process.
> Apr 06 17:20:27 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: _tickDomainCallback (node.js:381:11)","primaryTaskId":"BWlHKQLSSvmM7xswdRq70Q","primaryRunId":0,"taskId":"BWlHKQLSSvmM7xswdRq70Q","runId":0,"takenUntil":"2016-04-07T00:25:05.619Z"}
> Apr 06 17:20:27 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: {"type":"cancel task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-001a931d9a0ee6b31","workerGroup":"us-east-1a","workerType":"desktop-test-xlarge","workerNodeType":"m3.xlarge","taskId":"BWlHKQLSSvmM7xswdRq70Q","runId":0,"message":"Could not reclaim task. Error: Task BWlHKQLSSvmM7xswdRq70Q Can't be reclaimed past it's deadline of 2016-04-07T00:06:59.365Z.\n----\nerrorCode:  RequestConflict\nstatusCode: 409\nrequestInfo:\n  method:   reclaimTask\n  params:   {\"taskId\":\"BWlHKQLSSvmM7xswdRq70Q\",\"runId\":\"0\"}\n  payload:  {}\n  time:     2016-04-07T00:20:27.428Z\ndetails:\n{\n  \"taskId\": \"BWlHKQLSSvmM7xswdRq70Q\",\n  \"deadline\": \"2016-04-07T00:06:59.365Z\"\n}\n    at /home/ubuntu/docker_worker/node_modules/taskcluster-client/lib/client.js:322:21\n    at tryCallOne (/home/ubuntu/docker_worker/node_modules/promise/lib/core.js:37:12)\n    at /home/ubuntu/docker_worker/node_modules/promise/lib/core.js:123:15\n    at flush (/home/ubuntu/docker_worker/node_modules/promise/node_modules
> Apr 06 17:20:27 docker-worker.aws-provisioner.us-east-1a.ami-ef363b85.m3-xlarge.i-001a931d9a0ee6b31 docker-worker: /asap/raw.js:50:29)\n    at process._tickDomainCallback (node.js:381:11)"} 

and the instance appears to have been running ever since (39 hours and counting now).  So hopefully that's some good data to go on.

Comment 12

•

8 years ago

Oh, been a while since I saw these, I'd forgotten deadline-exceeded was almost a day later.

Long before that everyone has lost any interest in seeing the job run, so ignore everything I've said here and elsewhere about RETRY, nobody will ever see the result if you do. The part that would make this class of failure sheriff-manageable would be to make deadline-exceeded much shorter and to provide a log and a parseable error message, but without the much shorter they mostly just won't ever be seen.

Comment 13

•

8 years ago

https://tools.taskcluster.net/task-inspector/#THPB9OnjR2yZ9wiAdx_8XQ/0
https://tools.taskcluster.net/task-inspector/#ZGm78zfEQ5e2LhHxdLz9QA/0
https://tools.taskcluster.net/task-inspector/#ba_DsIsnTD6muBmTAJqgUw/0

are probably headed toward the same fate, but only 18/18/16 hours into the process.

Comment 14

•

8 years ago

Thank you! this is very helpful so I'll begin checking out these instances.

Comment 15

•

8 years ago

I have opened up the following bugs to address separate issues with this:

bug 1262959
bug 1262960

Selena Deckelmann :selenamarie :selena

Updated

•

8 years ago

Depends on: 1262960, 1262959

Updated

•

8 years ago

Summary: Give more information for a deadline-exceeded error → [tracker] Give more information for a deadline-exceeded error

Comment hidden (Intermittent Failures Robot)

5 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 5

Platform breakdown:
* linux64: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1107970&startday=2016-04-18&endday=2016-04-24&tree=all