Closed
Bug 1503430
Opened 6 years ago
Closed 6 years ago
release action tasks timing out but successfully submitting new task group
Categories
(Taskcluster :: Services, defect, P1)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlund, Unassigned)
References
Details
jlund> dustin: getting hangs with gecko-3-decision worker based tasks for release graphs. 14:32:52 e.g. https://tools.taskcluster.net/groups/MH723l7ZTfyYMNX5NXvVEQ/tasks/E0yoL16jTu-PenRG_2Dsgg/runs/0 14:33:17 unclear what's happening behind the scenes. Seems graph is getting submitted okay but task doesn't complete 14:33:57 same with https://tools.taskcluster.net/tasks/bKu596bqSRC1Lmd67jJj_A 14:34:05 blocking 63.0.1. Any tips? 14:34:33 <dustin> hm 14:35:29 <tomprince> Definitely seeing some spot instance termination with that worker type: https://tools.taskcluster.net/aws-provisioner/gecko-3-decision/errors 14:38:28 <dustin> that doesn't *appear* to be what happened here 14:40:09 <aki> 63.0rc took 10min to run https://tools.taskcluster.net/groups/NUU3EwFFT4q5kLZmmSD8PQ/tasks/SObDu5XHQTGgFHTOz4IuXQ/runs/0 14:40:48 so we more than doubled that time in 63.0.1 bKu596bqSRC1Lmd67jJj_A 14:40:51 run0 and 1 at least 14:43:18 <•jlund> Hm. Wonder if I didn’t give enough time for build2 run0 before cancelling. 14:43:40 I figured I should cancel given there was a portion graph but it was still running. 14:43:44 <dustin> Oct 30 20:27:59 docker-worker.aws-provisioner.us-east-1c.ami-17373968.c5-xlarge.i-0af3223c7c085631b docker-worker: {"type":"reclaiming 14:43:44 task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0af3223c7c085631b","workerGroup":"us-east-1","workerType":"gecko-3-decision","workerNodeType":"c5.xlarge","primaryTaskId":"bKu596bqSRC1Lmd67jJj_A","primaryRunId":0,"taskId":"bKu596bqSRC1Lmd67jJj_A","runId":0,"takenUntil":"2018-10-30T20:28:45.324Z"} 14:44:37 it's been hard to tell the difference between failure modes today 14:44:40 lots and lots of hg failures 14:45:01 (of which this isn't one) 14:45:36 so that log above, it called reclaimTask and got a takenUntil 46 seconds in the future 14:45:39 which makes no sense 14:45:56 https://github.com/taskcluster/taskcluster-queue/blob/master/src/api.js#L1560 14:45:59 claimTimeout is 1200 14:46:13 <aki> clock skew? 14:46:16 <dustin> maybe 14:50:36 I wonder if papertrail throws out log entries from "the future" 14:50:37 hm 14:50:42 I can't find any of those reclaim API calls in the queue 14:50:44 logs 14:51:05 <•mtabara> nalexander: https://bugzilla.mozilla.org/show_bug.cgi?id=1503392#c3 14:51:06 <firebot> Bug 1503392 is not accessible 14:52:11 ⇐ stpeter quit (stpeter@moz-nmh.7pt.25.76.IP) Client exited 14:52:19 <nalexander> mtabara: beautiful. 14:53:21 <dustin> oh 14:53:32 maybe takenUntil is the *previous* takenUntil 14:53:37 <•mtabara> np. happy to expand on any of those if needed 14:54:34 <dustin> yeah, ok, that sort of makes sense then 14:54:59 so the queue never got that request 14:59:12 https://status.heroku.com/incidents/1655 14:59:37 <aki> ah 15:00:16 <dustin> I suspect that's responsible for the worker's inability to reclaim ... after heroku resolved issue jlund> dustin: tomprince aki - looks like heroku issue resolved. will give a few min then try build3 15:34:19 <aki> awesome 16:02:57 <•jlund> dustin: 16m running. usually <10m. https://tools.taskcluster.net/groups/MH723l7ZTfyYMNX5NXvVEQ/tasks/VxMf3uF2TJOc0p4jDrnWrw/runs/0 16:03:10 not sure if it helps to debug while live 16:03:21 graph exists https://tools.taskcluster.net/groups/VxMf3uF2TJOc0p4jDrnWrw 16:10:02 bstack: ^ need help debugging to unblock 63.0.1 16:10:15 — bstack looks 16:10:28 <nthomas> just hit claim expired on task VxMf3uF2TJOc0p4jDrnWrw 16:10:43 <•jlund> see context above, determined at time to be https://status.heroku.com/incidents/1655 for cause of expiring 16:11:52 <bstack> do I remember correctly that is there a bug for this? I think I saw restarting the queue helped last time 16:11:56 I will do that now to unblock 16:11:56 → kwierso joined (sid1594@moz-gmnqpf.n4vs.eb63.0101.2620.IP) 16:12:04 <bstack> and increase priority of figuring out what is going on 16:12:23 <•jlund> thank you. can't recall can ciduty restart now? 16:12:33 ah no, waiting on something 16:13:00 <bstack> yeah 16:13:19 plus I don't want to get into a world where we're manually restarting services frequently anyway 16:13:33 it should be something that bugs tc engineers so that we're motivated to fix it 16:13:49 restarted
Reporter | ||
Updated•6 years ago
|
Severity: normal → blocker
Priority: -- → P1
Reporter | ||
Comment 1•6 years ago
|
||
verifying fix with new shipit go-to-build
Reporter | ||
Comment 2•6 years ago
|
||
https://tools.taskcluster.net/groups/MH723l7ZTfyYMNX5NXvVEQ/tasks/eAQCnoHFQtKgrtaNXSZTYA/runs/0 closing for now
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Comment 3•6 years ago
|
||
I think the original, Heroku-caused issue was that reclaim calls were getting hung for 30s in the Heroku routing layer, with a 5xx response at that time. The worker would have retried a few times, but had given itself only something like 90 seconds to reclaim, so probably overran that time range with the retries. I would not be surprised to learn that a sudden influx of traffic when that resolved caused us to get in trouble with Azure, and bug 1503062 shows that the dependency resolver does not recover from Azure errors.
Comment 4•6 years ago
|
||
In fact I see a big bump in task messages around 22:12-23:30 that caused PulseGuardian to threaten to delete our queues (it didn't delete them). That suggests we had a very busy spike around that time.
Assignee | ||
Updated•5 years ago
|
Component: Queue → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•