Closed Bug 1503430 Opened 6 years ago Closed 6 years ago

release action tasks timing out but successfully submitting new task group

Categories

(Taskcluster :: Services, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Unassigned)

References

Details

jlund> dustin: getting hangs with gecko-3-decision worker based tasks for release graphs.
14:32:52 e.g. https://tools.taskcluster.net/groups/MH723l7ZTfyYMNX5NXvVEQ/tasks/E0yoL16jTu-PenRG_2Dsgg/runs/0
14:33:17 unclear what's happening behind the scenes. Seems graph is getting submitted okay but task doesn't complete
14:33:57 same with https://tools.taskcluster.net/tasks/bKu596bqSRC1Lmd67jJj_A
14:34:05 blocking 63.0.1. Any tips?
14:34:33 
<dustin> hm
14:35:29 
<tomprince> Definitely seeing some spot instance termination with that worker type: https://tools.taskcluster.net/aws-provisioner/gecko-3-decision/errors
14:38:28 
<dustin> that doesn't *appear* to be what happened here
14:40:09 
<aki> 63.0rc took 10min to run https://tools.taskcluster.net/groups/NUU3EwFFT4q5kLZmmSD8PQ/tasks/SObDu5XHQTGgFHTOz4IuXQ/runs/0
14:40:48 so we more than doubled that time in 63.0.1 bKu596bqSRC1Lmd67jJj_A
14:40:51 run0 and 1 at least
14:43:18 
<•jlund> Hm. Wonder if I didn’t give enough time for build2 run0 before cancelling.
14:43:40 I figured I should cancel given there was a portion graph but it was still running.
14:43:44 
<dustin> Oct 30 20:27:59 docker-worker.aws-provisioner.us-east-1c.ami-17373968.c5-xlarge.i-0af3223c7c085631b docker-worker: {"type":"reclaiming
14:43:44 task","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0af3223c7c085631b","workerGroup":"us-east-1","workerType":"gecko-3-decision","workerNodeType":"c5.xlarge","primaryTaskId":"bKu596bqSRC1Lmd67jJj_A","primaryRunId":0,"taskId":"bKu596bqSRC1Lmd67jJj_A","runId":0,"takenUntil":"2018-10-30T20:28:45.324Z"} 
14:44:37 it's been hard to tell the difference between failure modes today
14:44:40 lots and lots of hg failures
14:45:01 (of which this isn't one)
14:45:36 so that log above, it called reclaimTask and got a takenUntil 46 seconds in the future
14:45:39 which makes no sense
14:45:56 https://github.com/taskcluster/taskcluster-queue/blob/master/src/api.js#L1560
14:45:59 claimTimeout is 1200
14:46:13 
<aki> clock skew?
14:46:16 
<dustin> maybe
14:50:36 I wonder if papertrail throws out log entries from "the future"
14:50:37 hm
14:50:42 I can't find any of those reclaim API calls in the queue
14:50:44 logs
14:51:05 
<•mtabara> nalexander: https://bugzilla.mozilla.org/show_bug.cgi?id=1503392#c3
14:51:06 
<firebot> Bug 1503392 is not accessible
14:52:11 ⇐ stpeter quit (stpeter@moz-nmh.7pt.25.76.IP) Client exited
14:52:19 
<nalexander> mtabara: beautiful.
14:53:21 
<dustin> oh
14:53:32 maybe takenUntil is the *previous* takenUntil
14:53:37 
<•mtabara> np. happy to expand on any of those if needed
14:54:34 
<dustin> yeah, ok, that sort of makes sense then
14:54:59 so the queue never got that request
14:59:12 https://status.heroku.com/incidents/1655
14:59:37 
<aki> ah
15:00:16 
<dustin> I suspect that's responsible for the worker's inability to reclaim


... after heroku resolved issue

jlund> dustin: tomprince aki - looks like heroku issue resolved. will give a few min then try build3
15:34:19 
<aki> awesome
16:02:57 
<•jlund> dustin: 16m running. usually <10m. https://tools.taskcluster.net/groups/MH723l7ZTfyYMNX5NXvVEQ/tasks/VxMf3uF2TJOc0p4jDrnWrw/runs/0
16:03:10 not sure if it helps to debug while live
16:03:21 graph exists https://tools.taskcluster.net/groups/VxMf3uF2TJOc0p4jDrnWrw
16:10:02 bstack: ^ need help debugging to unblock 63.0.1
16:10:15 
— bstack looks
16:10:28 
<nthomas> just hit claim expired on task VxMf3uF2TJOc0p4jDrnWrw
16:10:43 
<•jlund> see context above, determined at time to be https://status.heroku.com/incidents/1655 for cause of expiring
16:11:52 
<bstack> do I remember correctly that is there a bug for this? I think I saw restarting the queue helped last time
16:11:56 I will do that now to unblock
16:11:56 → kwierso joined (sid1594@moz-gmnqpf.n4vs.eb63.0101.2620.IP)
16:12:04 
<bstack> and increase priority of figuring out what is going on
16:12:23 
<•jlund> thank you. can't recall can ciduty restart now?
16:12:33 ah no, waiting on something
16:13:00 
<bstack> yeah
16:13:19 plus I don't want to get into a world where we're manually restarting services frequently anyway
16:13:33 it should be something that bugs tc engineers so that we're motivated to fix it 
16:13:49 restarted
Severity: normal → blocker
Priority: -- → P1
verifying fix with new shipit go-to-build
See Also: → 1503062
https://tools.taskcluster.net/groups/MH723l7ZTfyYMNX5NXvVEQ/tasks/eAQCnoHFQtKgrtaNXSZTYA/runs/0

closing for now
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
I think the original, Heroku-caused issue was that reclaim calls were getting hung for 30s in the Heroku routing layer, with a 5xx response at that time.  The worker would have retried a few times, but had given itself only something like 90 seconds to reclaim, so probably overran that time range with the retries.

I would not be surprised to learn that a sudden influx of traffic when that resolved caused us to get in trouble with Azure, and bug 1503062 shows that the dependency resolver does not recover from Azure errors.
In fact I see a big bump in task messages around 22:12-23:30 that caused PulseGuardian to threaten to delete our queues (it didn't delete them).  That suggests we had a very busy spike around that time.
Component: Queue → Services
You need to log in before you can comment on or make changes to this bug.