Closed Bug 1180187 Opened 9 years ago Closed 6 years ago

generic-worker: listen for and handle worker shutdown


(Taskcluster :: Workers, defect)

Not set


(Not tracked)



(Reporter: pmoore, Assigned: pmoore)



(Whiteboard: [generic-worker])


(1 file)

Spot nodes can terminate:

The generic worker should have a dedicated go routine to poll at 5s intervals in order to catch spot node termination notices.

Once established that the node will be terminated, the generic worker should handle the prompt termination of any running tasks, with appropriate reportException handling, giving a reason of "worker-shutdown". The log should have "[taskcluster] Spot node shutdown" added on a new line, and the log should be uploaded. If other artifacts already exist, it may be reasonable to upload them too.
Assigning all generic worker bugs to myself for now. If anyone wants to take this bug, feel free to add a comment to request it. I can provide context.
Assignee: nobody → pmoore
Component: TaskCluster → Generic-Worker
Product: Testing → Taskcluster
Component: Generic-Worker → Worker
Whiteboard: [generic-worker]
Component: Worker → Generic-Worker
QA Contact: pmoore
See Also: → 1444168
Blocks: 1444168
See Also: 1444168
Drafted an implementation - still need to add test(s).
Attachment #8957670 - Flags: review?(jhford)
Commits pushed to master at
Bug 1180187 - Resolve worker-shutdown for spot terminations
Merge pull request #78 from taskcluster/bug1180187

Bug 1180187 - listen for spot termination notice and abort task on discovery
This was released in 10.7.0 but doesn't seem to be catching all spot terminations.

I've added extra debugging in 10.7.2:

I've rolled that out to staging, so will wait for some reports to come in on papertrail.
Looks to me like a deadlock occurring...

Mar 16 22:07:09 generic-worker: 2018/03/16 21:07:07  
Mar 16 22:07:09 generic-worker: Spot request has MAYBE been issued??? Decide for yourself! 
Mar 16 22:07:09 generic-worker: 2018/03/16 21:07:07 HTTP/1.0 200 OK 
Mar 16 22:07:09 generic-worker: Content-Length: 20 
Mar 16 22:07:09 generic-worker: Accept-Ranges: bytes 
Mar 16 22:07:09 generic-worker: Connection: close 
Mar 16 22:07:09 generic-worker: Content-Type: text/plain 
Mar 16 22:07:09 generic-worker: Date: Fri, 16 Mar 2018 21:07:08 GMT 
Mar 16 22:07:09 generic-worker: Etag: "3708311465" 
Mar 16 22:07:09 generic-worker: Last-Modified: Fri, 16 Mar 2018 21:07:05 GMT 
Mar 16 22:07:09 generic-worker: Server: EC2ws 
Mar 16 22:07:09 generic-worker: 2018-03-16T21:09:05Z 
Mar 16 22:07:09 generic-worker: 2018/03/16 21:07:07 resp.StatusCode = 200 
Mar 16 22:07:09 generic-worker: 2018/03/16 21:07:07 WARNING: ABORTING task since an imminent spot termination notice has been received! 

The last line we see in the logs comes from:

The abort function it calls is:

This needs to update the task status, which is protected with a mutex. I suspect something else holds the mutex.

The machine shutsdown a couple of minutes later, and the task run is eventually resolved by the queue as exception/claim-expired rather than the worker resolving it as exception/worker-shutdown as is intended...

Strangely the unit test passes:

So something must be different when running as a unit test, to when this runs in production for real...
:pmoore, what is the status here?  this bug is marked as the root cause for bug 1444168 which had 80+ failures in the last week.
Flags: needinfo?(pmoore)
Hi Joel,

It turned out not to be a deadlock, rather that process termination wasn't implemented. This then got fixed in bug 1447265, but hasn't been rolled out yet as I am on PTO. It will be rolled out next week in bug 1399401 when I'm back.
Depends on: 1447265
Flags: needinfo?(pmoore)
Depends on: 1399401
Closed: 6 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.