Closed Bug 1180187 Opened 9 years ago Closed 6 years ago

generic-worker: listen for and handle worker shutdown

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

(Whiteboard: [generic-worker])

Attachments

(1 file)

Spot nodes can terminate: https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/

The generic worker should have a dedicated go routine to poll http://169.254.169.254/latest/meta-data/spot/termination-time at 5s intervals in order to catch spot node termination notices.

Once established that the node will be terminated, the generic worker should handle the prompt termination of any running tasks, with appropriate reportException handling, giving a reason of "worker-shutdown". The log should have "[taskcluster] Spot node shutdown" added on a new line, and the log should be uploaded. If other artifacts already exist, it may be reasonable to upload them too.
Assigning all generic worker bugs to myself for now. If anyone wants to take this bug, feel free to add a comment to request it. I can provide context.
Assignee: nobody → pmoore
Component: TaskCluster → Generic-Worker
Product: Testing → Taskcluster
Component: Generic-Worker → Worker
Whiteboard: [generic-worker]
Component: Worker → Generic-Worker
QA Contact: pmoore
See Also: → 1444168
Blocks: 1444168
See Also: 1444168
Drafted an implementation - still need to add test(s).
Attachment #8957670 - Flags: review?(jhford)
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/92d7f96555267d3cba10ca890289f4a65c6a8499
Bug 1180187 - Resolve worker-shutdown for spot terminations

https://github.com/taskcluster/generic-worker/commit/0593d903d56475ec5ffe65fbd8d942ebc02b5d6c
Merge pull request #78 from taskcluster/bug1180187

Bug 1180187 - listen for spot termination notice and abort task on discovery
This was released in 10.7.0 but doesn't seem to be catching all spot terminations.

I've added extra debugging in 10.7.2:

  https://github.com/taskcluster/generic-worker/commit/a0d5a75dd704a16fa11b035b6d492d0696b6a577

I've rolled that out to staging, so will wait for some reports to come in on papertrail.
Looks to me like a deadlock occurring...


Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: 2018/03/16 21:07:07  
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Spot request has MAYBE been issued??? Decide for yourself! 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: 2018/03/16 21:07:07 HTTP/1.0 200 OK 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Content-Length: 20 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Accept-Ranges: bytes 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Connection: close 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Content-Type: text/plain 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Date: Fri, 16 Mar 2018 21:07:08 GMT 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Etag: "3708311465" 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Last-Modified: Fri, 16 Mar 2018 21:07:05 GMT 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: Server: EC2ws 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: 2018-03-16T21:09:05Z 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: 2018/03/16 21:07:07 resp.StatusCode = 200 
Mar 16 22:07:09 i-0b43219e9cb0e74c9.gecko-t-win7-32-gpu-b.usw2.mozilla.com generic-worker: 2018/03/16 21:07:07 WARNING: ABORTING task since an imminent spot termination notice has been received! 

The last line we see in the logs comes from:

https://github.com/taskcluster/generic-worker/blob/v10.7.3/aws.go#L323

The abort function it calls is:

https://github.com/taskcluster/generic-worker/blob/v10.7.3/main.go#L1283-L1289

This needs to update the task status, which is protected with a mutex. I suspect something else holds the mutex.

The machine shutsdown a couple of minutes later, and the task run is eventually resolved by the queue as exception/claim-expired rather than the worker resolving it as exception/worker-shutdown as is intended...

Strangely the unit test passes:

https://github.com/taskcluster/generic-worker/blob/v10.7.3/aws_test.go#L8-L23

So something must be different when running as a unit test, to when this runs in production for real...
:pmoore, what is the status here?  this bug is marked as the root cause for bug 1444168 which had 80+ failures in the last week.
Flags: needinfo?(pmoore)
Hi Joel,

It turned out not to be a deadlock, rather that process termination wasn't implemented. This then got fixed in bug 1447265, but hasn't been rolled out yet as I am on PTO. It will be rolled out next week in bug 1399401 when I'm back.
Depends on: 1447265
Flags: needinfo?(pmoore)
Depends on: 1399401
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: