Taskcluster provisioning issues Sep 29, 2020
Categories
(Taskcluster :: Operations and Service Requests, defect)
Tracking
(Not tracked)
People
(Reporter: bstack, Assigned: bstack)
Details
Attachments
(1 file)
|
33.19 KB,
application/json
|
Details |
Worker manager seems to be terminating perfectly healthy workers after an upgrade to 37.2.0
| Assignee | ||
Updated•5 years ago
|
| Assignee | ||
Comment 1•5 years ago
|
||
I believe I have a fix for much of this in https://github.com/taskcluster/taskcluster/pull/3602
We're also rolling back production now
| Assignee | ||
Comment 2•5 years ago
|
||
Attaching the logs for one of the workers that was terminated before we wanted it to be.
| Assignee | ||
Comment 3•5 years ago
|
||
And here's what that worker looks like in the db currently (scrubbed of secrets):
taskcluster=> select * from workers where worker_id='i-0192d85f39c8842cc';
-[ RECORD 1 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
worker_pool_id | gecko-t/t-linux-large
worker_group | us-west-2
worker_id | i-0192d85f39c8842cc
provider_id | aws
created | 2020-09-29 17:21:26.232+00
expires | 2020-10-07 17:24:47.566+00
state | stopped
provider_data | {"owner": "692406183521", "state": "pending", "groups": [], "region": "us-west-2", "imageId": "ami-02209580e11786d58", "privateIp": "10.144.52.102", "stateReason": "pending", "architecture": "x86_64", "instanceType": "m5.large", "workerConfig": {"capacity": 1, "shutdown": {"enabled": true, "afterIdleSeconds": 15}, "dockerConfig": {"allowPrivileged": false}, "deviceManagement": {"kvm": {"enabled": false}, "hostSharedMemory": {"enabled": false}}, "capacityManagement": {"diskspaceThreshold": 20000000000}}, "amiLaunchIndex": 4, "terminateAfter": 1601401883271, "availabilityZone": "us-west-2c", "reregistrationTimeout": 345600000}
capacity | 1
last_modified | 2020-09-29 17:22:57.976+00
last_checked | 2020-09-29 18:03:45.092+00
etag | d25b5076-e35f-4566-b3e2-4c88763caf14
| Assignee | ||
Comment 4•5 years ago
|
||
What I know so far:
- The worker's
provider_data.terminateAfteris set to the initial time it would be set at when the worker was created. This implies thatregisterWorkernever updated the row. However... - The
last_modifiedtime matches up with the time in the logs whereregisterWorkerloggedworker-runningso it seems likely that the update did work. - Sheriffs have mentioned things feel slower than usual for the past couple weeks but got really bad today. This implies that this might be a long-standing bug that is made worse by the w-m changes in 37.2.0. The only change I see that feels related is the new worker-scanning logic in #3306. As far as I can see that logic is correct but I'm thinking that something about how it works exacerbates the underlying issue.
- update_worker_2 explicitly calls out
If the etag argument is empty then the update will overwrite the matched rowbut I don't see it actually using the etag in the function. I think this is actually ok but the logic around optimistic concurrency there is complicated so I want to go over it with someone tomorrow.
My current hunch is that somehow the registerWorker and checkworker are racing and the update in checkWorker is overwriting terminateAfter when it doesn't actually mean to. This is not a strongly held idea however. Just my best guess so far.
Please feel free to poke this more before I'm around if you're looking into this!
| Assignee | ||
Comment 5•5 years ago
•
|
||
I believe this will be fixed by https://github.com/taskcluster/taskcluster/pull/3602 and then we can roll forward again!
Description
•