Closed Bug 1668111 Opened 5 years ago Closed 5 years ago

Taskcluster provisioning issues Sep 29, 2020

Categories

(Taskcluster :: Operations and Service Requests, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bstack, Assigned: bstack)

Details

Attachments

(1 file)

Worker manager seems to be terminating perfectly healthy workers after an upgrade to 37.2.0

Assignee: nobody → bstack
Status: NEW → ASSIGNED

I believe I have a fix for much of this in https://github.com/taskcluster/taskcluster/pull/3602

We're also rolling back production now

Attaching the logs for one of the workers that was terminated before we wanted it to be.

And here's what that worker looks like in the db currently (scrubbed of secrets):

taskcluster=> select * from workers where worker_id='i-0192d85f39c8842cc';
-[ RECORD 1 ]--+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
worker_pool_id | gecko-t/t-linux-large
worker_group   | us-west-2
worker_id      | i-0192d85f39c8842cc
provider_id    | aws
created        | 2020-09-29 17:21:26.232+00
expires        | 2020-10-07 17:24:47.566+00
state          | stopped
provider_data  | {"owner": "692406183521", "state": "pending", "groups": [], "region": "us-west-2", "imageId": "ami-02209580e11786d58", "privateIp": "10.144.52.102", "stateReason": "pending", "architecture": "x86_64", "instanceType": "m5.large", "workerConfig": {"capacity": 1, "shutdown": {"enabled": true, "afterIdleSeconds": 15}, "dockerConfig": {"allowPrivileged": false}, "deviceManagement": {"kvm": {"enabled": false}, "hostSharedMemory": {"enabled": false}}, "capacityManagement": {"diskspaceThreshold": 20000000000}}, "amiLaunchIndex": 4, "terminateAfter": 1601401883271, "availabilityZone": "us-west-2c", "reregistrationTimeout": 345600000}
capacity       | 1
last_modified  | 2020-09-29 17:22:57.976+00
last_checked   | 2020-09-29 18:03:45.092+00
etag           | d25b5076-e35f-4566-b3e2-4c88763caf14

What I know so far:

  1. The worker's provider_data.terminateAfter is set to the initial time it would be set at when the worker was created. This implies that registerWorker never updated the row. However...
  2. The last_modified time matches up with the time in the logs where registerWorker logged worker-running so it seems likely that the update did work.
  3. Sheriffs have mentioned things feel slower than usual for the past couple weeks but got really bad today. This implies that this might be a long-standing bug that is made worse by the w-m changes in 37.2.0. The only change I see that feels related is the new worker-scanning logic in #3306. As far as I can see that logic is correct but I'm thinking that something about how it works exacerbates the underlying issue.
  4. update_worker_2 explicitly calls out If the etag argument is empty then the update will overwrite the matched row but I don't see it actually using the etag in the function. I think this is actually ok but the logic around optimistic concurrency there is complicated so I want to go over it with someone tomorrow.

My current hunch is that somehow the registerWorker and checkworker are racing and the update in checkWorker is overwriting terminateAfter when it doesn't actually mean to. This is not a strongly held idea however. Just my best guess so far.

Please feel free to poke this more before I'm around if you're looking into this!

I believe this will be fixed by https://github.com/taskcluster/taskcluster/pull/3602 and then we can roll forward again!

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: