Bug 1735411 Comment 12 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

I looked into a worker that has been "stopping" for a while. Many of the workers have transitioned to "stopped" since I wrote my last comment, so pool ``gecko-t/win10-64-2004`` was looking a lot more healthy:

```
State      Workers  Capacity
requested  53       53      
running    490      490     
stopping   21       21      
stopped    16171    16171  
```

The oldest was worker ``vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7``. It ran 19 tasks, with the last one resulting in an Exception:

https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/win10-64-2004/workers/centralus/vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7
https://firefox-ci-tc.services.mozilla.com/tasks/TJIuuqi-SEGXtiLSE9xlXw/runs/0

The final task exited with code 1073807364, or 0x40010004, which could be DBG_TERMINATE_PROCESS: https://stackoverflow.com/q/37078953/10612

The logs are available in GCP, ``taskcluster-prod`` project in firefox.gcp.mozilla.com, with this query:

```
resource.type="k8s_container"
resource.labels.location="us-west1"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
labels.k8s-pod/app_kubernetes_io/component="taskcluster-worker-manager-workerscanner"
labels.k8s-pod/app_kubernetes_io/instance="firefoxcitc-taskcluster-taskcluster-1"
labels.k8s-pod/app_kubernetes_io/name="taskcluster-worker-manager"
labels.k8s-pod/app_kubernetes_io/part-of="taskcluster"
jsonPayload.Fields.workerId="vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7"
```

I noticed that the log messages were grouped by time, which may correspond to worker-manager-workerscanner iterations. The log entries (GMT) are DEBUG unless noted

* Loop 0
    * ``2021-10-13 13:07:42.828`` fetched instance view
* Loop 1 - 86 minutes later
    * ``2021-10-13 14:33:45.870`` fetched instance view
* Loop 2 - 53 minutes later
    * ``2021-10-13 15:27:07.547`` fetched instance view
* Loop 3 - 79 minutes later
    * ``2021-10-13 16:46:08.453`` fetched instance view
    * ``2021-10-13T17:03:55.882`` *Not in GCP logs*, but when the final task exited
* Loop 4 - 41 minutes later
    * ``2021-10-13 17:27:49.260`` fetched instance view
    * ``2021-10-13 17:27:49.260`` **NOTICE**  ``worker-removed`` failed power state; powerStates=ProvisioningState/succeeded, PowerState/deallocated
* Loop 5 - 83 minutes later
    * ``2021-10-13 18:51:20.439`` deprovisionResource for vm with index undefined
    * ``2021-10-13 18:51:20.439`` deleting resource
* Loop 6 - 77 minute later
    * ``2021-10-13 20:08:47.828`` deprovisionResource for vm with index undefined
    * ``2021-10-13 20:08:48.040`` resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    * ``2021-10-13 20:08:48.045`` deprovisionResource for nic with index undefined
    * ``2021-10-13 20:08:48.045`` deleting resource
* Loop 7 - 32 minutes later
    * ``2021-10-13 20:41:41.257`` deprovisionResource for vm with index undefined
    * ``2021-10-13 20:41:41.445`` resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    * ``2021-10-13 20:41:41.456`` deprovisionResource for nic with index undefined
    * ``2021-10-13 20:41:41.650`` resource nic-skxgyjkssukgavv58lar not found; removing its id
    * ``2021-10-13 20:41:41.655`` deprovisionResource for ip with index undefined
    * ``2021-10-13 20:41:41.656`` deleting resource
* Loop 8 - 2 minutes later
    * ``2021-10-13 20:44:18.892`` deprovisionResource for vm with index undefined
    * ``2021-10-13 20:44:19.426`` resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    * ``2021-10-13 20:44:19.436`` deprovisionResource for nic with index undefined
    * ``2021-10-13 20:44:19.635`` resource nic-skxgyjkssukgavv58lar not found; removing its id
    * ``2021-10-13 20:44:19.639`` deprovisionResource for ip with index undefined
    * ``2021-10-13 20:44:19.814`` resource pip-eehpjfwmr7kj7h5omk0r not found; removing its id
    * ``2021-10-13 20:44:19.819`` deprovisionResource for disks with index 0
    * ``2021-10-13 20:44:19.819`` deleting resource
    * ``2021-10-13 20:44:20.360`` deprovisionResource for disks with index 1
    * ``2021-10-13 20:44:20.360`` deleting resource
* Loop 9 - 30 minutes later
    * ``2021-10-13 21:15:05.773`` deprovisionResource for vm with index undefined
    * ``2021-10-13 21:15:06.186`` resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    * ``2021-10-13 21:15:06.194`` deprovisionResource for nic with index undefined
    * ``2021-10-13 21:15:06.496`` resource nic-skxgyjkssukgavv58lar not found; removing its id
    * ``2021-10-13 21:15:06.503`` deprovisionResource for ip with index undefined
    * ``2021-10-13 21:15:06.818`` resource pip-eehpjfwmr7kj7h5omk0r not found; removing its id
    * ``2021-10-13 21:15:06.825`` deprovisionResource for disks with index 0
    * ``2021-10-13 21:15:07.129`` resource disk-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7-os not found; removing its id
    * ``2021-10-13 21:15:07.135`` deprovisionResource for disks with index 1
    * ``2021-10-13 21:15:07.308`` resource disk-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7-1 not found; removing its id
    * ``2021-10-13 21:15:07.314`` setting state to STOPPED

It was 251 minutes from the last task exiting (maybe from generic worker shutting down the VM?) to the worker being marked as STOPPED. It appears the `deprovisionResource` code is "failing" at each step, so that it takes at least 5 workerscanner loops to completely de-provision a worker. Plus, a workerscanner loops is really long, sometimes more than an hour.

I think this is a worker-manager bug, and is a pretty good explanation why stopping a VM is so slow, and only worse when CI load is high.
I looked into a worker that has been "stopping" for a while. Many of the workers have transitioned to "stopped" since I wrote my last comment, so pool ``gecko-t/win10-64-2004`` was looking a lot more healthy:

```
State      Workers  Capacity
requested  53       53      
running    490      490     
stopping   21       21      
stopped    16171    16171  
```

The oldest was worker ``vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7``. It ran 19 tasks, with the last one resulting in an Exception:

https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/win10-64-2004/workers/centralus/vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7
https://firefox-ci-tc.services.mozilla.com/tasks/TJIuuqi-SEGXtiLSE9xlXw/runs/0

The final task exited with code 1073807364, or 0x40010004, which could be DBG_TERMINATE_PROCESS: https://stackoverflow.com/q/37078953/10612

The logs are available in GCP, ``taskcluster-prod`` project in firefox.gcp.mozilla.com, with this query:

```
resource.type="k8s_container"
resource.labels.location="us-west1"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
labels.k8s-pod/app_kubernetes_io/component="taskcluster-worker-manager-workerscanner"
labels.k8s-pod/app_kubernetes_io/instance="firefoxcitc-taskcluster-taskcluster-1"
labels.k8s-pod/app_kubernetes_io/name="taskcluster-worker-manager"
labels.k8s-pod/app_kubernetes_io/part-of="taskcluster"
jsonPayload.Fields.workerId="vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7"
```

I noticed that the log messages were grouped by time, which may correspond to worker-manager-workerscanner iterations. The log entries (GMT) are DEBUG unless noted

* Loop 0
    * ``2021-10-13 13:07:42.828`` fetched instance view
* Loop 1 - 86 minutes later
    * ``2021-10-13 14:33:45.870`` fetched instance view
* Loop 2 - 53 minutes later
    * ``2021-10-13 15:27:07.547`` fetched instance view
    * ``2021-10-13 16:32:45.857`` **ERROR** ``monitor.periodic`` *Not in vm-ovcv... query*
    * ``2021-10-13 16:32:45.874`` **WARNING** ``monitor.error`` Error: Iteration exceeded maximum time allowed *Not in vm-ovcv... query*
* Loop 3 - 79 minutes later
    * ``2021-10-13 16:46:08.453`` fetched instance view
    * ``2021-10-13T17:03:55.882`` *Not in GCP logs*, but when the final task exited
* Loop 4 - 41 minutes later
    * ``2021-10-13 17:27:49.260`` fetched instance view
    * ``2021-10-13 17:27:49.260`` **NOTICE**  ``worker-removed`` failed power state; powerStates=ProvisioningState/succeeded, PowerState/deallocated
* Loop 5 - 83 minutes later
    * ``2021-10-13 18:51:20.439`` deprovisionResource for vm with index undefined
    * ``2021-10-13 18:51:20.439`` deleting resource
* Loop 6 - 77 minute later
    * ``2021-10-13 20:08:47.828`` deprovisionResource for vm with index undefined
    * ``2021-10-13 20:08:48.040`` resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    * ``2021-10-13 20:08:48.045`` deprovisionResource for nic with index undefined
    * ``2021-10-13 20:08:48.045`` deleting resource
* Loop 7 - 32 minutes later
    * ``2021-10-13 20:41:41.257`` deprovisionResource for vm with index undefined
    * ``2021-10-13 20:41:41.445`` resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    * ``2021-10-13 20:41:41.456`` deprovisionResource for nic with index undefined
    * ``2021-10-13 20:41:41.650`` resource nic-skxgyjkssukgavv58lar not found; removing its id
    * ``2021-10-13 20:41:41.655`` deprovisionResource for ip with index undefined
    * ``2021-10-13 20:41:41.656`` deleting resource
* Loop 8 - 2 minutes later
    * ``2021-10-13 20:44:18.892`` deprovisionResource for vm with index undefined
    * ``2021-10-13 20:44:19.426`` resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    * ``2021-10-13 20:44:19.436`` deprovisionResource for nic with index undefined
    * ``2021-10-13 20:44:19.635`` resource nic-skxgyjkssukgavv58lar not found; removing its id
    * ``2021-10-13 20:44:19.639`` deprovisionResource for ip with index undefined
    * ``2021-10-13 20:44:19.814`` resource pip-eehpjfwmr7kj7h5omk0r not found; removing its id
    * ``2021-10-13 20:44:19.819`` deprovisionResource for disks with index 0
    * ``2021-10-13 20:44:19.819`` deleting resource
    * ``2021-10-13 20:44:20.360`` deprovisionResource for disks with index 1
    * ``2021-10-13 20:44:20.360`` deleting resource
* Loop 9 - 30 minutes later
    * ``2021-10-13 21:15:05.773`` deprovisionResource for vm with index undefined
    * ``2021-10-13 21:15:06.186`` resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    * ``2021-10-13 21:15:06.194`` deprovisionResource for nic with index undefined
    * ``2021-10-13 21:15:06.496`` resource nic-skxgyjkssukgavv58lar not found; removing its id
    * ``2021-10-13 21:15:06.503`` deprovisionResource for ip with index undefined
    * ``2021-10-13 21:15:06.818`` resource pip-eehpjfwmr7kj7h5omk0r not found; removing its id
    * ``2021-10-13 21:15:06.825`` deprovisionResource for disks with index 0
    * ``2021-10-13 21:15:07.129`` resource disk-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7-os not found; removing its id
    * ``2021-10-13 21:15:07.135`` deprovisionResource for disks with index 1
    * ``2021-10-13 21:15:07.308`` resource disk-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7-1 not found; removing its id
    * ``2021-10-13 21:15:07.314`` setting state to STOPPED

It was 251 minutes from the last task exiting (maybe from generic worker shutting down the VM?) to the worker being marked as STOPPED. It appears the `deprovisionResource` code is "failing" at each step, so that it takes at least 5 workerscanner loops to completely de-provision a worker. Plus, a workerscanner loops is really long, sometimes more than an hour.

I think this is a worker-manager bug, and is a pretty good explanation why stopping a VM is so slow, and only worse when CI load is high.

Back to Bug 1735411 Comment 12