Open Bug 1735411 Opened 4 years ago Updated 1 year ago

windows 2004 test worker backlog (gecko-t/win10-64-2004)

Categories

(Release Engineering :: Firefox-CI Administration, defect)

defect

Tracking

(Not tracked)

People

(Reporter: malexandru, Unassigned)

References

Details

Attachments

(1 file)

Windows 2004 workers are seeing frequent backlogs, issue as far as I know was first seen on Thursday 7th October.

There are very few active workers (~25). firefox-ci reports a capacity of 704 workers.

The worker pool reports the following error: Cannot create more than 200 public IP addresses for this subscription in this region.

Component: General → Firefox-CI Administration
Summary: Frequent windows 2004 worker backlogs → windows 2004 test worker backlog (gecko-t/win10-64-2004) - production trees closed

Decision tasks for Try are not running, might be the same issue with spinning up workers.

Severity: -- → S1
Priority: -- → P1
Summary: windows 2004 test worker backlog (gecko-t/win10-64-2004) - production trees closed → decision tasks not starting, windows 2004 test worker backlog (gecko-t/win10-64-2004) - production trees closed

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #2)

There are very few active workers (~25). firefox-ci reports a capacity of 704 workers.

The worker pool reports the following error: Cannot create more than 200 public IP addresses for this subscription in this region.

That is because worker-manager's logic on when a worker actually is active vs. when not is broken.
That error happens because of either:

  • It wants to spin up more workers because the worker count is incorrect.

and/or:

  • It does not know of the per region limit, so it'll error out a few times and then move on to the next region. We need to find a way to make it aware to know create x workers per region so we can stay within that limit.

The decision task issue resolved itself, only 7 pushes are left without decision task.

The Windows worker count jumped to 120 machines and has plateaued there. masterwayz provided some insight. Mark, can you investigate this further?

Flags: needinfo?(mtabara) → needinfo?(mcornmesser)

i found 49 azure instances in the stopped state but retaining a public ip address. i've deleted them to free up those ips.

i found 240 public ip addresses not associated with any network interface and deleted them to free up further allocation.

i found 383 network interfaces not associated with a virtual machine but retaining an ip address and purged them and their public ip addresses to free up further allocation.
edit: the azure console does not provide a good mechanism for removal of orphaned nics. to purge these i triggered a purge job with an empty commit using a commit message containing the magic instruction purge-taskcluster-resources.

I think we hit a handful of different issues here. From grenade's comments, it looks like the worker-manger is missing some network resources when it is cleaning up or is possible just delayed in cleaning them up. I will add a schedule task in Azure to try to catch those orphaned resources.

Between the orphaned network resource and actually running we hit our IP quotas in one or more regions. When worker-manger hits these kind of issues there is then a delay in VMs being created for new tasks. However, this seems to have a minimum impact only affecting the creation of 5 VMs.

Though I think the biggest issue is the one Michelle mentioned where we had ~25 active workers and worker-manger thought there was 700+. Currently there are 174 active workers and worker-manger sees there being 700+.

pmoore/Jwhitlock: Any thoughts?

Flags: needinfo?(pmoore)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(jwhitlock)

Quoting jwhitlock:

I show this for worker manager's view of gecko-t/win10-64-2004:

State      Workers  Capacity
requested  223      223     
running    155      155     
stopping   230      230     
stopped    15706    15706   

So that's 155 available for running tasks, but capacity will show 608 (223 + 155 + 230)

First off, this is one of those sticky shared issues. There are several teams involved, the impact is different and at different times, and none of us have the full information or ability to fix the issue. Frustrating, but also pretty standard for firefox-ci. I'm focusing on the Azure workers; the decision tasks run on AWS docker workers, and seem like a separate issue to me. Azure pools are only in firefox-tc-ci, not community-tc, so any debugging will have to happen in firefox-ci, and any bug fixes or features will require Taskcluster version bumps in firefox-ci. Thanks for your input and patience!

Here's what I know, some of which I've repeated elsewhere (bug 1723789, etc.).

Worker manager uses a single capacity number to determine when to launch new instances. The capacity is the requested + running + stopping workers (times a capacity per instance, which is 1 for these Windows workers, and 2 for many decisions workers). The worker manager will launch new instances if there is a backlog in the queue, until it hits the maximum capacity (600 for gecko-t/win10-64-2004, and 600 for gecko-t/win10-64-2004-gpu). This reflects Worker Manager's bias to control costs - we pay for VMs before they start tasks and while they are shutting down, even if they aren't doing useful work.

The statuses:

  • requested workers have been launched but have not yet requested and received their configuration
  • running workers are running tasks or are available for tasks
  • stopping workers are in the process of shutting down. Only Azure workers have this state, AWS and GCP workers go straight to Stopped.
  • stopped workers are shut down but retained in the database. It looks like we retain for a week.

There is an additional wrinkle for Azure workers. When stopping a worker, the VM is shutdown first, then other resources. So, worker-manager may say that a worker is "stopping", when the only remaining Azure resource is a disk that was assigned to the VM. This is one source of the disconnect between VMs in Azure and pool capacity.

Running workers request tasks from the queue, and the queue "sees" a worker when it requests reserved tasks. These can be retrieved via queue APIs like listWorkers, used by the taskcluster UI. There is no second worker call to the queue when a worker stops itself, so these workers may include ones that worker manager thinks are "stopping" or "stopped". So, the queue view of workers can be misleading for Azure workers and pools.

Since the Azure pools are often at maximum capacity, it makes sense to look closely at requested and stopping workers, which are taking up the majority of the capacity almost every time I look at it. Unfortunately, worker manager doesn't provide these counts easily. Everyone who wants this breakdown uses the listWorkersForWorkerGroup API to iterate through all the workers, and count them by task. I've written a Python script to do this, and others have created scripts to generate data for the Graphana charts.

Here's the summary data for the three Azure pools, plus some non-Azure ones

Pool Provider Requested Running Stopping Stopped
gecko-t/win10-64-2004 azure 125 354 64 16105
gecko-t/win10-64-2004-gpu azure 33 295 158 12692
gecko-t/win10-64-2004-test azure 18 16 0 1389
gecko-t/t-linux-large aws 12 2351 0 140005
gecko-t/t-linux-xlarge aws 0 1006 0 39032
gecko-3/b-win2012 aws 0 35 0 8391
gecko-t/t-win7-32 aws 2 17 0 2948
gecko-1/b-linux-gcp gcp 0 1 0 35
gecko-3/b-linux-gcp gcp 4 13 0 1032

This is better than this morning, but requested and stopping workers are still a larger proportion of pool capacity for Azure workers than AWS or GCP worker pools. I think focusing on issues with requesting and stopping Azure workers is a productive use of our time.

We run Azure workers across regions (Group ID in worker manager, centralus, eastus, etc). It looks to me that requested and stopping workers are spread across regions, so this is not an issue with a particular region.

listWorkersForWorkerGroup returns a "created", "lastModified", and "lastChecked" timestamp, which can give an idea of the age. My script has a CSV output that lets me easily sort and filter. Out of the 126 "requested" workers, 86 are older than 26 minutes, and the rest were just requested. Most of the 86 are older than 30 minutes, which is the registrationTimeout for these pools - a worker that has been requested but not registered with the queue in 30 minutes is supposed to be terminated by worker manager. That looks like an issue with worker manager, but could be due to slowness in stopping workers. The average age of workers in "requested" is 65 minutes.

Looking at the "stopping" workers, I can used "lastChecked" and "lastModified" to determine how long they have been stopping. Out of 65 in the "stopping" state, 5 have just switched to stopping, and 59 have been stopping for more than 30 minutes. The longest has been "stopping" for 8 hours, and the average is 73 minutes in the "stopping" state.

This seems like a problem. A worker will shut itself off after 16 minutes of idle time. If it takes an hour or more to de-provision, the system will not respond well to spikes in CI requests, as compared to AWS and GCP workers.

Are the workers in "requested" and "stopping" the exceptions or the rule? I believe we'd need to process the logs to determine this, and it is not clear to me if there is sufficient data in the logs to make these calculations. I'd start with the worker-manager service logs in GCP, but we may need the worker logs in Papertrail as well.

The "stopping" state was added for Azure because it is complex and multi-step:

  • deprovision the VM
  • deprovision the NIC
  • deprovision the IP
  • deprovision to OS and Data disks
  • Switch state to "stopped"

Based on the hanging resources, there may be some bugs in deprovisioning. However, I don't see sufficient logging in worker manager to determine what is going wrong.

In summary, here are some questions to investigate, which all require looking at the running system in firefox-ci:

  1. Is worker manager attempting to launch Azure workers, or is it limiting due to capacity? Signs are it is hitting max capacity.
  2. Are the per-region limits a limiting factor, or does worker-manager send an email and quickly retry a different region? My guess is this is an annoyance but not a root cause.
  3. Do we have sufficient logging to measure worker lifecycle transition?
  4. How long does it take for an Azure worker to go from requested to running, and what is the distribution of startup times?
  5. How long does it take for an Azure worker to go from stopping to stopped, and what is the distribution of stopping times?
  6. What resources (VMs, NIC, IP, disks) are not being de-provisioned? Can we detect these with current tools?

Bugs:

  • Requested workers older than registrationTimeout are not terminated - why?
  • Workers are stuck in "STOPPING" for hours - why?

Features:

  • Increase logging for better visibility into Azure provisioning and de-provisioning from worker manager
  • (Maybe) Track startup, shutdown durations in the Taskcluster database

Tasks:

  • Update firefox-ci to the current Taskcluster version, and then continue updating for worker manager changes
  • (Maybe) Add an Azure pool to community-tc, so we can more rapidly test Taskcluster changes

Nice to have:

Flags: needinfo?(jwhitlock)

I looked into a worker that has been "stopping" for a while. Many of the workers have transitioned to "stopped" since I wrote my last comment, so pool gecko-t/win10-64-2004 was looking a lot more healthy:

State      Workers  Capacity
requested  53       53      
running    490      490     
stopping   21       21      
stopped    16171    16171  

The oldest was worker vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7. It ran 19 tasks, with the last one resulting in an Exception:

https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/win10-64-2004/workers/centralus/vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7
https://firefox-ci-tc.services.mozilla.com/tasks/TJIuuqi-SEGXtiLSE9xlXw/runs/0

The final task exited with code 1073807364, or 0x40010004, which could be DBG_TERMINATE_PROCESS: https://stackoverflow.com/q/37078953/10612

The logs are available in GCP, taskcluster-prod project in firefox.gcp.mozilla.com, with this query:

resource.type="k8s_container"
resource.labels.location="us-west1"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
labels.k8s-pod/app_kubernetes_io/component="taskcluster-worker-manager-workerscanner"
labels.k8s-pod/app_kubernetes_io/instance="firefoxcitc-taskcluster-taskcluster-1"
labels.k8s-pod/app_kubernetes_io/name="taskcluster-worker-manager"
labels.k8s-pod/app_kubernetes_io/part-of="taskcluster"
jsonPayload.Fields.workerId="vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7"

I noticed that the log messages were grouped by time, which may correspond to worker-manager-workerscanner iterations. The log entries (GMT) are DEBUG unless noted

  • Loop 0
    • 2021-10-13 13:07:42.828 fetched instance view
  • Loop 1 - 86 minutes later
    • 2021-10-13 14:33:45.870 fetched instance view
  • Loop 2 - 53 minutes later
    • 2021-10-13 15:27:07.547 fetched instance view
    • 2021-10-13 16:32:45.857 ERROR monitor.periodic Not in vm-ovcv... query
    • 2021-10-13 16:32:45.874 WARNING monitor.error Error: Iteration exceeded maximum time allowed Not in vm-ovcv... query
  • Loop 3 - 79 minutes later
    • 2021-10-13 16:46:08.453 fetched instance view
    • 2021-10-13T17:03:55.882 Not in GCP logs, but when the final task exited
  • Loop 4 - 41 minutes later
    • 2021-10-13 17:27:49.260 fetched instance view
    • 2021-10-13 17:27:49.260 NOTICE worker-removed failed power state; powerStates=ProvisioningState/succeeded, PowerState/deallocated
  • Loop 5 - 83 minutes later
    • 2021-10-13 18:51:20.439 deprovisionResource for vm with index undefined
    • 2021-10-13 18:51:20.439 deleting resource
  • Loop 6 - 77 minute later
    • 2021-10-13 20:08:47.828 deprovisionResource for vm with index undefined
    • 2021-10-13 20:08:48.040 resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    • 2021-10-13 20:08:48.045 deprovisionResource for nic with index undefined
    • 2021-10-13 20:08:48.045 deleting resource
  • Loop 7 - 32 minutes later
    • 2021-10-13 20:41:41.257 deprovisionResource for vm with index undefined
    • 2021-10-13 20:41:41.445 resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    • 2021-10-13 20:41:41.456 deprovisionResource for nic with index undefined
    • 2021-10-13 20:41:41.650 resource nic-skxgyjkssukgavv58lar not found; removing its id
    • 2021-10-13 20:41:41.655 deprovisionResource for ip with index undefined
    • 2021-10-13 20:41:41.656 deleting resource
  • Loop 8 - 2 minutes later
    • 2021-10-13 20:44:18.892 deprovisionResource for vm with index undefined
    • 2021-10-13 20:44:19.426 resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    • 2021-10-13 20:44:19.436 deprovisionResource for nic with index undefined
    • 2021-10-13 20:44:19.635 resource nic-skxgyjkssukgavv58lar not found; removing its id
    • 2021-10-13 20:44:19.639 deprovisionResource for ip with index undefined
    • 2021-10-13 20:44:19.814 resource pip-eehpjfwmr7kj7h5omk0r not found; removing its id
    • 2021-10-13 20:44:19.819 deprovisionResource for disks with index 0
    • 2021-10-13 20:44:19.819 deleting resource
    • 2021-10-13 20:44:20.360 deprovisionResource for disks with index 1
    • 2021-10-13 20:44:20.360 deleting resource
  • Loop 9 - 30 minutes later
    • 2021-10-13 21:15:05.773 deprovisionResource for vm with index undefined
    • 2021-10-13 21:15:06.186 resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id
    • 2021-10-13 21:15:06.194 deprovisionResource for nic with index undefined
    • 2021-10-13 21:15:06.496 resource nic-skxgyjkssukgavv58lar not found; removing its id
    • 2021-10-13 21:15:06.503 deprovisionResource for ip with index undefined
    • 2021-10-13 21:15:06.818 resource pip-eehpjfwmr7kj7h5omk0r not found; removing its id
    • 2021-10-13 21:15:06.825 deprovisionResource for disks with index 0
    • 2021-10-13 21:15:07.129 resource disk-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7-os not found; removing its id
    • 2021-10-13 21:15:07.135 deprovisionResource for disks with index 1
    • 2021-10-13 21:15:07.308 resource disk-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7-1 not found; removing its id
    • 2021-10-13 21:15:07.314 setting state to STOPPED

It was 251 minutes from the last task exiting (maybe from generic worker shutting down the VM?) to the worker being marked as STOPPED. It appears the deprovisionResource code is "failing" at each step, so that it takes at least 5 workerscanner loops to completely de-provision a worker. Plus, a workerscanner loops is really long, sometimes more than an hour.

I think this is a worker-manager bug, and is a pretty good explanation why stopping a VM is so slow, and only worse when CI load is high.

I need to do some more investigation into the history of this code, which has some comments that are unclear to me. This may be a feature, to give Azure time to actually de-provision a resource to move on to the next one. If that is the case, then we may not be able to de-provision all the Azure resources in the same loop. It may be better to look for ways to speed up the loops instead.

Flags: needinfo?(jwhitlock)
Flags: needinfo?(pmoore)

I looked more closely at the start-up process for an Azure worker. The worker-manager provisioner process will "start" a worker, based on the queue backlog and capacity. This simply adds a row with the required data to the database, and does not request any resources from Azure.

Next, the worker-manager worker-scanner reads from the database, and provisions the resources, one per loop:

  1. The IP
  2. The NIC
  3. The VM

The VM will then boot, and call the worker-manager API to register itself. At this point, it switches to "running" state and starts claiming work.

So, when the worker-scanner loops are fast, Azure workers are started quickly, but still in at least three loops. If worker-scanner is slow (more than an hour per loop, as seen in comment 12), it can take 3-4 hours for a worker to be ready to work. The worker-scanner does work for requested, running, and stopping workers, so I expect that worker-scanner loop time is roughly proportional to the sum of the current capacity across pools (gecko-t/win10-64-2004, gecko-t/win10-64-2004-gpu, and gecko-t/win10-64-2004-test.)

I think more logging is needed to 1) measure how long the worker-scanner loops are, and 2) model loop duration as a function of worker capacity. A lower max capacity may result in a more responsive system, and we may want to consider a minimum capacity as well.

There may be ways to improve the provisioning and de-provisioning code as well, but it looks like this was a known issue in the past, and past TC developers were unable to make progress. More on that in the next comment about the code history.

Summary: decision tasks not starting, windows 2004 test worker backlog (gecko-t/win10-64-2004) - production trees closed → windows 2004 test worker backlog (gecko-t/win10-64-2004)

I've looked at the code history, to determine when incremental resource provisioning and de-provisioning was added. The answer appears to be shortly after the Azure code was first shipped, in February 2020. This implies to me that the focus should be in making the worker-scanner loop as fast as possible, by identifying and removing bottlenecks, batching operations, or even terminating a loop faster.

I'm no longer working on Taskcluster, so a new developer will have to take on this work. My first suggestion is to determine how long each provider (AWS, Azure, Google) takes to scan, and how many API calls each makes. This can help determine which should be optimized. I'm guessing Azure, but that's a guess!

I'm curious if multiple items can be downloaded at once (like all the paginated VM data), rather than once per worker, and if that would speed things up. I also wonder if we should rethink resources in Azure. Should we have a pool of IPs and NICs that get attached to worker instances and released to the pools, rather than be allocated and de-allocated?

Here are the details of the Azure provisioner history:

2020-01-13 - v24.2.0 - Initially added Azure provider

The Azure work was done under bug 1600966, and Azure issues in tracking bug 1601033. The code was added in PR 2197.

The code notes that the Azure SDK has some built-in retry logic, and tries to handle 429 Too Many Requests and 5xx server errors.

Provisioning creates an IP address, a network interface (NIC), and the virtual machine (VM). Any error results in a VM Creation Error and deleting the worker.

Deleting a worker deletes the VM, the NIC, and the disk in parallel, then the IP. This went through some iterations during the PR process.

2020-02-19 - v25.4.0 - Avoid blocking resource calls

Bug 1616649 notes that the Azure API calls are blocking calls, and that operations can take a long time. This was fixed in PR 2384.

Resources are now allocated or de-allocated one per worker-scanner run.
The provisioner now incrementally provisions the IP Address, the NIC, and finally the VM.
The de-provisioner now incrementally de-provisioned the VM, NIC, IP, and then Disk.

2020-02-25 - v25.4.0 - Avoid over-provisioning

Bug 1600071 tracks an over-provisioning bug affecting t-win10-64 instances in AWS. The discussion covers a lot of ground, and identifies a few issues. PR 2407 addresses one issue, counting "requested" workers in the capacity estimate, to avoid provisioning new workers when one is slow to transition to "running".

Note, we have almost the opposite issue, that "stopping" Azure workers contribute to the capacity, so new Azure workers are not provisioned.

2020-05-28 - v30.0.3? - Test incremental resource provisioning / deprovisioning

PR 2909 was a "silent" change that refactored the way Azure code was tested, and re-wrote the tests. One of the tests demonstrates that it takes repeated loops to fully provision or remove a worker.

There are some minor changes in the production code as well, to avoid re-creating resources if creation is in progress.

2020-06-04 - v30.0.3 - Delete all of a VM's disks

Bug 1631824 notes that only the first disk is deleted, and a VM may have other data disks. PR 2790 changes this to delete each disk assigned to the VM. There was a bug that kept this from running correctly, fixes around 2021-04-27 (see below).

2020-06-05 - v30.1.0 - Worker table migration

Issue 2877 tracks migrating the worker table to support a re-registration token. This was fixed in PR 2908.

There were some code changes, such as the provisionResource returning the (updated?) worker.

2020-08-05 - v37.2.0 - Update Azure instance state recognition

Issue 3346 mentions bug 1646061, where an Azure VM had a state 'ProvisioningState/creating', but was terminated because it wasn't one of the expected states. PR 3489 changes the code to stop making one API call, and to only act on known failure states. For unknown states, it is assumed the (newish) registration timeout would catch and terminate the instance.

2020-09-15 - v37.2.0 - Clarify state transitions in Azure code

Issue 3058 talks about refactoring the provisioning logic, to allow for better testing, and being clearer if a resource is in the process of being provisioned. PR 3505 implements the changes, with tests.

One external change is that the API call to remove a worker doesn't do work, but instead waits for the next worker-scanner loop to start de-provisioning. However, there is a bug that meant no VMs were deprovisioned, see next entry.

2021-04-06 - v43.0.0 - Always de-provision workers in STOPPING state

Issue 4715 notes that workers in the STOPPING state are not having their resources removed, until the VM was removed. It guesses this was added in v37.2.0 with PR 3505, when the removeWorker API call no longer took the first step of removing the VM. PR 4718 fixes this.

This fix means that scanner loops are (again?) doing work to de-provision Azure workers, which may make those loops slower.

2021-04-27 - v43.2.0 - Azure disks get unique names, can be deleted

Work continues on bug 1631824, since disks are still present. PR 4806 gives disks unique names, so they can be deleted along with the VM.

This has a few consequences:

  • At the end of provisioning, the worker data is written to the database, to record the disk names.
  • Worker scanner may be slower, as it actually deletes disks
  • It may take additional worker scanner loops to fully stop an Azure worker
  • Disks can't be shared between VMs, the first to stop will delete it
Flags: needinfo?(jwhitlock)
Severity: -- → S3
See Also: → 1752461
QA Contact: mtabara → michelle

The win 2004 pools were removed in bug 1857614. Are we still seeing this with the current ones, or have things gotten better in the last couple of years?

Flags: needinfo?(aryx.bugmail)

Over the past 30 days, there are sometimes periods when tasks have wait for a machine. I lack the data how long they have to wait (Treeherder doesn't know when all dependencies are finished). The longest backlogs last 4-5.5h but we had not to close the trees for this (and don't watch the Try tree which would be impacted first).

Flags: needinfo?(aryx.bugmail)
QA Contact: michelle
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: