windows 2004 test worker backlog (gecko-t/win10-64-2004)
Categories
(Release Engineering :: Firefox-CI Administration, defect)
Tracking
(Not tracked)
People
(Reporter: malexandru, Unassigned)
References
Details
Attachments
(1 file)
|
10.64 KB,
image/png
|
Details |
Windows 2004 workers are seeing frequent backlogs, issue as far as I know was first seen on Thursday 7th October.
Updated•4 years ago
|
Comment 2•4 years ago
|
||
There are very few active workers (~25). firefox-ci reports a capacity of 704 workers.
The worker pool reports the following error: Cannot create more than 200 public IP addresses for this subscription in this region.
Updated•4 years ago
|
Comment 3•4 years ago
|
||
Decision tasks for Try are not running, might be the same issue with spinning up workers.
Comment 4•4 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #2)
There are very few active workers (~25). firefox-ci reports a capacity of 704 workers.
The worker pool reports the following error:
Cannot create more than 200 public IP addresses for this subscription in this region.
That is because worker-manager's logic on when a worker actually is active vs. when not is broken.
That error happens because of either:
- It wants to spin up more workers because the worker count is incorrect.
and/or:
- It does not know of the per region limit, so it'll error out a few times and then move on to the next region. We need to find a way to make it aware to know create x workers per region so we can stay within that limit.
Comment 5•4 years ago
|
||
The decision task issue resolved itself, only 7 pushes are left without decision task.
The Windows worker count jumped to 120 machines and has plateaued there. masterwayz provided some insight. Mark, can you investigate this further?
Updated•4 years ago
|
Comment 6•4 years ago
|
||
i found 49 azure instances in the stopped state but retaining a public ip address. i've deleted them to free up those ips.
Comment 7•4 years ago
|
||
i found 240 public ip addresses not associated with any network interface and deleted them to free up further allocation.
Comment 8•4 years ago
•
|
||
i found 383 network interfaces not associated with a virtual machine but retaining an ip address and purged them and their public ip addresses to free up further allocation.
edit: the azure console does not provide a good mechanism for removal of orphaned nics. to purge these i triggered a purge job with an empty commit using a commit message containing the magic instruction purge-taskcluster-resources.
Comment 9•4 years ago
|
||
I think we hit a handful of different issues here. From grenade's comments, it looks like the worker-manger is missing some network resources when it is cleaning up or is possible just delayed in cleaning them up. I will add a schedule task in Azure to try to catch those orphaned resources.
Between the orphaned network resource and actually running we hit our IP quotas in one or more regions. When worker-manger hits these kind of issues there is then a delay in VMs being created for new tasks. However, this seems to have a minimum impact only affecting the creation of 5 VMs.
Though I think the biggest issue is the one Michelle mentioned where we had ~25 active workers and worker-manger thought there was 700+. Currently there are 174 active workers and worker-manger sees there being 700+.
pmoore/Jwhitlock: Any thoughts?
Comment 10•4 years ago
|
||
Quoting jwhitlock:
I show this for worker manager's view of gecko-t/win10-64-2004:
State Workers Capacity
requested 223 223
running 155 155
stopping 230 230
stopped 15706 15706
So that's 155 available for running tasks, but capacity will show 608 (223 + 155 + 230)
Comment 11•4 years ago
|
||
First off, this is one of those sticky shared issues. There are several teams involved, the impact is different and at different times, and none of us have the full information or ability to fix the issue. Frustrating, but also pretty standard for firefox-ci. I'm focusing on the Azure workers; the decision tasks run on AWS docker workers, and seem like a separate issue to me. Azure pools are only in firefox-tc-ci, not community-tc, so any debugging will have to happen in firefox-ci, and any bug fixes or features will require Taskcluster version bumps in firefox-ci. Thanks for your input and patience!
Here's what I know, some of which I've repeated elsewhere (bug 1723789, etc.).
Worker manager uses a single capacity number to determine when to launch new instances. The capacity is the requested + running + stopping workers (times a capacity per instance, which is 1 for these Windows workers, and 2 for many decisions workers). The worker manager will launch new instances if there is a backlog in the queue, until it hits the maximum capacity (600 for gecko-t/win10-64-2004, and 600 for gecko-t/win10-64-2004-gpu). This reflects Worker Manager's bias to control costs - we pay for VMs before they start tasks and while they are shutting down, even if they aren't doing useful work.
The statuses:
requestedworkers have been launched but have not yet requested and received their configurationrunningworkers are running tasks or are available for tasksstoppingworkers are in the process of shutting down. Only Azure workers have this state, AWS and GCP workers go straight to Stopped.stoppedworkers are shut down but retained in the database. It looks like we retain for a week.
There is an additional wrinkle for Azure workers. When stopping a worker, the VM is shutdown first, then other resources. So, worker-manager may say that a worker is "stopping", when the only remaining Azure resource is a disk that was assigned to the VM. This is one source of the disconnect between VMs in Azure and pool capacity.
Running workers request tasks from the queue, and the queue "sees" a worker when it requests reserved tasks. These can be retrieved via queue APIs like listWorkers, used by the taskcluster UI. There is no second worker call to the queue when a worker stops itself, so these workers may include ones that worker manager thinks are "stopping" or "stopped". So, the queue view of workers can be misleading for Azure workers and pools.
Since the Azure pools are often at maximum capacity, it makes sense to look closely at requested and stopping workers, which are taking up the majority of the capacity almost every time I look at it. Unfortunately, worker manager doesn't provide these counts easily. Everyone who wants this breakdown uses the listWorkersForWorkerGroup API to iterate through all the workers, and count them by task. I've written a Python script to do this, and others have created scripts to generate data for the Graphana charts.
Here's the summary data for the three Azure pools, plus some non-Azure ones
| Pool | Provider | Requested | Running | Stopping | Stopped |
|---|---|---|---|---|---|
| gecko-t/win10-64-2004 | azure | 125 | 354 | 64 | 16105 |
| gecko-t/win10-64-2004-gpu | azure | 33 | 295 | 158 | 12692 |
| gecko-t/win10-64-2004-test | azure | 18 | 16 | 0 | 1389 |
| gecko-t/t-linux-large | aws | 12 | 2351 | 0 | 140005 |
| gecko-t/t-linux-xlarge | aws | 0 | 1006 | 0 | 39032 |
| gecko-3/b-win2012 | aws | 0 | 35 | 0 | 8391 |
| gecko-t/t-win7-32 | aws | 2 | 17 | 0 | 2948 |
| gecko-1/b-linux-gcp | gcp | 0 | 1 | 0 | 35 |
| gecko-3/b-linux-gcp | gcp | 4 | 13 | 0 | 1032 |
This is better than this morning, but requested and stopping workers are still a larger proportion of pool capacity for Azure workers than AWS or GCP worker pools. I think focusing on issues with requesting and stopping Azure workers is a productive use of our time.
We run Azure workers across regions (Group ID in worker manager, centralus, eastus, etc). It looks to me that requested and stopping workers are spread across regions, so this is not an issue with a particular region.
listWorkersForWorkerGroup returns a "created", "lastModified", and "lastChecked" timestamp, which can give an idea of the age. My script has a CSV output that lets me easily sort and filter. Out of the 126 "requested" workers, 86 are older than 26 minutes, and the rest were just requested. Most of the 86 are older than 30 minutes, which is the registrationTimeout for these pools - a worker that has been requested but not registered with the queue in 30 minutes is supposed to be terminated by worker manager. That looks like an issue with worker manager, but could be due to slowness in stopping workers. The average age of workers in "requested" is 65 minutes.
Looking at the "stopping" workers, I can used "lastChecked" and "lastModified" to determine how long they have been stopping. Out of 65 in the "stopping" state, 5 have just switched to stopping, and 59 have been stopping for more than 30 minutes. The longest has been "stopping" for 8 hours, and the average is 73 minutes in the "stopping" state.
This seems like a problem. A worker will shut itself off after 16 minutes of idle time. If it takes an hour or more to de-provision, the system will not respond well to spikes in CI requests, as compared to AWS and GCP workers.
Are the workers in "requested" and "stopping" the exceptions or the rule? I believe we'd need to process the logs to determine this, and it is not clear to me if there is sufficient data in the logs to make these calculations. I'd start with the worker-manager service logs in GCP, but we may need the worker logs in Papertrail as well.
The "stopping" state was added for Azure because it is complex and multi-step:
- deprovision the VM
- deprovision the NIC
- deprovision the IP
- deprovision to OS and Data disks
- Switch state to "stopped"
Based on the hanging resources, there may be some bugs in deprovisioning. However, I don't see sufficient logging in worker manager to determine what is going wrong.
In summary, here are some questions to investigate, which all require looking at the running system in firefox-ci:
- Is worker manager attempting to launch Azure workers, or is it limiting due to capacity? Signs are it is hitting max capacity.
- Are the per-region limits a limiting factor, or does worker-manager send an email and quickly retry a different region? My guess is this is an annoyance but not a root cause.
- Do we have sufficient logging to measure worker lifecycle transition?
- How long does it take for an Azure worker to go from requested to running, and what is the distribution of startup times?
- How long does it take for an Azure worker to go from stopping to stopped, and what is the distribution of stopping times?
- What resources (VMs, NIC, IP, disks) are not being de-provisioned? Can we detect these with current tools?
Bugs:
- Requested workers older than registrationTimeout are not terminated - why?
- Workers are stuck in "STOPPING" for hours - why?
Features:
- Increase logging for better visibility into Azure provisioning and de-provisioning from worker manager
- (Maybe) Track startup, shutdown durations in the Taskcluster database
Tasks:
- Update firefox-ci to the current Taskcluster version, and then continue updating for worker manager changes
- (Maybe) Add an Azure pool to community-tc, so we can more rapidly test Taskcluster changes
Nice to have:
- Worker counts by state - https://github.com/taskcluster/taskcluster/issues/4942
- Log registration warnings: https://github.com/taskcluster/taskcluster/pull/5000
- Per-region limits - https://github.com/taskcluster/taskcluster/issues/4938
Comment 12•4 years ago
•
|
||
I looked into a worker that has been "stopping" for a while. Many of the workers have transitioned to "stopped" since I wrote my last comment, so pool gecko-t/win10-64-2004 was looking a lot more healthy:
State Workers Capacity
requested 53 53
running 490 490
stopping 21 21
stopped 16171 16171
The oldest was worker vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7. It ran 19 tasks, with the last one resulting in an Exception:
https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/win10-64-2004/workers/centralus/vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7
https://firefox-ci-tc.services.mozilla.com/tasks/TJIuuqi-SEGXtiLSE9xlXw/runs/0
The final task exited with code 1073807364, or 0x40010004, which could be DBG_TERMINATE_PROCESS: https://stackoverflow.com/q/37078953/10612
The logs are available in GCP, taskcluster-prod project in firefox.gcp.mozilla.com, with this query:
resource.type="k8s_container"
resource.labels.location="us-west1"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
labels.k8s-pod/app_kubernetes_io/component="taskcluster-worker-manager-workerscanner"
labels.k8s-pod/app_kubernetes_io/instance="firefoxcitc-taskcluster-taskcluster-1"
labels.k8s-pod/app_kubernetes_io/name="taskcluster-worker-manager"
labels.k8s-pod/app_kubernetes_io/part-of="taskcluster"
jsonPayload.Fields.workerId="vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7"
I noticed that the log messages were grouped by time, which may correspond to worker-manager-workerscanner iterations. The log entries (GMT) are DEBUG unless noted
- Loop 0
2021-10-13 13:07:42.828fetched instance view
- Loop 1 - 86 minutes later
2021-10-13 14:33:45.870fetched instance view
- Loop 2 - 53 minutes later
2021-10-13 15:27:07.547fetched instance view2021-10-13 16:32:45.857ERRORmonitor.periodicNot in vm-ovcv... query2021-10-13 16:32:45.874WARNINGmonitor.errorError: Iteration exceeded maximum time allowed Not in vm-ovcv... query
- Loop 3 - 79 minutes later
2021-10-13 16:46:08.453fetched instance view2021-10-13T17:03:55.882Not in GCP logs, but when the final task exited
- Loop 4 - 41 minutes later
2021-10-13 17:27:49.260fetched instance view2021-10-13 17:27:49.260NOTICEworker-removedfailed power state; powerStates=ProvisioningState/succeeded, PowerState/deallocated
- Loop 5 - 83 minutes later
2021-10-13 18:51:20.439deprovisionResource for vm with index undefined2021-10-13 18:51:20.439deleting resource
- Loop 6 - 77 minute later
2021-10-13 20:08:47.828deprovisionResource for vm with index undefined2021-10-13 20:08:48.040resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id2021-10-13 20:08:48.045deprovisionResource for nic with index undefined2021-10-13 20:08:48.045deleting resource
- Loop 7 - 32 minutes later
2021-10-13 20:41:41.257deprovisionResource for vm with index undefined2021-10-13 20:41:41.445resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id2021-10-13 20:41:41.456deprovisionResource for nic with index undefined2021-10-13 20:41:41.650resource nic-skxgyjkssukgavv58lar not found; removing its id2021-10-13 20:41:41.655deprovisionResource for ip with index undefined2021-10-13 20:41:41.656deleting resource
- Loop 8 - 2 minutes later
2021-10-13 20:44:18.892deprovisionResource for vm with index undefined2021-10-13 20:44:19.426resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id2021-10-13 20:44:19.436deprovisionResource for nic with index undefined2021-10-13 20:44:19.635resource nic-skxgyjkssukgavv58lar not found; removing its id2021-10-13 20:44:19.639deprovisionResource for ip with index undefined2021-10-13 20:44:19.814resource pip-eehpjfwmr7kj7h5omk0r not found; removing its id2021-10-13 20:44:19.819deprovisionResource for disks with index 02021-10-13 20:44:19.819deleting resource2021-10-13 20:44:20.360deprovisionResource for disks with index 12021-10-13 20:44:20.360deleting resource
- Loop 9 - 30 minutes later
2021-10-13 21:15:05.773deprovisionResource for vm with index undefined2021-10-13 21:15:06.186resource vm-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7 not found; removing its id2021-10-13 21:15:06.194deprovisionResource for nic with index undefined2021-10-13 21:15:06.496resource nic-skxgyjkssukgavv58lar not found; removing its id2021-10-13 21:15:06.503deprovisionResource for ip with index undefined2021-10-13 21:15:06.818resource pip-eehpjfwmr7kj7h5omk0r not found; removing its id2021-10-13 21:15:06.825deprovisionResource for disks with index 02021-10-13 21:15:07.129resource disk-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7-os not found; removing its id2021-10-13 21:15:07.135deprovisionResource for disks with index 12021-10-13 21:15:07.308resource disk-ovcv76e0q6ucfdi4oalhwsj1hhoorqvckx7-1 not found; removing its id2021-10-13 21:15:07.314setting state to STOPPED
It was 251 minutes from the last task exiting (maybe from generic worker shutting down the VM?) to the worker being marked as STOPPED. It appears the deprovisionResource code is "failing" at each step, so that it takes at least 5 workerscanner loops to completely de-provision a worker. Plus, a workerscanner loops is really long, sometimes more than an hour.
I think this is a worker-manager bug, and is a pretty good explanation why stopping a VM is so slow, and only worse when CI load is high.
Comment 13•4 years ago
|
||
I need to do some more investigation into the history of this code, which has some comments that are unclear to me. This may be a feature, to give Azure time to actually de-provision a resource to move on to the next one. If that is the case, then we may not be able to de-provision all the Azure resources in the same loop. It may be better to look for ways to speed up the loops instead.
Updated•4 years ago
|
Comment 14•4 years ago
|
||
I looked more closely at the start-up process for an Azure worker. The worker-manager provisioner process will "start" a worker, based on the queue backlog and capacity. This simply adds a row with the required data to the database, and does not request any resources from Azure.
Next, the worker-manager worker-scanner reads from the database, and provisions the resources, one per loop:
- The IP
- The NIC
- The VM
The VM will then boot, and call the worker-manager API to register itself. At this point, it switches to "running" state and starts claiming work.
So, when the worker-scanner loops are fast, Azure workers are started quickly, but still in at least three loops. If worker-scanner is slow (more than an hour per loop, as seen in comment 12), it can take 3-4 hours for a worker to be ready to work. The worker-scanner does work for requested, running, and stopping workers, so I expect that worker-scanner loop time is roughly proportional to the sum of the current capacity across pools (gecko-t/win10-64-2004, gecko-t/win10-64-2004-gpu, and gecko-t/win10-64-2004-test.)
I think more logging is needed to 1) measure how long the worker-scanner loops are, and 2) model loop duration as a function of worker capacity. A lower max capacity may result in a more responsive system, and we may want to consider a minimum capacity as well.
There may be ways to improve the provisioning and de-provisioning code as well, but it looks like this was a known issue in the past, and past TC developers were unable to make progress. More on that in the next comment about the code history.
Updated•4 years ago
|
Comment 15•4 years ago
|
||
I've looked at the code history, to determine when incremental resource provisioning and de-provisioning was added. The answer appears to be shortly after the Azure code was first shipped, in February 2020. This implies to me that the focus should be in making the worker-scanner loop as fast as possible, by identifying and removing bottlenecks, batching operations, or even terminating a loop faster.
I'm no longer working on Taskcluster, so a new developer will have to take on this work. My first suggestion is to determine how long each provider (AWS, Azure, Google) takes to scan, and how many API calls each makes. This can help determine which should be optimized. I'm guessing Azure, but that's a guess!
I'm curious if multiple items can be downloaded at once (like all the paginated VM data), rather than once per worker, and if that would speed things up. I also wonder if we should rethink resources in Azure. Should we have a pool of IPs and NICs that get attached to worker instances and released to the pools, rather than be allocated and de-allocated?
Here are the details of the Azure provisioner history:
2020-01-13 - v24.2.0 - Initially added Azure provider
The Azure work was done under bug 1600966, and Azure issues in tracking bug 1601033. The code was added in PR 2197.
The code notes that the Azure SDK has some built-in retry logic, and tries to handle 429 Too Many Requests and 5xx server errors.
Provisioning creates an IP address, a network interface (NIC), and the virtual machine (VM). Any error results in a VM Creation Error and deleting the worker.
Deleting a worker deletes the VM, the NIC, and the disk in parallel, then the IP. This went through some iterations during the PR process.
2020-02-19 - v25.4.0 - Avoid blocking resource calls
Bug 1616649 notes that the Azure API calls are blocking calls, and that operations can take a long time. This was fixed in PR 2384.
Resources are now allocated or de-allocated one per worker-scanner run.
The provisioner now incrementally provisions the IP Address, the NIC, and finally the VM.
The de-provisioner now incrementally de-provisioned the VM, NIC, IP, and then Disk.
2020-02-25 - v25.4.0 - Avoid over-provisioning
Bug 1600071 tracks an over-provisioning bug affecting t-win10-64 instances in AWS. The discussion covers a lot of ground, and identifies a few issues. PR 2407 addresses one issue, counting "requested" workers in the capacity estimate, to avoid provisioning new workers when one is slow to transition to "running".
Note, we have almost the opposite issue, that "stopping" Azure workers contribute to the capacity, so new Azure workers are not provisioned.
2020-05-28 - v30.0.3? - Test incremental resource provisioning / deprovisioning
PR 2909 was a "silent" change that refactored the way Azure code was tested, and re-wrote the tests. One of the tests demonstrates that it takes repeated loops to fully provision or remove a worker.
There are some minor changes in the production code as well, to avoid re-creating resources if creation is in progress.
2020-06-04 - v30.0.3 - Delete all of a VM's disks
Bug 1631824 notes that only the first disk is deleted, and a VM may have other data disks. PR 2790 changes this to delete each disk assigned to the VM. There was a bug that kept this from running correctly, fixes around 2021-04-27 (see below).
2020-06-05 - v30.1.0 - Worker table migration
Issue 2877 tracks migrating the worker table to support a re-registration token. This was fixed in PR 2908.
There were some code changes, such as the provisionResource returning the (updated?) worker.
2020-08-05 - v37.2.0 - Update Azure instance state recognition
Issue 3346 mentions bug 1646061, where an Azure VM had a state 'ProvisioningState/creating', but was terminated because it wasn't one of the expected states. PR 3489 changes the code to stop making one API call, and to only act on known failure states. For unknown states, it is assumed the (newish) registration timeout would catch and terminate the instance.
2020-09-15 - v37.2.0 - Clarify state transitions in Azure code
Issue 3058 talks about refactoring the provisioning logic, to allow for better testing, and being clearer if a resource is in the process of being provisioned. PR 3505 implements the changes, with tests.
One external change is that the API call to remove a worker doesn't do work, but instead waits for the next worker-scanner loop to start de-provisioning. However, there is a bug that meant no VMs were deprovisioned, see next entry.
2021-04-06 - v43.0.0 - Always de-provision workers in STOPPING state
Issue 4715 notes that workers in the STOPPING state are not having their resources removed, until the VM was removed. It guesses this was added in v37.2.0 with PR 3505, when the removeWorker API call no longer took the first step of removing the VM. PR 4718 fixes this.
This fix means that scanner loops are (again?) doing work to de-provision Azure workers, which may make those loops slower.
2021-04-27 - v43.2.0 - Azure disks get unique names, can be deleted
Work continues on bug 1631824, since disks are still present. PR 4806 gives disks unique names, so they can be deleted along with the VM.
This has a few consequences:
- At the end of provisioning, the worker data is written to the database, to record the disk names.
- Worker scanner may be slower, as it actually deletes disks
- It may take additional worker scanner loops to fully stop an Azure worker
- Disks can't be shared between VMs, the first to stop will delete it
Updated•4 years ago
|
Updated•4 years ago
|
Comment 16•2 years ago
|
||
The win 2004 pools were removed in bug 1857614. Are we still seeing this with the current ones, or have things gotten better in the last couple of years?
Comment 17•2 years ago
|
||
Over the past 30 days, there are sometimes periods when tasks have wait for a machine. I lack the data how long they have to wait (Treeherder doesn't know when all dependencies are finished). The longest backlogs last 4-5.5h but we had not to close the trees for this (and don't watch the Try tree which would be impacted first).
Updated•1 year ago
|
Description
•