Azure provider capacity and provisioning
Categories
(Taskcluster :: Services, task)
Tracking
(Not tracked)
People
(Reporter: markco, Unassigned)
Details
Attachments
(1 file)
|
202.89 KB,
image/jpeg
|
Details |
2021-08-02 there was a delay in Azure creation of w10-64-2004 workers. Pending counts continued to rise. The time frame seemed to have been between 17:30 UTC and 19:30 UTC. The pending count continued to climb and no additional workers were created. At the time worker-manger was showing there was 594 of these workers available when there was actually 77 total in varying states running, stopping, or stopped.
| Reporter | ||
Updated•4 years ago
|
| Reporter | ||
Comment 1•4 years ago
|
||
I also wonder if when we hit vm creation errors like:
The requested size for resource '/subscriptions/108d46d5-fe9b-4850-9a7d-8c914aa6c1f0/resourceGroups/rg-taskcluster-worker-manager-production/providers/Microsoft.Compute/virtualMachines/vm-m7jufq1vsju5ij2yd5fkmazwrn1x8nsf2ij' is currently not available in location 'northcentralus' zones '' for subscription '108d46d5-fe9b-4850-9a7d-8c914aa6c1f0'. Please try another size or deploy to a different location or zones. See https://aka.ms/azureskunotavailable for details.
Reported
2021-08-03T18:00:29.958Z
causes and issue with worker-runner and available capacity. It does cause a significant delay for the task to be picked up. Up to an additional 30 minutes.
| Reporter | ||
Comment 2•4 years ago
|
||
Currently worker-manger thinks there is capacity of 742 10-64-2004. In actually there are about 300 instances running of that type. with less than 350 total instance going. This is resulting in an hour delay in a task being ran from the time of being triggered. See: https://firefox-ci-tc.services.mozilla.com/tasks/SD2B_p3KTOmlzoWclffquQ
Our true capacity for total instances in Azure is around 1200. depending on if we get outbid for spot instances. Which we are not seeing currently. https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Fwin10-64-2004/errors
| Reporter | ||
Comment 3•4 years ago
|
||
Just a note. the 10-64-2004 pool config is set for a max capacity of 600 workers. Which we have not hit as of yet.
Comment 4•4 years ago
•
|
||
I've been looking into this from the Taskcluster side, learning about the system.
Workers can be in four states:
requested- A new worker is being provisioned, but has not checked in as ready for tasksrunning- A worker is ready to run tasks, and is probably running onestopping- A worker is being shutdown, and resources releasedstopped- A worker has been shutdown
Workers can have a capacity, which says how many tasks they can run. All these Azure workers have a capacity of 1, which makes the math easy going from an Azure VM to a unit of worker capacity.
The Worker Pool capacity is the total capacity of workers that are not in the stopped state. This means it includes workers in the requested and stopping states, while it seems to me that only running workers are available for work. The worker state does not appear to be available from the UI, only from the API, so I created a script to collate the data for a worker pool. During a recent run, I saw 496 VMs in Azure (2 creating, 2 deleting, 492 running) and the TC workers showed 37 requested, 490 running, 77 stopping, and 9601 stopped. This corresponded to a pool capacity of 604.
Workers can shut themselves down when idle. I believe this means they will be marked as running until a periodic scan detects they have been shutdown. I need to read more code to understand this process. But, the number of running workers is pretty close to the number of running VMs in Azure, which implies that the worker-manager quickly transitions workers to stopping state. A cross-reference of VM names would be needed to see how accurately the worker manager states line up with the Azure dashboard.
I think that worker state should be exposed on the queue view, and you should be able to filter out stopped workers. I think it would be useful to show the worker count by state for a pool as well, and the number of workers in a running state is a better guide to how quickly tasks will be executed.
Comment 5•4 years ago
|
||
I'm going to have to hand this off to someone else on the SysEng team. I'm not sure who will pick it up, or when they will start. I think there's a large amount of work to understand Azure provisioning, and to make a worker-manager scanner and provisioner that efficiently work with it.
The big takeaways:
- Worker manager's view of the worker capacity is very different from the queues. "Stopping" workers are counted as part of the capacity, even though they can't take tasks.
- Worker shutdown is an important part of the process, and is slow because each resource (IP, network interface, disks, and finally VM) is taken down one worker scanner cycle at a time. When the worker scanner takes an hour per cycle, this can mean 4 hours to shutdown a worker!
- A faster worker scanner, with more cycles per hour, would have a huge positive impact on Azure provisioning, and probably Firefox CI provisioning in general.
Comment 6•3 years ago
|
||
Still an issue, or fixed with the worker-manager changes in H1?
Description
•