Open Bug 1723789 Opened 4 years ago Updated 3 years ago

Azure provider capacity and provisioning

Categories

(Taskcluster :: Services, task)

Tracking

(Not tracked)

People

(Reporter: markco, Unassigned)

Details

Attachments

(1 file)

Attached image delay.jpg

2021-08-02 there was a delay in Azure creation of w10-64-2004 workers. Pending counts continued to rise. The time frame seemed to have been between 17:30 UTC and 19:30 UTC. The pending count continued to climb and no additional workers were created. At the time worker-manger was showing there was 594 of these workers available when there was actually 77 total in varying states running, stopping, or stopped.

Assignee: nobody → jwhitlock

I also wonder if when we hit vm creation errors like:

The requested size for resource '/subscriptions/108d46d5-fe9b-4850-9a7d-8c914aa6c1f0/resourceGroups/rg-taskcluster-worker-manager-production/providers/Microsoft.Compute/virtualMachines/vm-m7jufq1vsju5ij2yd5fkmazwrn1x8nsf2ij' is currently not available in location 'northcentralus' zones '' for subscription '108d46d5-fe9b-4850-9a7d-8c914aa6c1f0'. Please try another size or deploy to a different location or zones. See https://aka.ms/azureskunotavailable for details.
Reported

2021-08-03T18:00:29.958Z

causes and issue with worker-runner and available capacity. It does cause a significant delay for the task to be picked up. Up to an additional 30 minutes.

Currently worker-manger thinks there is capacity of 742 10-64-2004. In actually there are about 300 instances running of that type. with less than 350 total instance going. This is resulting in an hour delay in a task being ran from the time of being triggered. See: https://firefox-ci-tc.services.mozilla.com/tasks/SD2B_p3KTOmlzoWclffquQ

Our true capacity for total instances in Azure is around 1200. depending on if we get outbid for spot instances. Which we are not seeing currently. https://firefox-ci-tc.services.mozilla.com/worker-manager/gecko-t%2Fwin10-64-2004/errors

Just a note. the 10-64-2004 pool config is set for a max capacity of 600 workers. Which we have not hit as of yet.

I've been looking into this from the Taskcluster side, learning about the system.

Workers can be in four states:

  • requested - A new worker is being provisioned, but has not checked in as ready for tasks
  • running - A worker is ready to run tasks, and is probably running one
  • stopping - A worker is being shutdown, and resources released
  • stopped - A worker has been shutdown

Workers can have a capacity, which says how many tasks they can run. All these Azure workers have a capacity of 1, which makes the math easy going from an Azure VM to a unit of worker capacity.

The Worker Pool capacity is the total capacity of workers that are not in the stopped state. This means it includes workers in the requested and stopping states, while it seems to me that only running workers are available for work. The worker state does not appear to be available from the UI, only from the API, so I created a script to collate the data for a worker pool. During a recent run, I saw 496 VMs in Azure (2 creating, 2 deleting, 492 running) and the TC workers showed 37 requested, 490 running, 77 stopping, and 9601 stopped. This corresponded to a pool capacity of 604.

Workers can shut themselves down when idle. I believe this means they will be marked as running until a periodic scan detects they have been shutdown. I need to read more code to understand this process. But, the number of running workers is pretty close to the number of running VMs in Azure, which implies that the worker-manager quickly transitions workers to stopping state. A cross-reference of VM names would be needed to see how accurately the worker manager states line up with the Azure dashboard.

I think that worker state should be exposed on the queue view, and you should be able to filter out stopped workers. I think it would be useful to show the worker count by state for a pool as well, and the number of workers in a running state is a better guide to how quickly tasks will be executed.

I'm going to have to hand this off to someone else on the SysEng team. I'm not sure who will pick it up, or when they will start. I think there's a large amount of work to understand Azure provisioning, and to make a worker-manager scanner and provisioner that efficiently work with it.

The big takeaways:

  • Worker manager's view of the worker capacity is very different from the queues. "Stopping" workers are counted as part of the capacity, even though they can't take tasks.
  • Worker shutdown is an important part of the process, and is slow because each resource (IP, network interface, disks, and finally VM) is taken down one worker scanner cycle at a time. When the worker scanner takes an hour per cycle, this can mean 4 hours to shutdown a worker!
  • A faster worker scanner, with more cycles per hour, would have a huge positive impact on Azure provisioning, and probably Firefox CI provisioning in general.
Assignee: jwhitlock → nobody

Still an issue, or fixed with the worker-manager changes in H1?

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: