I've been looking into this from the Taskcluster side, learning about the system. Workers can be in four states: * ``requested`` - A new worker is being provisioned, but has not checked in as ready for tasks * ``running`` - A worker is ready to run tasks, and is probably running one * ``stopping`` - A worker is being shutdown, and resources released * ``stopped`` - A worker has been shutdown Workers can have a capacity, which says how many tasks they can run. All these Azure workers have a capacity of 1, which makes the math easy going from an Azure VM to a unit of worker capacity. The Worker Pool capacity is the total capacity of workers that are not in the ``stopped`` state. This means it includes workers in the ``requested`` and ``stopping`` states, while it seems to me that only ``running`` workers are available for work. The worker state does not appear to be available from the UI, only from the API, so I created [a script](https://github.com/jwhitlock/tc-scripts/blob/main/worker-pool-stats.py) to collate the data for a worker pool. During a recent run, I saw 496 VMs in Azure (2 creating, 2 deleting, 492 running) and the TC workers showed 37 requested, 490 running, 77 stopping, and 9601 stopped. This corresponded to a pool capacity of 604. Workers can shut themselves down when idle. I believe this means they will be marked as ``running`` until a periodic scan detects they have been shutdown. I need to read more code to understand this process. But, the number of ``running`` workers is pretty close to the number of running VMs in Azure, which implies that the worker-manager quickly transitions workers to ``stopping`` state. A cross-reference of VM names would be needed to see how accurately the worker manager states line up with the Azure dashboard. I think that worker state should be exposed on the [queue view](https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/win10-64-2004), and you should be able to filter out stopped workers. I think it would be useful to show the worker count by state for a pool as well, and the number of workers in a ``running`` state is a better guide to how quickly tasks will be executed.
Bug 1723789 Comment 4 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
I've been looking into this from the Taskcluster side, learning about the system. Workers can be in four states: * ``requested`` - A new worker is being provisioned, but has not checked in as ready for tasks * ``running`` - A worker is ready to run tasks, and is probably running one * ``stopping`` - A worker is being shutdown, and resources released * ``stopped`` - A worker has been shutdown Workers can have a capacity, which says how many tasks they can run. All these Azure workers have a capacity of 1, which makes the math easy going from an Azure VM to a unit of worker capacity. The Worker Pool capacity is the total capacity of workers that are not in the ``stopped`` state. This means it includes workers in the ``requested`` and ``stopping`` states, while it seems to me that only ``running`` workers are available for work. The worker state does not appear to be available from the UI, only from the API, so I created [a script](https://github.com/jwhitlock/tc-scripts/blob/main/worker_pool_stats.py) to collate the data for a worker pool. During a recent run, I saw 496 VMs in Azure (2 creating, 2 deleting, 492 running) and the TC workers showed 37 requested, 490 running, 77 stopping, and 9601 stopped. This corresponded to a pool capacity of 604. Workers can shut themselves down when idle. I believe this means they will be marked as ``running`` until a periodic scan detects they have been shutdown. I need to read more code to understand this process. But, the number of ``running`` workers is pretty close to the number of running VMs in Azure, which implies that the worker-manager quickly transitions workers to ``stopping`` state. A cross-reference of VM names would be needed to see how accurately the worker manager states line up with the Azure dashboard. I think that worker state should be exposed on the [queue view](https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/win10-64-2004), and you should be able to filter out stopped workers. I think it would be useful to show the worker count by state for a pool as well, and the number of workers in a ``running`` state is a better guide to how quickly tasks will be executed.