Bug 1723789 Comment 4 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

John Whitlock [:jwhitlock]

on 2021-08-03 16:12:06 PDT

I've been looking into this from the Taskcluster side, learning about the system.

Workers can be in four states:

* ``requested`` - A new worker is being provisioned, but has not checked in as ready for tasks
* ``running`` - A worker is ready to run tasks, and is probably running one
* ``stopping`` - A worker is being shutdown, and resources released
* ``stopped`` - A worker has been shutdown

Workers can have a capacity, which says how many tasks they can run. All these Azure workers have a capacity of 1, which makes the math easy going from an Azure VM to a unit of worker capacity.

The Worker Pool capacity is the total capacity of workers that are not in the ``stopped`` state. This means it includes workers in the ``requested`` and ``stopping`` states, while it seems to me that only ``running`` workers are available for work. The worker state does not appear to be available from the UI, only from the API, so I created [a script](https://github.com/jwhitlock/tc-scripts/blob/main/worker-pool-stats.py) to collate the data for a worker pool. During a recent run, I saw 496 VMs in Azure (2 creating, 2 deleting, 492 running) and the TC workers showed 37 requested, 490 running, 77 stopping, and 9601 stopped. This corresponded to a pool capacity of 604.

Workers can shut themselves down when idle. I believe this means they will be marked as ``running`` until a periodic scan detects they have been shutdown. I need to read more code to understand this process. But, the number of ``running`` workers is pretty close to the number of running VMs in Azure, which implies that the worker-manager quickly transitions workers to ``stopping`` state. A cross-reference of VM names would be needed to see how accurately the worker manager states line up with the Azure dashboard.

I think that worker state should be exposed on the [queue view](https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-t/worker-types/win10-64-2004), and you should be able to filter out stopped workers. I think it would be useful to show the worker count by state for a pool as well, and the number of workers in a ``running`` state is a better guide to how quickly tasks will be executed.

Revision 1 by

John Whitlock [:jwhitlock]

on 2021-09-28 09:36:06 PDT

I've been looking into this from the Taskcluster side, learning about the system.

Workers can be in four states:

Workers can have a capacity, which says how many tasks they can run. All these Azure workers have a capacity of 1, which makes the math easy going from an Azure VM to a unit of worker capacity.

The Worker Pool capacity is the total capacity of workers that are not in the ``stopped`` state. This means it includes workers in the ``requested`` and ``stopping`` states, while it seems to me that only ``running`` workers are available for work. The worker state does not appear to be available from the UI, only from the API, so I created [a script](https://github.com/jwhitlock/tc-scripts/blob/main/worker_pool_stats.py) to collate the data for a worker pool. During a recent run, I saw 496 VMs in Azure (2 creating, 2 deleting, 492 running) and the TC workers showed 37 requested, 490 running, 77 stopping, and 9601 stopped. This corresponded to a pool capacity of 604.

Back to Bug 1723789 Comment 4