Closed Bug 1378357 Opened 7 years ago Closed 5 years ago

Inefficient use of Windows workers

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1585644

People

(Reporter: pmoore, Unassigned)

References

Details

A combined set of conditions lead to us creating (and paying for) far more Windows workers than we actually need. * The queue exposes only a number of pending tasks per worker type, without any data about what those tasks are (rather than e.g. a list of taskIds) * The provisioner therefore has no means to track how many spot instances it has already requested to meet the current demand (it does not know if pending tasks are new ones, or ones it already requested spot instances for) * The Windows workers spawned by OCC currently take a considerably long time to reach a state where they can execute a task, thus causing the provisioner to believe that it needs to spawn new instances to meet current demand There are several ways to attack this problem, and therefore this is a tracking bug to look into dealing with all these areas. 1) The queue should provide better data. For example, it could provide two data points per worker type. There are several equivalent variations here, but for example: * the absolute number of claimed tasks and the absolute number of submitted tasks (for a given worker type), or * the absolute number of claimed tasks, and the absolute number of pending tasks * a v4 uuid which is changed after some considerable interval (e.g. once per week) together with one of the options above where the absolute numbers are changed to absolute totals for the current uuid the third option above is just a way to "reset" the counter to 0 when absolute numbers become too big, or services need to restart and the absolute total is liable to getting lost (is volatile). Providing additional data, such as this, would allow the provisioner to more intelligently reason about how many instances it should request. 2) Wherever possible, instance initialisation should occur at AMI creation time, rather than instance startup. This would mean time-consuming tasks could be performed one-time when a worker type is updated, rather than when any instance starts up. 3) Currently we have a time-consuming (25 mins or so) overhead of formatting the Z: drive between tasks (including before the first task). This is performed to overcome the performance hit we have for copy-on-read semantics of accessing the EBS volume that backs the Z: drive. However, since we do not need to initialise the drive with data, we might be able to either dynimacally create the EBS backed volume, or instead use a local instance store volume instead for the Z: drive. There may also be other things we can do to improve our efficiencies - however these are the main options that stand out to me at the moment. So I'll create sub bugs for these optimisations, and we can keep this top-level bug as just a tracking bug for any optimisations we wish to try out.
4) we might be able to tweak some data in the worker type configs to get the provisioner to scale up more slowly
From the provisioner docs for scalingRatio: --- A scaling ratio of 0.2 means that the provisioner will attempt to keep the number of pending tasks around 20% of the provisioned capacity. This results in pending tasks waiting 20% of the average task execution time before starting to run. A higher scaling ratio often results in better utilization and longer waiting times. For workerTypes running long tasks a short scaling ratio may be preferred, but for workerTypes running quick tasks a higher scaling ratio may increase utilization without major delays. If using a scaling ratio of 0, the provisioner will attempt to keep the capacity of pending spot requests equal to the number of pending tasks. --- The underlying issue is that there's a "started but not taking jobs yet" phase that is invisible to the provisioner and for Windows instances is quite long. It exists for Linux instances too -- it's just much shorter. The scaling ratio parameter could probably be used to compensate for that. It looks like it's set to 0 for windows workerTypes right now (at least the few I looked at). That means that for every pending task the provisioner wants an instance in the "spot-request" phase (that is, not in the "started but not taking jobs yet" phase). Increasing that ratio would mean starting fewer new instances, but it's not going to be a perfect fix since the appropriate ratio will depend on the number of running instances, so it will slow down provisioning from a cold start. It seems that #2/#3 (I think #3 is a specific case of #2?) is the best fix here, not least because that's contributing substantially to E2E times and developer delays. #4 allows some fine-tuning once we get the startup time down to something more reasonable. Adjusting the provisioner (#1) would be complex and maybe unnecessary after #2/#3 is addressed.
I did mention this in the generic-provisioner RFC (https://github.com/taskcluster/taskcluster-rfcs/issues/31) as it may be useful to model this phase in future provisioner designs.
Depends on: 1378381
Depends on: 1378383
(In reply to Pete Moore [:pmoore][:pete] from comment #0) > 2) Wherever possible, instance initialisation should occur at AMI creation > time, rather than instance startup. This would mean time-consuming tasks > could be performed one-time when a worker type is updated, rather than when > any instance starts up. bug 1378383 > 3) Currently we have a time-consuming (25 mins or so) overhead of formatting > the Z: drive between tasks (including before the first task). This is > performed to overcome the performance hit we have for copy-on-read semantics > of accessing the EBS volume that backs the Z: drive. However, since we do > not need to initialise the drive with data, we might be able to either > dynimacally create the EBS backed volume, or instead use a local instance > store volume instead for the Z: drive. bug 1378381
Blocks: 1377332
No longer blocks: 1377332
Found in triage. This is still very much a valid bug. Pete will be working with relops this quarter to push as much setup as possible back into the initial AMI creation (bug 1378383).
Pete: has any of this been addressed by moving to generic-worker10 and/or OCC improvements?
Flags: needinfo?(pmoore)
(In reply to Chris Cooper [:coop] pronoun: he from comment #7) > Pete: has any of this been addressed by moving to generic-worker10 and/or > OCC improvements? No. Bug 1378383 is still a problem, but the main problem described in this bug will be solved by having a postgres backed queue.
Depends on: 1436478
Flags: needinfo?(pmoore)
i've closed bug 1378383 after reading the comment above, as it's not clear to me what problem is being described by that bug.
Component: Integration → Services

Duping this forward to bug 1585644

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.