Closed Bug 1294919 Opened 9 years ago Closed 7 years ago

Workers are too eager to self-terminate, create long gaps until provisioner respawns them

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: gps, Unassigned)

Details

TC workers, unlike The Terminator, are programmed to self-terminate. The code for this lives at https://github.com/taskcluster/docker-worker/blob/master/lib/shutdown_manager.js. Basically, if a worker is idle and there is no more work to do and a new AWS billing cycle is coming up, the worker terminates itself so as to not waste money just sitting around idle. The workers are very good at not sitting around idle and wasting money. I dare say too good. There are a number of problems with eagerly self-terminating: 1) The termination of the worker may bring the worker type capacity below the configured worker type capacity in the provisioner. In the worst case, total worker count goes to 0 and the provisioner needs to spawn a new worker. If the end result is no different, self-terminating just creates more overall work and introduces a period of lower capacity between when the worker terminates and the provisioner creates a new worker. 2) Terminating workers loses caches associated with the worker. When new workers come online, those caches are empty and must be recreated. For jobs that touch source control, they must clone a Firefox repo and perform a checkout (2+ minutes). For Firefox builds, they lose the objdir (possibly losing 10+ minutes of CPU time). Tooltool downloads are also lost. Docker images are also lost (can add 5+ minutes to download and import a 2+ GB Docker image). A large number of workers doing Firefox tasks require 10+ minutes after new worker startup to get in a state to run a task. This overhead can be avoided if a task goes to an existing worker. 3) The provisioner is slow to react. If new workers spawned into existence as soon as existing workers terminated, things wouldn't be so bad. Unfortunately, it can often take 20+ minutes for new workers to be created. Earlier today, I increased the capacity of the desktop-test worker pool from 2500 to 4000 so we could more quickly process a backlog of 10,000 test tasks. It took nearly 1hr for 1500 new workers to come online. This is unacceptable. 4) The cost savings optimization is premature in many cases. The m1.medium instances we run cost $0.01-$0.03/hr at spot prices. So we're terminating a worker to save a penny. Then we often have to spend 10+ minutes to repopulate cached worker state to get back to where we were. The cost savings just often aren't worth it. 5) Highly volatile capacity in worker pools. We had a prolonged tree closure today due to tasks not running. I increased the desktop-test worker capacity from 2500 to 4000 to help churn backlog quicker so we could open trees quicker. The plan worked. When test backlog reached ~1000, I asked the sheriffs to reopen the trees. A deluge of 25+ pushes landed on the autoland repo in rapid succession. By the time the build tasks for those pushes completed, the desktop-test worker pool had already decreased in size from 4000 to ~2000. The pool was then under-provisioned to handle load and a test backlog developed. It took a while for the provisioner to kick in. Tasks were running slow on new workers because they had empty caches. And the backlog quickly increased to 5k+. In other words, the worker pool size is a rollercoaster. The fundamental problem with worker self-termination as I see it is that it is too fast to react to no pending work. This violates an important principal we wish Firefox automation to follow: machines should wait on humans - humans should never have to wait on machines. This statement reflects the reality that a machine literally costs orders of magnitude less than a human ($0.01/hr versus $100.00/hr) and we don't want to be wasting people's time. I propose a simple fix for worker self-termination that I believe will help with the above problems: the ability to define a minimum time to elapse between the last completed task and self termination. If a worker doesn't terminate until N minutes after the last completed task, workers don't disappear so quickly. This means worker capacity remains high, even during temporary drops in available work. Put another way, machines sit around waiting on humans. If a deluge of new tasks arrive, workers start processing them immediately, on populated caches. If the amount of tasks ramps up, the worker pool ramps up with it. As the amount of tasks ramp down, the worker pool decreases in due time (lagging a bit). Yes, workers would remain idle for longer on average. However, at a cost of $0.01/hr, even 1000 idle workers is only $10/hr. That's still only 10% of what an engineer is costing. Obviously the costs for some workers are higher. Our builders use c4.4xlarge instances, which run $0.20-$0.30/hr. But we tend to have fewer than 100. A tax of $20-$30/hr doesn't seem so bad. But, we can and should support a configurable minimum time before termination for each worker type so we can aggressively terminate worker types if costs could spiral. One potential problem with minimum time to termination would be unwanted over-capacity. Depending on the order workers get tasks, it could be possible for just enough tasks to be scheduled that workers stick around even though their capacity isn't needed. I think you really want a system where workers that have been idle the least have the highest priority of taking an available job. That way, workers have the opportunity to expire if there is excess worker capacity. Otherwise, if the most idle worker or a random worker takes available tasks, a worker may never expire. This problem can be mitigated by keeping the minimum time to expiration low - say 5 or 10 minutes. We may also want to throw "jitter" into the minimum time to expiration: that way if a bunch of similar tasks all complete at the same time, we don't have large pools of workers self-terminating at the same time. Again, smoothing over changes in worker capacity is paramount.
Alternatively, if workers checked with the provisioner before they self-terminated, we'd have a host of other mechanisms available to control worker capacity. A first or second order low-pass filter on the worker pool size would probably be a good start. But that's probably a significant architectural change.
A combination of points #1 and #3 is happening right now with the gecko-decision worker. This worker type is configured such that 1 worker should always be running. This is because the gecko-decision task is the task that determines the task graph for pushes. It is effectively a common root node on the task graph. So it needs to complete ASAP or else it delays all subsequent tasks. We currently have 0 gecko-decision workers provisioned (active workers likely self-terminated during a lull). There are 17 pending tasks for this worker type. I reckon the provisioner is busy provisioning desktop-test instances, which are backlogged and *still* haven't returned to their capacity of 4000 after workers mass self-terminated in the lull between trees reopening and builds unlocking a flood of test tasks. At the time I wrote this, we have a gecko-decision task on autoland that's been pending for 110 minutes (https://tools.taskcluster.net/task-inspector/#alk_zWp7RQSpAhIcfSuz9w/0). Presumably the provisioner first became away of the need for a worker that long ago. We should have never reached 0 capacity. And the provisioner should not take this long to create a worker, especially when the worker pool is empty or important. Even 1 gecko-decision worker would have churned through the backlog of tasks by now.
Another idea to consider: assigning a worker a probability of not terminating on every billing cycle when it otherwise would. This might work a bit better in our world where workers don't coordinate with a central provisioner on whether to terminate.
One observation here is, the difference between 0 instances and 1 is infinitely larger than the difference between 1 instance and 2, which is larger than that between 2 and 3, and so on. I feel like we could invent a fairly simple priority metric that would follow that pattern.
I believe the agreed upon first step here is to at least keep a machine around for two billing cycles. Work will be done in bug 1295184
I think extending the machine for another billing cycle is an excellent first step and will go a long way towards mitigating capacity issues. We may still have problems when there are long gaps in load (such as during extended tree closures). But this should definitely help smooth over things during normal operation.
Does this still apply to how things are done today?
Flags: needinfo?(gps)
QA Contact: pmoore
A lot has changed in 2 years. Some points in comment #0 are still valid. Others aren't. Anyway, the proposal seems to have been to terminate workers after N Seconds of idle. This was actually implemented in bug 1424376 as part of moving away from hourly billing assumptions. So it appears there is nothing more to do here. I do think there is value in more cooperation between worker lifetimes and the provisioner. But this can be tracked elsewhere.
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(gps)
Resolution: --- → WORKSFORME
Component: Worker → Workers
You need to log in before you can comment on or make changes to this bug.