Closed Bug 1303839 Opened 8 years ago Closed 7 years ago

Evaluate worker idle times and adjust knobs to minimize it

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: dustin, Assigned: jonasfj)

References

Details

..bug for a work in progress. Jonas made up a nice dashboard showing busy and idle time on workers: https://app.signalfx.com/#/dashboard/Csf_QKhAcBU and has been experimenting with scalingFactor and its impact on those numbers. This bug is to track those efforts. Greg and I had a few thoughts on the topic this morning: 1. It should be pretty simple to test this in simulations. Capture all of the task creation events, and the task durations, then simulate the workers and provisioner and whatnot. This would allow enough experimentation to come up with some rough estimates of the effects of parameters like scalingFactor on task latency and efficiency (and other measures of import) without using our production workload as the experimental subject. 2. Not all of that "idle" time is costing us money. If a worker starts up and runs a job, then the worker is essentially "free" until the end of its billing period, and we should probably track that separately. It's only the second hour of idle time that we're paying for. Observation: there is exactly one such hour for each spot instance we start (ignoring AWS-induced terminations).
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #0) > 1. It should be pretty simple to test this in simulations. Capture all of > the task creation events, and the task durations, then simulate the workers > and provisioner and whatnot. This would allow enough experimentation to > come up with some rough estimates of the effects of parameters like > scalingFactor on task latency and efficiency (and other measures of import) > without using our production workload as the experimental subject. > > 2. Not all of that "idle" time is costing us money. If a worker starts up > and runs a job, then the worker is essentially "free" until the end of its > billing period, and we should probably track that separately. It's only the > second hour of idle time that we're paying for. Observation: there is > exactly one such hour for each spot instance we start (ignoring AWS-induced > terminations). Bug 1424376 dealt with #2, i.e. the switch to per-second billing, but #1 is about being more predictive and responding better to load events. How aggressively do we want to pursue #1? Should I leave this open for our eventual team switch to an efficiency focus after the redeployability work is done?
Flags: needinfo?(jopsen)
I don't think this particular bug will be useful at that time. But it's something that's been discussed on and off for a while now -- it's a great thought exercise! If we do come up with something, it should start as an RFC.
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(jopsen)
Resolution: --- → INCOMPLETE
There is a *ton* of future improvements on the table. Bug 1424376 established a static idle timeout for all workers. We probably want that to be configurable per worker. And we may want to make the idle timeout a range and have the worker choose a random point in that range. We also don't take not ready tasks into account. A little predictive analysis could pre spawn workers so they are ready to accept tasks once those tasks become unblocked. Then there's another large topic around workers taking the "best" task available. e.g. running tasks on workers that have a populated cache. Still a lot to consider here. But it is the territory of an RFC, not a single bug. It's a complex topic.
You need to log in before you can comment on or make changes to this bug.