Support prototyping external scheduling algorithms (aka optimizers)
Categories
(Firefox Build System :: Task Configuration, task)
Tracking
(firefox69 fixed)
Tracking | Status | |
---|---|---|
firefox69 | --- | fixed |
People
(Reporter: ahal, Assigned: ahal)
Details
Attachments
(1 file, 2 obsolete files)
Problem Statement
Our current scheduling algorithm SETA has served us well over the years, but it is very naive and we believe we can improve the ratio of regressions caught / hours spent running tasks. We believe things like code coverage, inspecting task queues and even simple tweaks to SETA's parameters can help.
The problem is that we have no way of testing our theories. The best we can do is:
- Gather a bunch of data with the current algorithm.
- Make a change to the algorithm.
- Wait a month or two.
- Gather a different bunch of data.
- Compare how each did.
This is problematic because:
A. The turnaround time for testing changes is months.
B. We are using different data sets to compare, so the comparison might not be fair.
We need an established way to test novel optimizers.
Possible Solution #1 - Shadow Scheduling
This is the simplest method. We could create one or more dummy decision task(s) that run(s) on every push (but doesn't schedule anything). These would emit artifacts that say which tasks would have been scheduled, had the associated scheduling algorithm been used. We can then download these artifacts and compare them to what actually got scheduled.
The downside here is that this only fixes problem B. We would still have month+ long turn-around times for testing out new algorithms. It also introduces a new problem:
C. If an algorithm is tweaked, any data that spans before+after that change will be invalid. If we have multiple algorithms being updated at various times, it might become easy to lose track of which ranges are valid for which algorithms.
Possible Solution #2 - Retroactive Analysis
I think the better approach is to build a mechanism that lets us plug in "external" algorithms and see how they would have performed on past data. To avoid bit-rot, these "test" algorithms would need to live outside of the tree, but still be able to get "plugged in" to the taskgraph.
Running an analysis this way would be a bit more complicated, but I think it will be worth it. I envision creating a script which:
- Accepts one or more test algorithms as input.
- Uses ActiveData to get push information (like parent revision).
- Physically updates a clone of mozilla-central to each push head in the desired range.
- Runs |mach taskgraph| (with the experimental schedulers applied) to get list of tasks that would be scheduled.
- Does the comparison.
Such an analysis might take a few hours to complete (depending on the range), but it could be run in an on demand try-only task only when necessary. This would give us a turn around of ~half a day for testing new scheduling hypotheses.
I already have a patch that allows me to support option #2 (which I'll upload shortly). This comment is meant to provide some context around why I'm proposing it.
Comment 1•5 years ago
|
||
I think in order for 2 to be effective, we also need easy ways (maybe more recipes?) to get past data from ActiveData. Some algorithms may use data from past failures and code changes in order to schedule tasks.
Assignee | ||
Comment 2•5 years ago
|
||
Assignee | ||
Comment 3•5 years ago
|
||
Depends on D32853
Assignee | ||
Comment 4•5 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #1)
I think in order for 2 to be effective, we also need easy ways (maybe more recipes?) to get past data from ActiveData. Some algorithms may use data from past failures and code changes in order to schedule tasks.
Yep, I have a prototype that (mostly) solves this. See:
https://github.com/ahal/ci-recipes/blob/master/recipes/push_health.py
My plan is to set up a Redis server that caches the AD results so we don't need to hit it over and over again for every single push.
Assignee | ||
Comment 5•5 years ago
|
||
Hey Dustin, Tom.
I've attached my attempt at option #2. I kind of expect you to not be thrilled about it (see comment 0 for context). I could potentially accomplish this via monkeypatching.. but would prefer something more properly supported.
Feedback greatly appreciated, I'm open to any and all suggestions!
Assignee | ||
Comment 6•5 years ago
|
||
With the above patches, here's how I would test out my custom scheduling algorithm:
- Save the following file somewhere on
$PYTHONPATH
:
custom_scheduler.py
import random
from taskgraph.optimize.base import Either, OptimizationStrategy
from taskgraph.optimize.strategies import (
IndexSearch,
SETA,
SkipUnlessChanged,
SkipUnlessSchedules,
)
class RandomOptimizer(OptimizationStrategy):
probability = 0.5 # probability task is optimized away
def should_remove_task(self, task, params, _):
if random.random() < self.probability:
return True
return False
CUSTOM_STRATEGIES = {
'never': OptimizationStrategy(), # "never" is the default behavior
'index-search': IndexSearch(),
'seta': RandomOptimizer(),
'skip-unless-changed': SkipUnlessChanged(),
'skip-unless-schedules': SkipUnlessSchedules(),
'skip-unless-schedules-or-seta': Either(SkipUnlessSchedules(), RandomOptimizer()),
}
- Run:
$ ./mach taskgraph optimized --strategies "custom_scheduler:CUSTOM_STRATEGIES"
Comment 7•5 years ago
|
||
It looks like this is mainly intended for testing replacements for SETA: skip-unless-changed
isn't really a name that suggests multiple competing implementations.
And SETA brings its own testing challenges, as it has its own external state that isn't included in Parameters -- it might give different answers today than it did a month ago (at least, that's my understanding). I'm not sure how to solve that except by "snapshotting" the SETA data.
Anyway, an alternative place to "splice" the alternatives in would be in taskcluster/taskgraph/util/seta.py
, perhaps after applying an extra layer of abstraction by renaming the SETA
/"seta"
optimization strategy to only-if-important
or something similarly generic, and likewise seta.py
-> importance.py
. Then importance.py
can have a way to apply a variety of approaches to defining "importance".
For after-the-fact analysis, it's probably easier to just feed an environment variable in, as it avoids threading through a new parameter and command-line arg through a bunch of files. Something like:
FORCE_IMPORTANCE_APPROACH=random ./mach taskgraph optimized
Assignee | ||
Comment 8•5 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #7)
It looks like this is mainly intended for testing replacements for SETA:
skip-unless-changed
isn't really a name that suggests multiple competing implementations.
Yes, one needs to shadow the same strategy names to avoid needing to update all the configs/transforms. Maybe we can come up with some more generic labels for these, though it's not strictly necessary.
And SETA brings its own testing challenges, as it has its own external state that isn't included in Parameters -- it might give different answers today than it did a month ago (at least, that's my understanding). I'm not sure how to solve that except by "snapshotting" the SETA data.
The SETA results are already frozen in the task-labels.json
artifact on every decision task. Though in general, you are correct that any non-deterministic algorithm will have this problem (think about a scheduler that inspects task queues). Using a shadow scheduler would solve this.. maybe there's room for both approaches in the future. For the immediate case, code coverage should be deterministic though.
Anyway, an alternative place to "splice" the alternatives in would be in
taskcluster/taskgraph/util/seta.py
, perhaps after applying an extra layer of abstraction by renaming theSETA
/"seta"
optimization strategy toonly-if-important
or something similarly generic, and likewiseseta.py
->importance.py
. Thenimportance.py
can have a way to apply a variety of approaches to defining "importance".
SETA is definitely the initial focus, but I'd prefer to prove this out a bit more generically. For example, a question that might come up once we have a robust code coverage scheduler is whether skip-unless-schedules
is still necessary or not. I think we'll want to be in a habit of continually testing our scheduling to see what performs best going into the future. I.e this isn't intended to be a one-off analysis with the only goal of replacing SETA.
For after-the-fact analysis, it's probably easier to just feed an environment variable in, as it avoids threading through a new parameter and command-line arg through a bunch of files. Something like:
FORCE_IMPORTANCE_APPROACH=random ./mach taskgraph optimized
Yes, this is simpler. Good idea!
Comment 9•5 years ago
|
||
I don't see a task-labels.json
in https://tools.taskcluster.net/groups/YJnsHDOhQWGiav8GfGXIcA/tasks/YJnsHDOhQWGiav8GfGXIcA/runs/0/artifacts for example. But it's good to know SETA data is captured somewhere.
Assignee | ||
Comment 10•5 years ago
|
||
This opens the door to swap in external strategies at runtime. This will be
used for back testing experimental strategies.
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 11•5 years ago
|
||
We can handle feedback in review moving forward.
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 12•5 years ago
|
||
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3a5ec7efe864 [taskgraph] Add ability to pass external optimize strategies via env, r=tomprince
Comment 13•5 years ago
|
||
bugherder |
Description
•