Our current scheduling algorithm SETA has served us well over the years, but it is very naive and we believe we can improve the ratio of regressions caught / hours spent running tasks. We believe things like code coverage, inspecting task queues and even simple tweaks to SETA's parameters can help.
The problem is that we have no way of testing our theories. The best we can do is:
- Gather a bunch of data with the current algorithm.
- Make a change to the algorithm.
- Wait a month or two.
- Gather a different bunch of data.
- Compare how each did.
This is problematic because:
A. The turnaround time for testing changes is months.
B. We are using different data sets to compare, so the comparison might not be fair.
We need an established way to test novel optimizers.
Possible Solution #1 - Shadow Scheduling
This is the simplest method. We could create one or more dummy decision task(s) that run(s) on every push (but doesn't schedule anything). These would emit artifacts that say which tasks would have been scheduled, had the associated scheduling algorithm been used. We can then download these artifacts and compare them to what actually got scheduled.
The downside here is that this only fixes problem B. We would still have month+ long turn-around times for testing out new algorithms. It also introduces a new problem:
C. If an algorithm is tweaked, any data that spans before+after that change will be invalid. If we have multiple algorithms being updated at various times, it might become easy to lose track of which ranges are valid for which algorithms.
Possible Solution #2 - Retroactive Analysis
I think the better approach is to build a mechanism that lets us plug in "external" algorithms and see how they would have performed on past data. To avoid bit-rot, these "test" algorithms would need to live outside of the tree, but still be able to get "plugged in" to the taskgraph.
Running an analysis this way would be a bit more complicated, but I think it will be worth it. I envision creating a script which:
- Accepts one or more test algorithms as input.
- Uses ActiveData to get push information (like parent revision).
- Physically updates a clone of mozilla-central to each push head in the desired range.
- Runs |mach taskgraph| (with the experimental schedulers applied) to get list of tasks that would be scheduled.
- Does the comparison.
Such an analysis might take a few hours to complete (depending on the range), but it could be run in an on demand try-only task only when necessary. This would give us a turn around of ~half a day for testing new scheduling hypotheses.
I already have a patch that allows me to support option #2 (which I'll upload shortly). This comment is meant to provide some context around why I'm proposing it.