Closed Bug 1555032 Opened 3 months ago Closed 3 months ago

Support prototyping external scheduling algorithms (aka optimizers)

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(firefox69 fixed)

RESOLVED FIXED
mozilla69
Tracking Status
firefox69 --- fixed

People

(Reporter: ahal, Assigned: ahal)

Details

Attachments

(1 file, 2 obsolete files)

Problem Statement

Our current scheduling algorithm SETA has served us well over the years, but it is very naive and we believe we can improve the ratio of regressions caught / hours spent running tasks. We believe things like code coverage, inspecting task queues and even simple tweaks to SETA's parameters can help.

The problem is that we have no way of testing our theories. The best we can do is:

  1. Gather a bunch of data with the current algorithm.
  2. Make a change to the algorithm.
  3. Wait a month or two.
  4. Gather a different bunch of data.
  5. Compare how each did.

This is problematic because:

A. The turnaround time for testing changes is months.
B. We are using different data sets to compare, so the comparison might not be fair.

We need an established way to test novel optimizers.

Possible Solution #1 - Shadow Scheduling

This is the simplest method. We could create one or more dummy decision task(s) that run(s) on every push (but doesn't schedule anything). These would emit artifacts that say which tasks would have been scheduled, had the associated scheduling algorithm been used. We can then download these artifacts and compare them to what actually got scheduled.

The downside here is that this only fixes problem B. We would still have month+ long turn-around times for testing out new algorithms. It also introduces a new problem:

C. If an algorithm is tweaked, any data that spans before+after that change will be invalid. If we have multiple algorithms being updated at various times, it might become easy to lose track of which ranges are valid for which algorithms.

Possible Solution #2 - Retroactive Analysis

I think the better approach is to build a mechanism that lets us plug in "external" algorithms and see how they would have performed on past data. To avoid bit-rot, these "test" algorithms would need to live outside of the tree, but still be able to get "plugged in" to the taskgraph.

Running an analysis this way would be a bit more complicated, but I think it will be worth it. I envision creating a script which:

  1. Accepts one or more test algorithms as input.
  2. Uses ActiveData to get push information (like parent revision).
  3. Physically updates a clone of mozilla-central to each push head in the desired range.
  4. Runs |mach taskgraph| (with the experimental schedulers applied) to get list of tasks that would be scheduled.
  5. Does the comparison.

Such an analysis might take a few hours to complete (depending on the range), but it could be run in an on demand try-only task only when necessary. This would give us a turn around of ~half a day for testing new scheduling hypotheses.

I already have a patch that allows me to support option #2 (which I'll upload shortly). This comment is meant to provide some context around why I'm proposing it.

I think in order for 2 to be effective, we also need easy ways (maybe more recipes?) to get past data from ActiveData. Some algorithms may use data from past failures and code changes in order to schedule tasks.

(In reply to Marco Castelluccio [:marco] from comment #1)

I think in order for 2 to be effective, we also need easy ways (maybe more recipes?) to get past data from ActiveData. Some algorithms may use data from past failures and code changes in order to schedule tasks.

Yep, I have a prototype that (mostly) solves this. See:
https://github.com/ahal/ci-recipes/blob/master/recipes/push_health.py

My plan is to set up a Redis server that caches the AD results so we don't need to hit it over and over again for every single push.

Hey Dustin, Tom.

I've attached my attempt at option #2. I kind of expect you to not be thrilled about it (see comment 0 for context). I could potentially accomplish this via monkeypatching.. but would prefer something more properly supported.

Feedback greatly appreciated, I'm open to any and all suggestions!

Flags: needinfo?(mozilla)
Flags: needinfo?(dustin)

With the above patches, here's how I would test out my custom scheduling algorithm:

  1. Save the following file somewhere on $PYTHONPATH:

custom_scheduler.py

import random

from taskgraph.optimize.base import Either, OptimizationStrategy
from taskgraph.optimize.strategies import (
    IndexSearch,
    SETA,
    SkipUnlessChanged,
    SkipUnlessSchedules,
)


class RandomOptimizer(OptimizationStrategy):
    probability = 0.5  # probability task is optimized away

    def should_remove_task(self, task, params, _):
        if random.random() < self.probability:
            return True
        return False


CUSTOM_STRATEGIES = {
    'never': OptimizationStrategy(),  # "never" is the default behavior
    'index-search': IndexSearch(),
    'seta': RandomOptimizer(),
    'skip-unless-changed': SkipUnlessChanged(),
    'skip-unless-schedules': SkipUnlessSchedules(),
    'skip-unless-schedules-or-seta': Either(SkipUnlessSchedules(), RandomOptimizer()),
}
  1. Run:
$ ./mach taskgraph optimized --strategies "custom_scheduler:CUSTOM_STRATEGIES"

It looks like this is mainly intended for testing replacements for SETA: skip-unless-changed isn't really a name that suggests multiple competing implementations.

And SETA brings its own testing challenges, as it has its own external state that isn't included in Parameters -- it might give different answers today than it did a month ago (at least, that's my understanding). I'm not sure how to solve that except by "snapshotting" the SETA data.

Anyway, an alternative place to "splice" the alternatives in would be in taskcluster/taskgraph/util/seta.py, perhaps after applying an extra layer of abstraction by renaming the SETA/"seta" optimization strategy to only-if-important or something similarly generic, and likewise seta.py -> importance.py. Then importance.py can have a way to apply a variety of approaches to defining "importance".

For after-the-fact analysis, it's probably easier to just feed an environment variable in, as it avoids threading through a new parameter and command-line arg through a bunch of files. Something like:

FORCE_IMPORTANCE_APPROACH=random ./mach taskgraph optimized
Flags: needinfo?(dustin)

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #7)

It looks like this is mainly intended for testing replacements for SETA: skip-unless-changed isn't really a name that suggests multiple competing implementations.

Yes, one needs to shadow the same strategy names to avoid needing to update all the configs/transforms. Maybe we can come up with some more generic labels for these, though it's not strictly necessary.

And SETA brings its own testing challenges, as it has its own external state that isn't included in Parameters -- it might give different answers today than it did a month ago (at least, that's my understanding). I'm not sure how to solve that except by "snapshotting" the SETA data.

The SETA results are already frozen in the task-labels.json artifact on every decision task. Though in general, you are correct that any non-deterministic algorithm will have this problem (think about a scheduler that inspects task queues). Using a shadow scheduler would solve this.. maybe there's room for both approaches in the future. For the immediate case, code coverage should be deterministic though.

Anyway, an alternative place to "splice" the alternatives in would be in taskcluster/taskgraph/util/seta.py, perhaps after applying an extra layer of abstraction by renaming the SETA/"seta" optimization strategy to only-if-important or something similarly generic, and likewise seta.py -> importance.py. Then importance.py can have a way to apply a variety of approaches to defining "importance".

SETA is definitely the initial focus, but I'd prefer to prove this out a bit more generically. For example, a question that might come up once we have a robust code coverage scheduler is whether skip-unless-schedules is still necessary or not. I think we'll want to be in a habit of continually testing our scheduling to see what performs best going into the future. I.e this isn't intended to be a one-off analysis with the only goal of replacing SETA.

For after-the-fact analysis, it's probably easier to just feed an environment variable in, as it avoids threading through a new parameter and command-line arg through a bunch of files. Something like:

FORCE_IMPORTANCE_APPROACH=random ./mach taskgraph optimized

Yes, this is simpler. Good idea!

I don't see a task-labels.json in https://tools.taskcluster.net/groups/YJnsHDOhQWGiav8GfGXIcA/tasks/YJnsHDOhQWGiav8GfGXIcA/runs/0/artifacts for example. But it's good to know SETA data is captured somewhere.

This opens the door to swap in external strategies at runtime. This will be
used for back testing experimental strategies.

Attachment #9068079 - Attachment description: Bug 1555032 - [taskgraph] Allow defining optimization strategies in parameters → Bug 1555032 - [taskgraph] Allow defining optimization strategies in an environment variable, r?tomprince
Attachment #9068078 - Attachment is obsolete: true

We can handle feedback in review moving forward.

Flags: needinfo?(mozilla)
Attachment #9068816 - Attachment description: Bug 1555032 - [taskgraph] Allow optimize strategies to be python_path.find_object strings, r?tomprince → Bug 1555032 - [taskgraph] A ability to pass external optimize strategies via env, r?tomprince
Attachment #9068079 - Attachment is obsolete: true
Attachment #9068816 - Attachment description: Bug 1555032 - [taskgraph] A ability to pass external optimize strategies via env, r?tomprince → Bug 1555032 - [taskgraph] Add ability to pass external optimize strategies via env, r?tomprince
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3a5ec7efe864
[taskgraph] Add ability to pass external optimize strategies via env, r=tomprince
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
You need to log in before you can comment on or make changes to this bug.