Open Bug 1608836 Opened 4 years ago Updated 1 year ago

Enable chunking in the taskgraph for all reftest suites

Categories

(Firefox Build System :: Task Configuration, task, P2)

task

Tracking

(Not tracked)

People

(Reporter: ahal, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

+++ This bug was initially created as a clone of Bug #1608834 +++

In bug 1583353 we started chunking test suites in the taskgraph (rather than at test runtime). Though I only enabled it for a subset of suites.

This bug tracks getting all reftest suites enabled (incl crashtest / jsreftest). To enable taskgraph chunking, remove the suite from this config:
https://searchfox.org/mozilla-central/rev/dd1dafd5c9c05640e76af30b58749076e0199704/taskcluster/taskgraph/transforms/tests.py#1280

I normally like to test that we still run the same set of tests before and after. To do this I:

  1. Push to try scheduling all tests in the suite, both with and without this change.
  2. Clone https://github.com/mozilla/ci-recipes and follow installation instructions.
  3. Run poetry run adr compare_pushes --branch try -r1 <rev1> -r2 <rev2>
  4. It will take awhile (~10-20min).
  5. Make sure the output says all tasks are the same. If the output shows one task ran more manifests than another, then we need to investigate.

Reftest will be a bit trickier to enable than other suites as it doesn't use manifestparser natively. Some modifications to the taskgraph code will also be required.

Summary: Enable chunking in the taskgraph for all mochitest suites → Enable chunking in the taskgraph for all reftest suites
Assignee: nobody → ahal

I have patches locally to make this work, but many tests are failing.

At least one issue is that under the taskgraph chunking world, we will pass several reftest manifests into the test harness. This contrasts with how things currently work where we only pass the single root manifest at layout/reftests/reftest.list. This root manifest contains several skip-if keys that conditionally include manifests based on the sandbox. With my patches, these skip-ifs are no longer being honoured.

I think we'll need to:

  1. Implement the ability to have manifest DEFAULTS in reftest manifests (similar to manifestparser)
  2. Move all skip-if(*) include statements to this new DEFAULTS section in the included manifest itself
  3. Remove the root layout/reftests/reftest.list manifest (and register all manifests with moz.build like we do other suites)
Depends on: 1616368
Depends on: 1616961
Status: NEW → ASSIGNED
Depends on: 1617261

Not currently working on this, though will try to get it back in the conversation.

Assignee: ahal → nobody
Status: ASSIGNED → NEW

Discussion we had recently on Matrix that ended up being related to this bug:

marco
I checked the difference between bugbug's response at CT_MEDIUM and the bugbug-disperse-medium shadow scheduler. The regression detection rate is worse by 8% (76% instead of 84%), but the number of selected groups is around 1/3.
bugbug's reponse at CT_LOW would have almost half the number of selected groups as bugbug-disperse-medium, with basically the same regression detection rate (83% instead of 84%)
so we could almost halve our CI load without affecting the regression detection rate

ahal
this is assuming we solved those remaining suites?

jmaher
if we don't have to solve the other suites and can cut our load in ~1/2 for autoland and ./mach try auto, that would be great
I think m-c replacement would be a bigger win

marco
yes, IDK which suite contributes the most

ahal
I'd guess reftest.. though it's possible some of those smaller suites simply haven't been tested. I.e I wonder why a11y is disabled

marco
what could I do to verify?
I could count the number of times groups outside the ones selected by bugbug were chosen, by group
and then by looking at which groups they are, we can figure out which suite they belong to
does that make sense? Or is it surely reftest?

ahal
Pretty surely reftest

marco
I guess reftest is the most difficult?

ahal
yeah, it needs some refactoring of how manifests work
I had made some good progress on it back in the day, but then other aspects of the project were more important

marco
I was thinking a "workaround" to at least reduce part of the problem would be to chunk reftests more, but 1) we might be unlucky and still run all of the chunks; 2) the overhead increases

ahal
I think the issue I had in that bug was that there were generated jsreftests that had skip-if in them
and I didn't know how to handle those
but, maybe we could have a mode where it's ok to enable it for jsreftest and then we can at least enable manifest scheduling on reftest/crashtest
once that is fixed, there still might be test interdependency issues and the normal problems that arose with other suites
but hard to say how much of that will be a problem (if any)

jmaher
jsreftest should only run if js/src/ code is changed, so that is an all or none type of package right now and logically makes sense

ahal
yeah, I think they have a SCHEDULES rule, which means they would ignore bugbug anyway
if you think we can reduce load on autoland / mach try auto by half... that definitely seems worth a month of work :)
probably won't happen this month, but I'll chat with Mihai about it (I think you've already been talking to him too)

jmaher
would this be only reducing reftests we schedule?
right now we are looking at web-platform-tests and browser-chrome being two of the most expensive suites (raptor is #1), reftest is half the cost of bc/wpt

ahal
I think Marco's "half" estimate is just within a bugbug push (so ignores backstops)
and it includes all suites that aren't manifest chunking enabled, not just reftest
so yeah, it's likely a lot less than half overall
so we should be careful about what we promise :)

marco
yes, this is only on "normal" pushes, not backstops

jmaher
assuming backstops are 1/2 the cost of autoland- not sure of cost of try for !./mach try auto pushes

marco
if backstops are 1/2, then reducing the cost by 1/2 on normal pushes would reduce the cost by 1/4 overall

ahal
(math gets a bit trickier with expanded backstops)
but yeah

marco
are full backstops 1/2 of the cost, or are full+optimized backstops 1/2?

ahal
I think jmaher was just spitballing a number

marco
oh ok :P

jmaher
yeah, I was just guestimating- So maybe we start with 25% reduction in cost as an upper end?

ahal
I'm unsure how much time people are spending on mach try auto now since AD is gone
but whatever that is could be halved on top

marco
true

jmaher
I think reviewbot is 25% of overall try usage:)

marco
do you mean by cost or by # of pushes?

jmaher
it is probably 50% of total pushes (I had looked into it a while back), they are cheap, but it adds up to a good price
maybe we could find a way to query treeherder or hg logs of try and get total % pushes using mach try auto
but what problem are we solving- capacity or money? if we want to solve money, storage should be resolved first, then idle time;

ahal
yeah, we could build this into mozci likely

agreed.. though those don't seem to have clear paths forward, or do you know the causes?
whereas here there are tangible steps that could take ~1-2 weeks and still have relatively large payoffs

jmaher
artifact cleanup is straightforward- just needs ~3 weeks of work to save $100K+/year

marco
regarding storage, we had https://bugzilla.mozilla.org/show_bug.cgi?id=1651965 which I thought was supposedly easy
regarding idle time, by running fewer tests we might also reduce the idle time (as we need to launch fewer instances), hard to calculate by how much though

jmaher
that hasn't worked in the past- idle time is just hung machines or paying 24x7 for a machine in DC and only using it 20% of the time

marco
anyway, if it is a relatively small amount of work to do it, I'd say it's worth doing it (even if you "only" save 25% of autoland cost and increase overall capacity)

...

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.