Open Bug 1272042 Opened 4 years ago Updated 4 years ago

Consider scheduling reftests only when necessary

Categories

(Testing :: Reftest, defect)

Version 3
defect
Not set

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

References

(Blocks 1 open bug)

Details

Reftests (along with most jobs in Firefox automation) are executed during most checkins. So e.g. if I change some JS code in the Firefox frontend, reftests still run.

Reftests can consume 6+ days of machine time per checkin (if all platforms and configurations run). Only mochitests consume a larger percentage of machine time than reftests.

Furthermore, we have a test machine capacity issue on Windows and OS X right now. So reducing the machine time spent running test jobs on those platforms will have a huge impact on e.g. Try turnaround times.

It seems to me that reftests are highly specific to layout code. So, I'm going to argue that we should not run reftests in automation unless specific, identified parts of the code have changed. Or, at least we don't need to run the full reftest suite by default if we think a checkin isn't going to impact reftests (e.g. we could run a smoketest of 1-5% of reftests).

Does this sound reasonable? Can we limit running reftests to when certain paths change? Ideally we'd build a whitelist and only run reftests if certain files changed. But I'd settle for a blacklist. e.g. "don't build if change only touched browser/*."
Doesn't sound bad in principle (aren't we trying to do this more generically for all test suites in the long run anyway?).
It should be possible to backfill the skipped runs if needed, in case something unexpected causes a regression.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #1)
> Doesn't sound bad in principle (aren't we trying to do this more generically
> for all test suites in the long run anyway?).

Yes. But that's likely several months down the road. I'm looking for low hanging fruit right now.
(In reply to Gregory Szorc [:gps] from comment #0)
> It seems to me that reftests are highly specific to layout code.

s/layout/"rendering"/ (which includes graphics, imagelib, native-widget code, some DOM-ish stuff like SVG... Pretty broad slice of the platform.)

> So, I'm going to argue that we should not run reftests in automation unless
> specific, identified parts of the code have changed.

This seems OK to me, if we had a perfect whitelist of those parts of the code.  But I'm skeptical that we can come up with a perfect whitelist... So, I'd suggest we *also* have an "at least every Nth push gets a full run of reftests" requirement.

> Or, at least we don't
> need to run the full reftest suite by default if we think a checkin isn't
> going to impact reftests (e.g. we could run a smoketest of 1-5% of reftests).

(That also seems maybe-useful, but I also think it'd be worth doing a full run at least every N pushes.)

> Can we limit running reftests to when certain
> paths change? Ideally we'd build a whitelist and only run reftests if
> certain files changed. But I'd settle for a blacklist. e.g. "don't build if
> change only touched browser/*."

A blacklist might be easier to construct.  (But even the strawman browser/* suggestion is not blacklistable, because e.g. pref-tweaks in browser/app/profile/prefs.js could still impact arbitrary behavior and cause reftests to start unexpectedly failing/passing. Maybe that's the only exception in there, but I'm not sure.)
ISTM there are a couple of possible approaches here. We could be "conservative", and say that we'll skip the tests only if we're (almost) certain the push shouldn't affect them -- such as only touching front-end JS code or browser CSS or such like. I suspect, though, that such a blacklist would either end up looking very small (and therefore not all that useful), or needing to be extremely detailed (and therefore hard to maintain).

So the alternative more "aggressive" approach would be to adopt a whitelist that aims to cover the main places where frequent reftest-relevant changes are liable to happen, and then in the (rare, we hope!) event that a change elsewhere -- like a pref tweak under browser -- does cause reftest failures, rely on backfilling skipped tests to isolate the culprit.

From a quick look at top-level directories, I'd suggest running reftests only when a patch touches anything within

    dom/
    gfx/
    image/
    intl/
    layout/
    media/
    modules/

and skipping them otherwise. Perhaps with dholbert's added condition of "run them at least every N pushes", for some value of N (try something modest like 8 or 16, and see how it goes?).

(Perhaps we could take a selection of the dom/ subdirs -- things like canvas and svg are clear candidates, and maybe a few more -- as it's a big subtree and many parts of it look relatively safe to skip.)

The list above is admittedly very crude, but my hunch is that bugs outside these directories -- including things like editor/, view/ and widget/ that are certainly connected to rendering -- will tend to show up as mochitest failures rather than (or in addition to) reftest failures. It's definitely _possible_ to break reftests with changes in all sorts of other places in the tree, but I'd like to think that it's relatively rare.

Maybe we could try something along these lines on an explicitly experimental basis for a couple of weeks, and see how much machine time it saves (leading, presumably, to overall quicker turnaround) versus how much extra hassle it causes when we need to backfill tests (causing added delay) to track down problems.
I think having a global reftest list is the root of the problem.
http://mxr.mozilla.org/mozilla-central/source/layout/reftests/reftest.list

As an example, we could probably remove 'image/' from Jonathan's list
above if we didn't have "include ../../image/test/reftest/reftest.list"
in this file.  While changes in image/ certainly can break lot's of other
reftests it seems it should also break its own reftests in most cases.
Changes in layout/ though, are very unlikely to break the image/ reftests
and these take long time to run, IIRC.

Other things that should not live in this file:
include ../../netwerk/test/reftest/reftest.list
include ../../parser/htmlparser/tests/reftest/reftest.list
etc.

I agree we should probably exclude editor/ but it would be nice if we
could run editor/reftests/reftest.list by default at least.

It may be too late to change layout/reftests/reftest.list from being
the global reftest list that runs all reftests, but we should start
moving towards having more topic-related reftest lists that maps
source directories to a smaller set of reftest.list files and only
run those by default.
(With an option to run the global list when you need it.)
I would say splitting up reftests into different jobs in order to run fewer reftests on average would be in scope! There is some overhead to creating more jobs (it takes a few minutes to set up and tear down jobs). But I'll gladly pay that tax if it means e.g. eliminating hours of layout reftests that don't need to run if layout/dom/whatever doesn't change.
The whitelist in comment 5 plus browser/app/profile/prefs.js as mentioned in comment 4, plus widget/, seems reasonable to me.

There are also two reftest.list files in toolkit/. These reftests test XUL features which can be affected by stylesheets in toolkit/, so they need to run if anything in toolkit/ changes.

Do we know which reftests or reftest directories are the most expensive?
Also, can we calculate the amount of savings we'd get if we used this whitelist?
So far today, physical Windows machines have spent 3,655,508s (42.3 days) of machine time on reftest jobs on Windows platforms (t-xp32-ix, t-w732-ix, t-w864-ix). There are ~600 machines in that pool. Assuming jobs are loaded during 50% of the day, that means we have the equivalent of ~300 days of Windows tester time. So reftests are ~14% of machine time.

It's worth noting that since I filed this bug, Release Engineering has started running Windows test jobs in AWS, drastically reducing the capacity problem on the physical Windows machines. They've also enabled a few hundred Mac machines in the datacenter(s), mostly eliminating capacity problems on that platform.

Mochitests account for more physical Windows machine time than reftests (it's at least 5x greater). However, mochitests are harder to optimize since there are more generic than reftests.

I heard we're planning to move more Windows test jobs to AWS. So the priority of this bug has gone down a bit. I'm in a wait and see pattern right now to see where we are capacity wise on Windows...
You need to log in before you can comment on or make changes to this bug.