Open Bug 1572820 Opened 4 months ago Updated 8 days ago

run expected-[fail|crash|timeout] tests once/day as tier-2 instead of all the time

Categories

(Testing :: General, task, P3)

Version 3
task

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

(Blocks 1 open bug)

Details

22% of our overall web-platform-test execution time is spent on tests which are not expected to pass. Most likely the tests are not well written or Firefox doesn't support the test fully. Either way, werun these per commit, for all try pushes and as tier-1. I would like to run these as tier-2 and only once/day on mozilla-central.

We could do this for xpcshell/mochitest as well, lets start with wpt as the runtime will be a significant savings.

Some things to solve:

  1. wpt-update scripts will need to account for this (ensure jobs run on try, etc.)
  2. write better code than this: https://hg.mozilla.org/try/rev/e39120a4e76eeb3b5140fec372fd54ed33f6ba09
  3. currently taskcluster scheduling with SETA skips many WPT tests when they are updated, we should write a rule for scheduling that when wpt manifests are changed we run all wpt (tier-1?)
  4. we should quantify how many regressions we would miss per commit and only find in the new once/day scenario (ignoring wpt updates).
Blocks: 1573872

some data for item #4 above ---

In 2019-H1, I find reference to 102 regressions where wpt tests are the only annotated items. Of those 80 were resolved by editing test cases, manifests, or tooling/infrastructure.

That leaves 22 regressions that had a product fix or an unknown fix (not straightforward to figure out:
fixed revision, type of fix, bug number
c908cdfffe30, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509575
008355f71ca0, unknown,
c7085705cd57, unknown,
2e610b19a641, unknown,
31b3a7b0f085, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1524648
36ae1038bf8c, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509738
caab6cb81820, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509738
903ca27a7aed, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1521476
31bb66676827, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1529117
3a2ced4fbd98, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1529831
66a4a5cb3fc7, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1525245
6b8e4909d303, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1521923
ede970649f87, unknown, https://bugzilla.mozilla.org/show_bug.cgi?id=1531294
1df7ae29b121, unknown, https://bugzilla.mozilla.org/show_bug.cgi?id=1535507
5d179e192915, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1538969
4d450ab98f58, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1539006
45bf2b32a377, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1525640
29be14061be5, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1545751
e8766f96041a, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1516722
5179e588a94f, product,
0a1d0653490a, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1554244
ff649d101ade, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1561773

for these 22 regressions we should see what test failed and if that test had any expectations in the manifest other than pass/ok. I suspect that list will be short. Knowing what type of expectations we have and the count will help determine how many will get found much slower and if the numbers are high or only on a single type of failure we could adjust the goals of this bug.

Priority: -- → P3

My rough calculations show this would save $100k/yr going forward. Definitely worth prioritizing.

See Also: → 1594796

I have some concerns with the approach suggested here:

  • The wpt sync relies on having complete results on central in order to detect differences introduced by PRs. Not running all tests on every central commit would break that.
  • The sync also relies on try running all tests, and I think that developers do too. Determining whether failing tests are in fact fixed by a patch is one of the most important functions of running wpt through try, so it shouldn't require extra effort to get that correct.
  • The suggested semantics ("don't run tests that aren't expected to pass") don't make much sense in detail. There is a difference between reftests and test types with subtests. Reftests have a single status that they report to the top level, whereas for other tests the top-level status is only whether the test completed without an uncaught exception, timeout, or crash. This means we will suppress running failing layout tests, but continue to run testharness tests in which all the subtests fail, even though the former are most likely gecko bugs but the latter may be features we don't intend to implement.
  • It's also possible for a test to have a status of ERROR or TIMEOUT but a number of stably passing subtests. They may also show leaks, asserts, or crashes. So the proposed semantics also don't correspond to only running tests that are able to show regressions.
  • I worry about the messaging associated with labelling failing tests "Tier 2". The failing tests — particuarly the ones that are passing in Chrome and Safari — are high value web-platform-tests because they are the ones that are identifying issues that we should triage and prioritise for fixing. Although logically we can still do that if they run less frequently, I think we already have a tendency to view tests we aren't already passing as "probably low value" without actually checking that assumption, and this plays into that narrative.

So at the very least we need to always run all wpt tests on central and try and only change things for integration branches.

However I think we should consider different approaches here. Two classes of tests were identified as being "low value" in comment #0

  • Not well written
  • Not supported in Firefox

For the first case I think we should improve the tests. There's no sense in having tests that are buggy running in any circumstance (on central, in the upstream CI, etc.). This presumably isn't like intermittents where the issues are often subtle race conditions, so it shouldn't be nearly as hard to work through the backlog.

The tests that aren't supported in Firefox breaks down into two subsets:

  • Features that we intend to support but where we have implemenation issues
  • Features we don't plan to support at this time

For features we intend to support the tests are high value in the sense that they're telling us about bugs in our implemenation. For all the reasons given above I'd like to keep running these at tier 1. For features we simply don't intend to support the overall status of the tests isn't so important; a well written WebUSB test might have a status of OK in Firefox, and so run under the above proposal, but be useless to us unless we plan to actually implement that feature. But we could add metadata to mark those directories as tier-2 and only run them on central to check for leaks, crashes, etc. This of course requires some manual work to label the tier-2 directories, but it seems like it's a useful exercise to record the choices that we've made, and the data can be used for other purposes (e.g. not autogenerating bugs corresponding to changes in those directories).

I keep seeing wpt-sync as a reason for this being not ideal. Can we fix wpt-sync to work in a before/after mode instead of depending on mozilla-central?

We do not have the luxury of running unlimited tests because they exist, if there are some subsets of non expected-pass tests that we need to run, I would like them to be justified individually with a business case:

  • for this test specifically has this been finding regressions in our code
  • we have developers actively working on a new feature which will depend on these tests
  • why we cannot wait 24 hours for a nightly build to find the failure and backfill

If we need to help figure out wpt-sync changes before deploying this change, then that should be part of the scope of this bug. Assuming we are running tests at a much lower rate, having a secondary try push seems reasonable from a cost/machine usage situation.

Can we fix wpt-sync to work in a before/after mode instead of depending on mozilla-central?

Yes, probably but that's additional complexity that we don't already have.

We do not have the luxury of running unlimited tests because they exist, if there are some subsets of non expected-pass tests that we need to run, I would like them to be justified individually with a business case:

Given that "expected pass" as defined here doesn't map logically onto the semantics you want, I don't think that's a reasonable position to take.

I have proposed an alternative solution (deprioritise tests for features we've made a business decision not to support). I think it's reasonable to have a discussion about the different tradeoffs, but I don't think it's reasonable to start from the assumption that the proposal in comment #0 is the correct solution.

See Also: → 1595515

Yesterday Joel, James, and I met about this bug. As James pointed out in comment 3 we can't just switch off everything that is expecting to fail as it would create a lot of work for all the engineering productivity tooling available.

We wrote out 2 possible scenarios in https://docs.google.com/document/d/1IaHpE0WVlK_ME_dXqSqrlzOCciHBjGwWFGHFwHPkDwY/edit?usp=sharing

After the meeting the plan is to

  • add more expectation data describing features we will never implement
  • add more expectation data for features we will do but arent on a roadmap

We will have those tests turned off for now.

The cost of updating the tooling and processes and rerunning issues to find the regression will offset just turning off the tests wholesale.

The next step here is to identify the parts of the wpt suite that are for features we aren't going to implement or don't have plans to implement at this time. I'll put together a spreadsheet with the list of test directories, the results summary for each, and arrange for it to be sent around to that the relevant code owners and people in product/engineering can fill in the implementation status. After that the steps are:

  • Add support for implementation status in the expectation metadata files.
  • Allow wptrunner to skip tests matching a given status using a command line option like `--skip-implementation-status=not-implementing,backlog" (or whatever the names we use are).
  • Update the TC tasks (and mozharness, perhaps) to pass in the relevant command line options to the jobs

Looking at https://docs.google.com/spreadsheets/d/1YOlTYJ6cXtXBcdmPVRm0zH6xP6KEJ08q4BrXg0NedTM/edit#gid=0 we have some candidates for areas where we should not be running tests

You need to log in before you can comment on or make changes to this bug.