Closed Bug 1572820 Opened 6 years ago Closed 5 years ago

run expected-[fail|crash|timeout] tests once/day as tier-2 instead of all the time

Categories

(Testing :: General, task, P3)

Product:

Component:

Version:

Version 3

Type:

task

Priority:

P3

Severity:

normal

Tracking

(firefox77 fixed)

Status:

RESOLVED FIXED

Milestone:

mozilla77

Tracking Flags:

Tracking

Status

firefox77

---

fixed

People

(Reporter: jmaher, Assigned: jmaher)

References

(Depends on 9 open bugs, Blocks 1 open bug)

Details

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Description

•

6 years ago

22% of our overall web-platform-test execution time is spent on tests which are not expected to pass. Most likely the tests are not well written or Firefox doesn't support the test fully. Either way, werun these per commit, for all try pushes and as tier-1. I would like to run these as tier-2 and only once/day on mozilla-central.

We could do this for xpcshell/mochitest as well, lets start with wpt as the runtime will be a significant savings.

Some things to solve:

wpt-update scripts will need to account for this (ensure jobs run on try, etc.)
write better code than this: https://hg.mozilla.org/try/rev/e39120a4e76eeb3b5140fec372fd54ed33f6ba09
currently taskcluster scheduling with SETA skips many WPT tests when they are updated, we should write a rule for scheduling that when wpt manifests are changed we run all wpt (tier-1?)
we should quantify how many regressions we would miss per commit and only find in the new once/day scenario (ignoring wpt updates).

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

6 years ago

Blocks: 1573872

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 1

•

6 years ago

some data for item #4 above ---

In 2019-H1, I find reference to 102 regressions where wpt tests are the only annotated items. Of those 80 were resolved by editing test cases, manifests, or tooling/infrastructure.

That leaves 22 regressions that had a product fix or an unknown fix (not straightforward to figure out:
fixed revision, type of fix, bug number
c908cdfffe30, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509575
008355f71ca0, unknown,
c7085705cd57, unknown,
2e610b19a641, unknown,
31b3a7b0f085, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1524648
36ae1038bf8c, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509738
caab6cb81820, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1509738
903ca27a7aed, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1521476
31bb66676827, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1529117
3a2ced4fbd98, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1529831
66a4a5cb3fc7, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1525245
6b8e4909d303, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1521923
ede970649f87, unknown, https://bugzilla.mozilla.org/show_bug.cgi?id=1531294
1df7ae29b121, unknown, https://bugzilla.mozilla.org/show_bug.cgi?id=1535507
5d179e192915, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1538969
4d450ab98f58, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1539006
45bf2b32a377, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1525640
29be14061be5, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1545751
e8766f96041a, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1516722
5179e588a94f, product,
0a1d0653490a, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1554244
ff649d101ade, product, https://bugzilla.mozilla.org/show_bug.cgi?id=1561773

for these 22 regressions we should see what test failed and if that test had any expectations in the manifest other than pass/ok. I suspect that list will be short. Knowing what type of expectations we have and the count will help determine how many will get found much slower and if the numbers are high or only on a single type of failure we could adjust the goals of this bug.

Geoff Brown [:gbrown]

Updated

•

6 years ago

Priority: -- → P3

Bobby Holley (:bholley)

Comment 2

•

5 years ago

My rough calculations show this would save $100k/yr going forward. Definitely worth prioritizing.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

See Also: → 1594796

James Graham [:jgraham]

Comment 3

•

5 years ago

I have some concerns with the approach suggested here:

The wpt sync relies on having complete results on central in order to detect differences introduced by PRs. Not running all tests on every central commit would break that.
The sync also relies on try running all tests, and I think that developers do too. Determining whether failing tests are in fact fixed by a patch is one of the most important functions of running wpt through try, so it shouldn't require extra effort to get that correct.
The suggested semantics ("don't run tests that aren't expected to pass") don't make much sense in detail. There is a difference between reftests and test types with subtests. Reftests have a single status that they report to the top level, whereas for other tests the top-level status is only whether the test completed without an uncaught exception, timeout, or crash. This means we will suppress running failing layout tests, but continue to run testharness tests in which all the subtests fail, even though the former are most likely gecko bugs but the latter may be features we don't intend to implement.
It's also possible for a test to have a status of ERROR or TIMEOUT but a number of stably passing subtests. They may also show leaks, asserts, or crashes. So the proposed semantics also don't correspond to only running tests that are able to show regressions.
I worry about the messaging associated with labelling failing tests "Tier 2". The failing tests — particuarly the ones that are passing in Chrome and Safari — are high value web-platform-tests because they are the ones that are identifying issues that we should triage and prioritise for fixing. Although logically we can still do that if they run less frequently, I think we already have a tendency to view tests we aren't already passing as "probably low value" without actually checking that assumption, and this plays into that narrative.

So at the very least we need to always run all wpt tests on central and try and only change things for integration branches.

However I think we should consider different approaches here. Two classes of tests were identified as being "low value" in comment #0

Not well written
Not supported in Firefox

For the first case I think we should improve the tests. There's no sense in having tests that are buggy running in any circumstance (on central, in the upstream CI, etc.). This presumably isn't like intermittents where the issues are often subtle race conditions, so it shouldn't be nearly as hard to work through the backlog.

The tests that aren't supported in Firefox breaks down into two subsets:

Features that we intend to support but where we have implemenation issues
Features we don't plan to support at this time

For features we intend to support the tests are high value in the sense that they're telling us about bugs in our implemenation. For all the reasons given above I'd like to keep running these at tier 1. For features we simply don't intend to support the overall status of the tests isn't so important; a well written WebUSB test might have a status of OK in Firefox, and so run under the above proposal, but be useless to us unless we plan to actually implement that feature. But we could add metadata to mark those directories as tier-2 and only run them on central to check for leaks, crashes, etc. This of course requires some manual work to label the tier-2 directories, but it seems like it's a useful exercise to record the choices that we've made, and the data can be used for other purposes (e.g. not autogenerating bugs corresponding to changes in those directories).

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 4

•

5 years ago

I keep seeing wpt-sync as a reason for this being not ideal. Can we fix wpt-sync to work in a before/after mode instead of depending on mozilla-central?

We do not have the luxury of running unlimited tests because they exist, if there are some subsets of non expected-pass tests that we need to run, I would like them to be justified individually with a business case:

for this test specifically has this been finding regressions in our code
we have developers actively working on a new feature which will depend on these tests
why we cannot wait 24 hours for a nightly build to find the failure and backfill

If we need to help figure out wpt-sync changes before deploying this change, then that should be part of the scope of this bug. Assuming we are running tests at a much lower rate, having a secondary try push seems reasonable from a cost/machine usage situation.

James Graham [:jgraham]

Comment 5

•

5 years ago

•

Can we fix wpt-sync to work in a before/after mode instead of depending on mozilla-central?

Yes, probably but that's additional complexity that we don't already have.

We do not have the luxury of running unlimited tests because they exist, if there are some subsets of non expected-pass tests that we need to run, I would like them to be justified individually with a business case:

Given that "expected pass" as defined here doesn't map logically onto the semantics you want, I don't think that's a reasonable position to take.

I have proposed an alternative solution (deprioritise tests for features we've made a business decision not to support). I think it's reasonable to have a discussion about the different tradeoffs, but I don't think it's reasonable to start from the assumption that the proposal in comment #0 is the correct solution.

Bob Clary [:bc] (inactive)

Updated

•

5 years ago

See Also: → 1595515

David Burns :automatedtester

Comment 6

•

5 years ago

Yesterday Joel, James, and I met about this bug. As James pointed out in comment 3 we can't just switch off everything that is expecting to fail as it would create a lot of work for all the engineering productivity tooling available.

We wrote out 2 possible scenarios in https://docs.google.com/document/d/1IaHpE0WVlK_ME_dXqSqrlzOCciHBjGwWFGHFwHPkDwY/edit?usp=sharing

After the meeting the plan is to

add more expectation data describing features we will never implement
add more expectation data for features we will do but arent on a roadmap

We will have those tests turned off for now.

The cost of updating the tooling and processes and rerunning issues to find the regression will offset just turning off the tests wholesale.

James Graham [:jgraham]

Comment 7

•

5 years ago

The next step here is to identify the parts of the wpt suite that are for features we aren't going to implement or don't have plans to implement at this time. I'll put together a spreadsheet with the list of test directories, the results summary for each, and arrange for it to be sent around to that the relevant code owners and people in product/engineering can fill in the implementation status. After that the steps are:

Add support for implementation status in the expectation metadata files.
Allow wptrunner to skip tests matching a given status using a command line option like `--skip-implementation-status=not-implementing,backlog" (or whatever the names we use are).
Update the TC tasks (and mozharness, perhaps) to pass in the relevant command line options to the jobs

David Burns :automatedtester

Comment 8

•

5 years ago

Looking at https://docs.google.com/spreadsheets/d/1YOlTYJ6cXtXBcdmPVRm0zH6xP6KEJ08q4BrXg0NedTM/edit#gid=0 we have some candidates for areas where we should not be running tests

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 9

•

5 years ago

in bug 1595515 we landed changes to not run 3 specific directories of tests that we do not have implemented.

I have done a fresh query for the last week and 23.4% of runtime is for expected timeout (18%) and error (5.4%). Looking at our old data for tests that timeout from May 2019:
https://docs.google.com/spreadsheets/d/1N4ix5ZDMSgxzCtMJ5BDGVr6RygeoGnjkko4l9gj_xpY/edit#gid=1212266288

March 2020: 430 test names
May 2019: 353 test names
of the 2 sets, 285 are the same tests

A lot of time has elapsed for us to fix issues, and in bug 1595515 it was hoped that a large chunk of our timeout tests would be disabled as a result, unfortunately that isn't true.

:jgraham, should we revisit the tests we skip for no implementation, or should we do something different for expected-timeout bugs?

Flags: needinfo?(james)

James Graham [:jgraham]

Comment 10

•

5 years ago

The plan is still to run things that we don't have plans to implement in the near future less often (the existing change only applies to the much smaller subset of things we regard as harmful to the web and plan to never implement). But doing that requires knowing which features those are. I talked to mconca about providing an initial cut of the data we need here (i.e. filling in the initial information in [1]), but I'm not sure what his timeline is there.

Once we have that data the infrastructure added in bug 1595515 can be applied to a much wider range of directories.

[1] https://docs.google.com/spreadsheets/d/1xMI7akJqtLfB-CsXI1maKsP57WrAtJbSpmZrsFTZNp0/edit#gid=1132212086

Flags: needinfo?(james) → needinfo?(mconca)

Mike Conca [:mconca]

Comment 11

•

5 years ago

(In reply to James Graham [:jgraham] from comment #10)

I talked to mconca about providing an initial cut of the data we need here (i.e. filling in the initial information in [1]), but I'm not sure what his timeline is there.

I just asked a few people to look at this. Let's assess the state of that spreadsheet in another two weeks.

Flags: needinfo?(mconca)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 12

•

5 years ago

I see on the spreadsheet all features have a status:
261 - total
90 - unknown
4 - not implemented
4 - not scheduled
2 - not a spec

it sounds like we have a lot of unknown data to resolve still, what will it take to do that? From an outsider perspective if we don't know we probably don't have a lot of value from it and should be running it once/day (vs not scheduled)

James Graham [:jgraham]

Comment 13

•

5 years ago

Unknown is literally the holes in data from my pass at filling it in. It means "I don't know for sure", not "Mozilla don't know"; lots of unknown is probably features we already implement.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 14

•

5 years ago

how do we get answers to that?

James Graham [:jgraham]

Comment 15

•

5 years ago

That's the question that I'm hoping mconca and colleagues can answer.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 16

•

5 years ago

:mconca, it has been a couple weeks and I wanted to check in with you- I assumed you were going to fill out the spreadsheet or twist arms to get it filled out. Right now there are a lot of unknown areas.

Flags: needinfo?(mconca)

Mike Conca [:mconca]

Comment 17

•

5 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #16)

:mconca, it has been a couple weeks and I wanted to check in with you- I assumed you were going to fill out the spreadsheet or twist arms to get it filled out. Right now there are a lot of unknown areas.

Looks like a bit more of it got filled out, but not significantly. Let me ping more folks (leaving my NI open for now).

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 18

•

5 years ago

I see a lot of updates:
10 unknown
8 not implementing
3 not a spec
52 not scheduled

thanks for the work so far and I assume the last 10 unknowns can be figured out in the coming week.

Mike Conca [:mconca]

Comment 19

•

5 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #18)

I see a lot of updates:
10 unknown
8 not implementing
3 not a spec
52 not scheduled

thanks for the work so far and I assume the last 10 unknowns can be figured out in the coming week.

Much more progress, I agree. I'll try to get the final ones figured out this week.

Mike Conca [:mconca]

Comment 20

•

5 years ago

(In reply to Mike Conca [:mconca] from comment #19)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #18)

I see a lot of updates:
10 unknown
8 not implementing
3 not a spec
52 not scheduled

thanks for the work so far and I assume the last 10 unknowns can be figured out in the coming week.

Much more progress, I agree. I'll try to get the final ones figured out this week.

Done. The list should be complete. Thank you for your patience.

Once we start implementing a new roadmap item, what is the process for moving a WPT test suite from tier 2 to tier 1?

Flags: needinfo?(mconca) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 21

•

5 years ago

first off, thanks for completing this list. The intention here is to run these "not implemented" tests as tier-2, which will be on mozilla-central pushes only, not beta, autoland, try by default.

These will have a single line in the meta/<dirname>/dir.ini which will be the implementation status [1] . Since this is in-tree, when the feature is being worked on (maybe even scheduled to work on in the short term), the implementation status line could be removed from the dir.ini file and the set of tests would be run as tier-1 (everywhere).

:mconca, does that work for you? I am happy to adjust things if needed

[1] https://searchfox.org/mozilla-central/source/testing/web-platform/meta/PeriodicBackgroundSync/__dir__.ini#1

Flags: needinfo?(jmaher)

Mike Conca [:mconca]

Comment 22

•

5 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #21)

first off, thanks for completing this list. The intention here is to run these "not implemented" tests as tier-2, which will be on mozilla-central pushes only, not beta, autoland, try by default.

These will have a single line in the meta/<dirname>/dir.ini which will be the implementation status [1] . Since this is in-tree, when the feature is being worked on (maybe even scheduled to work on in the short term), the implementation status line could be removed from the dir.ini file and the set of tests would be run as tier-1 (everywhere).

:mconca, does that work for you? I am happy to adjust things if needed

[1] https://searchfox.org/mozilla-central/source/testing/web-platform/meta/PeriodicBackgroundSync/__dir__.ini#1

Sounds good. Thanks.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1631410

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1631882

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1631922

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1631938

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1632079

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1632083

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1632086

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1632119

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1632122

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1632125

Comment 23

•

5 years ago

Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a9ed4f48957c run wpt components we are not implementing as tier-2 on m-c only. r=jgraham

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 24

•

5 years ago

Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/23179 for changes under testing/web-platform/tests

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 25

•

5 years ago

Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.

Noemi Erli[:noemi_erli]

Comment 26

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/a9ed4f48957c

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox77: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla77

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Assignee: nobody → jmaher

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1632811

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 27

•

5 years ago

Upstream PR merged by moz-wptsync-bot

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1632869

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1632873

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1632876

Edwin Takahashi (:egao | infrequent contributor)

Updated

•

5 years ago

Depends on: 1633536

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1634337

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1634357

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1634369

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1634389

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1634854

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1634862

Chris AtLee [:catlee]

Comment 28

•

5 years ago

We discussed some of the issues around this on matrix today, and decided that at least we should stop running tests that are expected to timeout out autoland. Ideally we would do the same on try if we had an easy mechanism to turn them back on for testing.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1635911

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1635919

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1635922

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1635926

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1636069

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1636070

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1636073

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1636079

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1636118

Chris AtLee [:catlee]

Updated

•

5 years ago

Depends on: 1636120

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1636178

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1636905

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1639487

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1639490

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1639499

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1639505

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1639507

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1639510

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1639511

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1639538

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1639539

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1642090

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1642101

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1642157

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1642160

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1642163

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1642193

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

5 years ago

Depends on: 1642302

Edwin Takahashi (:egao | infrequent contributor)

Updated

•

5 years ago

Depends on: 1643177

Edwin Takahashi (:egao | infrequent contributor)

Updated

•

5 years ago

Depends on: 1652128

Daniel Holbert [:dholbert]

Updated

•

1 year ago

Depends on: 1882901

You need to log in before you can comment on or make changes to this bug.