Open Bug 1357513 (test-verify) Opened 3 years ago Updated 1 month ago

[meta] New/modified test verification

Categories

(Testing :: General, enhancement, P3)

enhancement

Tracking

(Not tracked)

People

(Reporter: gbrown, Unassigned)

References

(Depends on 13 open bugs, Blocks 2 open bugs)

Details

(Keywords: meta)

A major finding from the Stockwell project's triaging experience: Many frequent intermittent test failures arise from the introduction of new tests, or the modification of existing tests. A few random examples:

https://bugzilla.mozilla.org/show_bug.cgi?id=1340413#c7
https://bugzilla.mozilla.org/show_bug.cgi?id=1318389#c2
https://bugzilla.mozilla.org/show_bug.cgi?id=1353894#c2
https://bugzilla.mozilla.org/show_bug.cgi?id=1351456#c1
https://bugzilla.mozilla.org/show_bug.cgi?id=1351409#c5

Sometimes a new/modified test fails frequently and obviously on try and the test is improved before check-in to an integration branch.

Sometimes a new/modified test fails frequently and obviously on check-in and the changeset is backed out.

But sometimes those checks fail and an intermittent test failure is introduced anyway. We can reduce intermittent failures by introducing tools and processes which find these cases faster. The basic strategy here is to notice when tests are being updated and subject those tests to more stringent verification right away.

For example, when mochitest test_blah.html is updated in a push to try or an integration branch, a new test-verification job is run and it runs test_blah.html 50 times, in isolation. A similar test-verification mach command might be useful for ad-hoc use in development environments.

Not all of the implementation details are clear to me, but some of them are; I'll file dependent bugs.
Duplicate of this bug: 1323044
Depends on: 1357520
Depends on: 1357551
See Also: → 1357557
It feels like there's a fair amount of overlap between this and bug 1322433. Some of the business logic is no doubt different but the basic concept ("run this job N times or until it fails") seems similar. You could consider using action tasks for this, which would have the added bonus of exposing the feature in the treeherder UI:

http://gecko.readthedocs.io/en/latest/taskcluster/taskcluster/actions.html
Depends on: 1371782
Depends on: 1380121
Depends on: 1380122
Depends on: 1380126
Depends on: 1390599
Depends on: 1390884
Depends on: 1390889
Depends on: 1390893
Depends on: 1391694
Depends on: 1396901
Depends on: 1396905
Depends on: 1397043
Depends on: 1397970
Depends on: 1398953
Depends on: 1398933
Depends on: 1394910
Depends on: 1400405
Depends on: 1400691
Depends on: 1400895
Depends on: 1400967
Depends on: 1400979
Depends on: 1404525
Depends on: 1404526
Depends on: 1405141
Depends on: 1405143
Depends on: 1403565
Depends on: 1405428
Depends on: 1406204
Depends on: 1406213
Depends on: 1406407
Depends on: 1409507
Depends on: 1409511
Depends on: 1410911
Depends on: 1411660
Depends on: 1412349
Depends on: 1418375
Depends on: 1418363
Depends on: 1423918
Depends on: 1411298
Blocks: 1428828
Depends on: 1431125
Priority: -- → P3
Depends on: 1439589
Depends on: 1441990
Depends on: 1443177
Depends on: 1453056
See Also: → 1447179
Depends on: 1455316
Depends on: 1455309
Assignee: gbrown → nobody
Depends on: 1461440
Depends on: 1461809
Depends on: 1462182
Depends on: 1465117
Depends on: 1466187
Depends on: 1466578
Depends on: 1466862
Depends on: 1460901
Depends on: 1466923
Depends on: 1467837
Depends on: 1469583
Depends on: 1471227
Depends on: 1473392
Depends on: 1476318
Depends on: 1475194
Depends on: 1477976
Depends on: 1483421
Depends on: 1482413
Depends on: 1483292
Depends on: 1522113
Depends on: 1534867
Depends on: 1535417
Depends on: 1536696
Alias: test-verify
Depends on: 1529238
Depends on: 1545297
Depends on: 1528471
Depends on: 1550735
Depends on: 1535287
Depends on: 1552300

One weakness of TV is that the TV task may not run with the same task configuration as the normal test task in which the tasks would run. For instance, if the xpcshell test task and the mochitest test task for a particular platform use different builds (eg. Windows xpcshell tests may run against a signed build) then TV can be configured to match one or the other, but not both. This is why TVg was introduced: So that different virtualizations could be used in TV/TVg; but deciding which tests apply to TV vs TVg has been tricky also. And task configurations are always changing.

:bc's recent work on "test isolation" suggests a different approach -- https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=bc8f78e7ea0b947c07b6a6c4c502882faa1b973f -- where existing task definitions are cloned. There would be additional challenges for TV, but I'm thinking TV could identify tests files from the hg log (as it does today), then spawn new tasks for each supported suite affected by the push. If a push modified a mochitest and an xpcshell test, TV would notice that, then spawn M-tv and X-tv tasks, each cloned from the appropriate existing task definition.

Depends on: 1561884
Depends on: 1568063
Depends on: 1569982
Depends on: 1577197
Depends on: 1593779
Depends on: 1599242

:gbrown, given the upcoming changes in test scheduling (test manifest level) as well as recent fixes to retain meta data while retriggering, do you think fixing some of the scheduling issues for test-verify is accurate in the coming months?

I would like to know that test-verify works for all our major test harnesses and configs and that it is scheduled properly. Maybe a stretch goal is to treat tests that do not pass test-verify as something we only run on m-c and not on try by default (i.e. lower value). I don't think we can consider something like that without knowing if test-verify is accurate.

Based on the dependencies:
https://bugzilla.mozilla.org/showdependencytree.cgi?id=1357513&hide_resolved=1

it looks like there is some work to do here but not a lot.

Flags: needinfo?(gbrown)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #5)

:gbrown, given the upcoming changes in test scheduling (test manifest level) as well as recent fixes to retain meta data while retriggering, do you think fixing some of the scheduling issues for test-verify is accurate in the coming months?

Since https://bugzilla.mozilla.org/show_bug.cgi?id=1522113#c14, most of my TV scheduling concerns have been addressed. Do you have TV scheduling concerns? How would test manifest level test scheduling affect TV scheduling?

I would like to know that test-verify works for all our major test harnesses and configs and that it is scheduled properly.

test-verify supports wpt, mochitest (including subsuites, etc), reftest/crashtest/jsreftest, and xpcshell; nothing else.

Maybe a stretch goal is to treat tests that do not pass test-verify as something we only run on m-c and not on try by default (i.e. lower value). I don't think we can consider something like that without knowing if test-verify is accurate.

TV is intended as an early warning system which draws attention to test vulnerabilities that can lead to intermittent failures; also, it provides a fast and convenient way to reproduce many intermittent failures quickly. I don't think it is appropriate to modify test scheduling based on TV results; certainly intermittent failure history is a more direct, simple, and fair metric to use for such purposes. (This is part of why I keep saying that tier-1 TV should be a non-goal.)

I believe that TV is mostly accurate: It finds genuine vulnerabilities in tests, it reproduces most frequent intermittent failures, it very rarely fails without good reason. There is sometimes a perception that TV is not accurate because it reports failures related to tests relying on state established by other tests (eg. tests that cannot run standalone cannot pass TV).

Based on the dependencies:
https://bugzilla.mozilla.org/showdependencytree.cgi?id=1357513&hide_resolved=1

it looks like there is some work to do here but not a lot.

Bug dependencies here reflect a mixture of in-progress work and imminent plans that have been unexpectedly postponed; more extensive, longer term plans for TV were proposed in planning documents in Q4 2018 and Q1 2019 and a trimmed down version ("smart" TV) was again proposed recently but none of these proposals have been supported. Given the on-going lack of investment in TV, I am considering de-scheduling it entirely in 2020.

Flags: needinfo?(gbrown)
Depends on: 1610886
Depends on: 1551889
You need to log in before you can comment on or make changes to this bug.