Return list of never-fail tests

RESOLVED INCOMPLETE

Status

Testing
ActiveData
RESOLVED INCOMPLETE
2 years ago
2 months ago

People

(Reporter: ekyle, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

2 years ago
Get a one-time list of never fail tests
(Reporter)

Updated

2 years ago
Depends on: 1161188
More specifically, I'd like to see a list of mochitest-browser-chrome tests (sorted by dir) along with their failure rates over the amount of time we have data in ActiveData.  This will allow us to detect which dirs are candidates for moving into rarely scheduled "perma-green" chunks, if any.

I'd also like the same data for mochitest-plain.

There are likely random infra problems that contribute to some failures, which is why I'd like to see failure rates, in addition to "never fail" tests...some of these failures may be discountable, but we'll have to investigate them to be sure.
I am still quite bothered by this kind of project. Over a timescale of even a year a test may never fail simply because no one alters the code that it happens to be testing. As soon as someone does alter that code it's rather inconvenient if the test is not run for N cycles and so we miss an obvious regression for much longer than we otherwise would. Of course one can hope that those tests would be run on try if they are directly relevant to the piece of code that has changed, but if we really trusted people to test on try we would turn off all tests on inbound. Doing this as a one-shot thing seems even worse; given that we have relatively little historical data I suspect the list of tests that fail maps strongly onto "areas of the code that are actively being developed today". But in one year we might have emphasis on very different projects touching very different parts of the code so this data will be broken.
I agree there's a risk in that here, so one of the things I'd like to do is to attempt to pick some dirs that seem stable and track their stability over time.  Anecdotally, we believe that there are certain low-value test dirs, like those for DOM level-0/1/2, that practically never fail unless something pretty fundamental breaks, in which case there are other failures as well.

I'm not suggesting we turn these off altogether, merely that we schedule them infrequently.  This is complementary to the SETA scheduling changes which are already in place.  Ideally, we'd be able to dynamically migrate test dirs in and out of less frequently scheduled "stable" chunks based on recent activity, so that if a "stable" dir starts exhibiting failures because a relevant feature is being worked on, this would signal some automation that these tests need to be run more frequently.

SETA scheduling changes dynamically based on the accidental ordering of tests in jobs; I think we could improve this targeting using ActiveData by prioritizing dirs which have exhibited failures recently.  The more stable the dir, the less often it's scheduled.

The first step here is to determine whether we do in fact have stable dirs beyond a few obvious candidates.
(Reporter)

Comment 4

2 years ago
From Joel:

> There are 3 things here:
> 1) data to collect for analysis
> 2) what vectors do we care about
> 3) time window of concern
> 
> let me start with vectors:
> 1) all platforms, options, builds
> 2) per platform/buildtype combo: linux32 opt, linux32 debug, 
>    linux64 opt, linux64, debug, ... (I think we are ~35)
> 
> for time windows, I would like to keep things on a multiple of 
> weeks for simplicity:
> 1) 7
> 2) 14
> 3) 21
> 4) 28
> so starting with vector-1, and time-1 to keep it simple here 
> is the data we want to collect:
> * for all unique test files, return a count of the number of 
>   failures
> ** no need to worry about a file with different failures
> ** no need to worry if the file failed for a real regression, etc.
> ** no need to worry if this is a side effect of an earlier failure
> * we will also have failures unassociated with test files 
>   (leaks, shutdown crashes)- bucket them into entries as well
> 
> repeat for the other time windows so we would have:
> {'time_window": 7, 'files': [{'filename': count, 'filename': count,...]},
>  "time_window": 14, 'files': [{'filename': count, 'filename': count,...]},
>  ...
> }
(Reporter)

Updated

2 years ago
Assignee: klahnakoski → nobody
(Reporter)

Comment 5

2 months ago
The test failures dashboard is a (stalled) project to produce-and-visualize test-level aggregates; including the fail rates. The one failing of this dashboard is it only tracks tests that have failed at least once.  This should be expanded to all tests so other test-level statistics (on duration) can be tracked.

Closing this bug; it will be solved when OrangeFactor v2 project [2] is resumed, if ever.


[1] https://people-mozilla.org/~klahnakoski/testfailures/
[2] https://docs.google.com/document/d/1LB4Ppj55rw9IFC-ggQr6Vy0yx8Gq-c6rcD7aNt_0Ib4
Status: NEW → RESOLVED
Last Resolved: 2 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.