Open Bug 1438500 Opened 2 years ago Updated 5 months ago

Investigate using code coverage data to prioritize intermittent tests

Categories

(Testing :: Code Coverage, enhancement)

enhancement
Not set

Tracking

(Not tracked)

People

(Reporter: marco, Unassigned)

References

(Blocks 1 open bug)

Details

We could use code coverage data to prioritize intermittent tests.

For example, intermittent tests that have unique coverage should be more important (and more effort should be spent in fixing them) than intermittent tests that cover code that is already covered by other non-intermittent tests.
I recall thinking this in Austin and Greg did an experiment there with a few intermittent failures and didn't have much luck.  Since we keep coming up with the same idea, we should come up with ways to get this data at scale and how to measure whether this is useful or not.

For example, we wanted to use 'rr' for intermittent failures- the problem is we could rarely (as in <5%) get intermittent failures to reproduce with 'rr'.  Typically we find other errors with the tools, the failure needs to be in an exact config/environment, or by the time we reproduce it and get data the failure pattern has changed or was fixed with other code.

I suspect similar patterns here, but we should validate that.  Probably taking the needswork:owner list is a great place to start:
https://bugzilla.mozilla.org/buglist.cgi?resolution=---&resolution=FIXED&resolution=INVALID&resolution=WONTFIX&resolution=DUPLICATE&resolution=WORKSFORME&status_whiteboard_type=allwordssubstr&query_format=advanced&status_whiteboard=[stockwell%20needswork:owner]&list_id=14005287

We could query that and determine which fail to some reasonable extent on linux64-debug or win10-debug (most similar to our ccov environments).  We should run the jobs in the same way as they are run on linux or windows, that means chunks in the same fashion as often the state of the browser is causing issues with the test or the test itself might not run solo or repeat reliably (there are hundreds of tests that cannot run by themselves or be repeated).  My thoughts are to then push to try for each test and maybe run the job hundreds of times (or some run until failure feature).

Then comes analyzing a green job vs a failed job- this is where Greg's work in december had troubles (well reproducing was hard).  I suspect if we had 20 reproduced failures only a few would yield a clear signal.  Of course if we had 10 green jobs and 10 orange jobs, we could aggregate the pass/fail data and have a clearer signal.

Interested in other thoughts here, my thinking is usually linear which often results in missing great innovation or ideas.
Blocks: 1429455
Greg's experiments were in looking, for a given intermittent test, what code was executed in the case of success and what in the case of failure, then analyze the difference to see if it made clearer what the cause of the failure might be. In this case, reproducing the problem is paramount.

What I'm proposing instead doesn't require us to reproduce the failure. We actually want to collect the coverage for the test when it is successful. Then, we compare this coverage with the coverage from all other tests and check if it covers code that the other tests don't cover.
oh- would the end result be to disable intermittent tests that have no unique coverage?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #3)
> oh- would the end result be to disable intermittent tests that have no
> unique coverage?

Not necessarily disable the tests with no unique coverage, but give more "fixing priority" to the ones that do have unique coverage.
that makes sense- and I have heard requests from developers that they would like to know which tests have unique coverage.

The challenge here is how to get coverage information for a specific test.  A simple way would be to run the job as it is, the again with the test skipped- then analyze the coverage differences.  Of course we have a lot of variability in coverage- so we would need multiple data points on each side.

Are there other ideas on how to achieve this?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #5)
> that makes sense- and I have heard requests from developers that they would
> like to know which tests have unique coverage.
> 
> The challenge here is how to get coverage information for a specific test. 
> A simple way would be to run the job as it is, the again with the test
> skipped- then analyze the coverage differences.  Of course we have a lot of
> variability in coverage- so we would need multiple data points on each side.
> 
> Are there other ideas on how to achieve this?

To reduce the variability, it would be useful to run the tests on the same build. Is there a way to do that? Is it possible to request additional tests to run for a given build?
We could:
1) Build Firefox;
2) With normal chunking, run all tests except the one we want to analyze;
3) In a single, separate chunk, run the test that we want to analyze.
there is not an easy way to do the workflow mentioned in comment 6.  I am sure we could build that, but I imagine just doing that with a couple try pushes would be difficult, let along a few clicks from treeherder.
I'm thinking that we could definitely do what is suggested in comment 1 because we wouldn't have to deal with variability as much. Although, it would always have the disclaimer that the unique coverage found is not necessarily what is causing the failure, if it occurred on a different build (like linux64-pgo). But this will still give us a very good idea of how important the test is. To do this, we follow something like what :marco says in comment 6 and it would go like this:

  1) Run all test chunks at least 5-10 times, without the affected test [1]. (B)
  2) Run a single test chunk with the one affected test on the same build, at least 5-10 times. (A)
  3) Perform a difference between A and B (A / B), keeping everything in A that is not in B.

OR

  1) Run all test chunks at least 5-10 times. (B)
  2) Run all test chunks, without the affected test, at least 5-10 times. (A)
  3) Perform the difference between B and A (B / A) keeping everything that was uniquely touched by that test in B.

  ((1) and (2) would have to be done in a single push - on a single build).

The problem though, is that we can't do (1) and (2) in one push. So, we would need a way to tell a try push to get it's build package from somewhere else. But this begs the question, does removing a test from a manifest file require a re-build?

We can always make a single extra build with all the test chunks run in it and find the build-based variability, then use that along with the chunk-based variability from (1) to determine unique coverage. It's not perfect but I think it's the best we may be able to do when we have variability to deal with.

One other idea I have now is that we could try to pinpoint all the tests that cause variability and blacklist them in these types of coverage collections, where we are looking at the uniqueness of a test - assuming all tests don't cause variability. (We still don't know if all tests cause variability or if only some of them cause it - this could actually be something interesting to analyze).

In my opinion, we'll definitely have to solve this variability issue, find a nice way around it, or find a way to use it to our advantage if we'd like to make coverage useful for disabling tests, or anymore than this. We'd also need to find a way to deal with build differences or getting coverage on other build configurations.



[1]: We need to run it without the affected test so that the unique coverage (if any) of this test is not included in the global view and so that the difference will work.
Another idea I have is that we could just look at the test coverage and get the variability measure from there by running it multiple times. If there are lines that are variable here, they are the only ones that would really affect our results. So, running all test chunks multiple times to capture global variability is not needed. (There is still the possibility of rare variability though as we can't approach infinity in terms of number of times the test was run).

Then, we can simply take that coverage and see what is unique in it by comparing it to a full coverage run on m-c. This will leave us with unique lines, and intermittently covered lines that occur in the test, but might not necessarily unique to it. I think this is the simplest way we could do this, and that this idea would be more of a trade-off between accuracy, and simplicity for ease of implementation.
possibly we can reuse the existing work in test-verify as this takes a list of individual test files and runs them in a unique job.  It also repeats the test file a few times- I think we could copy or use the existing test-verify to solve this in a prototype.  Test Verify is limited as it is tier-2 and cannot be tier-1 until we can run all existing tests in TV (our solution will be to annotate tests and create a ignore list)
Depends on: 1431753
With bug 1431753 fixed, we can do this manually with some effort. After we get all tests' data injested into ActiveData, we will be able to do this more easily.
Blocks: code-coverage
No longer blocks: 1429455
You need to log in before you can comment on or make changes to this bug.