Open Bug 657738 Opened 9 years ago Updated 6 years ago

Help detect orange randomness

Categories

(Testing :: Reftest, defect)

defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: glandium, Unassigned)

References

(Blocks 1 open bug)

Details

Copy/pasted from http://glandium.org/blog/?p=1998:
I think we lack one important information when we have a test failure: does it reliably happen with a given build? Chances are that most random oranges don’t (like the two I mentioned further above), but those that do may point out subtle problems of compiler optimizations breaking some of our assumptions (though so far, most of the time, they just turn into permanent oranges). The self-serve API does help in that regard, allowing to re-trigger a given test suite on the same build, but I think we should enhance our test harnesses to automatically retry failing tests.

We should do that on all our harnesses (xpcshell, reftest, make check)
We'd probably need to add something to this to flag known random oranges as a test that may fail randomly, and possibly set a threshold (i.e. fails 1/25 times, more than that we should report failure).  Known random oranges would then avoid making the tree orange...
We have data on this in the OrangeFactor web tool <http://brasstacks.mozilla.com/orangefactor/>, so I'm not sure if we need to modify the harness to get that data (in fact, I'm pretty sure that we don't!).

I don't get the rest of your proposal: what should we do with that data?  We don't have a reliable way to determine failure types for all of our intermittent oranges, and associating failures with test files is not correct.  I like the general idea, but I think it should be way more concrete before we can do anything useful with it.
OrangeFactor / Bugzilla + TBPL's tools are helpful for test that have been intermittently failing for a while.

But we're always going to have new tests added to the tree (one hopes ;), and new intermittent orange (either from new tests, or existing tests that start to go intermittent for one of a variety of reasons). Rerunning a failed test would be helpful the first few times something goes wonky.

A concrete example: A change is pushed, and a new test goes orange. (Unexpectedly, since I have _of course_ passed on try!) I could immediately back out, but if I can't reproduce it what then? Try relanding later, and hope the I luck out and miss the new orange?

If the test boxes immediately reran the specific failed test a couple times, it would help people watching the tree to know that either (1) the test is intermittent, and someone needs to start looking at why or (2) the test is _not_ intermittent, and the committer should back out (or otherwise close the tree).

[I wouldn't undersell #2 -- people don't like to think their change could have caused the problem, and having immediate data that it failed 3x would dash any hope that it might go green on the next cycle.]
Just a thought: why don't we also flag our known intermittent oranges in the test suites that allow such flagging (reftest and mochitest), such that instead of TEST-UNEXPECTED-FAIL, we'd get TEST-INTERMITTENT-FAIL + a bug number.

This would:
- help tbpl for all the cases it can't find the corresponding bug (and seeing how much times I had to to some bugzilla search or tbpl digging to find a corresponding bug, that'd be a clear win)
- help people that hit these intermittent failures on their local builds

The downside is that possibly, we may be getting a different failure from the one that is already known.
In conjunction with this approach, I think it's a great idea (if it's flagged as intermittent, and isn't, still fail).
(In reply to comment #3)
> OrangeFactor / Bugzilla + TBPL's tools are helpful for test that have been
> intermittently failing for a while.
> 
> But we're always going to have new tests added to the tree (one hopes ;), and
> new intermittent orange (either from new tests, or existing tests that start to
> go intermittent for one of a variety of reasons). Rerunning a failed test would
> be helpful the first few times something goes wonky.
> 
> A concrete example: A change is pushed, and a new test goes orange.
> (Unexpectedly, since I have _of course_ passed on try!) I could immediately
> back out, but if I can't reproduce it what then? Try relanding later, and hope
> the I luck out and miss the new orange?
> 
> If the test boxes immediately reran the specific failed test a couple times, it
> would help people watching the tree to know that either (1) the test is
> intermittent, and someone needs to start looking at why or (2) the test is
> _not_ intermittent, and the committer should back out (or otherwise close the
> tree).
> 
> [I wouldn't undersell #2 -- people don't like to think their change could have
> caused the problem, and having immediate data that it failed 3x would dash any
> hope that it might go green on the next cycle.]

OK, this proposal makes sense to me.  The only problem with it is that I don't necessarily think that we need additional test runs for intermittent oranges that TBPL knows about (which are the most common type of intermittent oranges).  I think for those cases, rerunning the tests just wastes everyone's time.

Also, right now, we're almost 99% there, with us being able to rerun a test job from TBPL.  I use this very technique quite often when I see a new orange.  So I guess this proposal is about making it automated, right?
(In reply to comment #4)
> The downside is that possibly, we may be getting a different failure from the
> one that is already known.

Which is pretty serious!

At least on  few occasions, I've come across test failures on my own patches which looked the same as an intermittent orange.  In one case it was my patch making an intermittent orange quite worse (nearly perma-orange), in another case I mistakenly pushed the patch to m-c without realizing that I'm actually seeing a perma-orange, and in the rest of cases, my patch was triggering similar failures for very different reasons (bugs in my patch).

This proposal makes detecting these cases before pushing to m-c nearly impossible.
(In reply to comment #7)
> This proposal makes detecting these cases before pushing to m-c nearly
> impossible.

I'm not saying they shouldn't be orange. I'm saying it might help to have them tagged somehow.
I hinted at this a year ago in a blog post:
https://elvis314.wordpress.com/2010/07/05/improving-personal-hygiene-by-adjusting-mochitests/

basically if we had meta data for each test we could know the history and report it as an orange or some other color.  this meta data could come from a webservice (think query orangefactor database) to determine if this is a known failure for that given platform.

In the past I have had test harnesses rerun a single test case (not test suite) if it fails to verify it reproduces.  I saw about 75% of the noise removed from my automation by doing that.  Almost every test file will run in seconds and having the harness rerun it to verify it fails could save us a lot of time.  Having to rerun the whole test suite could be time consuming and add more burden to our already backlogged machine pool.
Blocks: 996504
You need to log in before you can comment on or make changes to this bug.