Open Bug 918921 Opened 11 years ago Updated 2 years ago

Retry failing tests in Mochitest

Categories

(Testing :: Mochitest, defect)

defect

Tracking

(Not tracked)

People

(Reporter: ted, Unassigned)

References

(Blocks 1 open bug)

Details

In bug 906510, as part of the "parallel xpcshell tests" work, mihneadb made the xpcshell harness retry tests that fail, and only count them as failures if they fail again. I think we should expand this to Mochitest as well, to help with our intermittent failure problem.
See Also: → 906510
See Also: → 918922
I feel quite strongly that we should not do this.

Tests were written with a particular pass/fail threshold in mind,  and moving the threshold might change whether they're still testing what they're intended to test.  For example, a test might have been written that would pass all the time without a bug, but fail intermittently when the bug is present.  (I've definitely written tests of that form, and I suspect others have as well.)

Moving the line might relieve the pain temporarily, but it won't solve the problem that there will always be some bugs (whether bugs in the code or the tests) that cause tests to intermittently cross the pass/fail line.

The point of running tests is to prevent regressing the things that they're testing.  The way to deal with an intermittent failure problem is to put resources into fixing the tests that fail intermittently (particularly into finding the changes that caused them to start failing intermittently).
I agree with dbaron that there are concerns about silently hiding inconsistent result. We should be very careful about that. Perhaps we need a new color in TBPL to denote intra-job fluctuating results?

That being said, I think we need to move forward with this bug.

As we split up the bc suite into multiple jobs, tests are crossing job boundaries, leading to new realizations of test inter-dependencies, and often leading to extended tree closures. fx-team is currently closed apparently due to one such issue (see bug 989689).

As my Try push at https://tbpl.mozilla.org/?tree=Try&rev=ed2e90bc32ef shows, shuffling test execution causes a number of new test failures. (bc wasn't shuffled in that push - sorry.) If we retried failing tests and the test continues to fail in isolation, this would identify tests that a) have inter-dependencies b) don't run well on their own. Either way, it should help identify bad tests. And that's a good thing.
I agree with David that it's not clear we want *unexpected* intermittent fails to silently pass if the test is re-run successfully.  But there should be a way to mark *known* intermittent failures, so that if a particular failure pattern occurs it will be auto-rerun a couple of times.  This is effectively a way to automate what we do manually already -- known failure patterns get manually starred.  Marking a test as known intermittent is much less drastic than disabling it, so it could be done without a fuss to keep the tree clean.  Most normal tests are just as useful if intermittent failures are ignored, especially if it's only a particular known pattern of intermittent failure (matching some regex or whatever).

Of course, this would not make the random orange problem disappear, but it would make it drastically more manageable.  People would still need to do some manual marking, just much less than now.
I don't think there is any automated way for us to check whether a test failure is intermittent or not without rerunning the test multiple times.

David, would you still object to the idea of running tests multiple times only for those tests which an open intermittent failure bug or something like that?
Flags: needinfo?(dbaron)
I'd be ok with it if it came from some test annotation, and those annotations were generally added by test authors or module owners.  I still wouldn't want to see it used on existing tests in modules I'm responsible for, though.
Flags: needinfo?(dbaron)
(In reply to comment #5)
> I'd be ok with it if it came from some test annotation, and those annotations
> were generally added by test authors or module owners.

Sure.

> I still wouldn't want
> to see it used on existing tests in modules I'm responsible for, though.

Even with the above withstanding?  If yes, then that means that you really think it's a bad idea in general, right?
I still think it's a bad idea to use as a response to a regression in a previously-working test, at least in the absence of an understanding of why we expect the regression.  (We need better tools for finding these regressions; I'm hoping I might find some time to work on that, but I'm not sure how realistic that is.)
mcote looked at writing tooling to do this in bug 669316, but it turns out to be hard.
(In reply to David Baron [:dbaron] (needinfo? me) (UTC-7) from comment #5)
> I'd be ok with it if it came from some test annotation, and those
> annotations were generally added by test authors or module owners.  I still
> wouldn't want to see it used on existing tests in modules I'm responsible
> for, though.

Is that true even if a bug is filed, so that the relevant developers are aware of the intermittent failure?  If it takes some time to track it down, or people don't want to track it down at all, I don't think everyone who checks in code should have to deal with random orange in the meantime.
Yes, still true; disabling tests often piles on additional work needed to get them re-enabled, since additional failures can pile on.  It can be hard to unwind.

If it's bad enough that it can't be in the tree for a few days, then it's frequent enough to easily find the changeset that caused it and back it out (especially given better tools for bisection).
(In reply to David Baron [:dbaron] (needinfo? me) (UTC-7) from comment #10)
> If it's bad enough that it can't be in the tree for a few days, then it's
> frequent enough to easily find the changeset that caused it and back it out
> (especially given better tools for bisection).

Sadly this isn't true, death by a 1000 cuts and all that.
Blocks: 996504
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.