Jobs that fail for infrastructure reasons should automatically be rerun

RESOLVED INCOMPLETE

Status

RESOLVED INCOMPLETE
7 years ago
5 months ago

People

(Reporter: jrmuizel, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [automation][retry])

(Reporter)

Description

7 years ago
Getting purple results on android is pretty common. It's a pain to have to reschedule these jobs manually after waiting for the results. It would be much better if these were just rerun automatically.
We already retry on a bunch of purples. Which ones are you referring to that aren't getting retried?

Comment 2

7 years ago
It could easily be cleanup.py and updateSUT.py.

Perhaps its time to collect issues and address them.

Callek, philor: what are your insights in all of this?

Comment 3

7 years ago
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #2)
> It could easily be cleanup.py and updateSUT.py.

anything that is resolving as purple should already be getting auto retried on Android jobs

> 
> Perhaps its time to collect issues and address them.
> 
> Callek, philor: what are your insights in all of this?

Callek has already collected the issues and is addressing them.

On Android quite a few tests fail in ways that look to be infrastructure related and Philor has hidden quite a few of them because no one wants to fix them.

Comment 4

7 years ago
(In reply to Mike Taylor [:bear] from comment #3)
> (In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #2)
> > It could easily be cleanup.py and updateSUT.py.
> 
> anything that is resolving as purple should already be getting auto retried
> on Android jobs
> 
Just to clarify (IIUC):
purple (infra failure + no retry)doesn't mean auto-retry.
blue (infra failure + *retry*) does
Let's try a RETRY on this bug:

"""
Hey, jrmuziel here. I pushed to Try in https://tbpl.mozilla.org/?tree=Try&rev=552c5b65bb78 and got a bunch of Android failures which have nothing to do with my push and have nothing to do with actual tests failing, and I'd like that better if they just automatically retried instead of me having to do it manually or just getting sick of Android and completely ignoring the results I get from it.
"""

That's actually a fairly typical push:

* 3 jobs set RETRY and retried, 2 from devicemanager.DMError and 1 from Remote Device Error when it failed at the probably-pointless setting of resolution before running crashtests. Those are fine, nothing to worry about.

* 4 jobs set EXCEPTION and were purple. Two were talos that did so in the second cleanup device step, hitting bug 660480 after a completely successful talos run. They should not have set RETRY, and should not be manually retried, they should have ignored those failures and set SUCCESS, but there's considerable question whether we know how to make that happen. One, https://tbpl.mozilla.org/php/getParsedLog.php?id=11631062&tree=Try, hit roughly bug 711725, or maybe not, during the first cleanup device, and should have set RETRY, which maybe we do or don't know how to do. One hit bug 660480 during the first cleanup device step, and should have set RETRY, but we don't know how to do that, bug 660480 comment 818.

* 1 job hit bug 681861 during the first cleanup device step and was purple, which would be nice to RETRY on, but risky, since virtually every log of every step has a "reconnecting socket" in it.

* 1 job hit bug 681861 during the test run and was orange, which would be nice to RETRY on, but risky, since virtually every log of every step has a "reconnecting socket" in it.

* 1 job hit bug 686245, which would be delightful to retry but a good patch will hit it 5 out of 6 runs if you keep retriggering it, while a bad patch will hit it 6 out of 6 runs, so someone's going to have to sack up and fix it rather than wallpapering with RETRY.

* 1 jsreftest hit bug 686143 and we should just stop running jsreftests on Android, they are pointless and we ignore every single failure in them.

(* Just for completeness, every single Android native crashtest and reftest and jsreftest except the ones that failed without making it to the end of the test run crashed on shutdown, but was green despite that thing which should have turned them red.)

(* Just for bonus fun, although we have native reftest-2 hidden for being permaorange, it's not actually permaorange, and it would have made it more clear that the patch being tried did in fact break a reftest on Android if both the XUL and the native had been visible failing that test.)
(Reporter)

Comment 6

7 years ago
The purple jobs I was interested in this push: https://tbpl.mozilla.org/?tree=Try&rev=552c5b65bb78 were rck2 and rp
Component: Release Engineering → Release Engineering: Automation (General)
OS: Mac OS X → All
QA Contact: release → catlee
Hardware: x86 → All
Whiteboard: [automation][retry]
this bug is too broad.

please list specific cases that need to be automatically retried.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → INCOMPLETE
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
(Assignee)

Updated

5 months ago
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.