753761 - Jobs that fail for infrastructure reasons should automatically be rerun

Reporter

Description

•

12 years ago

Getting purple results on android is pretty common. It's a pain to have to reschedule these jobs manually after waiting for the results. It would be much better if these were just rerun automatically.

Chris AtLee [:catlee]

Comment 1

•

12 years ago

We already retry on a bunch of purples. Which ones are you referring to that aren't getting retried?

Armen [:armenzg]

Comment 2

•

12 years ago

It could easily be cleanup.py and updateSUT.py.

Perhaps its time to collect issues and address them.

Callek, philor: what are your insights in all of this?

Mike Taylor [:bear]

Comment 3

•

12 years ago

(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #2)
> It could easily be cleanup.py and updateSUT.py.

anything that is resolving as purple should already be getting auto retried on Android jobs

> 
> Perhaps its time to collect issues and address them.
> 
> Callek, philor: what are your insights in all of this?

Callek has already collected the issues and is addressing them.

On Android quite a few tests fail in ways that look to be infrastructure related and Philor has hidden quite a few of them because no one wants to fix them.

Armen [:armenzg]

Comment 4

•

12 years ago

(In reply to Mike Taylor [:bear] from comment #3)
> (In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #2)
> > It could easily be cleanup.py and updateSUT.py.
> 
> anything that is resolving as purple should already be getting auto retried
> on Android jobs
> 
Just to clarify (IIUC):
purple (infra failure + no retry)doesn't mean auto-retry.
blue (infra failure + *retry*) does

Phil Ringnalda (:philor)

Comment 5

•

12 years ago

Let's try a RETRY on this bug:

"""
Hey, jrmuziel here. I pushed to Try in https://tbpl.mozilla.org/?tree=Try&rev=552c5b65bb78 and got a bunch of Android failures which have nothing to do with my push and have nothing to do with actual tests failing, and I'd like that better if they just automatically retried instead of me having to do it manually or just getting sick of Android and completely ignoring the results I get from it.
"""

That's actually a fairly typical push:

* 3 jobs set RETRY and retried, 2 from devicemanager.DMError and 1 from Remote Device Error when it failed at the probably-pointless setting of resolution before running crashtests. Those are fine, nothing to worry about.

* 4 jobs set EXCEPTION and were purple. Two were talos that did so in the second cleanup device step, hitting bug 660480 after a completely successful talos run. They should not have set RETRY, and should not be manually retried, they should have ignored those failures and set SUCCESS, but there's considerable question whether we know how to make that happen. One, https://tbpl.mozilla.org/php/getParsedLog.php?id=11631062&tree=Try, hit roughly bug 711725, or maybe not, during the first cleanup device, and should have set RETRY, which maybe we do or don't know how to do. One hit bug 660480 during the first cleanup device step, and should have set RETRY, but we don't know how to do that, bug 660480 comment 818.

* 1 job hit bug 681861 during the first cleanup device step and was purple, which would be nice to RETRY on, but risky, since virtually every log of every step has a "reconnecting socket" in it.

* 1 job hit bug 681861 during the test run and was orange, which would be nice to RETRY on, but risky, since virtually every log of every step has a "reconnecting socket" in it.

* 1 job hit bug 686245, which would be delightful to retry but a good patch will hit it 5 out of 6 runs if you keep retriggering it, while a bad patch will hit it 6 out of 6 runs, so someone's going to have to sack up and fix it rather than wallpapering with RETRY.

* 1 jsreftest hit bug 686143 and we should just stop running jsreftests on Android, they are pointless and we ignore every single failure in them.

(* Just for completeness, every single Android native crashtest and reftest and jsreftest except the ones that failed without making it to the end of the test run crashed on shutdown, but was green despite that thing which should have turned them red.)

(* Just for bonus fun, although we have native reftest-2 hidden for being permaorange, it's not actually permaorange, and it would have made it more clear that the patch being tried did in fact break a reftest on Android if both the XUL and the native had been visible failing that test.)

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 6

•

12 years ago

The purple jobs I was interested in this push: https://tbpl.mozilla.org/?tree=Try&rev=552c5b65bb78 were rck2 and rp

Chris Cooper [:coop] (he/him)

Updated

•

12 years ago

Component: Release Engineering → Release Engineering: Automation (General)

OS: Mac OS X → All

QA Contact: release → catlee

Hardware: x86 → All

Whiteboard: [automation][retry]

Chris AtLee [:catlee]

Comment 7

•

12 years ago

this bug is too broad.

please list specific cases that need to be automatically retried.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → INCOMPLETE

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: General Automation → General

Bugzilla

Quick Search

Jobs that fail for infrastructure reasons should automatically be rerun

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: jrmuizel, Unassigned)

References

Details

(Whiteboard: [automation][retry])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Updated

Updated