Re-run behavior should be consistent between Gaia repos and Gecko repos

RESOLVED FIXED in mozilla46

Status

Taskcluster
Integration
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: ato, Assigned: aus)

Tracking

unspecified
mozilla46

Details

Attachments

(2 attachments)

(Reporter)

Description

2 years ago
The gaia and gaia-master branches re-run failed- and errored jobs automatically.  This predominantly happens for Gij because of it’s high intermittent rate.

To have consistent behaviour across branches, it would make sense to enable the behaviour for try.

Example of manually re-triggered jobs: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9074cf9ebb25&group_state=expanded
Example of automatically re-run jobs: https://treeherder.mozilla.org/#/jobs?repo=gaia&revision=09d58fa42fd0db566ee322f29a637153532c1421

Comment 1

2 years ago
garndt: how do we set tasks to be run again if the tests fail?

I only see 'rerun' set in those two base classes.

I can't see the variable rerun explained in here: http://docs.taskcluster.net/queue/api-docs/#createTask

Comment 2

2 years ago
So there are two things that could cause a task to get run again.

The first is when there is an infrastructure issue like a worker shutting down or a worker failing to renew the claim on a task, both of which will result in a task being run again.  In this case, it's called "retry" and is specified by "retries" in the task payload in the link in comment 1.  I believe the default is around 5 I think.

The second cause of a task being run again is when a task is explicitly marked as "failed", which is the setting you're looking for.  This is specified for each task when a task graph is created[1] and is called "rerun" .  This, by default, is 0.

Here [2] is where reruns for Gij were removed across the board.  This can be enabled on a per task level by reversing that change. To change the number of reruns per branch might require a little change to the mach target along with the branch configs.  I don't think this is supported out of the box, but I might be mistaken. 


[1] http://docs.taskcluster.net/scheduler/api-docs/#createTaskGraph
[2] http://hg.mozilla.org/mozilla-central/diff/14e1c43f09e8/testing/taskcluster/tasks/tests/b2g_gaia_js_integration_tests.yml
(Reporter)

Comment 3

2 years ago
(In reply to Greg Arndt [:garndt] from comment #2)
> Here [2] is where reruns for Gij were removed across the board.  This can be
> enabled on a per task level by reversing that change. To change the number
> of reruns per branch might require a little change to the mach target along
> with the branch configs.  I don't think this is supported out of the box,
> but I might be mistaken. 

I think consistency trumps everything in this context.  Having differing behaviour for Gij on gaia-master and try implicitly sets different bars for how much noise is tolerated in the results.

Comment 4

2 years ago
(In reply to Andreas Tolfsen (:ato) from comment #3)
> (In reply to Greg Arndt [:garndt] from comment #2)
> > Here [2] is where reruns for Gij were removed across the board.  This can be
> > enabled on a per task level by reversing that change. To change the number
> > of reruns per branch might require a little change to the mach target along
> > with the branch configs.  I don't think this is supported out of the box,
> > but I might be mistaken. 
> 
> I think consistency trumps everything in this context.  Having differing
> behaviour for Gij on gaia-master and try implicitly sets different bars for
> how much noise is tolerated in the results.

I can totally understand that.  So is the goal to have the retries *only* on try, or could these test suites be retried on all branches?
(Reporter)

Comment 5

2 years ago
(In reply to Greg Arndt [:garndt] from comment #4)
> I can totally understand that.  So is the goal to have the retries *only* on
> try, or could these test suites be retried on all branches?

On try we want the same re-run behaviour for Gij as on gaia-master because it’s tedious to manually re-run them.

For all other trees I’m not sure, and since that would be a question for the sheriffs I’m Cc’ing one.
Flags: needinfo?(cbook)
(Assignee)

Comment 6

2 years ago
I think it would make sense to do this for all branches for the time being as we're attempting to sort out various intermittent test issues. This would enable the sheriffs to only flag the really bad failures. 

Currently we have the retry count set to 3 on gaia, gaia-master. Seeing as this is a really small change and would save a lot of people a lot of pain. Whenever we change the retry scheme on gaia, gaia-master we'll update across the board in the future.
Flags: needinfo?(cbook)
(Assignee)

Comment 7

2 years ago
Carsten, hopefully you agree with comment #6, I know :philor has been flagging many failures because we *removed* our per-file retry scheme recently which has surfaced intermittents at a much higher rate of failure because we do not retry at all on any platform tree.

The plan is to eventually treat retry because of test failure as an orange I believe, but, we'd like instead to first use less retries!
Flags: needinfo?(cbook)
(Assignee)

Comment 8

2 years ago
I double checked and we actually use reruns: 4 for gaia, gaia-master. It would probably be good to be uniform across the board. Typically, we need only 1 rerun to get an intermittent to pass. So maybe meeting in the middle would be best. I'll set it to 3 everywhere.
Assignee: nobody → aus
Status: NEW → ASSIGNED
(Assignee)

Comment 9

2 years ago
Created attachment 8707109 [details] [diff] [review]
Patch - v1 - Use 3 retries when running Gij on platform builds.

:garndt -- feel free to punt if you're not the correct person to look at this.
Attachment #8707109 - Flags: review?(garndt)

Updated

2 years ago
Attachment #8707109 - Flags: review?(garndt) → review+
(Assignee)

Comment 10

2 years ago
Updating summary.
Summary: Enable re-run behaviour of Gij jobs on try → Re-run behavior should be consistent between Gaia repos and Gecko repos
Created attachment 8707112 [details] [review]
[gaia] nullaus:bug1228079 > mozilla-b2g:master
(Assignee)

Updated

2 years ago
Attachment #8707112 - Flags: review?(gaye)
Comment on attachment 8707112 [details] [review]
[gaia] nullaus:bug1228079 > mozilla-b2g:master

haha I have nothing to say about this
Attachment #8707112 - Flags: review?(gaye) → review+
(Assignee)

Comment 13

2 years ago
Commit (gaia-master): https://github.com/mozilla-b2g/gaia/commit/f2542f993f4e851c9b9ac81e2d62096955b3d28f

Leaving open for landing + merging of gecko tree changes.
(Assignee)

Comment 14

2 years ago
https://hg.mozilla.org/integration/b2g-inbound/rev/e443d68c6df6602490211c15a4f74c9eaeefb1ef
Bug 1228079 - Use 3 retries when running Gij on platform builds. r=garndt

Comment 15

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/e443d68c6df6
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox46: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
I believe we need to back this out. Sheriffs ignore blue runs and I just spend a lot of time starring m-i. We need precise information about intermittent test failures and I have a strong opinion that we need consistent behavior across all platforms on m-i and b-i. We can't teach the whole company that they can ignore blue runs except on b2g mulet.
Blue runs mean infra issues and developers can safely ignore them. We shouldn't mix them with intermittent test issues.
Flags: needinfo?(aus)
Flags: needinfo?(ato)
(Reporter)

Comment 17

2 years ago
Did you see the thread “Orange is the new Bad (Gij)” thread on dev-fxos@?  It explains some of the background around the state and the situation Gij is in:

    https://groups.google.com/d/msg/mozilla.dev.fxos/LTTobhx4tCc/nN_gad51AgAJ

There seems to be general agreement in the Firefox OS organisation that intermittent tests need to be eradicated.  mhenretty, aus, and gaye have done a stellar job so far in improving the situation.  As mhenretty explains in one of the latest emails to the thread, there are still some remaining issues.

The situation we had before this patch, where Gij intermittents would immediately turn orange on Gecko trees, was in my opinion not much better:  Sheriff’s would then manually trigger re-runs and they would attribute the last failure to flaky tests if the re-run passed.  In other words, this would not be cause for action.

What has changed with this patch is that the re-runs happen automatically.  But if a developer introduces a persistent failure to Gij, the the last of the re-runs will be marked as an orange.

However, I do see the argument that we need precise information on intermittents, but I can’t see that we’re not getting that now.  On the contrary I would argue that because the Gij jobs are re-run they much more clearly visualise just how unstable the tests are right now.

I know that mhenretty has disabled a lot of intermittent tests in Gij already, but it would be interesting to hear what can be done about the remaining intermittents.  We are obviously in a bad situation, and a prudent question is if disabling the remaining unstable tests would put us in a worse situation that what we’re already in?  In other words, are developers _actually_ finding the current tests valuable?
Flags: needinfo?(ato)
(Assignee)

Comment 18

2 years ago
(In reply to Gregor Wagner [:gwagner] from comment #16)
> I believe we need to back this out. Sheriffs ignore blue runs and I just
> spend a lot of time starring m-i. We need precise information about
> intermittent test failures and I have a strong opinion that we need
> consistent behavior across all platforms on m-i and b-i. We can't teach the
> whole company that they can ignore blue runs except on b2g mulet.
> Blue runs mean infra issues and developers can safely ignore them. We
> shouldn't mix them with intermittent test issues.

We all feel this sentiment, but, it's realistically the only way to keep the tree open. We used to use the PER FILE retry which would retry PER FILE up to 5 times. Now we've got it down to use 3 retries per CHUNK. The difference is huge. 

When we first removed the PER FILE retry we ended up with ZERO retries which caused our intermittent rate to skyrocket overnight which all sheriffs HATED. On the flipside, gaia-try and gaia-master have been using the built in retry since the beginning, we had 4 retries + the per file retry (up to 5 times) for up to 25 retries. We then removed per file. And now we've updated it to 3 retries everywhere.

I would be happy to remove all retry usage but people don't care enough about their tests to fix them and it can't be the responsibility of the test automation team to do that for them and then we'll just get our suites hidden and that's ultimately counter-productive.

It's my opinion that there is _plenty_ of data for people to work with to fix their tests.
Flags: needinfo?(aus)

Updated

2 years ago
Flags: needinfo?(cbook)
Moving closed bugs across to new Bugzilla product "TaskCluster".
status-firefox46: fixed → ---
Component: TaskCluster → Integration
Product: Testing → Taskcluster
You need to log in before you can comment on or make changes to this bug.