Closed Bug 812395 Opened 7 years ago Closed 7 years ago

Make mozharness emulator scripts re-try the run if a failure occurs during gecko installation

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

All
Gonk (Firefox OS)
defect
Not set

Tracking

(firefox18 fixed, firefox19 fixed, firefox20 fixed)

RESOLVED FIXED
B2G C2 (20nov-10dec)
Tracking Status
firefox18 --- fixed
firefox19 --- fixed
firefox20 --- fixed

People

(Reporter: jgriffin, Assigned: jgriffin)

References

Details

Attachments

(3 files, 3 obsolete files)

This is in response to bug 809437.

It currently seems like the fastest way (but by no means the best way) to get tests unhidden in TBPL again is to change the mozharness script so that it can detect when the test run it has invoked fails due to a problem installing gecko into the emulator, and re-try it.

Longer-term solutions are to either identify and fix the B2G or emulator bug that's causing this problem (which would require B2G dev support), or to use a full emulator build and avoid the necessity of installing gecko into the emulator during the test run.

I have a patch for this that I'm currently debugging; hopefully it will be ready to try Friday.
Target Milestone: --- → B2G C2 (20nov-10dec)
Assignee: nobody → jgriffin
This comes in two patches:

- a gecko patch which changes emulator.py to fail gracefully during install_gecko, so instead of raising an exception, it prints the exception, prints "Error installing gecko!", and exits with sys.exit().  This is done to avoid tripping mozharness error detection.

- a patch for the marionette mozharness script which runs Marionette, and if it sees "Error installing gecko!", it re-runs it, up to 5 times, before bailing.  No ERROR lines appear in the log during install_gecko failures, unless it fails for all 5 times.

This is a little yucky (well, probably more than a little), but I think it has a high likelihood of working.  If this works, I'll need to tweak things a bit more so it works with mochitest/reftest.

To test it in production, I propose landing the mozharness script (which should have no effect on trees where the corresponding gecko patch hasn't landed) and then landing the gecko changes on cedar and retriggering several times (we haven't had much luck reproducing the failure in question on try).
Attached patch Mozharness patchSplinter Review
Attachment #682582 - Flags: review?(aki)
Attached patch gecko patch (obsolete) — Splinter Review
Attachment #682584 - Flags: review?(ahalberstadt)
Attachment #682584 - Flags: feedback?(aki)
Comment on attachment 682584 [details] [diff] [review]
gecko patch

Review of attachment 682584 [details] [diff] [review]:
-----------------------------------------------------------------

Someday someone is going to look at this and go: http://i.imgur.com/UOdjD.png

Patch looks good, but my guess is you made this off cedar. This will bitrot as soon m-c gets merged. Anyway, I think it is a little excessive to have the whole retry self._restart_b2g as well as the retry at the mozharness level. We might want to even get rid of the sleeps (which is what's currently in m-c).

nit: add this bug number somewhere
Attachment #682584 - Flags: review?(ahalberstadt) → review+
(In reply to Andrew Halberstadt [:ahal] from comment #4)
> Comment on attachment 682584 [details] [diff] [review]
> gecko patch
> 
> Review of attachment 682584 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Someday someone is going to look at this and go: http://i.imgur.com/UOdjD.png

I think we already look like that.  :)

> 
> Patch looks good, but my guess is you made this off cedar. This will bitrot
> as soon m-c gets merged. Anyway, I think it is a little excessive to have
> the whole retry self._restart_b2g as well as the retry at the mozharness
> level. We might want to even get rid of the sleeps (which is what's
> currently in m-c).
> 
> nit: add this bug number somewhere

Yes, I'll take out the additional attempts at restarting b2g, and add this bug # to the comments.  Then I'll land on cedar and we'll see how it does.
Comment on attachment 682582 [details] [diff] [review]
Mozharness patch

If you want additional retry, you can set self.buildbot_status(TBPL_RETRY) and the job will retry automatically in buildbot.  This would go above the self.fatal() in the else: .  (This would require importing TBPL_RETRY as well.)

r=me either way.
Attachment #682582 - Flags: review?(aki) → review+
Comment on attachment 682582 [details] [diff] [review]
Mozharness patch

Also, this fixes marionette-webapi but we might need something similar for b2g_emulator_unittest.py .
Comment on attachment 682584 [details] [diff] [review]
gecko patch

I wish we could limit the except: to one or a handful of exceptions, but I'll take it if it helps here :)
Attachment #682584 - Flags: feedback?(aki) → feedback+
(In reply to Aki Sasaki [:aki] from comment #7)
> Comment on attachment 682582 [details] [diff] [review]
> Mozharness patch
> 
> Also, this fixes marionette-webapi but we might need something similar for
> b2g_emulator_unittest.py .

Yes, I'll work on that next if this doesn't completely blow something up.
(In reply to Aki Sasaki [:aki] from comment #6)
> Comment on attachment 682582 [details] [diff] [review]
> Mozharness patch
> 
> If you want additional retry, you can set self.buildbot_status(TBPL_RETRY)
> and the job will retry automatically in buildbot.  This would go above the
> self.fatal() in the else: .  (This would require importing TBPL_RETRY as
> well.)
> 
> r=me either way.

Cool, I'll add that!
(In reply to Aki Sasaki [:aki] from comment #8)
> Comment on attachment 682584 [details] [diff] [review]
> gecko patch
> 
> I wish we could limit the except: to one or a handful of exceptions, but
> I'll take it if it helps here :)

Sadly over the course of bug 809437 I've seen about 4-5 different exceptions thrown from that chunk of code.

p.s. it kind of feels like we are breaking the tests out of the matrix here. :)
Comment on attachment 682582 [details] [diff] [review]
Mozharness patch

http://hg.mozilla.org/build/mozharness/rev/b97f11b392ef, with changes suggested by aki
Attachment #682582 - Flags: checked-in+
Attached patch gecko patch (obsolete) — Splinter Review
Carry r+ forward.  Updated per aki's and ahal's comments.
Attachment #682584 - Attachment is obsolete: true
Attachment #682609 - Flags: review+
I've landed this on cedar.  I'll retrigger the Mn tests a few times, and if all is well, I'll land on inbound, then get to work on corresponding changes for mochitest/reftest.

https://tbpl.mozilla.org/?tree=Cedar&rev=6d2ff516cd68
(In reply to Andrew Halberstadt [:ahal] from comment #11)
> (In reply to Aki Sasaki [:aki] from comment #8)
> > Comment on attachment 682584 [details] [diff] [review]
> > gecko patch
> > 
> > I wish we could limit the except: to one or a handful of exceptions, but
> > I'll take it if it helps here :)
> 
> Sadly over the course of bug 809437 I've seen about 4-5 different exceptions
> thrown from that chunk of code.
> 
> p.s. it kind of feels like we are breaking the tests out of the matrix here.
> :)

On the cedar version I'm only catching DMError and MarionetteException (which will catch call subclasses of that).  This accounts for the majority of failures; if we see exceptions not covered by those, I'll add them.
So far so good on cedar.  Here's the corresponding mozharness patch for emulator unittests.
Attachment #682675 - Flags: review?(aki)
Attachment #682675 - Flags: review?(aki) → review+
Attached patch gecko patch (obsolete) — Splinter Review
This patch works with reftest and mochitests as well.  It also adds a convenience method to Marionette, getMarionetteOrExit, to handle all of the dirty work, so we don't have to repeat similar code elsewhere.
Attachment #682609 - Attachment is obsolete: true
Attachment #682686 - Flags: review?(ahalberstadt)
Landed newest gecko patch on cedar to see how it works:  https://tbpl.mozilla.org/?tree=Cedar&rev=e4fd39c0ee1b
Comment on attachment 682686 [details] [diff] [review]
gecko patch

Review of attachment 682686 [details] [diff] [review]:
-----------------------------------------------------------------

r+. Personally I think I preferred just printing the string and bailing at the source. But no point in delaying this over bikeshedding

::: testing/marionette/client/marionette/emulator.py
@@ +411,5 @@
> +            time.sleep(5)
> +            self.dm.shellCheckOutput(['stop', 'b2g'])
> +            time.sleep(10)
> +            self.dm.shellCheckOutput(['start', 'b2g'])
> +            time.sleep(5)

nit: we should probably remove at least the first and last sleep (not sure why I put them in in the first place) in the interest of saving 10sec/job on the linux slaves. I'm not even sure if the middle sleep makes much of a difference.
Attachment #682686 - Flags: review?(ahalberstadt) → review+
This job on cedar triggered the retry code and it worked:

https://tbpl.mozilla.org/php/getParsedLog.php?id=17160532&tree=Cedar&full=1

I'm going to land this on inbound after taking out the extra sleeps.

One of the things that we'll have to watch for is that each retry can add up to 10 min to the test run.  So if the test run is close to the 60min cutoff, adding a retry or two could push it over that threshold and cause the test run to timeout.  We should make sure our test chunks normally don't run over 40 min or so, I think.
Attached patch gecko patchSplinter Review
Remove extra sleeps; carry r+ forward.
Attachment #682686 - Attachment is obsolete: true
Attachment #683172 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/c780ff5ec8a8
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Were going to need this on aurora and beta still
Whiteboard: [automation-needed-in-aurora][automation-needed-in-beta]
Reopening to land on those branches
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
https://hg.mozilla.org/releases/mozilla-aurora/rev/221d91248314

Will land beta once tests are passing in aurora.
Whiteboard: [automation-needed-in-aurora][automation-needed-in-beta] → [automation-needed-in-beta]
https://hg.mozilla.org/releases/mozilla-beta/rev/88f664cf0188
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Whiteboard: [automation-needed-in-beta]
Product: mozilla.org → Release Engineering
Component: General Automation → Mozharness
You need to log in before you can comment on or make changes to this bug.