Closed Bug 1119830 Opened 9 years ago Closed 9 years ago

Jobs using run-by-dir don't turn orange if a mozharness timeout is triggered

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ahal, Assigned: ahal)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 2 obsolete files)

E.g, all the "green" bc1 jobs actually have failures:
https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=fec317aa00bc&filter-searchStr=bc1

Chris figured out what was happening:
> Those jobs are green because that test hangs the browser until it hits the
> mozprocess timeout, so no "Failed:   5" line has a chance to get out.
> Mozharness would typically turn this job orange due to seeing no "Passed: "
> with a count, but it gets those from other dirs, so it doesn't see a problem.

The orange bc1 jobs are only orange because they have other test failures within them.

In addition to failing the job if no summary is detected, mozharness should obviously fail it if the entire job timed out.
Attached file MozReview Request: bz://1119830/ahal (obsolete) —
Attachment #8546838 - Flags: review?(cmanchester)
/r/2309 - Bug 1119830 - Set tbpl_status to TBPL_EXCEPTION if test job returns with a non-success code, r=chmanchester

Pull down this commit:

hg pull review -r 5dbdf6e97d2168fcd0f971586afc6bcdcc9e4cc0
The root of the problem was that many of the mozharness test scripts were just completely ignoring the return value of the test job when determining tbpl_status. This change manually sets tbpl_status = TBPL_EXCEPTION if there was a non-success return code coming out of the test job.

The structuredlog parser (used by web-platform-tests and possibly others) takes the return code into account. So this is more of a band-aid fix until we get all test jobs using the structuredlog parser.

Here's a try run that (I think) runs a job from each file I touched against my user mozharness:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=837f5b9af192
https://reviewboard.mozilla.org/r/2307/#review1481

::: scripts/b2g_desktop_unittest.py
(Diff revision 1)
> +            tbpl_status = TBPL_EXCEPTION

I would expect these to be red, not purple.

::: scripts/desktop_unittest.py
(Diff revision 1)
> +                    log_level = ERROR

I think we'll always hit this on windows, because those test processes routinely return 1, and run_command appears to throw if you specify throw_exception and it gets a bad return code.
Attachment #8546838 - Flags: review?(cmanchester)
/r/2309 - Bug 1119830 - Set tbpl_status to TBPL_EXCEPTION if test job returns with a non-success code, r=chmanchester

Pull down this commit:

hg pull review -r 5dbdf6e97d2168fcd0f971586afc6bcdcc9e4cc0
I also just realized that we probably still need to call parser.evaluate_parser() as we might still need the passed/fail counts. I'll get a new patch up shortly.
android_emulator_unittest.py could use this too, but I think the place to fix this is in run_command. We know there when we time out, we don't really need to involve the returncode (my impression is that relying on them is basically hopeless at this point, because of windows).

Let's run this through try on the broken rev from comment 0 to prove the fix.

This wouldn't catch something like bug 1097278 combined with a situation where we're parsing multiple summaries. It's pretty much coincidental, but an effect of mozharness' summary parsing has been that any mochitest run that ends abruptly will be an orange job. Once we have multiple summaries, we don't have that property.
Attachment #8546838 - Flags: review?(cmanchester)
/r/2309 - Bug 1119830 - DesktopUnittestParser should use the return_code in its tbpl_status determination, r=chmanchester

Pull down this commit:

hg pull review -r 6a60368e46b094f3c12542f1df9f35fb1729718c
I think this version is a little bit better. It papers over the windows return code 1 problem, but at least it should start catching other strange and wonderful return codes..

Oh and I basically started over, so ignore the interdiff.. it won't make much sense.
(In reply to Chris Manchester [:chmanchester] from comment #7)
> android_emulator_unittest.py could use this too, but I think the place to
> fix this is in run_command. We know there when we time out, we don't really
> need to involve the returncode (my impression is that relying on them is
> basically hopeless at this point, because of windows).

I'm not sure we want to add anything to run_command, as that is used in so many places for so many different things. I borrowed run_command's success_codes concept to paper over the windows problem. We use a similar hack for b2g xpcshell which also inexplicably returns 1, so at least there's precedent (though admittedly a hacky precedent).

If we land this we at least constrain the return_code problem to just windows and b2g xpcshell. If we ever fix those problems, we could remove the 'success_codes' concept. I'm worried that if we continue ignoring the return_code we'll leave the door open for more broken return codes to slip through unnoticed.
(In reply to Chris Manchester [:chmanchester] from comment #7)
> Let's run this through try on the broken rev from comment 0 to prove the fix.

Here's the updated try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d67b58d96c73

Also I don't think the rev matters as every run seems broken.
(In reply to Andrew Halberstadt [:ahal] from comment #11)
> (In reply to Chris Manchester [:chmanchester] from comment #7)
> > Let's run this through try on the broken rev from comment 0 to prove the fix.
> 
> Here's the updated try run:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=d67b58d96c73
> 
> Also I don't think the rev matters as every run seems broken.

I don't see that failure after the bustage fix in the subsequent revision. Sure, the bug is present in them all, but I want to see that revision turn bc1 orange.
Comment on attachment 8546838 [details]
MozReview Request: bz://1119830/ahal

This looks much better, but I want to see the tree burn as a result of this kind of failure before we land it.

We introduced a new way of running mochitests that means we don't detect test failures when this sort of timeout happens. Ok, we can turn red when we get a wonky return code because we get those when that timeout happens, but it might only be addressing part of the problem. That being said, I think this is itself is worth doing because of the potentially severe nature of the issue.
Attachment #8546838 - Flags: review?(cmanchester) → feedback+
https://reviewboard.mozilla.org/r/2307/#review1497

::: scripts/desktop_unittest.py
(Diff revision 2)
> +                    success_codes = [0, 1]

Probably not a problem in practice, but if we fail on windows, hit the mozprocess timeout, and return 1, we're still green, right?
https://reviewboard.mozilla.org/r/2307/#review1509

> Probably not a problem in practice, but if we fail on windows, hit the mozprocess timeout, and return 1, we're still green, right?

Maybe this isn't so far fetched. I just noticed a few test failures followed by crashes on windows b-c runs that return 1. These would have been green if the crash failed to get reported for some reason.
(In reply to Chris Manchester [:chmanchester] from comment #12)
> (In reply to Andrew Halberstadt [:ahal] from comment #11)
> > (In reply to Chris Manchester [:chmanchester] from comment #7)
> > > Let's run this through try on the broken rev from comment 0 to prove the fix.
> > 
> > Here's the updated try run:
> > https://treeherder.mozilla.org/#/jobs?repo=try&revision=d67b58d96c73
> > 
> > Also I don't think the rev matters as every run seems broken.
> 
> I don't see that failure after the bustage fix in the subsequent revision.
> Sure, the bug is present in them all, but I want to see that revision turn
> bc1 orange.

I mean red. I pushed it to try, and at least or linux, bc1 is as red as I hoped (it's red):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cd83abb25a3
https://reviewboard.mozilla.org/r/2307/#review1523

Ship It!

::: scripts/desktop_unittest.py
(Diff revision 2)
> +                # but don't pass this into run_command so it remains at least somewhat visible.

Update the comment above instead of adding this.
https://reviewboard.mozilla.org/r/2307/#review1525

The mozharness timeout gets hit instead of the TestRunner.js or runtests.py in those jobs because runtests.py is hanging while attempting to shut down httpd.js.
Ok. Something here needs to land, but M-11 emulator debug jobs are consistently returning 247, so I bet those will burn.
(In reply to Chris Manchester [:chmanchester] from comment #13)
> This looks much better, but I want to see the tree burn as a result of this
> kind of failure before we land it.

Sorry, meant to run bc1 as part of my try push but I guess the syntax in the trychooser extension is out of date or something.
 
> Probably not a problem in practice, but if we fail on windows, hit the
> mozprocess timeout, and return 1, we're still green, right?

Hm, on unix we return -<signal> when the process is terminated. I forgot we don't do that on windows, is it returning 1 on timeout? I guess mozharness' run_command could set the returncode itself if the job is windows and we hit a mozharness timeout.

> Ok. Something here needs to land, but M-11 emulator debug jobs are
> consistently returning 247, so I bet those will burn.

File a bug and add it to success_codes for that job (with a comment pointing to the bug of course)? My thinking is that even if we put failure codes in success_codes and some jobs appear green when they aren't, it's still better than the current situation in which all jobs appear green no matter what the return code.

> > +                # but don't pass this into run_command so it remains at least somewhat visible.
> 
> Update the comment above instead of adding this.

I'm not quite sure what you mean by this. I'll do a new push and try to address these concerns. Thanks!
(In reply to Andrew Halberstadt [:ahal] from comment #20)
> (In reply to Chris Manchester [:chmanchester] from comment #13)
> > This looks much better, but I want to see the tree burn as a result of this
> > kind of failure before we land it.
> 
> Sorry, meant to run bc1 as part of my try push but I guess the syntax in the
> trychooser extension is out of date or something.
Even with bc on that run it don't think it would have been red, but let's see. The situation in comment 0 is from a particularly broken revision.
>  
> > Probably not a problem in practice, but if we fail on windows, hit the
> > mozprocess timeout, and return 1, we're still green, right?
> 
> Hm, on unix we return -<signal> when the process is terminated. I forgot we
> don't do that on windows, is it returning 1 on timeout? I guess mozharness'
> run_command could set the returncode itself if the job is windows and we hit
> a mozharness timeout.
Right, I doubt we'd return 1 on a mozprocess timeout. But because every mochitest job on windows returns 1, if we have a failure scenario like the one in comment 0, but the problem does not result in a mozharness timeout, we're still open to issues like the one hypothesized in comment 15.
> 
> > Ok. Something here needs to land, but M-11 emulator debug jobs are
> > consistently returning 247, so I bet those will burn.
> 
> File a bug and add it to success_codes for that job (with a comment pointing
> to the bug of course)? My thinking is that even if we put failure codes in
> success_codes and some jobs appear green when they aren't, it's still better
> than the current situation in which all jobs appear green no matter what the
> return code.
Adding it to success_codes is probably ok, because it reflects the current criteria. I don't think it's a separate bug.
> 
> > > +                # but don't pass this into run_command so it remains at least somewhat visible.
> > 
> > Update the comment above instead of adding this.
> 
> I'm not quite sure what you mean by this. I'll do a new push and try to
> address these concerns. Thanks!
There's already a comment about how failures are determined directly above the one you added.
(In reply to Chris Manchester [:chmanchester] from comment #21)
> Right, I doubt we'd return 1 on a mozprocess timeout. But because every
> mochitest job on windows returns 1, if we have a failure scenario like the
> one in comment 0, but the problem does not result in a mozharness timeout,
> we're still open to issues like the one hypothesized in comment 15.

Yeah, there would still be work to do after this lands. At least it would be an incremental improvement.

> Adding it to success_codes is probably ok, because it reflects the current
> criteria. I don't think it's a separate bug.

Sorry, I meant file a bug for the fact that mochitest-11 returns 245 (if there isn't one already).

> There's already a comment about how failures are determined directly above
> the one you added.

Ah. I was trying to make it explicitly clear that we shouldn't pass in 'success_codes' into run_command. I'll try and word it a little better.
Depends on: 1120580
Depends on: 1120644
The state of mozreview for this bug is very messed up for some reason, so just using splinter for this small follow-up patch.

As you pointed out, there is still room for false positives if a windows job that returns 1 crashes but doesn't trigger crash reporting. But this is still a big improvement over the status quo.

Since apparently random chunks can return non-zero and pass, here is a try -a:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=84d8d6943bc7

And here's one that should be based on that faulty changeset:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=2c73c7b3abb6
Attachment #8547802 - Flags: review?(cmanchester)
Comment on attachment 8547802 [details] [diff] [review]
Add 247 to success_codes for emulator mochitest-11

Review of attachment 8547802 [details] [diff] [review]:
-----------------------------------------------------------------

::: scripts/b2g_emulator_unittest.py
@@ +307,5 @@
>              if suite_options in self.tree_config:
>                  missing_key = False
>              options = self.tree_config[suite_options]
>  
> +        if missing_key:

Extraneous whitespace change. If intentional, feel free to leave it in.
Attachment #8547802 - Flags: review?(cmanchester) → review+
Nice, the second try run turned bc1 red as expected. Waiting on some retriggers on the first try run. Look unrelated at first glance, but doesn't hurt to be sure.
Attachment #8546838 - Flags: checked-in+
Attachment #8547802 - Flags: checked-in+
Blocks: 1048775
There's a job that always returns 1 due to a shutdown timeout on older b2g release branches:
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=68608&repo=mozilla-b2g34_v2_1
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=84703&repo=mozilla-b2g32_v2_0

I'll add those to the success_codes whitelist as well. Patch upcoming.
(In reply to Andrew Halberstadt [:ahal] from comment #28)
> There's a job that always returns 1 due to a shutdown timeout on older b2g
> release branches:
> https://treeherder.mozilla.org/ui/logviewer.html#?job_id=68608&repo=mozilla-
> b2g34_v2_1
> https://treeherder.mozilla.org/ui/logviewer.html#?job_id=84703&repo=mozilla-
> b2g32_v2_0
> 
> I'll add those to the success_codes whitelist as well. Patch upcoming.

They return 247, too. Whatever's happening in m-11 changed chunks.
Ugh, this is pretty horrible.. but not sure of any better way of doing it.
Attachment #8549143 - Flags: review?(cmanchester)
Comment on attachment 8549143 [details] [diff] [review]
Add 1 to success_codes for emulator mochitest-12 on older b2g branches

Review of attachment 8549143 [details] [diff] [review]:
-----------------------------------------------------------------

::: scripts/b2g_emulator_unittest.py
@@ +371,5 @@
> +            success_codes = [0, 1]
> +        elif suite_name == 'mochitest' and platform == 'emulator':
> +            if chunk == 12 and branch in ('mozilla-b2g32_v2_0', 'mozilla-b2g34_v2_1'):
> +                # mochitest-12 on older b2g release branches have shutdown timeouts
> +                success_codes = [0, 1]

Looking at the logs in comment 28, these start by returning 247, then run ps because that return results in TBPL_FAILURE, and ps then returns 1.
Attachment #8549143 - Flags: review?(cmanchester) → review-
Ah thanks for catching that!
Attachment #8549143 - Attachment is obsolete: true
Attachment #8549153 - Flags: review?(cmanchester)
Comment on attachment 8549153 [details] [diff] [review]
Add 1 to success_codes for emulator mochitest-12 on older b2g branches

Review of attachment 8549153 [details] [diff] [review]:
-----------------------------------------------------------------

::: scripts/b2g_emulator_unittest.py
@@ +359,5 @@
>          # the base implementation from running here.
>          pass
>  
> +    def _get_success_codes(self):
> +        suite_name = [x for x in self.test_suites if x in self.config['test_suite']][0]

I'm assuming this does the right thing. It would be clearer if you passed suite_name as an argument to _get_success_codes.
Attachment #8549153 - Flags: review?(cmanchester) → review+
Comment on attachment 8549153 [details] [diff] [review]
Add 1 to success_codes for emulator mochitest-12 on older b2g branches

Pushed with review comment fixed:
https://hg.mozilla.org/build/mozharness/rev/d2efcba0f986
Attachment #8549153 - Flags: checked-in+
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Attachment #8546838 - Attachment is obsolete: true
Attachment #8619094 - Flags: review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: