Closed
Bug 1119830
Opened 9 years ago
Closed 9 years ago
Jobs using run-by-dir don't turn orange if a mozharness timeout is triggered
Categories
(Release Engineering :: Applications: MozharnessCore, defect)
Release Engineering
Applications: MozharnessCore
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ahal, Assigned: ahal)
References
(Blocks 1 open bug)
Details
Attachments
(3 files, 2 obsolete files)
3.83 KB,
patch
|
chmanchester
:
review+
ahal
:
checked-in+
|
Details | Diff | Splinter Review |
3.61 KB,
patch
|
chmanchester
:
review+
ahal
:
checked-in+
|
Details | Diff | Splinter Review |
39 bytes,
text/x-review-board-request
|
chmanchester
:
review+
|
Details |
E.g, all the "green" bc1 jobs actually have failures: https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=fec317aa00bc&filter-searchStr=bc1 Chris figured out what was happening: > Those jobs are green because that test hangs the browser until it hits the > mozprocess timeout, so no "Failed: 5" line has a chance to get out. > Mozharness would typically turn this job orange due to seeing no "Passed: " > with a count, but it gets those from other dirs, so it doesn't see a problem. The orange bc1 jobs are only orange because they have other test failures within them. In addition to failing the job if no summary is detected, mozharness should obviously fail it if the entire job timed out.
Assignee | ||
Comment 1•9 years ago
|
||
Attachment #8546838 -
Flags: review?(cmanchester)
Assignee | ||
Comment 2•9 years ago
|
||
/r/2309 - Bug 1119830 - Set tbpl_status to TBPL_EXCEPTION if test job returns with a non-success code, r=chmanchester Pull down this commit: hg pull review -r 5dbdf6e97d2168fcd0f971586afc6bcdcc9e4cc0
Assignee | ||
Comment 3•9 years ago
|
||
The root of the problem was that many of the mozharness test scripts were just completely ignoring the return value of the test job when determining tbpl_status. This change manually sets tbpl_status = TBPL_EXCEPTION if there was a non-success return code coming out of the test job. The structuredlog parser (used by web-platform-tests and possibly others) takes the return code into account. So this is more of a band-aid fix until we get all test jobs using the structuredlog parser. Here's a try run that (I think) runs a job from each file I touched against my user mozharness: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=837f5b9af192
Comment 4•9 years ago
|
||
https://reviewboard.mozilla.org/r/2307/#review1481 ::: scripts/b2g_desktop_unittest.py (Diff revision 1) > + tbpl_status = TBPL_EXCEPTION I would expect these to be red, not purple. ::: scripts/desktop_unittest.py (Diff revision 1) > + log_level = ERROR I think we'll always hit this on windows, because those test processes routinely return 1, and run_command appears to throw if you specify throw_exception and it gets a bad return code.
Assignee | ||
Updated•9 years ago
|
Attachment #8546838 -
Flags: review?(cmanchester)
Assignee | ||
Comment 5•9 years ago
|
||
/r/2309 - Bug 1119830 - Set tbpl_status to TBPL_EXCEPTION if test job returns with a non-success code, r=chmanchester Pull down this commit: hg pull review -r 5dbdf6e97d2168fcd0f971586afc6bcdcc9e4cc0
Assignee | ||
Comment 6•9 years ago
|
||
I also just realized that we probably still need to call parser.evaluate_parser() as we might still need the passed/fail counts. I'll get a new patch up shortly.
Comment 7•9 years ago
|
||
android_emulator_unittest.py could use this too, but I think the place to fix this is in run_command. We know there when we time out, we don't really need to involve the returncode (my impression is that relying on them is basically hopeless at this point, because of windows). Let's run this through try on the broken rev from comment 0 to prove the fix. This wouldn't catch something like bug 1097278 combined with a situation where we're parsing multiple summaries. It's pretty much coincidental, but an effect of mozharness' summary parsing has been that any mochitest run that ends abruptly will be an orange job. Once we have multiple summaries, we don't have that property.
Assignee | ||
Updated•9 years ago
|
Attachment #8546838 -
Flags: review?(cmanchester)
Assignee | ||
Comment 8•9 years ago
|
||
/r/2309 - Bug 1119830 - DesktopUnittestParser should use the return_code in its tbpl_status determination, r=chmanchester Pull down this commit: hg pull review -r 6a60368e46b094f3c12542f1df9f35fb1729718c
Assignee | ||
Comment 9•9 years ago
|
||
I think this version is a little bit better. It papers over the windows return code 1 problem, but at least it should start catching other strange and wonderful return codes.. Oh and I basically started over, so ignore the interdiff.. it won't make much sense.
Assignee | ||
Comment 10•9 years ago
|
||
(In reply to Chris Manchester [:chmanchester] from comment #7) > android_emulator_unittest.py could use this too, but I think the place to > fix this is in run_command. We know there when we time out, we don't really > need to involve the returncode (my impression is that relying on them is > basically hopeless at this point, because of windows). I'm not sure we want to add anything to run_command, as that is used in so many places for so many different things. I borrowed run_command's success_codes concept to paper over the windows problem. We use a similar hack for b2g xpcshell which also inexplicably returns 1, so at least there's precedent (though admittedly a hacky precedent). If we land this we at least constrain the return_code problem to just windows and b2g xpcshell. If we ever fix those problems, we could remove the 'success_codes' concept. I'm worried that if we continue ignoring the return_code we'll leave the door open for more broken return codes to slip through unnoticed.
Assignee | ||
Comment 11•9 years ago
|
||
(In reply to Chris Manchester [:chmanchester] from comment #7) > Let's run this through try on the broken rev from comment 0 to prove the fix. Here's the updated try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d67b58d96c73 Also I don't think the rev matters as every run seems broken.
Comment 12•9 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #11) > (In reply to Chris Manchester [:chmanchester] from comment #7) > > Let's run this through try on the broken rev from comment 0 to prove the fix. > > Here's the updated try run: > https://treeherder.mozilla.org/#/jobs?repo=try&revision=d67b58d96c73 > > Also I don't think the rev matters as every run seems broken. I don't see that failure after the bustage fix in the subsequent revision. Sure, the bug is present in them all, but I want to see that revision turn bc1 orange.
Comment 13•9 years ago
|
||
Comment on attachment 8546838 [details]
MozReview Request: bz://1119830/ahal
This looks much better, but I want to see the tree burn as a result of this kind of failure before we land it.
We introduced a new way of running mochitests that means we don't detect test failures when this sort of timeout happens. Ok, we can turn red when we get a wonky return code because we get those when that timeout happens, but it might only be addressing part of the problem. That being said, I think this is itself is worth doing because of the potentially severe nature of the issue.
Attachment #8546838 -
Flags: review?(cmanchester) → feedback+
Comment 14•9 years ago
|
||
https://reviewboard.mozilla.org/r/2307/#review1497 ::: scripts/desktop_unittest.py (Diff revision 2) > + success_codes = [0, 1] Probably not a problem in practice, but if we fail on windows, hit the mozprocess timeout, and return 1, we're still green, right?
Comment 15•9 years ago
|
||
https://reviewboard.mozilla.org/r/2307/#review1509 > Probably not a problem in practice, but if we fail on windows, hit the mozprocess timeout, and return 1, we're still green, right? Maybe this isn't so far fetched. I just noticed a few test failures followed by crashes on windows b-c runs that return 1. These would have been green if the crash failed to get reported for some reason.
Comment 16•9 years ago
|
||
(In reply to Chris Manchester [:chmanchester] from comment #12) > (In reply to Andrew Halberstadt [:ahal] from comment #11) > > (In reply to Chris Manchester [:chmanchester] from comment #7) > > > Let's run this through try on the broken rev from comment 0 to prove the fix. > > > > Here's the updated try run: > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=d67b58d96c73 > > > > Also I don't think the rev matters as every run seems broken. > > I don't see that failure after the bustage fix in the subsequent revision. > Sure, the bug is present in them all, but I want to see that revision turn > bc1 orange. I mean red. I pushed it to try, and at least or linux, bc1 is as red as I hoped (it's red): https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cd83abb25a3
Updated•9 years ago
|
Attachment #8546838 -
Flags: review+
Comment 17•9 years ago
|
||
https://reviewboard.mozilla.org/r/2307/#review1523 Ship It! ::: scripts/desktop_unittest.py (Diff revision 2) > + # but don't pass this into run_command so it remains at least somewhat visible. Update the comment above instead of adding this.
Comment 18•9 years ago
|
||
https://reviewboard.mozilla.org/r/2307/#review1525 The mozharness timeout gets hit instead of the TestRunner.js or runtests.py in those jobs because runtests.py is hanging while attempting to shut down httpd.js.
Comment 19•9 years ago
|
||
Ok. Something here needs to land, but M-11 emulator debug jobs are consistently returning 247, so I bet those will burn.
Assignee | ||
Comment 20•9 years ago
|
||
(In reply to Chris Manchester [:chmanchester] from comment #13) > This looks much better, but I want to see the tree burn as a result of this > kind of failure before we land it. Sorry, meant to run bc1 as part of my try push but I guess the syntax in the trychooser extension is out of date or something. > Probably not a problem in practice, but if we fail on windows, hit the > mozprocess timeout, and return 1, we're still green, right? Hm, on unix we return -<signal> when the process is terminated. I forgot we don't do that on windows, is it returning 1 on timeout? I guess mozharness' run_command could set the returncode itself if the job is windows and we hit a mozharness timeout. > Ok. Something here needs to land, but M-11 emulator debug jobs are > consistently returning 247, so I bet those will burn. File a bug and add it to success_codes for that job (with a comment pointing to the bug of course)? My thinking is that even if we put failure codes in success_codes and some jobs appear green when they aren't, it's still better than the current situation in which all jobs appear green no matter what the return code. > > + # but don't pass this into run_command so it remains at least somewhat visible. > > Update the comment above instead of adding this. I'm not quite sure what you mean by this. I'll do a new push and try to address these concerns. Thanks!
Comment 21•9 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #20) > (In reply to Chris Manchester [:chmanchester] from comment #13) > > This looks much better, but I want to see the tree burn as a result of this > > kind of failure before we land it. > > Sorry, meant to run bc1 as part of my try push but I guess the syntax in the > trychooser extension is out of date or something. Even with bc on that run it don't think it would have been red, but let's see. The situation in comment 0 is from a particularly broken revision. > > > Probably not a problem in practice, but if we fail on windows, hit the > > mozprocess timeout, and return 1, we're still green, right? > > Hm, on unix we return -<signal> when the process is terminated. I forgot we > don't do that on windows, is it returning 1 on timeout? I guess mozharness' > run_command could set the returncode itself if the job is windows and we hit > a mozharness timeout. Right, I doubt we'd return 1 on a mozprocess timeout. But because every mochitest job on windows returns 1, if we have a failure scenario like the one in comment 0, but the problem does not result in a mozharness timeout, we're still open to issues like the one hypothesized in comment 15. > > > Ok. Something here needs to land, but M-11 emulator debug jobs are > > consistently returning 247, so I bet those will burn. > > File a bug and add it to success_codes for that job (with a comment pointing > to the bug of course)? My thinking is that even if we put failure codes in > success_codes and some jobs appear green when they aren't, it's still better > than the current situation in which all jobs appear green no matter what the > return code. Adding it to success_codes is probably ok, because it reflects the current criteria. I don't think it's a separate bug. > > > > + # but don't pass this into run_command so it remains at least somewhat visible. > > > > Update the comment above instead of adding this. > > I'm not quite sure what you mean by this. I'll do a new push and try to > address these concerns. Thanks! There's already a comment about how failures are determined directly above the one you added.
Assignee | ||
Comment 22•9 years ago
|
||
(In reply to Chris Manchester [:chmanchester] from comment #21) > Right, I doubt we'd return 1 on a mozprocess timeout. But because every > mochitest job on windows returns 1, if we have a failure scenario like the > one in comment 0, but the problem does not result in a mozharness timeout, > we're still open to issues like the one hypothesized in comment 15. Yeah, there would still be work to do after this lands. At least it would be an incremental improvement. > Adding it to success_codes is probably ok, because it reflects the current > criteria. I don't think it's a separate bug. Sorry, I meant file a bug for the fact that mochitest-11 returns 245 (if there isn't one already). > There's already a comment about how failures are determined directly above > the one you added. Ah. I was trying to make it explicitly clear that we shouldn't pass in 'success_codes' into run_command. I'll try and word it a little better.
Assignee | ||
Comment 23•9 years ago
|
||
The state of mozreview for this bug is very messed up for some reason, so just using splinter for this small follow-up patch. As you pointed out, there is still room for false positives if a windows job that returns 1 crashes but doesn't trigger crash reporting. But this is still a big improvement over the status quo. Since apparently random chunks can return non-zero and pass, here is a try -a: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=84d8d6943bc7 And here's one that should be based on that faulty changeset: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=2c73c7b3abb6
Attachment #8547802 -
Flags: review?(cmanchester)
Comment 24•9 years ago
|
||
Comment on attachment 8547802 [details] [diff] [review] Add 247 to success_codes for emulator mochitest-11 Review of attachment 8547802 [details] [diff] [review]: ----------------------------------------------------------------- ::: scripts/b2g_emulator_unittest.py @@ +307,5 @@ > if suite_options in self.tree_config: > missing_key = False > options = self.tree_config[suite_options] > > + if missing_key: Extraneous whitespace change. If intentional, feel free to leave it in.
Attachment #8547802 -
Flags: review?(cmanchester) → review+
Assignee | ||
Comment 25•9 years ago
|
||
Nice, the second try run turned bc1 red as expected. Waiting on some retriggers on the first try run. Look unrelated at first glance, but doesn't hurt to be sure.
Assignee | ||
Comment 26•9 years ago
|
||
Re-triggers came back green: https://hg.mozilla.org/build/mozharness/rev/a8fc88f41679
Assignee | ||
Updated•9 years ago
|
Attachment #8546838 -
Flags: checked-in+
Assignee | ||
Updated•9 years ago
|
Attachment #8547802 -
Flags: checked-in+
Comment 27•9 years ago
|
||
In production: https://hg.mozilla.org/build/mozharness/rev/a8fc88f41679
Assignee | ||
Comment 28•9 years ago
|
||
There's a job that always returns 1 due to a shutdown timeout on older b2g release branches: https://treeherder.mozilla.org/ui/logviewer.html#?job_id=68608&repo=mozilla-b2g34_v2_1 https://treeherder.mozilla.org/ui/logviewer.html#?job_id=84703&repo=mozilla-b2g32_v2_0 I'll add those to the success_codes whitelist as well. Patch upcoming.
Comment 29•9 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #28) > There's a job that always returns 1 due to a shutdown timeout on older b2g > release branches: > https://treeherder.mozilla.org/ui/logviewer.html#?job_id=68608&repo=mozilla- > b2g34_v2_1 > https://treeherder.mozilla.org/ui/logviewer.html#?job_id=84703&repo=mozilla- > b2g32_v2_0 > > I'll add those to the success_codes whitelist as well. Patch upcoming. They return 247, too. Whatever's happening in m-11 changed chunks.
Assignee | ||
Comment 30•9 years ago
|
||
Ugh, this is pretty horrible.. but not sure of any better way of doing it.
Attachment #8549143 -
Flags: review?(cmanchester)
Comment 31•9 years ago
|
||
Comment on attachment 8549143 [details] [diff] [review] Add 1 to success_codes for emulator mochitest-12 on older b2g branches Review of attachment 8549143 [details] [diff] [review]: ----------------------------------------------------------------- ::: scripts/b2g_emulator_unittest.py @@ +371,5 @@ > + success_codes = [0, 1] > + elif suite_name == 'mochitest' and platform == 'emulator': > + if chunk == 12 and branch in ('mozilla-b2g32_v2_0', 'mozilla-b2g34_v2_1'): > + # mochitest-12 on older b2g release branches have shutdown timeouts > + success_codes = [0, 1] Looking at the logs in comment 28, these start by returning 247, then run ps because that return results in TBPL_FAILURE, and ps then returns 1.
Attachment #8549143 -
Flags: review?(cmanchester) → review-
Assignee | ||
Comment 32•9 years ago
|
||
Ah thanks for catching that!
Attachment #8549143 -
Attachment is obsolete: true
Attachment #8549153 -
Flags: review?(cmanchester)
Comment 33•9 years ago
|
||
Comment on attachment 8549153 [details] [diff] [review] Add 1 to success_codes for emulator mochitest-12 on older b2g branches Review of attachment 8549153 [details] [diff] [review]: ----------------------------------------------------------------- ::: scripts/b2g_emulator_unittest.py @@ +359,5 @@ > # the base implementation from running here. > pass > > + def _get_success_codes(self): > + suite_name = [x for x in self.test_suites if x in self.config['test_suite']][0] I'm assuming this does the right thing. It would be clearer if you passed suite_name as an argument to _get_success_codes.
Attachment #8549153 -
Flags: review?(cmanchester) → review+
Assignee | ||
Comment 34•9 years ago
|
||
Comment on attachment 8549153 [details] [diff] [review] Add 1 to success_codes for emulator mochitest-12 on older b2g branches Pushed with review comment fixed: https://hg.mozilla.org/build/mozharness/rev/d2efcba0f986
Attachment #8549153 -
Flags: checked-in+
Assignee | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 35•9 years ago
|
||
Attachment #8546838 -
Attachment is obsolete: true
Attachment #8619094 -
Flags: review+
Assignee | ||
Comment 36•9 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•