Closed Bug 1119830 Opened 10 years ago Closed 10 years ago

Jobs using run-by-dir don't turn orange if a mozharness timeout is triggered

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: ahal, Assigned: ahal)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 2 obsolete files)

MozReview Request: bz://1119830/ahal 10 years ago Andrew Halberstadt [:ahal] 39 bytes, text/x-review-board-request	chmanchester : review+ chmanchester : feedback+ ahal : checked-in+	Details
Add 247 to success_codes for emulator mochitest-11 10 years ago Andrew Halberstadt [:ahal] 3.83 KB, patch	chmanchester : review+ ahal : checked-in+	Details \| Diff \| Splinter Review
Add 1 to success_codes for emulator mochitest-12 on older b2g branches 10 years ago Andrew Halberstadt [:ahal] 3.74 KB, patch	chmanchester : review-	Details \| Diff \| Splinter Review
Add 1 to success_codes for emulator mochitest-12 on older b2g branches 10 years ago Andrew Halberstadt [:ahal] 3.61 KB, patch	chmanchester : review+ ahal : checked-in+	Details \| Diff \| Splinter Review
MozReview Request: Bug 1119830 - DesktopUnittestParser should use the return_code in its tbpl_status determination, r=chmanchester 10 years ago Andrew Halberstadt [:ahal] 39 bytes, text/x-review-board-request	chmanchester : review+	Details

Andrew Halberstadt [:ahal]

Assignee

Description

•

10 years ago

E.g, all the "green" bc1 jobs actually have failures: https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=fec317aa00bc&filter-searchStr=bc1 Chris figured out what was happening: > Those jobs are green because that test hangs the browser until it hits the > mozprocess timeout, so no "Failed: 5" line has a chance to get out. > Mozharness would typically turn this job orange due to seeing no "Passed: " > with a count, but it gets those from other dirs, so it doesn't see a problem. The orange bc1 jobs are only orange because they have other test failures within them. In addition to failing the job if no summary is detected, mozharness should obviously fail it if the entire job timed out.

Andrew Halberstadt [:ahal]

Assignee

Comment 1

•

10 years ago

Attached file MozReview Request: bz://1119830/ahal (obsolete) — Details

Attachment #8546838 - Flags: review?(cmanchester)

Andrew Halberstadt [:ahal]

Assignee

Comment 2

•

10 years ago

/r/2309 - Bug 1119830 - Set tbpl_status to TBPL_EXCEPTION if test job returns with a non-success code, r=chmanchester Pull down this commit: hg pull review -r 5dbdf6e97d2168fcd0f971586afc6bcdcc9e4cc0

Andrew Halberstadt [:ahal]

Assignee

Comment 3

•

10 years ago

The root of the problem was that many of the mozharness test scripts were just completely ignoring the return value of the test job when determining tbpl_status. This change manually sets tbpl_status = TBPL_EXCEPTION if there was a non-success return code coming out of the test job. The structuredlog parser (used by web-platform-tests and possibly others) takes the return code into account. So this is more of a band-aid fix until we get all test jobs using the structuredlog parser. Here's a try run that (I think) runs a job from each file I touched against my user mozharness: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=837f5b9af192

Chris Manchester (limited bugmail, email directly)

Comment 4

•

10 years ago

https://reviewboard.mozilla.org/r/2307/#review1481 ::: scripts/b2g_desktop_unittest.py (Diff revision 1) > + tbpl_status = TBPL_EXCEPTION I would expect these to be red, not purple. ::: scripts/desktop_unittest.py (Diff revision 1) > + log_level = ERROR I think we'll always hit this on windows, because those test processes routinely return 1, and run_command appears to throw if you specify throw_exception and it gets a bad return code.

Andrew Halberstadt [:ahal]

Assignee

Updated

•

10 years ago

Attachment #8546838 - Flags: review?(cmanchester)

Andrew Halberstadt [:ahal]

Assignee

Comment 5

•

10 years ago

/r/2309 - Bug 1119830 - Set tbpl_status to TBPL_EXCEPTION if test job returns with a non-success code, r=chmanchester Pull down this commit: hg pull review -r 5dbdf6e97d2168fcd0f971586afc6bcdcc9e4cc0

Andrew Halberstadt [:ahal]

Assignee

Comment 6

•

10 years ago

I also just realized that we probably still need to call parser.evaluate_parser() as we might still need the passed/fail counts. I'll get a new patch up shortly.

Chris Manchester (limited bugmail, email directly)

Comment 7

•

10 years ago

android_emulator_unittest.py could use this too, but I think the place to fix this is in run_command. We know there when we time out, we don't really need to involve the returncode (my impression is that relying on them is basically hopeless at this point, because of windows). Let's run this through try on the broken rev from comment 0 to prove the fix. This wouldn't catch something like bug 1097278 combined with a situation where we're parsing multiple summaries. It's pretty much coincidental, but an effect of mozharness' summary parsing has been that any mochitest run that ends abruptly will be an orange job. Once we have multiple summaries, we don't have that property.

Andrew Halberstadt [:ahal]

Assignee

Updated

•

10 years ago

Attachment #8546838 - Flags: review?(cmanchester)

Andrew Halberstadt [:ahal]

Assignee

Comment 8

•

10 years ago

/r/2309 - Bug 1119830 - DesktopUnittestParser should use the return_code in its tbpl_status determination, r=chmanchester Pull down this commit: hg pull review -r 6a60368e46b094f3c12542f1df9f35fb1729718c

Andrew Halberstadt [:ahal]

Assignee

Comment 9

•

10 years ago

I think this version is a little bit better. It papers over the windows return code 1 problem, but at least it should start catching other strange and wonderful return codes.. Oh and I basically started over, so ignore the interdiff.. it won't make much sense.

Andrew Halberstadt [:ahal]

Assignee

Comment 10

•

10 years ago

(In reply to Chris Manchester [:chmanchester] from comment #7) > android_emulator_unittest.py could use this too, but I think the place to > fix this is in run_command. We know there when we time out, we don't really > need to involve the returncode (my impression is that relying on them is > basically hopeless at this point, because of windows). I'm not sure we want to add anything to run_command, as that is used in so many places for so many different things. I borrowed run_command's success_codes concept to paper over the windows problem. We use a similar hack for b2g xpcshell which also inexplicably returns 1, so at least there's precedent (though admittedly a hacky precedent). If we land this we at least constrain the return_code problem to just windows and b2g xpcshell. If we ever fix those problems, we could remove the 'success_codes' concept. I'm worried that if we continue ignoring the return_code we'll leave the door open for more broken return codes to slip through unnoticed.

Andrew Halberstadt [:ahal]

Assignee

Comment 11

•

10 years ago

(In reply to Chris Manchester [:chmanchester] from comment #7) > Let's run this through try on the broken rev from comment 0 to prove the fix. Here's the updated try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d67b58d96c73 Also I don't think the rev matters as every run seems broken.

Chris Manchester (limited bugmail, email directly)

Comment 12

•

10 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #11) > (In reply to Chris Manchester [:chmanchester] from comment #7) > > Let's run this through try on the broken rev from comment 0 to prove the fix. > > Here's the updated try run: > https://treeherder.mozilla.org/#/jobs?repo=try&revision=d67b58d96c73 > > Also I don't think the rev matters as every run seems broken. I don't see that failure after the bustage fix in the subsequent revision. Sure, the bug is present in them all, but I want to see that revision turn bc1 orange.

Chris Manchester (limited bugmail, email directly)

Comment 13

•

10 years ago

Comment on attachment 8546838 [details] MozReview Request: bz://1119830/ahal This looks much better, but I want to see the tree burn as a result of this kind of failure before we land it. We introduced a new way of running mochitests that means we don't detect test failures when this sort of timeout happens. Ok, we can turn red when we get a wonky return code because we get those when that timeout happens, but it might only be addressing part of the problem. That being said, I think this is itself is worth doing because of the potentially severe nature of the issue.

Attachment #8546838 - Flags: review?(cmanchester) → feedback+

Chris Manchester (limited bugmail, email directly)

Comment 14

•

10 years ago

https://reviewboard.mozilla.org/r/2307/#review1497 ::: scripts/desktop_unittest.py (Diff revision 2) > + success_codes = [0, 1] Probably not a problem in practice, but if we fail on windows, hit the mozprocess timeout, and return 1, we're still green, right?

Chris Manchester (limited bugmail, email directly)

Comment 15

•

10 years ago

https://reviewboard.mozilla.org/r/2307/#review1509 > Probably not a problem in practice, but if we fail on windows, hit the mozprocess timeout, and return 1, we're still green, right? Maybe this isn't so far fetched. I just noticed a few test failures followed by crashes on windows b-c runs that return 1. These would have been green if the crash failed to get reported for some reason.

Chris Manchester (limited bugmail, email directly)

Comment 16

•

10 years ago

(In reply to Chris Manchester [:chmanchester] from comment #12) > (In reply to Andrew Halberstadt [:ahal] from comment #11) > > (In reply to Chris Manchester [:chmanchester] from comment #7) > > > Let's run this through try on the broken rev from comment 0 to prove the fix. > > > > Here's the updated try run: > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=d67b58d96c73 > > > > Also I don't think the rev matters as every run seems broken. > > I don't see that failure after the bustage fix in the subsequent revision. > Sure, the bug is present in them all, but I want to see that revision turn > bc1 orange. I mean red. I pushed it to try, and at least or linux, bc1 is as red as I hoped (it's red): https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cd83abb25a3

Chris Manchester (limited bugmail, email directly)

Updated

•

10 years ago

Attachment #8546838 - Flags: review+

Chris Manchester (limited bugmail, email directly)

Comment 17

•

10 years ago

https://reviewboard.mozilla.org/r/2307/#review1523 Ship It! ::: scripts/desktop_unittest.py (Diff revision 2) > + # but don't pass this into run_command so it remains at least somewhat visible. Update the comment above instead of adding this.

Chris Manchester (limited bugmail, email directly)

Comment 18

•

10 years ago

https://reviewboard.mozilla.org/r/2307/#review1525 The mozharness timeout gets hit instead of the TestRunner.js or runtests.py in those jobs because runtests.py is hanging while attempting to shut down httpd.js.

Chris Manchester (limited bugmail, email directly)

Comment 19

•

10 years ago

Ok. Something here needs to land, but M-11 emulator debug jobs are consistently returning 247, so I bet those will burn.

Andrew Halberstadt [:ahal]

Assignee

Comment 20

•

10 years ago

(In reply to Chris Manchester [:chmanchester] from comment #13) > This looks much better, but I want to see the tree burn as a result of this > kind of failure before we land it. Sorry, meant to run bc1 as part of my try push but I guess the syntax in the trychooser extension is out of date or something. > Probably not a problem in practice, but if we fail on windows, hit the > mozprocess timeout, and return 1, we're still green, right? Hm, on unix we return -<signal> when the process is terminated. I forgot we don't do that on windows, is it returning 1 on timeout? I guess mozharness' run_command could set the returncode itself if the job is windows and we hit a mozharness timeout. > Ok. Something here needs to land, but M-11 emulator debug jobs are > consistently returning 247, so I bet those will burn. File a bug and add it to success_codes for that job (with a comment pointing to the bug of course)? My thinking is that even if we put failure codes in success_codes and some jobs appear green when they aren't, it's still better than the current situation in which all jobs appear green no matter what the return code. > > + # but don't pass this into run_command so it remains at least somewhat visible. > > Update the comment above instead of adding this. I'm not quite sure what you mean by this. I'll do a new push and try to address these concerns. Thanks!

Chris Manchester (limited bugmail, email directly)

Comment 21

•

10 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #20) > (In reply to Chris Manchester [:chmanchester] from comment #13) > > This looks much better, but I want to see the tree burn as a result of this > > kind of failure before we land it. > > Sorry, meant to run bc1 as part of my try push but I guess the syntax in the > trychooser extension is out of date or something. Even with bc on that run it don't think it would have been red, but let's see. The situation in comment 0 is from a particularly broken revision. > > > Probably not a problem in practice, but if we fail on windows, hit the > > mozprocess timeout, and return 1, we're still green, right? > > Hm, on unix we return -<signal> when the process is terminated. I forgot we > don't do that on windows, is it returning 1 on timeout? I guess mozharness' > run_command could set the returncode itself if the job is windows and we hit > a mozharness timeout. Right, I doubt we'd return 1 on a mozprocess timeout. But because every mochitest job on windows returns 1, if we have a failure scenario like the one in comment 0, but the problem does not result in a mozharness timeout, we're still open to issues like the one hypothesized in comment 15. > > > Ok. Something here needs to land, but M-11 emulator debug jobs are > > consistently returning 247, so I bet those will burn. > > File a bug and add it to success_codes for that job (with a comment pointing > to the bug of course)? My thinking is that even if we put failure codes in > success_codes and some jobs appear green when they aren't, it's still better > than the current situation in which all jobs appear green no matter what the > return code. Adding it to success_codes is probably ok, because it reflects the current criteria. I don't think it's a separate bug. > > > > + # but don't pass this into run_command so it remains at least somewhat visible. > > > > Update the comment above instead of adding this. > > I'm not quite sure what you mean by this. I'll do a new push and try to > address these concerns. Thanks! There's already a comment about how failures are determined directly above the one you added.

Andrew Halberstadt [:ahal]

Assignee

Comment 22

•

10 years ago

(In reply to Chris Manchester [:chmanchester] from comment #21) > Right, I doubt we'd return 1 on a mozprocess timeout. But because every > mochitest job on windows returns 1, if we have a failure scenario like the > one in comment 0, but the problem does not result in a mozharness timeout, > we're still open to issues like the one hypothesized in comment 15. Yeah, there would still be work to do after this lands. At least it would be an incremental improvement. > Adding it to success_codes is probably ok, because it reflects the current > criteria. I don't think it's a separate bug. Sorry, I meant file a bug for the fact that mochitest-11 returns 245 (if there isn't one already). > There's already a comment about how failures are determined directly above > the one you added. Ah. I was trying to make it explicitly clear that we shouldn't pass in 'success_codes' into run_command. I'll try and word it a little better.

Andrew Halberstadt [:ahal]

Assignee

Updated

•

10 years ago

Depends on: 1120580

Andrew Halberstadt [:ahal]

Assignee

Updated

•

10 years ago

Depends on: 1120644

Andrew Halberstadt [:ahal]

Assignee

Comment 23

•

10 years ago

Attached patch Add 247 to success_codes for emulator mochitest-11 — Details — Splinter Review

The state of mozreview for this bug is very messed up for some reason, so just using splinter for this small follow-up patch. As you pointed out, there is still room for false positives if a windows job that returns 1 crashes but doesn't trigger crash reporting. But this is still a big improvement over the status quo. Since apparently random chunks can return non-zero and pass, here is a try -a: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=84d8d6943bc7 And here's one that should be based on that faulty changeset: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=2c73c7b3abb6

Attachment #8547802 - Flags: review?(cmanchester)

Chris Manchester (limited bugmail, email directly)

Comment 24

•

10 years ago

Comment on attachment 8547802 [details] [diff] [review] Add 247 to success_codes for emulator mochitest-11 Review of attachment 8547802 [details] [diff] [review]: ----------------------------------------------------------------- ::: scripts/b2g_emulator_unittest.py @@ +307,5 @@ > if suite_options in self.tree_config: > missing_key = False > options = self.tree_config[suite_options] > > + if missing_key: Extraneous whitespace change. If intentional, feel free to leave it in.

Attachment #8547802 - Flags: review?(cmanchester) → review+

Andrew Halberstadt [:ahal]

Assignee

Comment 25

•

10 years ago

Nice, the second try run turned bc1 red as expected. Waiting on some retriggers on the first try run. Look unrelated at first glance, but doesn't hurt to be sure.

Andrew Halberstadt [:ahal]

Assignee

Comment 26

•

10 years ago

Re-triggers came back green: https://hg.mozilla.org/build/mozharness/rev/a8fc88f41679

Andrew Halberstadt [:ahal]

Assignee

Updated

•

10 years ago

Attachment #8546838 - Flags: checked-in+

Andrew Halberstadt [:ahal]

Assignee

Updated

•

10 years ago

Attachment #8547802 - Flags: checked-in+

Chris Cooper [:coop] (he/him)

Comment 27

•

10 years ago

In production: https://hg.mozilla.org/build/mozharness/rev/a8fc88f41679

Ed Morley [:emorley]

Updated

•

10 years ago

Blocks: 1048775

Andrew Halberstadt [:ahal]

Assignee

Comment 28

•

10 years ago

There's a job that always returns 1 due to a shutdown timeout on older b2g release branches: https://treeherder.mozilla.org/ui/logviewer.html#?job_id=68608&repo=mozilla-b2g34_v2_1 https://treeherder.mozilla.org/ui/logviewer.html#?job_id=84703&repo=mozilla-b2g32_v2_0 I'll add those to the success_codes whitelist as well. Patch upcoming.

Chris Manchester (limited bugmail, email directly)

Comment 29

•

10 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #28) > There's a job that always returns 1 due to a shutdown timeout on older b2g > release branches: > https://treeherder.mozilla.org/ui/logviewer.html#?job_id=68608&repo=mozilla- > b2g34_v2_1 > https://treeherder.mozilla.org/ui/logviewer.html#?job_id=84703&repo=mozilla- > b2g32_v2_0 > > I'll add those to the success_codes whitelist as well. Patch upcoming. They return 247, too. Whatever's happening in m-11 changed chunks.

Andrew Halberstadt [:ahal]

Assignee

Comment 30

•

10 years ago

Attached patch Add 1 to success_codes for emulator mochitest-12 on older b2g branches (obsolete) — Details — Splinter Review

Ugh, this is pretty horrible.. but not sure of any better way of doing it.

Attachment #8549143 - Flags: review?(cmanchester)

Chris Manchester (limited bugmail, email directly)

Comment 31

•

10 years ago

Comment on attachment 8549143 [details] [diff] [review] Add 1 to success_codes for emulator mochitest-12 on older b2g branches Review of attachment 8549143 [details] [diff] [review]: ----------------------------------------------------------------- ::: scripts/b2g_emulator_unittest.py @@ +371,5 @@ > + success_codes = [0, 1] > + elif suite_name == 'mochitest' and platform == 'emulator': > + if chunk == 12 and branch in ('mozilla-b2g32_v2_0', 'mozilla-b2g34_v2_1'): > + # mochitest-12 on older b2g release branches have shutdown timeouts > + success_codes = [0, 1] Looking at the logs in comment 28, these start by returning 247, then run ps because that return results in TBPL_FAILURE, and ps then returns 1.

Attachment #8549143 - Flags: review?(cmanchester) → review-

Andrew Halberstadt [:ahal]

Assignee

Comment 32

•

10 years ago

Attached patch Add 1 to success_codes for emulator mochitest-12 on older b2g branches — Details — Splinter Review

Ah thanks for catching that!

Attachment #8549143 - Attachment is obsolete: true

Attachment #8549153 - Flags: review?(cmanchester)

Chris Manchester (limited bugmail, email directly)

Comment 33

•

10 years ago

Comment on attachment 8549153 [details] [diff] [review] Add 1 to success_codes for emulator mochitest-12 on older b2g branches Review of attachment 8549153 [details] [diff] [review]: ----------------------------------------------------------------- ::: scripts/b2g_emulator_unittest.py @@ +359,5 @@ > # the base implementation from running here. > pass > > + def _get_success_codes(self): > + suite_name = [x for x in self.test_suites if x in self.config['test_suite']][0] I'm assuming this does the right thing. It would be clearer if you passed suite_name as an argument to _get_success_codes.

Attachment #8549153 - Flags: review?(cmanchester) → review+

Andrew Halberstadt [:ahal]

Assignee

Comment 34

•

10 years ago

Comment on attachment 8549153 [details] [diff] [review] Add 1 to success_codes for emulator mochitest-12 on older b2g branches Pushed with review comment fixed: https://hg.mozilla.org/build/mozharness/rev/d2efcba0f986

Attachment #8549153 - Flags: checked-in+

Andrew Halberstadt [:ahal]

Assignee

Updated

•

10 years ago

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Andrew Halberstadt [:ahal]

Assignee

Comment 35

•

10 years ago

Attachment #8546838 - Attachment is obsolete: true

Attachment #8619094 - Flags: review+

Andrew Halberstadt [:ahal]

Assignee

Comment 36

•

10 years ago

Attached file MozReview Request: Bug 1119830 - DesktopUnittestParser should use the return_code in its tbpl_status determination, r=chmanchester — Details

You need to log in before you can comment on or make changes to this bug.