Closed Bug 881950 Opened 11 years ago Closed 11 years ago

Update test harness to return a retry return code when activation fails

Categories

(Release Engineering :: General, defect)

x86_64
Windows 8.1
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jimm, Assigned: coop)

References

Details

Attachments

(4 files)

Spun off from bug 847442 - > WRT to C, jimm can you please write a patch that has a return code of 4 if the > metrotestharness.exe fails to activate? It will turn the job purple and run it > on another win8 machine.
Blocks: 880693
Assignee: nobody → jmathies
Attached patch patchSplinter Review
Comment on attachment 761491 [details] [diff] [review] patch I haven't gotten this to trigger yet on try, but assuming I do and this works, seeking review.
Attachment #761491 - Flags: review?(netzen)
Comment on attachment 761491 [details] [diff] [review] patch Review of attachment 761491 [details] [diff] [review]: ----------------------------------------------------------------- ::: browser/metro/shell/testing/metrotestharness.cpp @@ +193,5 @@ > +// From buildbotcustom logic: > +#define SUCCESS 0 > +#define WARNINGS 1 > +#define FAILURE 2 > +#define EXCEPTION 3 Is it expected that EXCEPTION is not used? @@ +194,5 @@ > +#define SUCCESS 0 > +#define WARNINGS 1 > +#define FAILURE 2 > +#define EXCEPTION 3 > +#define RETRY 4 /* will retry endlessly on new slaves, be careful with this! */ Did you mean to add any RETRY logic? You return it but I don't see it used anywhere.
These are just return codes that buildbot picks up and acts on. I added defines for each but we only currently return three out of the four. According to armen, if I return RETRY the harness will pick that up, mark it as a blue exception on tbpl, and retrigger the run automatically.
(In reply to Jim Mathies [:jimm] from comment #5) > These are just return codes that buildbot picks up and acts on. I added > defines for each but we only currently return three out of the four. > According to armen, if I return RETRY the harness will pick that up, mark it > as a blue exception on tbpl, and retrigger the run automatically. * three out of the five
Comment on attachment 761491 [details] [diff] [review] patch Review of attachment 761491 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for explaining, sounds good.
Attachment #761491 - Flags: review?(netzen) → review+
I triggered about twenty runs and didn't get it, so I'm just going to land this and we'll see what happens on inbound. There we're seeing about five per day.
https://tbpl.mozilla.org/php/getParsedLog.php?id=24070010&tree=Mozilla-Inbound - I sort of wondered whether it was really your return code being 4 that would do it, or it was mozharness returning 4 to buildbot. Looks like it's the latter, and you have to do something to tell mozharness that it should be giving buildbot the 4.
Whiteboard: [leave-open]
I will look into it. Thanks philor.
Assignee: jmathies → armenzg
The return codes are coming back, so I guess we just need some sort of automation harness change to pick these up - 06:57:41 WARNING - TEST-UNEXPECTED-FAIL | metrotestharness.exe | ActivateApplication result 80270254 Bug 880693 - Intermittent metrotestharness.exe | ActivateApplication result 80270254 06:57:41 WARNING - TEST-UNEXPECTED-FAIL | automation.py | Exited with code 4 during test run 06:57:43 ERROR - Return code: 4
(In reply to Jim Mathies [:jimm] from comment #13) > The return codes are coming back, so I guess we just need some sort of > automation harness change to pick these up - > > 06:57:41 WARNING - TEST-UNEXPECTED-FAIL | metrotestharness.exe | > ActivateApplication result 80270254 > Bug 880693 - Intermittent metrotestharness.exe | ActivateApplication result > 80270254 06:57:41 WARNING - TEST-UNEXPECTED-FAIL | automation.py | Exited > with code 4 during test run > 06:57:43 ERROR - Return code: 4 Yeah. I'm looking into mozharness' code to find out how to make it pipe correctly.
Component: Tests → Release Engineering: Automation (General)
Product: Firefox for Metro → mozilla.org
QA Contact: catlee
Version: Trunk → other
Besides mozharness not respecting return codes because all other suites do not necessarily respect standards, I'm pretty sure that this two lines are what makes the parser turn the whole job as just a warning. 12:58:43 WARNING - TEST-UNEXPECTED-FAIL | metrotestharness.exe | ActivateApplication result 80270254 12:58:43 WARNING - TEST-UNEXPECTED-FAIL | automation.py | Exited with code 4 during test run I'm pretty sure it is "TEST-UNEXPECTED-FAIL" that makes us end with TBPL WARNING status. I'm not sure though. This is so messy. This happens somewhere in parse_line_by_line: http://hg.mozilla.org/build/mozharness/file/default/mozharness/mozilla/testing/unittest.py#l74 which grabs its matching info from here: http://hg.mozilla.org/build/mozharness/file/59a2f5792c69/mozharness/mozilla/testing/errors.py#l22 I think if we added another entry under "harness_error": http://hg.mozilla.org/build/mozharness/file/59a2f5792c69/mozharness/mozilla/testing/errors.py#l46 and we printed it from inside the harness (not the mozharness) we could then turn it into a TBPL_RETRY. Here's some of the code path: * run the command [1] * we parse the log and the return code [2] * the parser runs [3] * we set the buildbot status to warning [4] > WARNING - - # TBPL WARNING # * self.worst_buildbot_status is what is 1 and makes us just warn [5] > program finished with exit code 1 I'm going to experiment with something on staging. [1] http://hg.mozilla.org/build/mozharness/file/default/scripts/desktop_unittest.py#l349 return_code = \ self.run_command(cmd, cwd=dirs['abs_work_dir'], output_parser=parser, env=env) [2] http://hg.mozilla.org/build/mozharness/file/default/scripts/desktop_unittest.py#l361 tbpl_status, log_level = parser.evaluate_parser(return_code) [3] http://hg.mozilla.org/build/mozharness/file/default/mozharness/mozilla/testing/unittest.py#l116 [4] http://hg.mozilla.org/build/mozharness/file/default/scripts/desktop_unittest.py#l364 [5] http://hg.mozilla.org/build/mozharness/file/tip/mozharness/mozilla/buildbot.py#l89 self.buildbot_status(tbpl_status, level=log_level)
We will have to replace the "RETRY_ME_PLEASE" with some other string that becomes a consensus. Does everything else look good? I will send an email to get a consensus meanwhile.
Attachment #762282 - Flags: feedback?(aki)
(In reply to Armen Zambrano G. [:armenzg] (Release Enginerring) from comment #15) > Besides mozharness not respecting return codes because all other suites do > not necessarily respect standards, I'm pretty sure that this two lines are > what makes the parser turn the whole job as just a warning. > 12:58:43 WARNING - TEST-UNEXPECTED-FAIL | metrotestharness.exe | > ActivateApplication result 80270254 > 12:58:43 WARNING - TEST-UNEXPECTED-FAIL | automation.py | Exited with code > 4 during test run > > I'm pretty sure it is "TEST-UNEXPECTED-FAIL" that makes us end with TBPL > WARNING status. I'm not sure though. This is so messy. I can strip that text out of the internal harness output - http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/metrotestharness.cpp#59 http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/metrotestharness.cpp#319 if you think it will help.
Comment on attachment 762282 [details] [diff] [review] mozharness retry example I'm not sure if CRITICAL is the right log level, but if it is, that's fine. I'm actually fine with "RETRY_ME_PLEASE" but am also ok with changing this to another unique string. This looks good in terms of how it's written; thanks for figuring this out!
Attachment #762282 - Flags: feedback?(aki) → feedback+
Assigning to coop to find someone to drive this to completion. Volunteers welcomed! This is what we need: 1) get a consensus on what string to fail on (I'm sending an email now) 2) jimm to get the metro job to print that magic retry string 3) review and land this patch
Assignee: armenzg → coop
(In reply to Aki Sasaki [:aki] from comment #18) > Comment on attachment 762282 [details] [diff] [review] > mozharness retry example > > I'm not sure if CRITICAL is the right log level, but if it is, that's fine. > > I'm actually fine with "RETRY_ME_PLEASE" but am also ok with changing this > to another unique string. > > This looks good in terms of how it's written; thanks for figuring this out! How about 'FAIL_SHOULD_RETRY'? The test framework should make it's own decision about how many retries it grants a single build (say 3 times tops, then it should fail the run). I think armen filed a bug on doing this, currently I guess we will retry indefinitely, potentially exhausting the slave pool.
(In reply to Jim Mathies [:jimm] from comment #20) > (In reply to Aki Sasaki [:aki] from comment #18) > > Comment on attachment 762282 [details] [diff] [review] > > mozharness retry example > > > > I'm not sure if CRITICAL is the right log level, but if it is, that's fine. > > > > I'm actually fine with "RETRY_ME_PLEASE" but am also ok with changing this > > to another unique string. > > > > This looks good in terms of how it's written; thanks for figuring this out! > > How about 'FAIL_SHOULD_RETRY'? The test framework should make it's own > decision about how many retries it grants a single build (say 3 times tops, > then it should fail the run). I think armen filed a bug on doing this, > currently I guess we will retry indefinitely, potentially exhausting the > slave pool. Actually to match other test output, that should be 'FAIL-SHOULD-RETRY'.
Attached patch harness updateSplinter Review
Add retry message support.
Attachment #762742 - Flags: review?(netzen)
Attachment #762742 - Flags: review?(netzen) → review+
I'll check this in once we have sign off from releng.
Blocks: 884247
(In reply to Jim Mathies [:jimm] from comment #23) > I'll check this in once we have sign off from releng. What sign off do you need, Jim? I'm fine with 'FAIL-SHOULD-RETRY' if that's what you're looking for.
(In reply to Chris Cooper [:coop] from comment #24) > (In reply to Jim Mathies [:jimm] from comment #23) > > I'll check this in once we have sign off from releng. > > What sign off do you need, Jim? > > I'm fine with 'FAIL-SHOULD-RETRY' if that's what you're looking for. Sounds good, I'll check it in. Can someone from releng pick up step 3 in comment 19?
Comment on attachment 762282 [details] [diff] [review] mozharness retry example Checked in, with new retry string. https://hg.mozilla.org/build/mozharness/rev/5c195cb35e14 Not live yet. I can transplant to production if we need this sooner, or wait til :edmorley is comfy with http://hg.mozilla.org/build/mozharness/rev/7a39cb9045f9 and merge everything at once.
Attachment #762282 - Flags: checked-in+
Made them unstarrable, since tbpl doesn't know that FAIL-SHOULD-RETRY has any bearing on results, so at least you quieted me down about them :)
The mozharness portion is in production.
This doesn't seem to be working. From a recent build on inbound: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=winnt%206.2&rev=5b3196ad66f4 15:52:42 CRITICAL - FAIL-SHOULD-RETRY | metrotestharness.exe | ActivateApplication result 80270254 15:52:42 INFO - INFO | metrotestharness.exe | Deleting C:\slave\test\build\application\firefox\tests.ini 15:52:42 WARNING - TEST-UNEXPECTED-FAIL | automation.py | Exited with code 4 during test run 15:52:42 INFO - INFO | automation.py | Application ran for: 0:00:00.385000 15:52:42 INFO - INFO | zombiecheck | Reading PID log: c:\users\cltbld~1.t-w\appdata\local\temp\tmprgt87ipidlog 15:52:43 INFO - SUCCESS: The process with PID 3236 has been terminated. 15:52:43 INFO - ERROR: The process with PID 1572 could not be terminated. 15:52:43 INFO - Reason: There is no running instance of the task. 15:52:43 INFO - SUCCESS: The process with PID 3956 has been terminated. 15:52:43 INFO - WARNING | leakcheck | refcount logging is off, so leaks can't be detected! 15:52:43 INFO - INFO | runtests.py | Running tests: end. 15:52:43 ERROR - Return code: 4 15:52:43 INFO - TinderboxPrint: mochitest-mochitest-metro-chrome<br/><em class="testfail">T-FAIL</em> 15:52:43 CRITICAL - # TBPL WARNING # 15:52:43 CRITICAL - The mochitest suite: mochitest-metro-chrome ran with return status: WARNING 15:52:43 INFO - Copying logs to upload dir... 15:52:43 INFO - mkdir: C:\slave\test\build\upload\logs The FAIL-SHOULD-RETRY seems to get picked up as a CRITICAL error, but the run doesn't restart.
We used to either a) not use worst_level, or b) use worst_level against the keys of an unordered dict. Now we're using worst_level in all self.tbpl_status setting locations, and using an ordered tuple.
Attachment #765138 - Flags: review?(rail)
Attachment #765138 - Flags: review?(rail) → review+
This should be fixed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [leave-open]
Product: mozilla.org → Release Engineering
OS: Windows 8 Metro → Windows 8.1
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: