Update test harness to return a retry return code when activation fails

RESOLVED FIXED

Status

defect
RESOLVED FIXED
6 years ago
a year ago

People

(Reporter: jimm, Assigned: coop)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments)

Reporter

Description

6 years ago
Spun off from bug 847442 - 

> WRT to C, jimm can you please write a patch that has a return code of 4 if the > metrotestharness.exe fails to activate? It will turn the job purple and run it > on another win8 machine.
Reporter

Updated

6 years ago
Blocks: 880693
Reporter

Updated

6 years ago
Assignee: nobody → jmathies
Reporter

Comment 1

6 years ago
Posted patch patchSplinter Review
Reporter

Comment 3

6 years ago
Comment on attachment 761491 [details] [diff] [review]
patch

I haven't gotten this to trigger yet on try, but assuming I do and this works, seeking review.
Attachment #761491 - Flags: review?(netzen)
Comment on attachment 761491 [details] [diff] [review]
patch

Review of attachment 761491 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/metro/shell/testing/metrotestharness.cpp
@@ +193,5 @@
> +// From buildbotcustom logic:
> +#define SUCCESS   0
> +#define WARNINGS  1
> +#define FAILURE   2
> +#define EXCEPTION 3

Is it expected that EXCEPTION is not used?

@@ +194,5 @@
> +#define SUCCESS   0
> +#define WARNINGS  1
> +#define FAILURE   2
> +#define EXCEPTION 3
> +#define RETRY     4 /* will retry endlessly on new slaves, be careful with this! */

Did you mean to add any RETRY logic? You return it but I don't see it used anywhere.
Reporter

Comment 5

6 years ago
These are just return codes that buildbot picks up and acts on. I added defines for each but we only currently return three out of the four. According to armen, if I return RETRY the harness will pick that up, mark it as a blue exception on tbpl, and retrigger the run automatically.
Reporter

Comment 6

6 years ago
(In reply to Jim Mathies [:jimm] from comment #5)
> These are just return codes that buildbot picks up and acts on. I added
> defines for each but we only currently return three out of the four.
> According to armen, if I return RETRY the harness will pick that up, mark it
> as a blue exception on tbpl, and retrigger the run automatically.

* three out of the five
Comment on attachment 761491 [details] [diff] [review]
patch

Review of attachment 761491 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for explaining, sounds good.
Attachment #761491 - Flags: review?(netzen) → review+
Reporter

Comment 8

6 years ago
I triggered about twenty runs and didn't get it, so I'm just going to land this and we'll see what happens on inbound. There we're seeing about five per day.
https://tbpl.mozilla.org/php/getParsedLog.php?id=24070010&tree=Mozilla-Inbound - I sort of wondered whether it was really your return code being 4 that would do it, or it was mozharness returning 4 to buildbot. Looks like it's the latter, and you have to do something to tell mozharness that it should be giving buildbot the 4.
Reporter

Updated

6 years ago
Whiteboard: [leave-open]
I will look into it. Thanks philor.
Assignee: jmathies → armenzg
Reporter

Comment 13

6 years ago
The return codes are coming back, so I guess we just need some sort of automation harness change to pick these up - 

06:57:41 WARNING - TEST-UNEXPECTED-FAIL | metrotestharness.exe | ActivateApplication result 80270254
Bug 880693 - Intermittent metrotestharness.exe | ActivateApplication result 80270254 06:57:41 WARNING - TEST-UNEXPECTED-FAIL | automation.py | Exited with code 4 during test run
06:57:43 ERROR - Return code: 4
(In reply to Jim Mathies [:jimm] from comment #13)
> The return codes are coming back, so I guess we just need some sort of
> automation harness change to pick these up - 
> 
> 06:57:41 WARNING - TEST-UNEXPECTED-FAIL | metrotestharness.exe |
> ActivateApplication result 80270254
> Bug 880693 - Intermittent metrotestharness.exe | ActivateApplication result
> 80270254 06:57:41 WARNING - TEST-UNEXPECTED-FAIL | automation.py | Exited
> with code 4 during test run
> 06:57:43 ERROR - Return code: 4

Yeah. I'm looking into mozharness' code to find out how to make it pipe correctly.
Component: Tests → Release Engineering: Automation (General)
Product: Firefox for Metro → mozilla.org
QA Contact: catlee
Version: Trunk → other
Besides mozharness not respecting return codes because all other suites do not necessarily respect standards, I'm pretty sure that this two lines are what makes the parser turn the whole job as just a warning.
12:58:43  WARNING -  TEST-UNEXPECTED-FAIL | metrotestharness.exe | ActivateApplication result 80270254
12:58:43  WARNING -  TEST-UNEXPECTED-FAIL | automation.py | Exited with code 4 during test run

I'm pretty sure it is "TEST-UNEXPECTED-FAIL" that makes us end with TBPL WARNING status. I'm not sure though. This is so messy.
This happens somewhere in parse_line_by_line:
http://hg.mozilla.org/build/mozharness/file/default/mozharness/mozilla/testing/unittest.py#l74
which grabs its matching info from here:
http://hg.mozilla.org/build/mozharness/file/59a2f5792c69/mozharness/mozilla/testing/errors.py#l22

I think if we added another entry under "harness_error":
http://hg.mozilla.org/build/mozharness/file/59a2f5792c69/mozharness/mozilla/testing/errors.py#l46
and we printed it from inside the harness (not the mozharness) we could then turn it into a TBPL_RETRY.

Here's some of the code path:
* run the command [1]
* we parse the log and the return code [2]
* the parser runs [3]
* we set the buildbot status to warning [4]
>   WARNING - - # TBPL WARNING #
* self.worst_buildbot_status is what is 1 and makes us just warn [5]
> program finished with exit code 1

I'm going to experiment with something on staging.

[1] http://hg.mozilla.org/build/mozharness/file/default/scripts/desktop_unittest.py#l349
return_code = \
  self.run_command(cmd, cwd=dirs['abs_work_dir'], output_parser=parser, env=env)

[2] http://hg.mozilla.org/build/mozharness/file/default/scripts/desktop_unittest.py#l361
tbpl_status, log_level = parser.evaluate_parser(return_code)

[3] http://hg.mozilla.org/build/mozharness/file/default/mozharness/mozilla/testing/unittest.py#l116

[4] http://hg.mozilla.org/build/mozharness/file/default/scripts/desktop_unittest.py#l364

[5] http://hg.mozilla.org/build/mozharness/file/tip/mozharness/mozilla/buildbot.py#l89
self.buildbot_status(tbpl_status, level=log_level)
We will have to replace the "RETRY_ME_PLEASE" with some other string that becomes a consensus.

Does everything else look good?
I will send an email to get a consensus meanwhile.
Attachment #762282 - Flags: feedback?(aki)
Reporter

Comment 17

6 years ago
(In reply to Armen Zambrano G. [:armenzg] (Release Enginerring) from comment #15)
> Besides mozharness not respecting return codes because all other suites do
> not necessarily respect standards, I'm pretty sure that this two lines are
> what makes the parser turn the whole job as just a warning.
> 12:58:43  WARNING -  TEST-UNEXPECTED-FAIL | metrotestharness.exe |
> ActivateApplication result 80270254
> 12:58:43  WARNING -  TEST-UNEXPECTED-FAIL | automation.py | Exited with code
> 4 during test run
> 
> I'm pretty sure it is "TEST-UNEXPECTED-FAIL" that makes us end with TBPL
> WARNING status. I'm not sure though. This is so messy.

I can strip that text out of the internal harness output - 

http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/metrotestharness.cpp#59
http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/metrotestharness.cpp#319

if you think it will help.
Comment on attachment 762282 [details] [diff] [review]
mozharness retry example

I'm not sure if CRITICAL is the right log level, but if it is, that's fine.

I'm actually fine with "RETRY_ME_PLEASE" but am also ok with changing this to another unique string.

This looks good in terms of how it's written; thanks for figuring this out!
Attachment #762282 - Flags: feedback?(aki) → feedback+
Assigning to coop to find someone to drive this to completion.
Volunteers welcomed!

This is what we need:
1) get a consensus on what string to fail on (I'm sending an email now)
2) jimm to get the metro job to print that magic retry string
3) review and land this patch
Assignee: armenzg → coop
Reporter

Comment 20

6 years ago
(In reply to Aki Sasaki [:aki] from comment #18)
> Comment on attachment 762282 [details] [diff] [review]
> mozharness retry example
> 
> I'm not sure if CRITICAL is the right log level, but if it is, that's fine.
> 
> I'm actually fine with "RETRY_ME_PLEASE" but am also ok with changing this
> to another unique string.
> 
> This looks good in terms of how it's written; thanks for figuring this out!

How about 'FAIL_SHOULD_RETRY'? The test framework should make it's own decision about how many retries it grants a single build (say 3 times tops, then it should fail the run). I think armen filed a bug on doing this, currently I guess we will retry indefinitely, potentially exhausting the slave pool.
Reporter

Comment 21

6 years ago
(In reply to Jim Mathies [:jimm] from comment #20)
> (In reply to Aki Sasaki [:aki] from comment #18)
> > Comment on attachment 762282 [details] [diff] [review]
> > mozharness retry example
> > 
> > I'm not sure if CRITICAL is the right log level, but if it is, that's fine.
> > 
> > I'm actually fine with "RETRY_ME_PLEASE" but am also ok with changing this
> > to another unique string.
> > 
> > This looks good in terms of how it's written; thanks for figuring this out!
> 
> How about 'FAIL_SHOULD_RETRY'? The test framework should make it's own
> decision about how many retries it grants a single build (say 3 times tops,
> then it should fail the run). I think armen filed a bug on doing this,
> currently I guess we will retry indefinitely, potentially exhausting the
> slave pool.

Actually to match other test output, that should be 'FAIL-SHOULD-RETRY'.
Reporter

Comment 22

6 years ago
Add retry message support.
Attachment #762742 - Flags: review?(netzen)
Attachment #762742 - Flags: review?(netzen) → review+
Reporter

Comment 23

6 years ago
I'll check this in once we have sign off from releng.
Reporter

Updated

6 years ago
Blocks: 884247
(In reply to Jim Mathies [:jimm] from comment #23)
> I'll check this in once we have sign off from releng.

What sign off do you need, Jim? 

I'm fine with 'FAIL-SHOULD-RETRY' if that's what you're looking for.
Reporter

Comment 25

6 years ago
(In reply to Chris Cooper [:coop] from comment #24)
> (In reply to Jim Mathies [:jimm] from comment #23)
> > I'll check this in once we have sign off from releng.
> 
> What sign off do you need, Jim? 
> 
> I'm fine with 'FAIL-SHOULD-RETRY' if that's what you're looking for.

Sounds good, I'll check it in. Can someone from releng pick up step 3 in comment 19?
Comment on attachment 762282 [details] [diff] [review]
mozharness retry example

Checked in, with new retry string.
https://hg.mozilla.org/build/mozharness/rev/5c195cb35e14
Not live yet.  I can transplant to production if we need this sooner, or wait til :edmorley is comfy with http://hg.mozilla.org/build/mozharness/rev/7a39cb9045f9 and merge everything at once.
Attachment #762282 - Flags: checked-in+
Made them unstarrable, since tbpl doesn't know that FAIL-SHOULD-RETRY has any bearing on results, so at least you quieted me down about them :)
The mozharness portion is in production.
Reporter

Comment 31

6 years ago
This doesn't seem to be working. From a recent build on inbound:

https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=winnt%206.2&rev=5b3196ad66f4

15:52:42 CRITICAL -  FAIL-SHOULD-RETRY | metrotestharness.exe | ActivateApplication result 80270254
15:52:42     INFO -  INFO | metrotestharness.exe | Deleting C:\slave\test\build\application\firefox\tests.ini
15:52:42  WARNING -  TEST-UNEXPECTED-FAIL | automation.py | Exited with code 4 during test run
15:52:42     INFO -  INFO | automation.py | Application ran for: 0:00:00.385000
15:52:42     INFO -  INFO | zombiecheck | Reading PID log: c:\users\cltbld~1.t-w\appdata\local\temp\tmprgt87ipidlog
15:52:43     INFO -  SUCCESS: The process with PID 3236 has been terminated.
15:52:43     INFO -  ERROR: The process with PID 1572 could not be terminated.
15:52:43     INFO -  Reason: There is no running instance of the task.
15:52:43     INFO -  SUCCESS: The process with PID 3956 has been terminated.
15:52:43     INFO -  WARNING | leakcheck | refcount logging is off, so leaks can't be detected!
15:52:43     INFO -  INFO | runtests.py | Running tests: end.
15:52:43    ERROR - Return code: 4
15:52:43     INFO - TinderboxPrint: mochitest-mochitest-metro-chrome<br/><em class="testfail">T-FAIL</em>
15:52:43 CRITICAL - # TBPL WARNING #
15:52:43 CRITICAL - The mochitest suite: mochitest-metro-chrome ran with return status: WARNING
15:52:43     INFO - Copying logs to upload dir...
15:52:43     INFO - mkdir: C:\slave\test\build\upload\logs

The FAIL-SHOULD-RETRY seems to get picked up as a CRITICAL error, but the run doesn't restart.
We used to either a) not use worst_level, or b) use worst_level against the keys of an unordered dict.

Now we're using worst_level in all self.tbpl_status setting locations, and using an ordered tuple.
Attachment #765138 - Flags: review?(rail)

Updated

6 years ago
Attachment #765138 - Flags: review?(rail) → review+
This should be fixed.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Reporter

Updated

6 years ago
Whiteboard: [leave-open]
Product: mozilla.org → Release Engineering
OS: Windows 8 Metro → Windows 8.1
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.