Closed Bug 940699 Opened 9 years ago Closed 9 years ago

Trigger a retry for metro talos suites that fail to initialize metro browser

Categories

(Release Engineering :: General, defect)

x86
Windows 8
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: jlund)

References

Details

Attachments

(1 file, 2 obsolete files)

No description provided.
in general I don't like this- I would love to figure out the root cause of this, but the time spent on this already hasn't got me there.

If we can get an auto-retry added in, then we will be able to see other issues.  How do we determine the retry?  Is this a return code from Talos?  A specific error message output that some log parser interprets?

Please let me know if we need to adjust talos for this.
Any time slave config isn't working right, the harness will pick up the error dump a note in the output - 

12:15:01     INFO -  INFO | metrotestharness.exe | Launching browser...
12:15:01     INFO -  INFO | metrotestharness.exe | App model id='E4CFE2E6B75AA3A3'
12:15:01     INFO -  INFO | metrotestharness.exe | Harness process id: 3860
12:15:01     INFO -  INFO | metrotestharness.exe | Using bin path: 'C:\slave\test\build\a..
12:15:01     INFO -  INFO | metrotestharness.exe | Writing out tests.ini to: 'C:\slave\..
12:15:01     INFO -  FAIL-SHOULD-RETRY | metrotestharness.exe | ActivateApplication result 80270254
I have tried painstackingly hard to reproduce the fail should retry scenario. I may have done over 20 sendchanges to no avail. However, it at least does not produce any mozharn errors.

Aki, I don't think we used buildbot_status() for talos before, so this now produces log lines like:
12:08:32     INFO - # TBPL SUCCESS #

I thought this would be OK and the only way to signal a retry?
Attachment #8341241 - Flags: review?(aki)
s/painstackingly/painstakingly
This fixes retry src code comments. There was some grammar/logic mistakes.

The diff from previous is:
https://github.com/lundjordan/mozharness/commit/dd243415f41637ae8af2bec288ede284a7bd7f8e
Attachment #8341241 - Attachment is obsolete: true
Attachment #8341241 - Flags: review?(aki)
Attachment #8341251 - Flags: review?(aki)
Comment on attachment 8341251 [details] [diff] [review]
bug_940699_trigger_retry_metro_fail.021213-4.diff

>+        if self.return_code not in [0]:
>+            # update the worst log level and tbpl status
>+            parser.worst_log_level = parser.worst_level(ERROR,
>+                                                        parser.worst_log_level)
>+            parser.worst_tbpl_status = parser.worst_level(
>+                TBPL_FAILURE, parser.worst_tbpl_status,
>+                levels=TBPL_WORST_LEVEL_TUPLE
>+            )
>+        self.buildbot_status(parser.worst_tbpl_status,
>+                             level=parser.worst_log_level)

It's not clear to me why we want to set these two variables (worst_{log_level,tbpl_status}) in the parser object, as opposed to two local variables.
I don't think it's harmful at all, though.
Attachment #8341251 - Flags: review?(aki) → review+
(In reply to Aki Sasaki [:aki] from comment #6)
> It's not clear to me why we want to set these two variables
> (worst_{log_level,tbpl_status}) in the parser object, as opposed to two
> local variables.
> I don't think it's harmful at all, though.

N/m, I see those already exist.  I should have checked that before replying :)
(In reply to Jordan Lund (:jlund) from comment #3)
> Created attachment 8341241 [details] [diff] [review]
> like with unittests, talos will use the harness retry error pattern:
> FAIL-SHOULD-RETRY and tell buildbot to retry if parsed
> 
> I have tried painstackingly hard to reproduce the fail should retry
> scenario. I may have done over 20 sendchanges to no avail. However, it at
> least does not produce any mozharn errors.
> 
> Aki, I don't think we used buildbot_status() for talos before, so this now
> produces log lines like:
> 12:08:32     INFO - # TBPL SUCCESS #
> 
> I thought this would be OK and the only way to signal a retry?

This depends on what's running the job.
http://hg.mozilla.org/build/buildbotcustom/file/8756b09ddc1c/misc.py#l446 only looks at the exit code, not the string, which is the preferred way because

* buildbot log parsing is expensive on the master
* truncated logs will not show this string
as per our email thread, the same patch with 1(orange) included in list of viable return codes.
Attachment #8341251 - Attachment is obsolete: true
Attachment #8342605 - Flags: review?(aki)
Attachment #8342605 - Flags: review?(aki) → review+
Attachment #8342605 - Flags: checked-in+
In production
I think this is resolved. the fail-should-retry was parsed and looks like it persisted over the return code and triggered a retry:
https://tbpl.mozilla.org/php/getParsedLog.php?id=31752089&tree=Cedar
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to Jordan Lund (:jlund) from comment #12)
> I think this is resolved. the fail-should-retry was parsed and looks like it
> persisted over the return code and triggered a retry:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=31752089&tree=Cedar

Hmm, don't think this is going to accomplish what we want. With mochitests, the retry happens on a new slave, so the test runs can complete. Looks like we are retrying on the same slave three times then bailing? That won't help since it's the slaves startup browser config that's hosed.
> Hmm, don't think this is going to accomplish what we want. With mochitests,
> the retry happens on a new slave, so the test runs can complete. Looks like
> we are retrying on the same slave three times then bailing? That won't help
> since it's the slaves startup browser config that's hosed.

Shoot. I'll look into this more. I must be missing something.

I'll explain what I see so we are on the same page for sure: from what I can tell in the log I sent, there is a 'retry' attempt on the metroharness launch three times. I think we have always done this (but it was never highlighted)? What I was referring to in that log is how I am parsing those "FAIL-SHOULD-RETRY" lines and marking them as "Critical". I then add log output at the end: "# TBPL RETRY #". Before, we had jobs that were red in tbpl for this scenario:
https://tbpl.mozilla.org/php/getParsedLog.php?id=31510103&tree=Cedar

Now when we look at tbpl, this shows that the buildbot build is signaled as a retry, and another build is started. So if you look here:
https://tbpl.mozilla.org/?tree=Cedar&jobname=svgr-metro

In the above, there is a blue (critical, retry) build recently and a green (success) one beside it in the same cedar push. 

Is this what you are seeing and I am still mistaken or missing something?
ah, my mistake, all I did was look at the log. tbpl looks right!
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.