Closed Bug 657835 Opened 13 years ago Closed 12 years ago

Watch for and re-try intermittent oranges from autolanded patches on try before pushing to $branch

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lsblakk, Assigned: mjessome)

References

Details

Attachments

(3 files, 6 obsolete files)

Watch results coming back with one or two oranges (allow for settin a threshold) and re-trigger - watch for the second set of results - to try and catch intermittent/random oranges on a branch push.

This script/tool/bot needs to then backout patches from $branchname where even with a rebuild the result is still orange

I'm marking this as dependent on the results polling in bug 430942 for now - this will be one of the last parts of the first implementation of Assisted/Auto-Landing on branches through Bugzilla.
Adjusting this bug to reflect a more recent take on what we're going to try and do here.

The proposed workflow is to have:
* per-branch threshold of MAX_ORANGES and then a passable amount of oranges after retrying the builds again.
    eg: 5/2 -- If a build has 5 or less oranges, retry each orange and wait for the second round of results. If there are now 2 or less oranges (ie: some rebuilds went green) - we can go ahead and push to branch, otherwise fail the build and print comment to bug with results.

* adding --retry-oranges to the commit message (and thus, try syntax) so that the autoland script knows when to check for oranges and do retrying (ie: we do not want this on by default for every push to try with --post-to-bugzilla which also calls the same script, schedulerdbpoller)

This bug does not cover future plans to do better handling of oranges in ways other than retrying and comparing results.  So, explicit non-goal is starring or reading from orangefactor db - those will be other projects, other bugs.
Summary: Watch for and re-try intermittent oranges from autolanding patches on $branchname → Watch for and re-try intermittent oranges from autolanded patches on try before pushing to $branch
Attached patch Retry Oranges v1 (obsolete) — Splinter Review
Unnecessary to look at and pair warnings with their retries, since OrangeFactorHandling will only be called if all builds are in a complete state. This means that everything can be calculated to figure out 1. if it is a retried run or not, and 2. if the retried run was successful or not.
Attachment #615331 - Flags: feedback?(lsblakk)
Attached patch Retry Oranges - Tests v1 (obsolete) — Splinter Review
Tests to match. Note that it also includes some trivial rename fixes on function calls.
The relevant tests start at line 382 of the patch.
Attachment #615334 - Flags: feedback?(lsblakk)
Need a way of handling per-branch thresholds.
Since schedulerDbPoller should be running somewhat independently from autoland, I feel it inappropriate to have sdbpoller check the autoland databse for branch information. I also feel like it is a lot of maintenance to keep a separate information store for sdbpoller.
Perhaps --retry-oranges could take an argument, --retry-orages N, where N is the MAX_ORANGES value for that run?
Comment on attachment 615331 [details] [diff] [review]
Retry Oranges v1

Looks good - the only thing missing (from what I can tell) is that you return a final_status of RETRYING but that never seems to get checked for anywhere to kick that revision out of the processing loop to wait for its return later when retries are done.  Also I like your idea of adding the threshold to the --retry-oranges flag, let's go with that for now until the unknown future autoland can integrate with [orangefactor]
Attachment #615331 - Flags: feedback?(lsblakk) → feedback+
Comment on attachment 615334 [details] [diff] [review]
Retry Oranges - Tests v1

Tests look good (as do the trivial changes)
Attachment #615334 - Flags: feedback?(lsblakk) → feedback+
Attached patch Retry Oranges v2 (obsolete) — Splinter Review
Putting these up for review. We need to think of some way that we can test this properly.
Attachment #615331 - Attachment is obsolete: true
Attachment #616120 - Flags: review?(lsblakk)
Comment on attachment 616120 [details] [diff] [review]
Retry Oranges v2

cancelling review, realized there was a small mistake in there.
Attachment #616120 - Flags: review?(lsblakk)
Attached patch Retry Oranges v3 (obsolete) — Splinter Review
Fixed a little mistake getting an argument for --retry-orange flag.
Attachment #616120 - Attachment is obsolete: true
Attachment #616191 - Flags: review?(lsblakk)
Attachment #615334 - Attachment is obsolete: true
Attachment #616193 - Flags: review?(lsblakk)
Comment on attachment 616191 [details] [diff] [review]
Retry Oranges v3

Review of attachment 616191 [details] [diff] [review]:
-----------------------------------------------------------------

Just one fix (and then a few nits) needed for the OrangeFactor portion but also for consistency, since you're taking out many of the self.verbose == True messages please go ahead and remove them from the whole script.

::: schedulerDBpoller.py
@@ +129,5 @@
> +            log.debug("Complete, all builds successful.")
> +            is_complete = True
> +            final_status = "SUCCESS"
> +        elif results['warnings'] <= max_orange:
> +            log.debug("Complete, have %d warnings with threshold %s."

nit: be more explicit in the log comment that we're under the set threshold:  "Complete, %d warnings are less than %d threshold"

@@ +135,5 @@
> +            is_complete = True
> +            final_status = "SUCCESS"
> +        elif results['total_builds'] == results['success'] + results['warnings']:
> +            print "Have only warnings and success. Retry case."
> +            log.debug("Incomplete, have warnings.")

remove print and add to the log message that this is where we check if this is a retry case

@@ +158,5 @@
> +
> +            retry_count = len(duplicates)
> +
> +            if retry_count*2 >= results['warnings']:
> +                # this is a finished retry run

don't need this comment since the log statement is the same on next line

@@ +169,4 @@
>                  else:
> +                    final_status = "FAILURE"
> +            elif results['warnings'] > max_orange:
> +                print "Over max orange, trigger retry"

nit: 'triggering retries' (plural is more accurate to what we're doing here)

@@ +280,5 @@
> +                                comments.split('--retry-orange') if x]
> +                        if not len(max_orange) > 1:
> +                            max_orange = MAX_ORANGE
> +                            continue
> +                        print max_orange

nit: remove print, I suspect we don't need it beyond debugging

@@ +471,2 @@
>                  is_complete, status['status_string'] = \
>                          self.OrangeFactorHandling(buildrequests)

should pass max_orange to OrangeFactorHandling, yes?
Attachment #616191 - Flags: review?(lsblakk) → review-
Attached patch Retry Oranges v4 (obsolete) — Splinter Review
Addressed changes from v3.
Attachment #616191 - Attachment is obsolete: true
Attachment #616292 - Flags: review?(lsblakk)
Comment on attachment 616193 [details] [diff] [review]
Retry Oranges - Tests v3

tests++
Attachment #616193 - Flags: review?(lsblakk) → review+
Comment on attachment 616292 [details] [diff] [review]
Retry Oranges v4

Review of attachment 616292 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with print statements removed

::: schedulerDBpoller.py
@@ +146,5 @@
>                  else:
>                      buildernames[br['buildername']].append(
>                              (br['results_str'], br['branch'], br['bid']))
> +
> +            print "buildernames: %s" % (buildernames)

sorry, must have missed this the first time - please remove print statement

@@ +159,5 @@
> +
> +            if retry_count*2 >= results['warnings']:
> +                log.debug("Finished retry run")
> +                is_complete = True
> +                print "Warnings: %s" % (results['warnings'])

same here

@@ +169,5 @@
> +            elif results['warnings'] > max_orange:
> +                log.debug("Over max orange, trigger retries")
> +                # attempt rebuilds
> +                for info in buildernames.values():
> +                    print info

and here

@@ +278,5 @@
> +                                comments.split('--retry-orange') if x]
> +                        if not len(max_orange) > 1:
> +                            max_orange = MAX_ORANGE
> +                            continue
> +                        print max_orange

and here
Attachment #616292 - Flags: review?(lsblakk) → review+
Attached patch Retry Oranges v5 (obsolete) — Splinter Review
Removed all prints, can't believe I missed those.
Attachment #616292 - Attachment is obsolete: true
Attachment #616303 - Flags: review?(lsblakk)
Comment on attachment 616303 [details] [diff] [review]
Retry Oranges v5

Review of attachment 616303 [details] [diff] [review]:
-----------------------------------------------------------------

all good.
Attachment #616303 - Flags: review?(lsblakk) → review+
I've tested this out by doing a few "controlled" dry runs. Haven't done anything that uses --retry-orange, but the tests show that it works. I had to make a few minor changes, will post a v6 soon.

Any suggestions on getting a --retry-orange job running? If I push to try with --retry-orange, I could end up waiting and getting an exception or something non-orange, which wouldn't be so nice.

I was also thinking, should there be a maximum number of oranges to retry? say we have 15 builds run, 12 come back orange... chances are that it's no good & we're wasting resources.
Attached patch Retry Oranges v6Splinter Review
This contains changes required while staging. Just a few logging changes, took out an incorrect variable usage with buildapi retry request, moved a dry-run check, changed a few log.debug to log.warn, fixed a config=>configs variable usage, and fixed --verbose toggle for DEBUG loglevel.
Attachment #616303 - Attachment is obsolete: true
Attachment #617055 - Flags: review?(lsblakk)
A bit of a strange question, but do if we're making the try syntax widely available, do we want --retry-orange or --retry-oranges ?
Talked with lsblakk, going with --retry-oranges, plural is more descriptive to the function.
Comment on attachment 617055 [details] [diff] [review]
Retry Oranges v6


>-        return push_type
>+                    if '--retry-orange' in comments:
>+                        push_type = "RETRY"

should be if '--retry-oranges'

>+                        # eliminate any empty strings from the split
>+                        max_orange = [x for x in
>+                                comments.split('--retry-orange') if x]

same here

r+ with those changes addressed.
Attachment #617055 - Flags: review?(lsblakk) → review+
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Attachment #619096 - Flags: review?(rail)
Attachment #619096 - Flags: review?(rail) → review+
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: