Closed Bug 1167467 Opened 8 years ago Closed 7 years ago

release-runner and automated reconfigs can clash

Categories

(Release Engineering :: Release Automation: Other, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: coop)

Details

Attachments

(1 file)

For Firefox 38.0.5 build2 we had a case of bad timing. Release runner (r-r) was attempting a reconfig at 2pm, just as the hourly reconfig check was also running. My theory is the latter created a lockfile, and we hit this in r-r
 [FAIL] lockfile (reconfig.lock) found in buildbot-master81.bb.releng.scl3.mozilla.com:/builds/buildbot/build_scheduler
 [FAIL] lockfile (reconfig.lock) found in buildbot-master81.bb.releng.scl3.mozilla.com:/builds/buildbot/tests_scheduler
Also hit this on bm71 and bm74. r-r carried on to do the sendchange, which then had no effect because the build scheduler thought it was still doing 38.0.1.

There should probably be a retry around getting the lockfile, with a bit of a sliding backoff, and then a good message if that still fails.

release-runner should not continue to the sendchange if the reconfig has failed on some masters. In this case it was just a delay, but if the scheduler had reconfiged and some build masters had not then we'd be in a bad state. We need to find out if r-r is not paying attention to the result from fabric, or if fabric is not reporting it. Bonus points for not needing do build N+1 to recover.
release-runner builds it's own reconfig command, so it's not respecting the actions that already check for the lockfile:

https://hg.mozilla.org/build/tools/file/c643b28eae40/lib/python/util/fabric/common.py#l58
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
The log in comment #0 is from release-runner, so that fabric call must be paying attention to the lockfile ?
(In reply to Nick Thomas [:nthomas] from comment #2)
> The log in comment #0 is from release-runner, so that fabric call must be
> paying attention to the lockfile ?

Yes, it's building a call to manage_master.py with does obey the lockfile, but doesn't properly raise an exception when it finds one. Patch incoming.
I didn't change the retry intervals here. With the default sleep-time and back-off, we're looking at 1+2+4+5+5=17 minutes of retries which *should* be enough time for a build master to finish an in-progress reconfig. 

Let me know if you don't agree and I can adjust those timings.
Attachment #8617481 - Flags: review?(nthomas)
Comment on attachment 8617481 [details] [diff] [review]
[tools] Retry reconfig when called from release-runner; raise an exception when a reconfig lockfile if found

>diff --git a/lib/python/util/fabric/actions.py b/lib/python/util/fabric/actions.py
>         lockfile_check = run('if [ -e %s ]; then echo "lockfile found"; fi' % RECONFIG_LOCKFILE, workdir=master['basedir'])
>         if lockfile_check != "":
>             print FAIL, "lockfile (%s) found in %s:%s" % (RECONFIG_LOCKFILE,
>                                                           master['hostname'],
>                                                           master['basedir'])
>-            return
>+            raise

Doesn't look like a re-raise, so how about something like
    raise Exception("Couldn't get lockfile to reconfig")
release-runner should catch that, and post the explanation into ship-it.

The timing for the retry seems fine to me, we can adjust if needed.
Attachment #8617481 - Flags: review?(nthomas) → review+
(In reply to Nick Thomas [:nthomas] from comment #5)
> Doesn't look like a re-raise, so how about something like
>     raise Exception("Couldn't get lockfile to reconfig")
> release-runner should catch that, and post the explanation into ship-it.

That's fair. I was comparing with action_checkconfig() which actual is a re-raise, so your suggestion makes sense.
Comment on attachment 8617481 [details] [diff] [review]
[tools] Retry reconfig when called from release-runner; raise an exception when a reconfig lockfile if found

Review of attachment 8617481 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/d7cf4f2dc5fc
Attachment #8617481 - Flags: checked-in+
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.