Closed
Bug 1167467
Opened 9 years ago
Closed 9 years ago
release-runner and automated reconfigs can clash
Categories
(Release Engineering :: Release Automation: Other, defect, P2)
Release Engineering
Release Automation: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: coop)
Details
Attachments
(1 file)
2.22 KB,
patch
|
nthomas
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
For Firefox 38.0.5 build2 we had a case of bad timing. Release runner (r-r) was attempting a reconfig at 2pm, just as the hourly reconfig check was also running. My theory is the latter created a lockfile, and we hit this in r-r [FAIL] lockfile (reconfig.lock) found in buildbot-master81.bb.releng.scl3.mozilla.com:/builds/buildbot/build_scheduler [FAIL] lockfile (reconfig.lock) found in buildbot-master81.bb.releng.scl3.mozilla.com:/builds/buildbot/tests_scheduler Also hit this on bm71 and bm74. r-r carried on to do the sendchange, which then had no effect because the build scheduler thought it was still doing 38.0.1. There should probably be a retry around getting the lockfile, with a bit of a sliding backoff, and then a good message if that still fails. release-runner should not continue to the sendchange if the reconfig has failed on some masters. In this case it was just a delay, but if the scheduler had reconfiged and some build masters had not then we'd be in a bad state. We need to find out if r-r is not paying attention to the result from fabric, or if fabric is not reporting it. Bonus points for not needing do build N+1 to recover.
Assignee | ||
Comment 1•9 years ago
|
||
release-runner builds it's own reconfig command, so it's not respecting the actions that already check for the lockfile: https://hg.mozilla.org/build/tools/file/c643b28eae40/lib/python/util/fabric/common.py#l58
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Reporter | ||
Comment 2•9 years ago
|
||
The log in comment #0 is from release-runner, so that fabric call must be paying attention to the lockfile ?
Assignee | ||
Comment 3•9 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #2) > The log in comment #0 is from release-runner, so that fabric call must be > paying attention to the lockfile ? Yes, it's building a call to manage_master.py with does obey the lockfile, but doesn't properly raise an exception when it finds one. Patch incoming.
Assignee | ||
Comment 4•9 years ago
|
||
I didn't change the retry intervals here. With the default sleep-time and back-off, we're looking at 1+2+4+5+5=17 minutes of retries which *should* be enough time for a build master to finish an in-progress reconfig. Let me know if you don't agree and I can adjust those timings.
Attachment #8617481 -
Flags: review?(nthomas)
Reporter | ||
Comment 5•9 years ago
|
||
Comment on attachment 8617481 [details] [diff] [review] [tools] Retry reconfig when called from release-runner; raise an exception when a reconfig lockfile if found >diff --git a/lib/python/util/fabric/actions.py b/lib/python/util/fabric/actions.py > lockfile_check = run('if [ -e %s ]; then echo "lockfile found"; fi' % RECONFIG_LOCKFILE, workdir=master['basedir']) > if lockfile_check != "": > print FAIL, "lockfile (%s) found in %s:%s" % (RECONFIG_LOCKFILE, > master['hostname'], > master['basedir']) >- return >+ raise Doesn't look like a re-raise, so how about something like raise Exception("Couldn't get lockfile to reconfig") release-runner should catch that, and post the explanation into ship-it. The timing for the retry seems fine to me, we can adjust if needed.
Attachment #8617481 -
Flags: review?(nthomas) → review+
Assignee | ||
Comment 6•9 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #5) > Doesn't look like a re-raise, so how about something like > raise Exception("Couldn't get lockfile to reconfig") > release-runner should catch that, and post the explanation into ship-it. That's fair. I was comparing with action_checkconfig() which actual is a re-raise, so your suggestion makes sense.
Assignee | ||
Comment 7•9 years ago
|
||
Comment on attachment 8617481 [details] [diff] [review] [tools] Retry reconfig when called from release-runner; raise an exception when a reconfig lockfile if found Review of attachment 8617481 [details] [diff] [review]: ----------------------------------------------------------------- https://hg.mozilla.org/build/tools/rev/d7cf4f2dc5fc
Attachment #8617481 -
Flags: checked-in+
Assignee | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•