Closed Bug 1256118 Opened 9 years ago Closed 7 years ago

Make master restart script more tolerant of common error conditions

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P5)

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: nthomas, Unassigned)

References

Details

(Whiteboard: Buildbot)

The weekly master restarts (bug 1057888) failed this weekend, because the tools repo at dev-master2:~coop/restart-masters/tools didn't have the removal of the Panda masters and hit Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: 2016-03-12 15:02:12,746 - INFO - __main__ - Disabling buildbot-master102.bb.releng.scl3.mozilla.com in slavealloc. Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: Traceback (most recent call last): Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: File "./restart_masters.py", line 410, in <module> Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: if disable_master(running_buckets[key]): Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: File "./restart_masters.py", line 206, in disable_master Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: disable_url = furl(slavealloc_api_url + "/" + str(master_ids[master['name']])) Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: KeyError: 'bm102-tests1-panda' At this point it had disabled 9 other masters, which duly stopped gracefully but didn't start up again (the script had exited on the error). There are nagios alerts about '0 processes with command name buildbot) and sns alerts about stale lockconfigs. The masters affected are buildbot-master79.bb.releng.usw2.mozilla.com buildbot-master86.bb.releng.scl3.mozilla.com buildbot-master87.bb.releng.scl3.mozilla.com buildbot-master94.bb.releng.use1.mozilla.com buildbot-master105.bb.releng.scl3.mozilla.com buildbot-master108.bb.releng.scl3.mozilla.com buildbot-master124.bb.releng.use1.mozilla.com buildbot-master125.bb.releng.usw2.mozilla.com buildbot-master127.bb.releng.scl3.mozilla.com
It looks like there are local mods to tools/buildfarm/maintenance/restart_masters.py, and updating it is commented out in the script (dev-master2:~coop/bin/restart_masters.sh), so I'm going to leave that for coop to assess. I did restart the masters with this (although obviously all the other ones are a week-old still): ssh <master> cd /builds/buildbot/*1* && pwd && rm -v reconfig.lock && make update checkconfig start || tail -F master/twistd.log | grep configuration Plus re-enable in slavealloc.
Assignee: nobody → coop
There is a patch to remove the panda/foopy code in tools in bug 1186617
(In reply to Nick Thomas [:nthomas] from comment #1) > It looks like there are local mods to > tools/buildfarm/maintenance/restart_masters.py, and updating it is commented > out in the script (dev-master2:~coop/bin/restart_masters.sh), so I'm going > to leave that for coop to assess. I did restart the masters with this > (although obviously all the other ones are a week-old still): Mea culpa. I was testing some changes for bug 1249356 (ironically by using the panda masters) and forgot to remove the code before the weekend. I'll get the patch reviewed/landed this week.
Alerts are alerting again today. How's that patch coming?
Flags: needinfo?(coop)
(In reply to Wes Kocher (:KWierso) from comment #4) > Alerts are alerting again today. How's that patch coming? Only master I'm seeing alerting is http://buildbot-master116.bb.releng.usw2.mozilla.com/
From a very quick look in papertrail it looks like bm116 shut down in just a few seconds, confusing the script. Started it back up again, cleared the reconfig.lock, and re-enabled in slavealloc. Ran nicely for everything else \o/
(In reply to Wes Kocher (:KWierso) from comment #4) > Alerts are alerting again today. How's that patch coming? Some of the logging changes already went in, which is why Nick could see what he did in papertrail in comment #6. I can revisit the script logic itself and try to add some resiliency. One change I could make would be to track the pid of the existing master process to better determine when we've restarted. I could also modify the hourly reconfig check to restart the master if it's supposed to be running and isn't.
Flags: needinfo?(coop)
Summary: Master restart failures March 13th → Make master restart script more tolerant of common error conditions
See Also: → 1263414
13 masters were found disabled in slavealloc today, causing buildbot master lag alerts by loading up the remaining masters in each pool.
(In reply to Nick Thomas [:nthomas] from comment #8) > 13 masters were found disabled in slavealloc today, causing buildbot master > lag alerts by loading up the remaining masters in each pool. For reference: 11:52 <aselagea|buildduty> catlee: with respect to the disabled bb masters, there are 13 at the moment (not counting bm69 which is disabled by bug 1021086) 11:52 <aselagea|buildduty> tests-use1-linux64: 6, tests-use1-linux32: 2, tests-usw2-linux32: 1, tests-usw2-linux64: 3, try-usw2: 1 Alin said he didn't think these were all necessarily caused my the master shutdown script just based on uptime, but that's the most reasonable explanation. Handing this off to Andrei who's has been looking into bug 1275428.
Assignee: coop → andrei.obreja
The list of robustness improvements I've wanted to make: * track buildbot master PID so we can tell when it's down and when it's restarted * use the master PID to verify the master has really restarted, rather than just assuming. * track how long we have been waiting for a given master to shutdown so we can do a forced shutdown after some interval. This happens a lot on linux test masters where we end up with no jobs running, but *something* prevents the master from shutting down. Our longest legitimate job is/was >4hr, so if we set a hard limit at 5hr and then issued a 'make stop', we could avoid the need for manual intervention in these cases. * if we're tracking duration, we could conceivably check the buildslaves?no_builders=1 page to see how many jobs are still running. This is expensive while lots of jobs are still running, so we might want to wait >1hr after triggering shutdown before trying this so most jobs will have already drained. This might still be too expensive to do, and would rely on parsing the web page anyway. Just an idea, we don't necessarily need to go with this.
Note explaining the priority level: P5 doesn't mean we've lowered the priority, but the contrary. However, we're aligning these levels to the quarterly deliverables, where P1-P3 are taken by our daily waterline KTLO operational tasks.
Priority: -- → P5
Assignee: aobreja → nobody
Bulk change of QA Contact to :jlund, per https://bugzilla.mozilla.org/show_bug.cgi?id=1428483
QA Contact: bugspam.Callek → jlund
As we are moving away from BB and we only need to run this script each ~20 days, I will close this bug for now. Please reopen and needinfo me if you think this should not be the case.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Whiteboard: Buildbot
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.