1256118 - Make master restart script more tolerant of common error conditions

Reporter

Description

•

9 years ago

The weekly master restarts (bug 1057888) failed this weekend, because the tools repo at dev-master2:~coop/restart-masters/tools didn't have the removal of the Panda masters and hit Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: 2016-03-12 15:02:12,746 - INFO - __main__ - Disabling buildbot-master102.bb.releng.scl3.mozilla.com in slavealloc. Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: Traceback (most recent call last): Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: File "./restart_masters.py", line 410, in <module> Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: if disable_master(running_buckets[key]): Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: File "./restart_masters.py", line 206, in disable_master Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: disable_url = furl(slavealloc_api_url + "/" + str(master_ids[master['name']])) Mar 12 15:02:12 dev-master2.bb.releng.use1.mozilla.com restart_masters.sh: KeyError: 'bm102-tests1-panda' At this point it had disabled 9 other masters, which duly stopped gracefully but didn't start up again (the script had exited on the error). There are nagios alerts about '0 processes with command name buildbot) and sns alerts about stale lockconfigs. The masters affected are buildbot-master79.bb.releng.usw2.mozilla.com buildbot-master86.bb.releng.scl3.mozilla.com buildbot-master87.bb.releng.scl3.mozilla.com buildbot-master94.bb.releng.use1.mozilla.com buildbot-master105.bb.releng.scl3.mozilla.com buildbot-master108.bb.releng.scl3.mozilla.com buildbot-master124.bb.releng.use1.mozilla.com buildbot-master125.bb.releng.usw2.mozilla.com buildbot-master127.bb.releng.scl3.mozilla.com

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

9 years ago

It looks like there are local mods to tools/buildfarm/maintenance/restart_masters.py, and updating it is commented out in the script (dev-master2:~coop/bin/restart_masters.sh), so I'm going to leave that for coop to assess. I did restart the masters with this (although obviously all the other ones are a week-old still): ssh <master> cd /builds/buildbot/*1* && pwd && rm -v reconfig.lock && make update checkconfig start || tail -F master/twistd.log | grep configuration Plus re-enable in slavealloc.

Assignee: nobody → coop

Kim Moir [:kmoir] ET

Comment 2

•

9 years ago

There is a patch to remove the panda/foopy code in tools in bug 1186617

Chris Cooper [:coop] (he/him)

Comment 3

•

9 years ago

(In reply to Nick Thomas [:nthomas] from comment #1) > It looks like there are local mods to > tools/buildfarm/maintenance/restart_masters.py, and updating it is commented > out in the script (dev-master2:~coop/bin/restart_masters.sh), so I'm going > to leave that for coop to assess. I did restart the masters with this > (although obviously all the other ones are a week-old still): Mea culpa. I was testing some changes for bug 1249356 (ironically by using the panda masters) and forgot to remove the code before the weekend. I'll get the patch reviewed/landed this week.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 4

•

9 years ago

Alerts are alerting again today. How's that patch coming?

Flags: needinfo?(coop)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 5

•

9 years ago

(In reply to Wes Kocher (:KWierso) from comment #4) > Alerts are alerting again today. How's that patch coming? Only master I'm seeing alerting is http://buildbot-master116.bb.releng.usw2.mozilla.com/

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 6

•

9 years ago

From a very quick look in papertrail it looks like bm116 shut down in just a few seconds, confusing the script. Started it back up again, cleared the reconfig.lock, and re-enabled in slavealloc. Ran nicely for everything else \o/

Chris Cooper [:coop] (he/him)

Comment 7

•

9 years ago

(In reply to Wes Kocher (:KWierso) from comment #4) > Alerts are alerting again today. How's that patch coming? Some of the logging changes already went in, which is why Nick could see what he did in papertrail in comment #6. I can revisit the script logic itself and try to add some resiliency. One change I could make would be to track the pid of the existing master process to better determine when we've restarted. I could also modify the hourly reconfig check to restart the master if it's supposed to be running and isn't.

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Updated

•

9 years ago

Summary: Master restart failures March 13th → Make master restart script more tolerant of common error conditions

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

9 years ago

Comment 8

•

9 years ago

13 masters were found disabled in slavealloc today, causing buildbot master lag alerts by loading up the remaining masters in each pool.

Chris Cooper [:coop] (he/him)

Comment 9

•

9 years ago

(In reply to Nick Thomas [:nthomas] from comment #8) > 13 masters were found disabled in slavealloc today, causing buildbot master > lag alerts by loading up the remaining masters in each pool. For reference: 11:52 <aselagea|buildduty> catlee: with respect to the disabled bb masters, there are 13 at the moment (not counting bm69 which is disabled by bug 1021086) 11:52 <aselagea|buildduty> tests-use1-linux64: 6, tests-use1-linux32: 2, tests-usw2-linux32: 1, tests-usw2-linux64: 3, try-usw2: 1 Alin said he didn't think these were all necessarily caused my the master shutdown script just based on uptime, but that's the most reasonable explanation. Handing this off to Andrei who's has been looking into bug 1275428.

Assignee: coop → andrei.obreja

Chris Cooper [:coop] (he/him)

Comment 10

•

9 years ago

The list of robustness improvements I've wanted to make: * track buildbot master PID so we can tell when it's down and when it's restarted * use the master PID to verify the master has really restarted, rather than just assuming. * track how long we have been waiting for a given master to shutdown so we can do a forced shutdown after some interval. This happens a lot on linux test masters where we end up with no jobs running, but *something* prevents the master from shutting down. Our longest legitimate job is/was >4hr, so if we set a hard limit at 5hr and then issued a 'make stop', we could avoid the need for manual intervention in these cases. * if we're tracking duration, we could conceivably check the buildslaves?no_builders=1 page to see how many jobs are still running. This is expensive while lots of jobs are still running, so we might want to wait >1hr after triggering shutdown before trying this so most jobs will have already drained. This might still be too expensive to do, and would rely on parsing the web page anyway. Just an idea, we don't necessarily need to go with this.

Mihai Tabara [:mtabara]⌚️GMT

Comment 11

•

8 years ago

Note explaining the priority level: P5 doesn't mean we've lowered the priority, but the contrary. However, we're aligning these levels to the quarterly deliverables, where P1-P3 are taken by our daily waterline KTLO operational tasks.

Priority: -- → P5

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Updated

•

8 years ago

Assignee: aobreja → nobody

Firefox Bug Husbandry Bot

Comment 12

•

8 years ago

Bulk change of QA Contact to :jlund, per https://bugzilla.mozilla.org/show_bug.cgi?id=1428483

QA Contact: bugspam.Callek → jlund

Danut Labici [:dlabici]

Comment 13

•

7 years ago

As we are moving away from BB and we only need to run this script each ~20 days, I will close this bug for now. Please reopen and needinfo me if you think this should not be the case.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

Whiteboard: Buildbot

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Make master restart script more tolerant of common error conditions

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P5)

Tracking

(Not tracked)

People

(Reporter: nthomas, Unassigned)

References

Details

(Whiteboard: Buildbot)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Updated

Updated