Closed Bug 565397 Opened 14 years ago Closed 14 years ago

Idle slaves should reboot after 6 hours, so they get puppet/opsi changes

Categories

(Release Engineering :: General, defect, P2)

x86
All
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: dustin)

References

Details

(Whiteboard: [automation][puppet][opsi])

Attachments

(1 file)

RelEng slaves are configured to check puppet/opsi on reboot. All good. However, during rollout of puppet changes recently, we discovered a side-effect of this - any slaves which have been idle for days have not rebooted, hence not picked up the config mgmt changes. This meant that enabling puppet builds in production required us to manually verify all slaves (including idle slaves) had rebooted and picked up the changes, before we could start doing android builds in production. We should force slaves, even if they have been idle, to reboot every 6? 12? hours. This would ensure that every slave checks for updates and simplify future rollouts.
Whiteboard: [automation][puppet][opsi]
(from bug 576158) In support of our mobile test pool I wrote a script that reboots a slave after 10 hours of uptime or if a job has been going for more than 10 hours. It is located at http://hg.mozilla.org/build/tools/file/10b1787feb65/buildfarm/mobile/n900-imaging/rootfs/bin/uptime-check.py This script currently runs only on linux but porting to another platform should be mainly finding a source of system uptime information. We'd need to make the first step of *any and every* build on these systems be copying the uptime information to the offset file. The script calls out to reboot-user which is used to cleanly reboot the machine (shutting down the buildslave first). reboot-user does have some mobile specifics that could be removed. This script is running in production in our N900 pool and is reliably rebooting our slaves after 10 hours of non-activity. It also is tested to respect the offset file. This would mean that after deploying a puppet change, we know that after 10 hours every slave is either up to date (because it has rebooted and is back in the pool) or is not up to date (because puppet is running or is about to run), though this depends on making sure that the buildbot slave is only started after a successful puppet run.
Not having this caused some burning yesterday on our linux slaves. A bunch of idle linux slaves had not rebooted in a while, hence had not connected with puppet... this meant they didnt have updated toolchain, so broke as predicted when they started building. The fix was to manually reboot them all, which meant they got updated toolchains via puppet, and then built successfully again, as expected.
we have a solution that works and has been tested on linux (using maemo). This solution carries a slight risk in that a job could be picked up by the slave seconds before the slave is stopped. The failure case is something like: 1. slave starts but remains idle 2. 10 hours passes of slave being idle 3. reboot script triggers and runs buildbot stop /builds/slave 4. new job is dispatched to slave and started 5. buildbot stop kills the slave, and thus, the job As this script is currently configured, we are at risk for this happening for under one minute in 10 hours. With a trivial change to the script, we could reduce our exposure to a couple of seconds in a 36,000 second window. This does carry risk, but it might carry less risk them not updating to the latest configuration. This code is not currently cross platform, but could be made to be without too much trouble I think. http://hg.mozilla.org/build/tools/file/10b1787feb65/buildfarm/mobile/n900-imaging/rootfs/bin/uptime-check.py
Can we have the script post to the master's graceful shutdown page before killing off the slave?
that wouldn't be difficult. We would either have to make that response happen until the graceful shutdown or perform subsequent requests to the /buildslave/slavename page to see that the slave has actually shutdown. Another option would be to run buildbot on the slaves in non-daemon mode in a script like #!/bin/bash twistd --nosave --no_daemon -y /builds/slave/buildbot.tac <other options as needed> if [[ -f /builds/slave/buildbot.tac.off ]] ; then echo 'Not Rebooting' else sudo reboot fi And have this script do its magic by hitting the graceful shutdown page after a certain number of seconds have elapsed.
Fixing this will simplify nagios configs, and help cleanup nagios noise.
jhford: What would it take to rework this script so there is *no* race condition? It would be bad to kill a real job in production accidentally thinking that the slave was idle-and-needed-rebooting.
Assignee: nobody → jhford
Summary: Idle slaves should reboot after 'n' hours, so they get puppet/opsi changes → Idle slaves should reboot after 6 hours, so they get puppet/opsi changes
Blocks: 617166
I think that the best approach here would be to have the buildslave code itself stop and request a reboot in one of two cases: 1. idle for N hours 2. unable to connect to master for M hours The slave could do this in a raceless way, since it can shut down its PB connection immediately on determining that it's idle. Both, but especially #2, would be a big help with slave allocation, as it would cause the slave to reallocate. Then if we have a master go down, we won't have to resuscitate the stranded slaves. You'd need to do some futzing with the interactions with runslave.py - that script could potentially wait for the slave to exit, then re-allocate and re-start it. Or just reboot. Open questions.
Oh, and note that runslave.py isn't installed on any Windows builders right now, nor mobile devices. I intend to get it on Windows, and would like to get it on mobile too at some point.
I'll work on this (assuming jhford's OK with it..)
Assignee: jhford → dustin
Priority: P5 → P2
Blocks: 629701
Blocks: 627126
So the worry with leaving runslave.py running is that it's turning it into a daemon, which is complicated, and that the daemon may consume resources - memory or CPU - during talos runs. So the plan is to modify buildbot itself to - initiate a graceful shutdown when idle for > N hours - but don't shut down - reboot instead - monitor itself for disconnectedness - respond to extended disconnection by rebooting - take configuration for the above from buildbot.tac The reboot code should come from build/buildfarm/maintenance/count_and_reboot.py This will *not* entail an upgrade of the slave-side buildbot code (which is currently at 0.8.0). I'll add a new branch to build/buildbot to track the version currently deployed on slaves. I'll open new bugs to deploy this via puppet, OPSI, and VNC helper-monkey once I have the buildbot modifications in place.
So that begins with finding out what version of Buildbot is *currently* installed on all of the slaves. It's labeled "0.8.0", but we need a specific hg revision (assuming it was built from something in hg)
From my look at the source on linux-ix-slave01 talos-r3-snow-001 talos-r3-w7-001 it looks like we have revision 39e3ca7f3c87 on the talos systems, and an older version on linux-ix-slave01. The differences are in master-side code, so I don't think that this is a big problem. I'm going to add a branch named 'slaves' that tracks the buildbot we have installed on slaves, with a tag (SLAVES_0_8_0), and base it on that revision. When I deploy the new version, I'll do it with an _R1 suffix on the SLAVES tag.
OK, here's what I've got so far: http://hg.mozilla.org/users/dmitchell_mozilla.com/buildbot/rev/a4417ba031aa It's just the idleness detection - it doesn't deal with the graceful shutdown part yet - that's next. catlee, any thoughts?
ayust suggests that this would make a good addition to contrib/ in upstream, and I agree - then if others want to use it, they can, and we can merge it if it gets popular. So I'll try to make the rebooting functionality a little more modular to suit other installations.
Blocks: 631851
Depends on: 631849
Blocks: 516808
I name catlee for r?, since he did the slave-side graceful shutdown stuff. Note that this won't work with 0.7.x masters. Too bad, so sad. On slaves connected to 0.7.x masters, it will log messages about being unable to gracefully shut down, and the slave will stay up. This runs as a distinct service from buildbot.tac, and only monkey-patches into the buildslave code itself (with the exception of a missing 'return', which I've already committed upstream). The idea is to make it resilient to upstream changes. To use, just add this to buildbot.tac: from buildslave import idleizer idlz = idleizer.Idleizer(s, max_idle_time=3600*7, max_disconnected_time=3600*1) idlz.setServiceParent(application) (and slavealloc can handle producing those times - yay!)
Attachment #512535 - Flags: review?(catlee)
Attachment #512535 - Flags: review?(catlee) → review+
landed in c349df349db3 on the slaves branch.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: