Closed
Bug 565397
Opened 14 years ago
Closed 14 years ago
Idle slaves should reboot after 6 hours, so they get puppet/opsi changes
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: dustin)
References
Details
(Whiteboard: [automation][puppet][opsi])
Attachments
(1 file)
9.85 KB,
patch
|
catlee
:
review+
|
Details | Diff | Splinter Review |
RelEng slaves are configured to check puppet/opsi on reboot. All good.
However, during rollout of puppet changes recently, we discovered a side-effect of this - any slaves which have been idle for days have not rebooted, hence not picked up the config mgmt changes. This meant that enabling puppet builds in production required us to manually verify all slaves (including idle slaves) had rebooted and picked up the changes, before we could start doing android builds in production.
We should force slaves, even if they have been idle, to reboot every 6? 12? hours. This would ensure that every slave checks for updates and simplify future rollouts.
Updated•14 years ago
|
Whiteboard: [automation][puppet][opsi]
Comment 2•14 years ago
|
||
(from bug 576158)
In support of our mobile test pool I wrote a script that reboots a slave after 10 hours of uptime or if a job has been going for more than 10 hours.
It is located at
http://hg.mozilla.org/build/tools/file/10b1787feb65/buildfarm/mobile/n900-imaging/rootfs/bin/uptime-check.py
This script currently runs only on linux but porting to another platform should
be mainly finding a source of system uptime information. We'd need to make the
first step of *any and every* build on these systems be copying the uptime
information to the offset file. The script calls out to reboot-user which is
used to cleanly reboot the machine (shutting down the buildslave first).
reboot-user does have some mobile specifics that could be removed.
This script is running in production in our N900 pool and is reliably rebooting
our slaves after 10 hours of non-activity. It also is tested to respect the
offset file.
This would mean that after deploying a puppet change, we know that after 10
hours every slave is either up to date (because it has rebooted and is back in
the pool) or is not up to date (because puppet is running or is about to run),
though this depends on making sure that the buildbot slave is only started
after a successful puppet run.
Reporter | ||
Comment 3•14 years ago
|
||
Not having this caused some burning yesterday on our linux slaves.
A bunch of idle linux slaves had not rebooted in a while, hence had not connected with puppet... this meant they didnt have updated toolchain, so broke as predicted when they started building.
The fix was to manually reboot them all, which meant they got updated toolchains via puppet, and then built successfully again, as expected.
Comment 4•14 years ago
|
||
we have a solution that works and has been tested on linux (using maemo). This solution carries a slight risk in that a job could be picked up by the slave seconds before the slave is stopped.
The failure case is something like:
1. slave starts but remains idle
2. 10 hours passes of slave being idle
3. reboot script triggers and runs buildbot stop /builds/slave
4. new job is dispatched to slave and started
5. buildbot stop kills the slave, and thus, the job
As this script is currently configured, we are at risk for this happening for under one minute in 10 hours. With a trivial change to the script, we could reduce our exposure to a couple of seconds in a 36,000 second window. This does carry risk, but it might carry less risk them not updating to the latest configuration.
This code is not currently cross platform, but could be made to be without too much trouble I think.
http://hg.mozilla.org/build/tools/file/10b1787feb65/buildfarm/mobile/n900-imaging/rootfs/bin/uptime-check.py
Comment 5•14 years ago
|
||
Can we have the script post to the master's graceful shutdown page before killing off the slave?
Comment 6•14 years ago
|
||
that wouldn't be difficult. We would either have to make that response happen until the graceful shutdown or perform subsequent requests to the /buildslave/slavename page to see that the slave has actually shutdown.
Another option would be to run buildbot on the slaves in non-daemon mode in a script like
#!/bin/bash
twistd --nosave --no_daemon -y /builds/slave/buildbot.tac <other options as needed>
if [[ -f /builds/slave/buildbot.tac.off ]] ; then
echo 'Not Rebooting'
else
sudo reboot
fi
And have this script do its magic by hitting the graceful shutdown page after a certain number of seconds have elapsed.
Reporter | ||
Comment 7•14 years ago
|
||
Fixing this will simplify nagios configs, and help cleanup nagios noise.
Blocks: releng-nagios
Reporter | ||
Comment 8•14 years ago
|
||
jhford: What would it take to rework this script so there is *no* race condition? It would be bad to kill a real job in production accidentally thinking that the slave was idle-and-needed-rebooting.
Assignee: nobody → jhford
Summary: Idle slaves should reboot after 'n' hours, so they get puppet/opsi changes → Idle slaves should reboot after 6 hours, so they get puppet/opsi changes
Assignee | ||
Comment 9•14 years ago
|
||
I think that the best approach here would be to have the buildslave code itself stop and request a reboot in one of two cases:
1. idle for N hours
2. unable to connect to master for M hours
The slave could do this in a raceless way, since it can shut down its PB connection immediately on determining that it's idle.
Both, but especially #2, would be a big help with slave allocation, as it would cause the slave to reallocate. Then if we have a master go down, we won't have to resuscitate the stranded slaves.
You'd need to do some futzing with the interactions with runslave.py - that script could potentially wait for the slave to exit, then re-allocate and re-start it. Or just reboot. Open questions.
Assignee | ||
Comment 10•14 years ago
|
||
Oh, and note that runslave.py isn't installed on any Windows builders right now, nor mobile devices. I intend to get it on Windows, and would like to get it on mobile too at some point.
Assignee | ||
Comment 11•14 years ago
|
||
I'll work on this (assuming jhford's OK with it..)
Assignee: jhford → dustin
Assignee | ||
Updated•14 years ago
|
Priority: P5 → P2
Assignee | ||
Comment 12•14 years ago
|
||
So the worry with leaving runslave.py running is that it's turning it into a daemon, which is complicated, and that the daemon may consume resources - memory or CPU - during talos runs.
So the plan is to modify buildbot itself to
- initiate a graceful shutdown when idle for > N hours
- but don't shut down - reboot instead
- monitor itself for disconnectedness
- respond to extended disconnection by rebooting
- take configuration for the above from buildbot.tac
The reboot code should come from build/buildfarm/maintenance/count_and_reboot.py
This will *not* entail an upgrade of the slave-side buildbot code (which is currently at 0.8.0). I'll add a new branch to build/buildbot to track the version currently deployed on slaves.
I'll open new bugs to deploy this via puppet, OPSI, and VNC helper-monkey once I have the buildbot modifications in place.
Assignee | ||
Comment 13•14 years ago
|
||
So that begins with finding out what version of Buildbot is *currently* installed on all of the slaves. It's labeled "0.8.0", but we need a specific hg revision (assuming it was built from something in hg)
Assignee | ||
Comment 14•14 years ago
|
||
From my look at the source on
linux-ix-slave01
talos-r3-snow-001
talos-r3-w7-001
it looks like we have revision 39e3ca7f3c87 on the talos systems, and an older version on linux-ix-slave01. The differences are in master-side code, so I don't think that this is a big problem. I'm going to add a branch named 'slaves' that tracks the buildbot we have installed on slaves, with a tag (SLAVES_0_8_0), and base it on that revision. When I deploy the new version, I'll do it with an _R1 suffix on the SLAVES tag.
Assignee | ||
Comment 15•14 years ago
|
||
OK, here's what I've got so far:
http://hg.mozilla.org/users/dmitchell_mozilla.com/buildbot/rev/a4417ba031aa
It's just the idleness detection - it doesn't deal with the graceful shutdown part yet - that's next.
catlee, any thoughts?
Assignee | ||
Comment 16•14 years ago
|
||
ayust suggests that this would make a good addition to contrib/ in upstream, and I agree - then if others want to use it, they can, and we can merge it if it gets popular. So I'll try to make the rebooting functionality a little more modular to suit other installations.
Assignee | ||
Updated•14 years ago
|
Assignee | ||
Comment 17•14 years ago
|
||
I name catlee for r?, since he did the slave-side graceful shutdown stuff. Note that this won't work with 0.7.x masters. Too bad, so sad. On slaves connected to 0.7.x masters, it will log messages about being unable to gracefully shut down, and the slave will stay up.
This runs as a distinct service from buildbot.tac, and only monkey-patches into the buildslave code itself (with the exception of a missing 'return', which I've already committed upstream). The idea is to make it resilient to upstream changes.
To use, just add this to buildbot.tac:
from buildslave import idleizer
idlz = idleizer.Idleizer(s, max_idle_time=3600*7, max_disconnected_time=3600*1)
idlz.setServiceParent(application)
(and slavealloc can handle producing those times - yay!)
Attachment #512535 -
Flags: review?(catlee)
Updated•14 years ago
|
Attachment #512535 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 18•14 years ago
|
||
landed in c349df349db3 on the slaves branch.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•