565397 - Idle slaves should reboot after 6 hours, so they get puppet/opsi changes

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Description

•

15 years ago

RelEng slaves are configured to check puppet/opsi on reboot. All good. However, during rollout of puppet changes recently, we discovered a side-effect of this - any slaves which have been idle for days have not rebooted, hence not picked up the config mgmt changes. This meant that enabling puppet builds in production required us to manually verify all slaves (including idle slaves) had rebooted and picked up the changes, before we could start doing android builds in production. We should force slaves, even if they have been idle, to reboot every 6? 12? hours. This would ensure that every slave checks for updates and simplify future rollouts.

Chris Cooper [:coop] (he/him)

Updated

•

15 years ago

Whiteboard: [automation][puppet][opsi]

John Ford [:jhford] CET/CEST Berlin Time

Comment 2

•

15 years ago

(from bug 576158) In support of our mobile test pool I wrote a script that reboots a slave after 10 hours of uptime or if a job has been going for more than 10 hours. It is located at http://hg.mozilla.org/build/tools/file/10b1787feb65/buildfarm/mobile/n900-imaging/rootfs/bin/uptime-check.py This script currently runs only on linux but porting to another platform should be mainly finding a source of system uptime information. We'd need to make the first step of *any and every* build on these systems be copying the uptime information to the offset file. The script calls out to reboot-user which is used to cleanly reboot the machine (shutting down the buildslave first). reboot-user does have some mobile specifics that could be removed. This script is running in production in our N900 pool and is reliably rebooting our slaves after 10 hours of non-activity. It also is tested to respect the offset file. This would mean that after deploying a puppet change, we know that after 10 hours every slave is either up to date (because it has rebooted and is back in the pool) or is not up to date (because puppet is running or is about to run), though this depends on making sure that the buildbot slave is only started after a successful puppet run.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 3

•

15 years ago

Not having this caused some burning yesterday on our linux slaves. A bunch of idle linux slaves had not rebooted in a while, hence had not connected with puppet... this meant they didnt have updated toolchain, so broke as predicted when they started building. The fix was to manually reboot them all, which meant they got updated toolchains via puppet, and then built successfully again, as expected.

John Ford [:jhford] CET/CEST Berlin Time

Comment 4

•

15 years ago

we have a solution that works and has been tested on linux (using maemo). This solution carries a slight risk in that a job could be picked up by the slave seconds before the slave is stopped. The failure case is something like: 1. slave starts but remains idle 2. 10 hours passes of slave being idle 3. reboot script triggers and runs buildbot stop /builds/slave 4. new job is dispatched to slave and started 5. buildbot stop kills the slave, and thus, the job As this script is currently configured, we are at risk for this happening for under one minute in 10 hours. With a trivial change to the script, we could reduce our exposure to a couple of seconds in a 36,000 second window. This does carry risk, but it might carry less risk them not updating to the latest configuration. This code is not currently cross platform, but could be made to be without too much trouble I think. http://hg.mozilla.org/build/tools/file/10b1787feb65/buildfarm/mobile/n900-imaging/rootfs/bin/uptime-check.py

Chris AtLee [:catlee]

Comment 5

•

15 years ago

Can we have the script post to the master's graceful shutdown page before killing off the slave?

John Ford [:jhford] CET/CEST Berlin Time

Comment 6

•

15 years ago

that wouldn't be difficult. We would either have to make that response happen until the graceful shutdown or perform subsequent requests to the /buildslave/slavename page to see that the slave has actually shutdown. Another option would be to run buildbot on the slaves in non-daemon mode in a script like #!/bin/bash twistd --nosave --no_daemon -y /builds/slave/buildbot.tac <other options as needed> if [[ -f /builds/slave/buildbot.tac.off ]] ; then echo 'Not Rebooting' else sudo reboot fi And have this script do its magic by hitting the graceful shutdown page after a certain number of seconds have elapsed.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 7

•

15 years ago

Fixing this will simplify nagios configs, and help cleanup nagios noise.

Blocks: releng-nagios

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 8

•

15 years ago

jhford: What would it take to rework this script so there is *no* race condition? It would be bad to kill a real job in production accidentally thinking that the slave was idle-and-needed-rebooting.

Assignee: nobody → jhford

Summary: Idle slaves should reboot after 'n' hours, so they get puppet/opsi changes → Idle slaves should reboot after 6 hours, so they get puppet/opsi changes

Mike Taylor [:bear]

Updated

•

15 years ago

Blocks: 617166