Closed Bug 576158 Opened 15 years ago Closed 15 years ago

reboot linux slaves after 10 hours of inactivity

Categories

(Release Engineering :: General, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 565397

People

(Reporter: jhford, Unassigned)

Details

(Whiteboard: [automation])

Sometimes we have slaves that are idle for an extended period of time. As puppet changes are only applied on boot, it is possible for a slave to be up for hours between puppet run and build time. In support of our mobile test pool I wrote a script that reboots a slave after 10 hours of uptime or if a job has been going for more than 10 hours. It is located at http://hg.mozilla.org/build/tools/file/10b1787feb65/buildfarm/mobile/n900-imaging/rootfs/bin/uptime-check.py This script currently runs only on linux but porting to another platform should be mainly finding a source of system uptime information. We'd need to make the first step of *any and every* build on these systems be copying the uptime information to the offset file. The script calls out to reboot-user which is used to cleanly reboot the machine (shutting down the buildslave first). reboot-user does have some mobile specifics that could be removed. This script is running in production in our N900 pool and is reliably rebooting our slaves after 10 hours of non-activity. It also is tested to respect the offset file. This would mean that after deploying a puppet change, we know that after 10 hours every slave is either up to date (because it has rebooted and is back in the pool) or is not up to date (because puppet is running or is about to run), though this depends on making sure that the buildbot slave is only started after a successful puppet run.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.