Closed Bug 695220 Opened 13 years ago Closed 11 years ago

failing to run count_and_reboot.py should reboot slave

Categories

(Release Engineering :: General, defect, P5)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 712205

People

(Reporter: jhford, Unassigned)

Details

(Whiteboard: [puppet][rebooting])

A failure mode recently encountered was DNS breaking on some slaves. This meant that we couldn't check out the tools repo. This meant that when we tried to run count_and_reboot.py, the slave was unable to reboot itself. When count_and_reboot.py fails to run, the slave should leave the pool. Whether the slave stays on, but stops its buildbot process or reboots doesn't matter to me (the former is better for debugging, the latter is better for resiliency). I don't know whether this should mean that the whole job is retried or not, but that's something to consider. Another option is to deploy a copy of count_and_reboot.py with puppet instead of using the copy in tools.
Why not solve the root cause and fail the build if tools can't be checked out? Or if we really must allow tools checkout to continue on failure, check that essential tools are present before continuing: checkout_tools if checkout_fail: if exists(['count_and_reboot.py', 'other_essential_tool.py']): pass else: #TODO could RETRY build instead throw exception "cannot continue without essential tools" Better to fail early than at the last step in the build.
They are failing early and on (or because of) the hg tools clone failing. The problem is that the machine cannot reboot when the tools checkout fails, so even though we fail the individual job early, we never reboot the slave so it continues to take jobs. Because the only fix to this class of problem is a reboot, we get wedged into a state where every job on the slave fails because each job fails tools clone and none of the jobs reboot so the slave continues to take jobs and deterministically fail on cloning tools. I'd modify your psudeo code to do: checkout_tools() if checkout_fail: fail_job() try: count_and_reboot() except set_job_statu(RETRY) forcefully_reboot()
I agree with jhopkins: we should exit early when we fail to checkout tools/. I'm not sure what forcefully rebooting the slave buys us though if we don't have a way to prevent buildbot from starting up again on reboot. Especially if we're exiting earlier, that could be a pretty tight loop.
Priority: -- → P5
Product: mozilla.org → Release Engineering
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.