Closed Bug 695220 Opened 13 years ago Closed 11 years ago

failing to run count_and_reboot.py should reboot slave

Tracking

(Not tracked)

Status:

RESOLVED DUPLICATE of bug 712205

People

(Reporter: jhford, Unassigned)

Details

(Whiteboard: [puppet][rebooting])

John Ford [:jhford] CET/CEST Berlin Time

Reporter

Description

•

13 years ago

A failure mode recently encountered was DNS breaking on some slaves. This meant that we couldn't check out the tools repo. This meant that when we tried to run count_and_reboot.py, the slave was unable to reboot itself. When count_and_reboot.py fails to run, the slave should leave the pool. Whether the slave stays on, but stops its buildbot process or reboots doesn't matter to me (the former is better for debugging, the latter is better for resiliency). I don't know whether this should mean that the whole job is retried or not, but that's something to consider. Another option is to deploy a copy of count_and_reboot.py with puppet instead of using the copy in tools.

John Hopkins (:jhopkins)

Comment 1

•

13 years ago

Why not solve the root cause and fail the build if tools can't be checked out? Or if we really must allow tools checkout to continue on failure, check that essential tools are present before continuing: checkout_tools if checkout_fail: if exists(['count_and_reboot.py', 'other_essential_tool.py']): pass else: #TODO could RETRY build instead throw exception "cannot continue without essential tools" Better to fail early than at the last step in the build.

John Ford [:jhford] CET/CEST Berlin Time

Reporter

Comment 2

•

13 years ago

They are failing early and on (or because of) the hg tools clone failing. The problem is that the machine cannot reboot when the tools checkout fails, so even though we fail the individual job early, we never reboot the slave so it continues to take jobs. Because the only fix to this class of problem is a reboot, we get wedged into a state where every job on the slave fails because each job fails tools clone and none of the jobs reboot so the slave continues to take jobs and deterministically fail on cloning tools. I'd modify your psudeo code to do: checkout_tools() if checkout_fail: fail_job() try: count_and_reboot() except set_job_statu(RETRY) forcefully_reboot()

Chris Cooper [:coop] (he/him)

Comment 3

•

13 years ago

I agree with jhopkins: we should exit early when we fail to checkout tools/. I'm not sure what forcefully rebooting the slave buys us though if we don't have a way to prevent buildbot from starting up again on reboot. Especially if we're exiting earlier, that could be a pretty tight loop.

Priority: -- → P5

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → DUPLICATE

You need to log in before you can comment on or make changes to this bug.

Bugzilla

failing to run count_and_reboot.py should reboot slave

Categories

(Release Engineering :: General, defect, P5)

Tracking

(Not tracked)

People

(Reporter: jhford, Unassigned)

References

Details

(Whiteboard: [puppet][rebooting])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Updated