Closed Bug 971861 Opened 10 years ago Closed 7 years ago

slaveapi's shutdown_buildslave action doesn't cope well with a machine that isn't connected to buildbot

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P5)

x86_64
Linux

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bhearsum, Unassigned)

References

Details

Specifically, verification doesn't happen. We get stuck in the "Server Shut Down" loop until we hit max time: http://git.mozilla.org/?p=build/slaveapi.git;a=blob;f=slaveapi/actions/shutdown_buildslave.py;h=772e45d44300635e76d2b38f48d0a62c16c2c737;hb=HEAD#l45
Why do we try to shut it down nicely?

If it hasn't taken a job for a while can we just reboot with ssh/ipmi?
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #1)
> Why do we try to shut it down nicely?
> 
> If it hasn't taken a job for a while can we just reboot with ssh/ipmi?

Something may have happened (eg, there's finally enough load for it to take a job) between the time you see that it hasn't taken a job and when you start the reboot. A graceful shutdown protects against that case. Without it, we would burn some (small) amount of jobs. If someone else wants to be on the hook for it, we could try taking out the graceful shutdown and see what happens.
In the t-w732-ix-* class of machines there are several disconnected and which slave rebooted is unable to gracefully shutdown:

2014-02-27 00:43:37,096 - INFO - t-w732-ix-020 - Last job ended at Thursday, February 20, 21:33, rebooting
2014-02-27 00:44:07,363 - INFO - t-w732-ix-020 - Graceful shutdown failed, aborting reboot

If you look at the date/time stamp on twistd.log, it is February 20, 2014 (today is February 27).  That detail should give slave rebooter confidence that it can reboot the slave even if the graceful shutdown fails (or even take the place of a graceful shutdown attempt).
I support that we should reboot at that point.
We used to do that with kittenherder.
It's better to burn a job (at worst) as long as the machine comes back into action sooner.
It would be nice to catch the root cause but that can take a while.

FYI, I added logging to runslave.py to determine why after starting up a Windows machine would not start buildbot.
A tegra (verifying against a foopy with Bug 921067) just hit this as well, due to the twistd.log not being present.

I propose we can assume no buildbot at that point.
See Also: → 977341
Assignee: nobody → bugspam.Callek
slaveapi's new get_last_activity API call (being added in bug 987158) should help deal with this situation.
Just about all of the talos-linux32-ix class was out of action, because buildbot stopped but the machine didn't reboot, and slaveapi can't pick it up from there.
I'm not working on this, but offhand I'm not sure how important this is..
Assignee: bugspam.Callek → nobody
Component: Tools → Buildduty
QA Contact: hwine → bugspam.Callek
Priority: -- → P5
We're transitioning away from buildbot, so this is no longer critical.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.