Specifically, verification doesn't happen. We get stuck in the "Server Shut Down" loop until we hit max time: http://git.mozilla.org/?p=build/slaveapi.git;a=blob;f=slaveapi/actions/shutdown_buildslave.py;h=772e45d44300635e76d2b38f48d0a62c16c2c737;hb=HEAD#l45
Why do we try to shut it down nicely? If it hasn't taken a job for a while can we just reboot with ssh/ipmi?
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #1) > Why do we try to shut it down nicely? > > If it hasn't taken a job for a while can we just reboot with ssh/ipmi? Something may have happened (eg, there's finally enough load for it to take a job) between the time you see that it hasn't taken a job and when you start the reboot. A graceful shutdown protects against that case. Without it, we would burn some (small) amount of jobs. If someone else wants to be on the hook for it, we could try taking out the graceful shutdown and see what happens.
In the t-w732-ix-* class of machines there are several disconnected and which slave rebooted is unable to gracefully shutdown: 2014-02-27 00:43:37,096 - INFO - t-w732-ix-020 - Last job ended at Thursday, February 20, 21:33, rebooting 2014-02-27 00:44:07,363 - INFO - t-w732-ix-020 - Graceful shutdown failed, aborting reboot If you look at the date/time stamp on twistd.log, it is February 20, 2014 (today is February 27). That detail should give slave rebooter confidence that it can reboot the slave even if the graceful shutdown fails (or even take the place of a graceful shutdown attempt).
I support that we should reboot at that point. We used to do that with kittenherder. It's better to burn a job (at worst) as long as the machine comes back into action sooner. It would be nice to catch the root cause but that can take a while. FYI, I added logging to runslave.py to determine why after starting up a Windows machine would not start buildbot.
A tegra (verifying against a foopy with Bug 921067) just hit this as well, due to the twistd.log not being present. I propose we can assume no buildbot at that point.
slaveapi's new get_last_activity API call (being added in bug 987158) should help deal with this situation.
Just about all of the talos-linux32-ix class was out of action, because buildbot stopped but the machine didn't reboot, and slaveapi can't pick it up from there.
I'm not working on this, but offhand I'm not sure how important this is..