The default bug view has changed. See this FAQ.

slaveapi's shutdown_buildslave action doesn't cope well with a machine that isn't connected to buildbot

NEW
Unassigned

Status

Release Engineering
Buildduty
3 years ago
8 months ago

People

(Reporter: bhearsum, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
Specifically, verification doesn't happen. We get stuck in the "Server Shut Down" loop until we hit max time: http://git.mozilla.org/?p=build/slaveapi.git;a=blob;f=slaveapi/actions/shutdown_buildslave.py;h=772e45d44300635e76d2b38f48d0a62c16c2c737;hb=HEAD#l45
Why do we try to shut it down nicely?

If it hasn't taken a job for a while can we just reboot with ssh/ipmi?
(Reporter)

Comment 2

3 years ago
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #1)
> Why do we try to shut it down nicely?
> 
> If it hasn't taken a job for a while can we just reboot with ssh/ipmi?

Something may have happened (eg, there's finally enough load for it to take a job) between the time you see that it hasn't taken a job and when you start the reboot. A graceful shutdown protects against that case. Without it, we would burn some (small) amount of jobs. If someone else wants to be on the hook for it, we could try taking out the graceful shutdown and see what happens.
In the t-w732-ix-* class of machines there are several disconnected and which slave rebooted is unable to gracefully shutdown:

2014-02-27 00:43:37,096 - INFO - t-w732-ix-020 - Last job ended at Thursday, February 20, 21:33, rebooting
2014-02-27 00:44:07,363 - INFO - t-w732-ix-020 - Graceful shutdown failed, aborting reboot

If you look at the date/time stamp on twistd.log, it is February 20, 2014 (today is February 27).  That detail should give slave rebooter confidence that it can reboot the slave even if the graceful shutdown fails (or even take the place of a graceful shutdown attempt).
I support that we should reboot at that point.
We used to do that with kittenherder.
It's better to burn a job (at worst) as long as the machine comes back into action sooner.
It would be nice to catch the root cause but that can take a while.

FYI, I added logging to runslave.py to determine why after starting up a Windows machine would not start buildbot.
A tegra (verifying against a foopy with Bug 921067) just hit this as well, due to the twistd.log not being present.

I propose we can assume no buildbot at that point.
See Also: → bug 977341

Updated

3 years ago
Assignee: nobody → bugspam.Callek
slaveapi's new get_last_activity API call (being added in bug 987158) should help deal with this situation.
Just about all of the talos-linux32-ix class was out of action, because buildbot stopped but the machine didn't reboot, and slaveapi can't pick it up from there.

Comment 8

8 months ago
I'm not working on this, but offhand I'm not sure how important this is..
Assignee: bugspam.Callek → nobody
Component: Tools → Buildduty
QA Contact: hwine → bugspam.Callek
You need to log in before you can comment on or make changes to this bug.