Closed
Bug 971861
Opened 11 years ago
Closed 7 years ago
slaveapi's shutdown_buildslave action doesn't cope well with a machine that isn't connected to buildbot
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P5)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: bhearsum, Unassigned)
References
Details
Specifically, verification doesn't happen. We get stuck in the "Server Shut Down" loop until we hit max time: http://git.mozilla.org/?p=build/slaveapi.git;a=blob;f=slaveapi/actions/shutdown_buildslave.py;h=772e45d44300635e76d2b38f48d0a62c16c2c737;hb=HEAD#l45
Comment 1•11 years ago
|
||
Why do we try to shut it down nicely?
If it hasn't taken a job for a while can we just reboot with ssh/ipmi?
Reporter | ||
Comment 2•11 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #1)
> Why do we try to shut it down nicely?
>
> If it hasn't taken a job for a while can we just reboot with ssh/ipmi?
Something may have happened (eg, there's finally enough load for it to take a job) between the time you see that it hasn't taken a job and when you start the reboot. A graceful shutdown protects against that case. Without it, we would burn some (small) amount of jobs. If someone else wants to be on the hook for it, we could try taking out the graceful shutdown and see what happens.
Comment 3•11 years ago
|
||
In the t-w732-ix-* class of machines there are several disconnected and which slave rebooted is unable to gracefully shutdown:
2014-02-27 00:43:37,096 - INFO - t-w732-ix-020 - Last job ended at Thursday, February 20, 21:33, rebooting
2014-02-27 00:44:07,363 - INFO - t-w732-ix-020 - Graceful shutdown failed, aborting reboot
If you look at the date/time stamp on twistd.log, it is February 20, 2014 (today is February 27). That detail should give slave rebooter confidence that it can reboot the slave even if the graceful shutdown fails (or even take the place of a graceful shutdown attempt).
Comment 4•11 years ago
|
||
I support that we should reboot at that point.
We used to do that with kittenherder.
It's better to burn a job (at worst) as long as the machine comes back into action sooner.
It would be nice to catch the root cause but that can take a while.
FYI, I added logging to runslave.py to determine why after starting up a Windows machine would not start buildbot.
Comment 5•11 years ago
|
||
A tegra (verifying against a foopy with Bug 921067) just hit this as well, due to the twistd.log not being present.
I propose we can assume no buildbot at that point.
Updated•11 years ago
|
Assignee: nobody → bugspam.Callek
Comment 6•11 years ago
|
||
slaveapi's new get_last_activity API call (being added in bug 987158) should help deal with this situation.
Comment 7•10 years ago
|
||
Just about all of the talos-linux32-ix class was out of action, because buildbot stopped but the machine didn't reboot, and slaveapi can't pick it up from there.
Comment 8•8 years ago
|
||
I'm not working on this, but offhand I'm not sure how important this is..
Assignee: bugspam.Callek → nobody
Component: Tools → Buildduty
QA Contact: hwine → bugspam.Callek
Updated•7 years ago
|
Priority: -- → P5
Comment 9•7 years ago
|
||
We're transitioning away from buildbot, so this is no longer critical.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•