Closed Bug 971861 Opened 10 years ago Closed 7 years ago

slaveapi's shutdown_buildslave action doesn't cope well with a machine that isn't connected to buildbot

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: bhearsum, Unassigned)

References

Details

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

10 years ago

Specifically, verification doesn't happen. We get stuck in the "Server Shut Down" loop until we hit max time: http://git.mozilla.org/?p=build/slaveapi.git;a=blob;f=slaveapi/actions/shutdown_buildslave.py;h=772e45d44300635e76d2b38f48d0a62c16c2c737;hb=HEAD#l45

Armen [:armenzg]

Comment 1

•

10 years ago

Why do we try to shut it down nicely?

If it hasn't taken a job for a while can we just reboot with ssh/ipmi?

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 2

•

10 years ago

(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #1)
> Why do we try to shut it down nicely?
> 
> If it hasn't taken a job for a while can we just reboot with ssh/ipmi?

Something may have happened (eg, there's finally enough load for it to take a job) between the time you see that it hasn't taken a job and when you start the reboot. A graceful shutdown protects against that case. Without it, we would burn some (small) amount of jobs. If someone else wants to be on the hook for it, we could try taking out the graceful shutdown and see what happens.

John Hopkins (:jhopkins)

Comment 3

•

10 years ago

In the t-w732-ix-* class of machines there are several disconnected and which slave rebooted is unable to gracefully shutdown:

2014-02-27 00:43:37,096 - INFO - t-w732-ix-020 - Last job ended at Thursday, February 20, 21:33, rebooting
2014-02-27 00:44:07,363 - INFO - t-w732-ix-020 - Graceful shutdown failed, aborting reboot

If you look at the date/time stamp on twistd.log, it is February 20, 2014 (today is February 27).  That detail should give slave rebooter confidence that it can reboot the slave even if the graceful shutdown fails (or even take the place of a graceful shutdown attempt).

Armen [:armenzg]

Comment 4

•

10 years ago

I support that we should reboot at that point.
We used to do that with kittenherder.
It's better to burn a job (at worst) as long as the machine comes back into action sooner.
It would be nice to catch the root cause but that can take a while.

FYI, I added logging to runslave.py to determine why after starting up a Windows machine would not start buildbot.

Justin Wood (:Callek)

Comment 5

•

10 years ago

A tegra (verifying against a foopy with Bug 921067) just hit this as well, due to the twistd.log not being present.

I propose we can assume no buildbot at that point.

Armen [:armenzg]

Updated

•

10 years ago

Updated

•

10 years ago

Assignee: nobody → bugspam.Callek

John Hopkins (:jhopkins)

Comment 6

•

10 years ago

slaveapi's new get_last_activity API call (being added in bug 987158) should help deal with this situation.

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

10 years ago

Just about all of the talos-linux32-ix class was out of action, because buildbot stopped but the machine didn't reboot, and slaveapi can't pick it up from there.

Justin Wood (:Callek)

Comment 8

•

8 years ago

I'm not working on this, but offhand I'm not sure how important this is..

Assignee: bugspam.Callek → nobody

Component: Tools → Buildduty

QA Contact: hwine → bugspam.Callek

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

7 years ago

Priority: -- → P5

Amy Rich [:arr] [:arich]

Comment 9

•

7 years ago

We're transitioning away from buildbot, so this is no longer critical.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

slaveapi's shutdown_buildslave action doesn't cope well with a machine that isn't connected to buildbot

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P5)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Updated

Updated