1167580 - slaveapi seems to claim machines do not reboot when they do

Reporter

Description

•

9 years ago

      No description provided.

Amy Rich [:arr] [:arich]

Comment 1

•

9 years ago

We've been opening a lot of bugs for machines that are both pingable and have a pinable ipmi recently, and I'm wondering if something is broken with slaveapi.  Callek, could you verify that slaveapi actually tried to fix this host and see (if it did) why it failed?

Flags: needinfo?(bugspam.Callek)

Justin Wood (:Callek)

Assignee

Comment 2

•

9 years ago

2015-05-22 06:24:53,517 - INFO - -.- - Processing item: (u't-xp32-ix-092', <function reboot at 0x28a8758>, (), {}, <slaveapi.acti
ons.results.ActionResult object at 0x3f9d590>)
2015-05-22 06:24:53,519 - INFO - -=- - 10.22.81.88 - - [2015-05-22 06:24:53] "POST /slaves/t-xp32-ix-092/actions/reboot HTTP/1.1"
 202 274 0.001686
2015-05-22 06:24:53,522 - INFO - t-xp32-ix-092 - Getting inventory info
2015-05-22 06:24:53,682 - INFO - t-xp32-ix-092 - Getting devices.json info
2015-05-22 06:24:53,871 - INFO - t-xp32-ix-092 - Getting bug info
2015-05-22 06:24:53,871 - INFO - t-xp32-ix-092 - Sending request: GET https://bugzilla.mozilla.org/rest/bug/t-xp32-ix-092
2015-05-22 06:24:54,184 - INFO - t-xp32-ix-092 - Got response: 200
2015-05-22 06:24:54,186 - INFO - t-xp32-ix-092 - Sending request: GET https://bugzilla.mozilla.org/rest/bug?product=Infrastructur
e%20%26%20Operations&component=DCOps&blocks=940092&resolution=---
2015-05-22 06:24:54,511 - INFO - t-xp32-ix-092 - Got response: 200
2015-05-22 06:25:07,529 - INFO - t-xp32-ix-092 - Powercycling
2015-05-22 06:27:28,033 - INFO - t-xp32-ix-092 - Powercycle completed.
2015-05-22 06:27:28,033 - INFO - t-xp32-ix-092 - Waiting 60 seconds for reboot.
2015-05-22 06:27:30,040 - INFO - t-xp32-ix-092 - Checking for signs of life
2015-05-22 06:32:36,245 - ERROR - t-xp32-ix-092 - Timeout of 300 exceeded, giving up
2015-05-22 06:32:36,247 - INFO - t-xp32-ix-092 - Sending request: POST https://bugzilla.mozilla.org/rest/bug
2015-05-22 06:32:37,195 - INFO - t-xp32-ix-092 - Got response: 200
2015-05-22 06:32:37,197 - INFO - t-xp32-ix-092 - Sending request: GET https://bugzilla.mozilla.org/rest/bug/1167580
2015-05-22 06:32:37,488 - INFO - t-xp32-ix-092 - Got response: 200
2015-05-22 06:32:37,490 - INFO - t-xp32-ix-092 - Sending request: PUT https://bugzilla.mozilla.org/rest/bug/940092
2015-05-22 06:32:38,652 - INFO - t-xp32-ix-092 - Got response: 200
2015-05-22 06:32:38,655 - INFO - t-xp32-ix-092 - Finished Processing item: (u't-xp32-ix-092', <function reboot at 0x28a8758>, (),
 {}, <slaveapi.actions.results.ActionResult object at 0x3f9d590>)

Specifically it did a reboot issue (via ipmi) then waited for the machine to ping, and failed after 300 seconds...

Flags: needinfo?(bugspam.Callek)

Amy Rich [:arr] [:arich]

Comment 3

•

9 years ago

The machine is very clearly responding to ping now... Can we increase the timeout since windows machines are generally slow to boot or try more than once?

Amy Rich [:arr] [:arich]

Updated

•

9 years ago

Flags: needinfo?(bugspam.Callek)

Justin Wood (:Callek)

Assignee

Comment 4

•

9 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #3)
> The machine is very clearly responding to ping now... Can we increase the
> timeout since windows machines are generally slow to boot or try more than
> once?

We could, but it does that "wait for ping-up" after each action it takes, e.g. ssh reboot (wait 300 seconds) --> ipmi reboot (wait 300 seconds).

I suppose we can add one final "wait 300 seconds" at the end of everything, but I'd like that to be a seperate bug.

Of note the credentials file had the correct ipmi password, HOWEVER slaveapi's server started in early april and the credentials file was updated in early may. So I restarted it incase the reload didn't quite do the right thing....

If we need that extra time for bringup (as in, if Q/mark/etc can confirm that time-to-come-back-up is typically >300 seconds, or at least often enough that it matters) I can work on said patch and deployment in a seperate bug.

Flags: needinfo?(bugspam.Callek)

Amy Rich [:arr] [:arich]

Comment 5

•

9 years ago

Slaveapi having a wrong password wouldn't have mattered if it's just looking at ping. Either it never would have taken the machine down and it's still in the state it was in before (which is pingable), or it worked and the machine came back up (since it's pingable).

I'm pretty sure our windows machines can commonly take longer than 5 minutes to do a shutdown and a reboot. Mark, can you verify?

Flags: needinfo?(mcornmesser)

Van Le [:van]

Comment 6

•

9 years ago

vans-MacBook-Pro:~ vle$ fping t-xp32-ix-092.wintest.releng.scl3.mozilla.com
t-xp32-ix-092.wintest.releng.scl3.mozilla.com is alive
vans-MacBook-Pro:~ vle$ ssh !$
ssh t-xp32-ix-092.wintest.releng.scl3.mozilla.com
The authenticity of host 't-xp32-ix-092.wintest.releng.scl3.mozilla.com (10.26.41.112)' can't be established.
RSA key fingerprint is 92:e7:12:2b:98:a0:70:ff:8a:05:eb:9c:f2:ac:79:fe.
Are you sure you want to continue connecting (yes/no)?

Status: NEW → RESOLVED

colo-trip: --- → scl3

Closed: 9 years ago

Resolution: --- → FIXED

Amy Rich [:arr] [:arich]

Updated

•

9 years ago

Assignee: server-ops-dcops → nobody

Status: RESOLVED → REOPENED

Component: DCOps → Tools

Product: Infrastructure & Operations → Release Engineering

QA Contact: hwine

Resolution: FIXED → ---

Summary: t-xp32-ix-092 is unreachable → slaveapi seems to claim machines do not reboot when they do

Amy Rich [:arr] [:arich]

Updated

•

9 years ago

Assignee: nobody → bugspam.Callek

Mark Cornmesser [:markco] OOO 2024/04/15

Comment 7

•

9 years ago

It can take longer than 5 minutes for the full process. It depends on whats happening with the GPOs being applied and the time it takes for it to communicate back to the domain.

Flags: needinfo?(mcornmesser)

Phil Ringnalda (:philor)

Updated

•

9 years ago

No longer blocks: t-xp32-ix-092

Phil Ringnalda (:philor)

Comment 8

•

9 years ago

I doubt that it's actually the case that this is a Windows-specific problem, so much as just that Windows is mostly what I have to reboot. I don't think there has ever been a time when I rebooted more than a couple talos-linux64-ix when it didn't wind up filing at least one invalid unreachable.

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Tools → General

Justin Wood (:Callek)

Assignee

Comment 9

•

6 years ago

Not expecting to work on this tool anymore.

Status: REOPENED → RESOLVED

Closed: 9 years ago → 6 years ago

Resolution: --- → WONTFIX

Bugzilla

Quick Search

slaveapi seems to claim machines do not reboot when they do

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: slaveapi, Assigned: Callek)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Updated

Updated

Comment 7

Updated

Comment 8

Updated

Comment 9