Closed Bug 1167580 Opened 9 years ago Closed 6 years ago

slaveapi seems to claim machines do not reboot when they do

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: slaveapi, Assigned: Callek)

Details

      No description provided.
We've been opening a lot of bugs for machines that are both pingable and have a pinable ipmi recently, and I'm wondering if something is broken with slaveapi.  Callek, could you verify that slaveapi actually tried to fix this host and see (if it did) why it failed?
Flags: needinfo?(bugspam.Callek)
2015-05-22 06:24:53,517 - INFO - -.- - Processing item: (u't-xp32-ix-092', <function reboot at 0x28a8758>, (), {}, <slaveapi.acti
ons.results.ActionResult object at 0x3f9d590>)
2015-05-22 06:24:53,519 - INFO - -=- - 10.22.81.88 - - [2015-05-22 06:24:53] "POST /slaves/t-xp32-ix-092/actions/reboot HTTP/1.1"
 202 274 0.001686
2015-05-22 06:24:53,522 - INFO - t-xp32-ix-092 - Getting inventory info
2015-05-22 06:24:53,682 - INFO - t-xp32-ix-092 - Getting devices.json info
2015-05-22 06:24:53,871 - INFO - t-xp32-ix-092 - Getting bug info
2015-05-22 06:24:53,871 - INFO - t-xp32-ix-092 - Sending request: GET https://bugzilla.mozilla.org/rest/bug/t-xp32-ix-092
2015-05-22 06:24:54,184 - INFO - t-xp32-ix-092 - Got response: 200
2015-05-22 06:24:54,186 - INFO - t-xp32-ix-092 - Sending request: GET https://bugzilla.mozilla.org/rest/bug?product=Infrastructur
e%20%26%20Operations&component=DCOps&blocks=940092&resolution=---
2015-05-22 06:24:54,511 - INFO - t-xp32-ix-092 - Got response: 200
2015-05-22 06:25:07,529 - INFO - t-xp32-ix-092 - Powercycling
2015-05-22 06:27:28,033 - INFO - t-xp32-ix-092 - Powercycle completed.
2015-05-22 06:27:28,033 - INFO - t-xp32-ix-092 - Waiting 60 seconds for reboot.
2015-05-22 06:27:30,040 - INFO - t-xp32-ix-092 - Checking for signs of life
2015-05-22 06:32:36,245 - ERROR - t-xp32-ix-092 - Timeout of 300 exceeded, giving up
2015-05-22 06:32:36,247 - INFO - t-xp32-ix-092 - Sending request: POST https://bugzilla.mozilla.org/rest/bug
2015-05-22 06:32:37,195 - INFO - t-xp32-ix-092 - Got response: 200
2015-05-22 06:32:37,197 - INFO - t-xp32-ix-092 - Sending request: GET https://bugzilla.mozilla.org/rest/bug/1167580
2015-05-22 06:32:37,488 - INFO - t-xp32-ix-092 - Got response: 200
2015-05-22 06:32:37,490 - INFO - t-xp32-ix-092 - Sending request: PUT https://bugzilla.mozilla.org/rest/bug/940092
2015-05-22 06:32:38,652 - INFO - t-xp32-ix-092 - Got response: 200
2015-05-22 06:32:38,655 - INFO - t-xp32-ix-092 - Finished Processing item: (u't-xp32-ix-092', <function reboot at 0x28a8758>, (),
 {}, <slaveapi.actions.results.ActionResult object at 0x3f9d590>)

Specifically it did a reboot issue (via ipmi) then waited for the machine to ping, and failed after 300 seconds...
Flags: needinfo?(bugspam.Callek)
The machine is very clearly responding to ping now... Can we increase the timeout since windows machines are generally slow to boot or try more than once?
Flags: needinfo?(bugspam.Callek)
(In reply to Amy Rich [:arich] [:arr] from comment #3)
> The machine is very clearly responding to ping now... Can we increase the
> timeout since windows machines are generally slow to boot or try more than
> once?

We could, but it does that "wait for ping-up" after each action it takes, e.g. ssh reboot (wait 300 seconds) --> ipmi reboot (wait 300 seconds).

I suppose we can add one final "wait 300 seconds" at the end of everything, but I'd like that to be a seperate bug.

Of note the credentials file had the correct ipmi password, HOWEVER slaveapi's server started in early april and the credentials file was updated in early may. So I restarted it incase the reload didn't quite do the right thing....

If we need that extra time for bringup (as in, if Q/mark/etc can confirm that time-to-come-back-up is typically >300 seconds, or at least often enough that it matters) I can work on said patch and deployment in a seperate bug.
Flags: needinfo?(bugspam.Callek)
Slaveapi having a wrong password wouldn't have mattered if it's just looking at ping. Either it never would have taken the machine down and it's still in the state it was in before (which is pingable), or it worked and the machine came back up (since it's pingable).

I'm pretty sure our windows machines can commonly take longer than 5 minutes to do a shutdown and a reboot. Mark, can you verify?
Flags: needinfo?(mcornmesser)
vans-MacBook-Pro:~ vle$ fping t-xp32-ix-092.wintest.releng.scl3.mozilla.com
t-xp32-ix-092.wintest.releng.scl3.mozilla.com is alive
vans-MacBook-Pro:~ vle$ ssh !$
ssh t-xp32-ix-092.wintest.releng.scl3.mozilla.com
The authenticity of host 't-xp32-ix-092.wintest.releng.scl3.mozilla.com (10.26.41.112)' can't be established.
RSA key fingerprint is 92:e7:12:2b:98:a0:70:ff:8a:05:eb:9c:f2:ac:79:fe.
Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
colo-trip: --- → scl3
Closed: 9 years ago
Resolution: --- → FIXED
Assignee: server-ops-dcops → nobody
Status: RESOLVED → REOPENED
Component: DCOps → Tools
Product: Infrastructure & Operations → Release Engineering
QA Contact: hwine
Resolution: FIXED → ---
Summary: t-xp32-ix-092 is unreachable → slaveapi seems to claim machines do not reboot when they do
Assignee: nobody → bugspam.Callek
It can take longer than 5 minutes for the full process. It depends on whats happening with the GPOs being applied and the time it takes for it to communicate back to the domain.
Flags: needinfo?(mcornmesser)
No longer blocks: t-xp32-ix-092
I doubt that it's actually the case that this is a Windows-specific problem, so much as just that Windows is mostly what I have to reboot. I don't think there has ever been a time when I rebooted more than a couple talos-linux64-ix when it didn't wind up filing at least one invalid unreachable.
Component: Tools → General
Not expecting to work on this tool anymore.
Status: REOPENED → RESOLVED
Closed: 9 years ago6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.