Closed
Bug 1167580
Opened 9 years ago
Closed 6 years ago
slaveapi seems to claim machines do not reboot when they do
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: slaveapi, Assigned: Callek)
Details
No description provided.
Comment 1•9 years ago
|
||
We've been opening a lot of bugs for machines that are both pingable and have a pinable ipmi recently, and I'm wondering if something is broken with slaveapi. Callek, could you verify that slaveapi actually tried to fix this host and see (if it did) why it failed?
Flags: needinfo?(bugspam.Callek)
Assignee | ||
Comment 2•9 years ago
|
||
2015-05-22 06:24:53,517 - INFO - -.- - Processing item: (u't-xp32-ix-092', <function reboot at 0x28a8758>, (), {}, <slaveapi.acti ons.results.ActionResult object at 0x3f9d590>) 2015-05-22 06:24:53,519 - INFO - -=- - 10.22.81.88 - - [2015-05-22 06:24:53] "POST /slaves/t-xp32-ix-092/actions/reboot HTTP/1.1" 202 274 0.001686 2015-05-22 06:24:53,522 - INFO - t-xp32-ix-092 - Getting inventory info 2015-05-22 06:24:53,682 - INFO - t-xp32-ix-092 - Getting devices.json info 2015-05-22 06:24:53,871 - INFO - t-xp32-ix-092 - Getting bug info 2015-05-22 06:24:53,871 - INFO - t-xp32-ix-092 - Sending request: GET https://bugzilla.mozilla.org/rest/bug/t-xp32-ix-092 2015-05-22 06:24:54,184 - INFO - t-xp32-ix-092 - Got response: 200 2015-05-22 06:24:54,186 - INFO - t-xp32-ix-092 - Sending request: GET https://bugzilla.mozilla.org/rest/bug?product=Infrastructur e%20%26%20Operations&component=DCOps&blocks=940092&resolution=--- 2015-05-22 06:24:54,511 - INFO - t-xp32-ix-092 - Got response: 200 2015-05-22 06:25:07,529 - INFO - t-xp32-ix-092 - Powercycling 2015-05-22 06:27:28,033 - INFO - t-xp32-ix-092 - Powercycle completed. 2015-05-22 06:27:28,033 - INFO - t-xp32-ix-092 - Waiting 60 seconds for reboot. 2015-05-22 06:27:30,040 - INFO - t-xp32-ix-092 - Checking for signs of life 2015-05-22 06:32:36,245 - ERROR - t-xp32-ix-092 - Timeout of 300 exceeded, giving up 2015-05-22 06:32:36,247 - INFO - t-xp32-ix-092 - Sending request: POST https://bugzilla.mozilla.org/rest/bug 2015-05-22 06:32:37,195 - INFO - t-xp32-ix-092 - Got response: 200 2015-05-22 06:32:37,197 - INFO - t-xp32-ix-092 - Sending request: GET https://bugzilla.mozilla.org/rest/bug/1167580 2015-05-22 06:32:37,488 - INFO - t-xp32-ix-092 - Got response: 200 2015-05-22 06:32:37,490 - INFO - t-xp32-ix-092 - Sending request: PUT https://bugzilla.mozilla.org/rest/bug/940092 2015-05-22 06:32:38,652 - INFO - t-xp32-ix-092 - Got response: 200 2015-05-22 06:32:38,655 - INFO - t-xp32-ix-092 - Finished Processing item: (u't-xp32-ix-092', <function reboot at 0x28a8758>, (), {}, <slaveapi.actions.results.ActionResult object at 0x3f9d590>) Specifically it did a reboot issue (via ipmi) then waited for the machine to ping, and failed after 300 seconds...
Flags: needinfo?(bugspam.Callek)
Comment 3•9 years ago
|
||
The machine is very clearly responding to ping now... Can we increase the timeout since windows machines are generally slow to boot or try more than once?
Updated•9 years ago
|
Flags: needinfo?(bugspam.Callek)
Assignee | ||
Comment 4•9 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #3) > The machine is very clearly responding to ping now... Can we increase the > timeout since windows machines are generally slow to boot or try more than > once? We could, but it does that "wait for ping-up" after each action it takes, e.g. ssh reboot (wait 300 seconds) --> ipmi reboot (wait 300 seconds). I suppose we can add one final "wait 300 seconds" at the end of everything, but I'd like that to be a seperate bug. Of note the credentials file had the correct ipmi password, HOWEVER slaveapi's server started in early april and the credentials file was updated in early may. So I restarted it incase the reload didn't quite do the right thing.... If we need that extra time for bringup (as in, if Q/mark/etc can confirm that time-to-come-back-up is typically >300 seconds, or at least often enough that it matters) I can work on said patch and deployment in a seperate bug.
Flags: needinfo?(bugspam.Callek)
Comment 5•9 years ago
|
||
Slaveapi having a wrong password wouldn't have mattered if it's just looking at ping. Either it never would have taken the machine down and it's still in the state it was in before (which is pingable), or it worked and the machine came back up (since it's pingable). I'm pretty sure our windows machines can commonly take longer than 5 minutes to do a shutdown and a reboot. Mark, can you verify?
Flags: needinfo?(mcornmesser)
Comment 6•9 years ago
|
||
vans-MacBook-Pro:~ vle$ fping t-xp32-ix-092.wintest.releng.scl3.mozilla.com t-xp32-ix-092.wintest.releng.scl3.mozilla.com is alive vans-MacBook-Pro:~ vle$ ssh !$ ssh t-xp32-ix-092.wintest.releng.scl3.mozilla.com The authenticity of host 't-xp32-ix-092.wintest.releng.scl3.mozilla.com (10.26.41.112)' can't be established. RSA key fingerprint is 92:e7:12:2b:98:a0:70:ff:8a:05:eb:9c:f2:ac:79:fe. Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
colo-trip: --- → scl3
Closed: 9 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Assignee: server-ops-dcops → nobody
Status: RESOLVED → REOPENED
Component: DCOps → Tools
Product: Infrastructure & Operations → Release Engineering
QA Contact: hwine
Resolution: FIXED → ---
Summary: t-xp32-ix-092 is unreachable → slaveapi seems to claim machines do not reboot when they do
Updated•9 years ago
|
Assignee: nobody → bugspam.Callek
Comment 7•9 years ago
|
||
It can take longer than 5 minutes for the full process. It depends on whats happening with the GPOs being applied and the time it takes for it to communicate back to the domain.
Flags: needinfo?(mcornmesser)
Updated•9 years ago
|
No longer blocks: t-xp32-ix-092
Comment 8•9 years ago
|
||
I doubt that it's actually the case that this is a Windows-specific problem, so much as just that Windows is mostly what I have to reboot. I don't think there has ever been a time when I rebooted more than a couple talos-linux64-ix when it didn't wind up filing at least one invalid unreachable.
Updated•7 years ago
|
Component: Tools → General
Assignee | ||
Comment 9•6 years ago
|
||
Not expecting to work on this tool anymore.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 6 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•