Closed Bug 1107736 Opened 10 years ago Closed 10 years ago

Please run diagnostics on t-xp32-ix-140

Categories

(Infrastructure & Operations :: DCOps, task)

x86
Windows XP
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Unassigned)

References

Details

(Whiteboard: no IPMI reboot attempts)

This machine is currently unreachable, and has put itself into that state multiple times over the past few weeks. Please run full diagnostics.
colo-trip: --- → scl3
opened 	#AFB-729-61182 to drop off this host at iX for an extended burn in test.
Whiteboard: #AFB-729-61182
This host was dropped off at ix (Bug 1106633 )
passed burn-in tests, reimaging.

from ix;

"Hey Sal,

Both nodes have passed our burn-in.

We went ahead and updated the BIOS, as well as IPMI, on both units but please feel free to drop by any time.

Cheers!"
Whiteboard: #AFB-729-61182 → reimaging
host is back online.

sals-MacBook-Pro-3:~ sal$ sudo fping  10.26.19.236
10.26.19.236 is alive
sals-MacBook-Pro-3:~ sal$ sudo fping  10.26.41.237
10.26.41.237 is alive
sals-MacBook-Pro-3:~ sal$ ssh !$
ssh 10.26.41.237
The authenticity of host '10.26.41.237 (10.26.41.237)' can't be established.
RSA key fingerprint is fe:ba:e0:31:37:7b:97:08:b9:68:6f:73:49:c2:69:7c.
Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Just exactly like it was before its trip home.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
these hosts (137 and 140) not trying IPMI might be a config issue per :callek. he said he would have to dig into this but this was during the Portland work week. i'll need info him and see if he can find any more info regarding these 2 hosts.
Flags: needinfo?(bugspam.Callek)
Whiteboard: reimaging → no IPMI reboot attempts
for my own knowledge, slaveapi1 log of a reboot attempt:

2014-12-22 21:34:34,343 - INFO - -.- - Processing item: (u't-xp32-ix-140', <function reboot at 0x2a6cb18>, (), {}, <slaveapi.acti
ons.results.ActionResult object at 0x10d25550>)
2014-12-22 21:34:34,345 - INFO - -=- - 10.22.81.89 - - [2014-12-22 21:34:34] "POST /slaves/t-xp32-ix-140/actions/reboot HTTP/1.1"
 202 275 0.002916

2014-12-22 21:34:34,348 - INFO - t-xp32-ix-140 - Getting inventory info
2014-12-22 21:34:34,684 - INFO - t-xp32-ix-140 - Getting devices.json info
2014-12-22 21:34:34,892 - INFO - t-xp32-ix-140 - Unable to establish IPMI session, retrying...
2014-12-22 21:34:34,893 - INFO - t-xp32-ix-140 - Getting bug info
2014-12-22 21:34:34,893 - INFO - t-xp32-ix-140 - Sending request: GET https://bugzilla.mozilla.org/rest/bug/t-xp32-ix-140
2014-12-22 21:34:35,177 - INFO - t-xp32-ix-140 - Got response: 200
2014-12-22 21:34:35,181 - INFO - t-xp32-ix-140 - Sending request: GET https://bugzilla.mozilla.org/rest/bug?product=Infrastructur
e%20%26%20Operations&component=DCOps&blocks=1095980&resolution=---
2014-12-22 21:34:35,477 - INFO - t-xp32-ix-140 - Got response: 200
2014-12-22 21:34:39,211 - WARNING - t-xp32-ix-140 - First password as administrator didn't work.
2014-12-22 21:34:41,138 - WARNING - t-xp32-ix-140 - First password as root didn't work.
2014-12-22 21:34:42,336 - WARNING - t-xp32-ix-140 - First password as cltbld didn't work.
2014-12-22 21:34:43,543 - INFO - t-xp32-ix-140 - Couldn't connect with any credentials.
2014-12-22 21:34:43,544 - ERROR - t-xp32-ix-140 - Authentication failed.
2014-12-22 21:34:43,544 - ERROR - t-xp32-ix-140 - Traceback (most recent call last):
2014-12-22 21:34:43,545 - ERROR - t-xp32-ix-140 -
2014-12-22 21:34:43,545 - ERROR - t-xp32-ix-140 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", l
ine 139, in get_console
2014-12-22 21:34:43,545 - ERROR - t-xp32-ix-140 -     console.connect()  # Make sure we can connect properly
2014-12-22 21:34:43,546 - ERROR - t-xp32-ix-140 -
2014-12-22 21:34:43,546 - ERROR - t-xp32-ix-140 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh.
py", line 86, in connect
2014-12-22 21:34:43,547 - ERROR - t-xp32-ix-140 -     raise last_exc
2014-12-22 21:34:43,547 - ERROR - t-xp32-ix-140 -
2014-12-22 21:34:43,547 - ERROR - t-xp32-ix-140 - AuthenticationException: Authentication failed.
2014-12-22 21:34:43,547 - ERROR - t-xp32-ix-140 -
2014-12-22 21:34:43,548 - ERROR - t-xp32-ix-140 -
2014-12-22 21:34:43,550 - INFO - t-xp32-ix-140 - Sending request: POST https://bugzilla.mozilla.org/rest/bug
2014-12-22 21:34:44,462 - INFO - t-xp32-ix-140 - Got response: 200
2014-12-22 21:34:44,464 - INFO - t-xp32-ix-140 - Sending request: GET https://bugzilla.mozilla.org/rest/bug/1114877
2014-12-22 21:34:44,691 - INFO - t-xp32-ix-140 - Got response: 200
2014-12-22 21:34:44,694 - INFO - t-xp32-ix-140 - Sending request: PUT https://bugzilla.mozilla.org/rest/bug/1095980
2014-12-22 21:34:45,660 - INFO - t-xp32-ix-140 - Got response: 200
2014-12-22 21:34:45,664 - INFO - t-xp32-ix-140 - Finished Processing item: (u't-xp32-ix-140', <function reboot at 0x2a6cb18>, (),
 {}, <slaveapi.actions.results.ActionResult object at 0x10d25550>)
2014-12-22 21:34:49,675 - INFO - -=- - 10.22.81.89 - - [2014-12-22 21:34:49] "GET /slaves/t-xp32-ix-140/actions/reboot?requestid=
282219856 HTTP/1.1" 200 320 0.002137

2014-12-22 21:34:49,788 - INFO - -=- - 10.22.81.88 - - [2014-12-22 21:34:49] "GET /slaves/t-xp32-ix-140/actions/shutdown_buildsla
ve HTTP/1.1" 200 135 0.001704

2014-12-22 21:34:49,898 - INFO - -=- - 10.22.81.89 - - [2014-12-22 21:34:49] "GET /slaves/t-xp32-ix-140/actions/reboot HTTP/1.1"
200 573 0.001381

=========================

Specifically seeing: 2014-12-22 21:34:34,684 - INFO - t-xp32-ix-140 - Getting devices.json info
2014-12-22 21:34:34,892 - INFO - t-xp32-ix-140 - Unable to establish IPMI session, retrying...
2014-12-22 21:34:34,893 - INFO - t-xp32-ix-140 - Getting bug info

tells me that it can't connect to ipmi with the stored credentials (that work for other ipmi sessions)

:van can you confirm that

(a) We have ipmi available on t-xp32-ix-140-mgmt.build.mozilla.org
(b) Our user/password combo for ipmi is accurate per docs

If (a) is wrong then we need to update inventory, if (b) is wrong we need to correct the ipmi interface, if neither is wrong I need to delve back in. (This probably all applies also to Bug 1106633
Flags: needinfo?(bugspam.Callek) → needinfo?(vle)
looks like it came back from iX with its IPMI password reset. ive changed the IPMI credentials for both -140 and -137, can you give it another try? thanks!
Flags: needinfo?(vle) → needinfo?(bugspam.Callek)
Whether or not Callek will find any difference in the logs, from the slave health UI nothing's changed - I rebooted them both, they both failed ssh and "didn't attempt" (which as comment 9 says, actually means failed to connect to) IPMI.
(In reply to Van Le [:van] from comment #10)
> looks like it came back from iX with its IPMI password reset. ive changed
> the IPMI credentials for both -140 and -137, can you give it another try?
> thanks!

And per web-UI, the IPMI user/password is still incorrect

http://t-xp32-ix-140-mgmt.build.mozilla.org/

(I can't connect with any of the ipmi/ilo/pdu password+user combos I have either)
Flags: needinfo?(bugspam.Callek)
after IRC discussion, :van fixed the ipmi user/pass combo

12/23/2014, 11:49:15 AM	reboot	Attempting SSH reboot...Failed. Attempting IPMI reboot...Success!

Lets do a final reimage and then try in prod again.
the issue was that there are 2 passwords used for iX IPMI. i read this and used the default mozillaadmin password:

* ix ipmi passwords (default)
    see bug 837165
    New:
        Username: mozillaadmin
        Password: (see above, keep matched to infra)

issue is resolved as there is a releng account that needs to be added for the reboot script to work. host reimaged.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.