Closed
Bug 1107736
Opened 10 years ago
Closed 10 years ago
Please run diagnostics on t-xp32-ix-140
Categories
(Infrastructure & Operations :: DCOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Unassigned)
References
Details
(Whiteboard: no IPMI reboot attempts)
This machine is currently unreachable, and has put itself into that state multiple times over the past few weeks. Please run full diagnostics.
Updated•10 years ago
|
colo-trip: --- → scl3
Comment 1•10 years ago
|
||
opened #AFB-729-61182 to drop off this host at iX for an extended burn in test.
Whiteboard: #AFB-729-61182
Comment 2•10 years ago
|
||
This host was dropped off at ix (Bug 1106633 )
Comment 3•10 years ago
|
||
passed burn-in tests, reimaging. from ix; "Hey Sal, Both nodes have passed our burn-in. We went ahead and updated the BIOS, as well as IPMI, on both units but please feel free to drop by any time. Cheers!"
Whiteboard: #AFB-729-61182 → reimaging
Comment 4•10 years ago
|
||
host is back online. sals-MacBook-Pro-3:~ sal$ sudo fping 10.26.19.236 10.26.19.236 is alive sals-MacBook-Pro-3:~ sal$ sudo fping 10.26.41.237 10.26.41.237 is alive sals-MacBook-Pro-3:~ sal$ ssh !$ ssh 10.26.41.237 The authenticity of host '10.26.41.237 (10.26.41.237)' can't be established. RSA key fingerprint is fe:ba:e0:31:37:7b:97:08:b9:68:6f:73:49:c2:69:7c. Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 6•10 years ago
|
||
Just exactly like it was before its trip home.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 7•10 years ago
|
||
these hosts (137 and 140) not trying IPMI might be a config issue per :callek. he said he would have to dig into this but this was during the Portland work week. i'll need info him and see if he can find any more info regarding these 2 hosts.
Flags: needinfo?(bugspam.Callek)
Updated•10 years ago
|
Whiteboard: reimaging → no IPMI reboot attempts
Comment 9•10 years ago
|
||
for my own knowledge, slaveapi1 log of a reboot attempt: 2014-12-22 21:34:34,343 - INFO - -.- - Processing item: (u't-xp32-ix-140', <function reboot at 0x2a6cb18>, (), {}, <slaveapi.acti ons.results.ActionResult object at 0x10d25550>) 2014-12-22 21:34:34,345 - INFO - -=- - 10.22.81.89 - - [2014-12-22 21:34:34] "POST /slaves/t-xp32-ix-140/actions/reboot HTTP/1.1" 202 275 0.002916 2014-12-22 21:34:34,348 - INFO - t-xp32-ix-140 - Getting inventory info 2014-12-22 21:34:34,684 - INFO - t-xp32-ix-140 - Getting devices.json info 2014-12-22 21:34:34,892 - INFO - t-xp32-ix-140 - Unable to establish IPMI session, retrying... 2014-12-22 21:34:34,893 - INFO - t-xp32-ix-140 - Getting bug info 2014-12-22 21:34:34,893 - INFO - t-xp32-ix-140 - Sending request: GET https://bugzilla.mozilla.org/rest/bug/t-xp32-ix-140 2014-12-22 21:34:35,177 - INFO - t-xp32-ix-140 - Got response: 200 2014-12-22 21:34:35,181 - INFO - t-xp32-ix-140 - Sending request: GET https://bugzilla.mozilla.org/rest/bug?product=Infrastructur e%20%26%20Operations&component=DCOps&blocks=1095980&resolution=--- 2014-12-22 21:34:35,477 - INFO - t-xp32-ix-140 - Got response: 200 2014-12-22 21:34:39,211 - WARNING - t-xp32-ix-140 - First password as administrator didn't work. 2014-12-22 21:34:41,138 - WARNING - t-xp32-ix-140 - First password as root didn't work. 2014-12-22 21:34:42,336 - WARNING - t-xp32-ix-140 - First password as cltbld didn't work. 2014-12-22 21:34:43,543 - INFO - t-xp32-ix-140 - Couldn't connect with any credentials. 2014-12-22 21:34:43,544 - ERROR - t-xp32-ix-140 - Authentication failed. 2014-12-22 21:34:43,544 - ERROR - t-xp32-ix-140 - Traceback (most recent call last): 2014-12-22 21:34:43,545 - ERROR - t-xp32-ix-140 - 2014-12-22 21:34:43,545 - ERROR - t-xp32-ix-140 - File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", l ine 139, in get_console 2014-12-22 21:34:43,545 - ERROR - t-xp32-ix-140 - console.connect() # Make sure we can connect properly 2014-12-22 21:34:43,546 - ERROR - t-xp32-ix-140 - 2014-12-22 21:34:43,546 - ERROR - t-xp32-ix-140 - File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/ssh. py", line 86, in connect 2014-12-22 21:34:43,547 - ERROR - t-xp32-ix-140 - raise last_exc 2014-12-22 21:34:43,547 - ERROR - t-xp32-ix-140 - 2014-12-22 21:34:43,547 - ERROR - t-xp32-ix-140 - AuthenticationException: Authentication failed. 2014-12-22 21:34:43,547 - ERROR - t-xp32-ix-140 - 2014-12-22 21:34:43,548 - ERROR - t-xp32-ix-140 - 2014-12-22 21:34:43,550 - INFO - t-xp32-ix-140 - Sending request: POST https://bugzilla.mozilla.org/rest/bug 2014-12-22 21:34:44,462 - INFO - t-xp32-ix-140 - Got response: 200 2014-12-22 21:34:44,464 - INFO - t-xp32-ix-140 - Sending request: GET https://bugzilla.mozilla.org/rest/bug/1114877 2014-12-22 21:34:44,691 - INFO - t-xp32-ix-140 - Got response: 200 2014-12-22 21:34:44,694 - INFO - t-xp32-ix-140 - Sending request: PUT https://bugzilla.mozilla.org/rest/bug/1095980 2014-12-22 21:34:45,660 - INFO - t-xp32-ix-140 - Got response: 200 2014-12-22 21:34:45,664 - INFO - t-xp32-ix-140 - Finished Processing item: (u't-xp32-ix-140', <function reboot at 0x2a6cb18>, (), {}, <slaveapi.actions.results.ActionResult object at 0x10d25550>) 2014-12-22 21:34:49,675 - INFO - -=- - 10.22.81.89 - - [2014-12-22 21:34:49] "GET /slaves/t-xp32-ix-140/actions/reboot?requestid= 282219856 HTTP/1.1" 200 320 0.002137 2014-12-22 21:34:49,788 - INFO - -=- - 10.22.81.88 - - [2014-12-22 21:34:49] "GET /slaves/t-xp32-ix-140/actions/shutdown_buildsla ve HTTP/1.1" 200 135 0.001704 2014-12-22 21:34:49,898 - INFO - -=- - 10.22.81.89 - - [2014-12-22 21:34:49] "GET /slaves/t-xp32-ix-140/actions/reboot HTTP/1.1" 200 573 0.001381 ========================= Specifically seeing: 2014-12-22 21:34:34,684 - INFO - t-xp32-ix-140 - Getting devices.json info 2014-12-22 21:34:34,892 - INFO - t-xp32-ix-140 - Unable to establish IPMI session, retrying... 2014-12-22 21:34:34,893 - INFO - t-xp32-ix-140 - Getting bug info tells me that it can't connect to ipmi with the stored credentials (that work for other ipmi sessions) :van can you confirm that (a) We have ipmi available on t-xp32-ix-140-mgmt.build.mozilla.org (b) Our user/password combo for ipmi is accurate per docs If (a) is wrong then we need to update inventory, if (b) is wrong we need to correct the ipmi interface, if neither is wrong I need to delve back in. (This probably all applies also to Bug 1106633
Flags: needinfo?(bugspam.Callek) → needinfo?(vle)
Comment 10•10 years ago
|
||
looks like it came back from iX with its IPMI password reset. ive changed the IPMI credentials for both -140 and -137, can you give it another try? thanks!
Flags: needinfo?(vle) → needinfo?(bugspam.Callek)
Comment 11•10 years ago
|
||
Whether or not Callek will find any difference in the logs, from the slave health UI nothing's changed - I rebooted them both, they both failed ssh and "didn't attempt" (which as comment 9 says, actually means failed to connect to) IPMI.
Comment 14•10 years ago
|
||
(In reply to Van Le [:van] from comment #10) > looks like it came back from iX with its IPMI password reset. ive changed > the IPMI credentials for both -140 and -137, can you give it another try? > thanks! And per web-UI, the IPMI user/password is still incorrect http://t-xp32-ix-140-mgmt.build.mozilla.org/ (I can't connect with any of the ipmi/ilo/pdu password+user combos I have either)
Flags: needinfo?(bugspam.Callek)
Comment 15•10 years ago
|
||
after IRC discussion, :van fixed the ipmi user/pass combo 12/23/2014, 11:49:15 AM reboot Attempting SSH reboot...Failed. Attempting IPMI reboot...Success! Lets do a final reimage and then try in prod again.
Comment 16•10 years ago
|
||
the issue was that there are 2 passwords used for iX IPMI. i read this and used the default mozillaadmin password: * ix ipmi passwords (default) see bug 837165 New: Username: mozillaadmin Password: (see above, keep matched to infra) issue is resolved as there is a releng account that needs to be added for the reboot script to work. host reimaged.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•