Closed
Bug 850001
Opened 11 years ago
Closed 11 years ago
Failed power supply on node[13-16].mango.metrics.scl3
Categories
(Infrastructure & Operations :: DCOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ericz, Assigned: ericz)
Details
< nagios-scl3> | Mon 14:52:49 PDT [506] node16.mango.metrics.scl3.mozilla.com:IPMI Log is CRITICAL: CRITICAL - 18 -- 03/11/2013 -- 21:21:12 -- Power Supply #0xce -- Failure detected [eziegenhorn@node16.mango.metrics.scl3 ~]$ sudo ipmitool sel list 18 | 03/11/2013 | 21:21:12 | Power Supply #0xce | Failure detected | Asserted
Comment 1•11 years ago
|
||
ops: If you do find a failed power supply or lose cable, please use that opportunity to install a retention sleeve on PDU junction at the top of the rack.
Updated•11 years ago
|
colo-trip: --- → scl3
Comment 2•11 years ago
|
||
The way these hosts are set up is that each chassis has 4 power supplies, but 2 power supply is bonded to go to 1 PDU (we only have 2 PDUs per rack). I am wondering if these alerts we're getting is because the hardware is doing some power balancing act or power management and when it does, it's some how sending out a false alarm because of the bonded power. These Y-splitting power cables don't have any intelligent switching capabilities. I'm checking the power supplies in the chassis and there are no error lights/LEDs. Everything looks copacetic when I check the IPMI logs and IML and nothing else has been reported since. Is there a way to increase the threshold on these alerts to page if we get more than 1 alert in the same day?
Comment 3•11 years ago
|
||
It could genuinely be an intermittent, failing power supply. Do we have a feeling for how many different chassis we've seen these alerts on?
Comment 4•11 years ago
|
||
We've been alerted for node[1-16],[41-44] before. Yah I'm pretty sure you're correct, there might be an intermittent power supply since there is a bunch of hosts in that rack that hasn't reported anything. SREs, any chance we can pin point which power supply it is? Not really sure which one it is with only a "Power Supply #0xce" error.
Comment 5•11 years ago
|
||
SRE, HP's SmartStart Linux software can do hardware diagnostics. Is this something you guys can install and run?
Assignee | ||
Updated•11 years ago
|
Assignee: server-ops-dcops → eziegenhorn
Assignee | ||
Comment 6•11 years ago
|
||
Will see if we can get downtime to boot off SmartStart image.
Comment 7•11 years ago
|
||
case number:4644467337 opened for 2 psu.
Comment 8•11 years ago
|
||
PSU replaced and SEL cleared. please let me know if issues persist.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•