Closed Bug 850001 Opened 11 years ago Closed 11 years ago

Failed power supply on node[13-16].mango.metrics.scl3

Categories

(Infrastructure & Operations :: DCOps, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ericz, Assigned: ericz)

Details

< nagios-scl3> | Mon 14:52:49 PDT [506] node16.mango.metrics.scl3.mozilla.com:IPMI Log is CRITICAL: CRITICAL -   18 -- 03/11/2013 -- 21:21:12 -- Power Supply #0xce -- Failure detected

[eziegenhorn@node16.mango.metrics.scl3 ~]$ sudo ipmitool sel list
  18 | 03/11/2013 | 21:21:12 | Power Supply #0xce | Failure detected | Asserted
ops: If you do find a failed power supply or lose cable, please use that opportunity to install a retention sleeve on PDU junction at the top of the rack.
colo-trip: --- → scl3
The way these hosts are set up is that each chassis has 4 power supplies, but 2 power supply is bonded to go to 1 PDU (we only have 2 PDUs per rack). I am wondering if these alerts we're getting is because the hardware is doing some power balancing act or power management and when it does, it's some how sending out a false alarm because of the bonded power. These Y-splitting power cables don't have any intelligent switching capabilities. 

I'm checking the power supplies in the chassis and there are no error lights/LEDs. Everything looks copacetic when I check the IPMI logs and IML and nothing else has been reported since. Is there a way to increase the threshold on these alerts to page if we get more than 1 alert in the same day?
It could genuinely be an intermittent, failing power supply. Do we have a feeling for how many different chassis we've seen these alerts on?
We've been alerted for node[1-16],[41-44] before. Yah I'm pretty sure you're correct, there might be an intermittent power supply since there is a bunch of hosts in that rack that hasn't reported anything. 

SREs, any chance we can pin point which power supply it is? Not really sure which one it is with only a "Power Supply #0xce" error.
SRE, HP's SmartStart Linux software can do hardware diagnostics. Is this something you guys can install and run?
Assignee: server-ops-dcops → eziegenhorn
Will see if we can get downtime to boot off SmartStart image.
case number:4644467337 opened for 2 psu.
PSU replaced and SEL cleared. please let me know if issues persist.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.