Failed power supply on node[13-16].mango.metrics.scl3

RESOLVED FIXED

Status

Infrastructure & Operations
DCOps
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: ericz, Assigned: ericz)

Tracking

Details

(Assignee)

Description

5 years ago
< nagios-scl3> | Mon 14:52:49 PDT [506] node16.mango.metrics.scl3.mozilla.com:IPMI Log is CRITICAL: CRITICAL -   18 -- 03/11/2013 -- 21:21:12 -- Power Supply #0xce -- Failure detected

[eziegenhorn@node16.mango.metrics.scl3 ~]$ sudo ipmitool sel list
  18 | 03/11/2013 | 21:21:12 | Power Supply #0xce | Failure detected | Asserted
ops: If you do find a failed power supply or lose cable, please use that opportunity to install a retention sleeve on PDU junction at the top of the rack.

Updated

5 years ago
colo-trip: --- → scl3

Comment 2

5 years ago
The way these hosts are set up is that each chassis has 4 power supplies, but 2 power supply is bonded to go to 1 PDU (we only have 2 PDUs per rack). I am wondering if these alerts we're getting is because the hardware is doing some power balancing act or power management and when it does, it's some how sending out a false alarm because of the bonded power. These Y-splitting power cables don't have any intelligent switching capabilities. 

I'm checking the power supplies in the chassis and there are no error lights/LEDs. Everything looks copacetic when I check the IPMI logs and IML and nothing else has been reported since. Is there a way to increase the threshold on these alerts to page if we get more than 1 alert in the same day?
It could genuinely be an intermittent, failing power supply. Do we have a feeling for how many different chassis we've seen these alerts on?

Comment 4

5 years ago
We've been alerted for node[1-16],[41-44] before. Yah I'm pretty sure you're correct, there might be an intermittent power supply since there is a bunch of hosts in that rack that hasn't reported anything. 

SREs, any chance we can pin point which power supply it is? Not really sure which one it is with only a "Power Supply #0xce" error.

Comment 5

5 years ago
SRE, HP's SmartStart Linux software can do hardware diagnostics. Is this something you guys can install and run?
(Assignee)

Updated

5 years ago
Assignee: server-ops-dcops → eziegenhorn
(Assignee)

Comment 6

5 years ago
Will see if we can get downtime to boot off SmartStart image.

Comment 7

5 years ago
case number:4644467337 opened for 2 psu.

Comment 8

5 years ago
PSU replaced and SEL cleared. please let me know if issues persist.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.