node2.testing.stage.metrics.scl3 has a bad DIMM

RESOLVED WONTFIX

Status

Infrastructure & Operations
DCOps
RESOLVED WONTFIX
6 years ago
3 years ago

People

(Reporter: ericz, Unassigned)

Tracking

Details

(Reporter)

Description

6 years ago
node2.testing.stage.metrics.scl3.mozilla.com:IPMI Log is CRITICAL: CRITICAL -    4 -- 08/07/2012 -- 19:20:07 -- Memory -- Uncorrectable ECC -
(Reporter)

Comment 1

6 years ago
It also has a bad power supply:

node2.testing.stage.metrics.scl3 ~]$ sudo ipmitool sel list
   1 | 05/07/2012 | 23:30:41 | OEM #0x02 | 
   2 | 05/07/2012 | 23:30:44 | OEM #0x02 | 
   3 | 08/07/2012 | 19:04:57 | Power Supply #0x17 | Failure detected | Asserted
   4 | 08/07/2012 | 19:20:07 | Memory | Uncorrectable ECC | Asserted
(Reporter)

Comment 2

6 years ago
Transferring to DC Ops to do the RMA process.  IX Systems ticket submission is online at http://support.ixsystems.com/index.php?_m=tickets&_a=submit in case that is helpful (though I imagine you already knew that).
Assignee: eziegenhorn → server-ops
Component: Server Operations → Server Operations: DCOps
QA Contact: jdow → dmoore

Updated

6 years ago
colo-trip: --- → scl3

Comment 3

6 years ago
:ericz, is there any more information you can provide us? There's more than one p/s in the chassis and multiple DIMMs in the board. I've tried logging in the host and checking ipmitool but wasn't able to narrow it down. Hardware-wise, neither p/s have an amber light indicating hardware failure.

Thanks,
Van
Duplicate of this bug: 781938
(Reporter)

Comment 5

6 years ago
No, it really doesn't give much information.  I'd ask the vendor how to identify the bad DIMM.  It's difficult in my experience.  I'd also ask about the power supply because this keeps happening where it alerts as to being bad and then looks ok.

Comment 6

6 years ago
:ericz, the host is no longer showing the error messages. I'm not seeing any amber led indicating bad hardware. What do you suggest?

[vle@node2.testing.stage.metrics.scl3 ~]$ sudo ipmitool sel elist
SEL has no entries
[vle@node2.testing.stage.metrics.scl3 ~]$ sudo ipmitool sel list
SEL has no entries

Comment 7

6 years ago
spoke to ericz and we're closing this as a false alarm as the system event logs are cleared, we're not seeing any hardware amber lights, and the logs aren't showing any specific hardware component failing. we'll reopen if it alerts again.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WONTFIX

Updated

5 years ago
Assignee: server-ops → server-ops-dcops
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.