Closed Bug 781242 Opened 12 years ago Closed 12 years ago

node2.testing.stage.metrics.scl3 has a bad DIMM

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ericz, Unassigned)

References

Details

node2.testing.stage.metrics.scl3.mozilla.com:IPMI Log is CRITICAL: CRITICAL -    4 -- 08/07/2012 -- 19:20:07 -- Memory -- Uncorrectable ECC -
It also has a bad power supply:

node2.testing.stage.metrics.scl3 ~]$ sudo ipmitool sel list
   1 | 05/07/2012 | 23:30:41 | OEM #0x02 | 
   2 | 05/07/2012 | 23:30:44 | OEM #0x02 | 
   3 | 08/07/2012 | 19:04:57 | Power Supply #0x17 | Failure detected | Asserted
   4 | 08/07/2012 | 19:20:07 | Memory | Uncorrectable ECC | Asserted
Transferring to DC Ops to do the RMA process.  IX Systems ticket submission is online at http://support.ixsystems.com/index.php?_m=tickets&_a=submit in case that is helpful (though I imagine you already knew that).
Assignee: eziegenhorn → server-ops
Component: Server Operations → Server Operations: DCOps
QA Contact: jdow → dmoore
colo-trip: --- → scl3
:ericz, is there any more information you can provide us? There's more than one p/s in the chassis and multiple DIMMs in the board. I've tried logging in the host and checking ipmitool but wasn't able to narrow it down. Hardware-wise, neither p/s have an amber light indicating hardware failure.

Thanks,
Van
No, it really doesn't give much information.  I'd ask the vendor how to identify the bad DIMM.  It's difficult in my experience.  I'd also ask about the power supply because this keeps happening where it alerts as to being bad and then looks ok.
:ericz, the host is no longer showing the error messages. I'm not seeing any amber led indicating bad hardware. What do you suggest?

[vle@node2.testing.stage.metrics.scl3 ~]$ sudo ipmitool sel elist
SEL has no entries
[vle@node2.testing.stage.metrics.scl3 ~]$ sudo ipmitool sel list
SEL has no entries
spoke to ericz and we're closing this as a false alarm as the system event logs are cleared, we're not seeing any hardware amber lights, and the logs aren't showing any specific hardware component failing. we'll reopen if it alerts again.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Assignee: server-ops → server-ops-dcops
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.