Closed
Bug 1235844
Opened 8 years ago
Closed 8 years ago
vertica2.stage.metrics.scl3.mozilla.com:HP RAID is CRITICAL: RAID CRITICAL - HP Smart Array Failed: Smart Array E200i in Slot 0
Categories
(Infrastructure & Operations :: DCOps, task)
Infrastructure & Operations
DCOps
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: mlankford, Assigned: van)
Details
8:38 AM <@nagios-scl3> Wed 08:38:13 PST [5180] vertica2.stage.metrics.scl3.mozilla.com:HP RAID is CRITICAL: RAID CRITICAL - HP Smart Array Failed: Smart Array E200i in Slot 0 (Embedded) Controller Status: OK Cache Status: OK Smart Array P400 in Slot 3 Controller Status: OK Cache Status: Temporarily Disabled Battery/Capacitor Status: Failed (Replace Batteries/Capacitors) (http://m.mozilla.org/HP+RAID)
[root@vertica2.stage.metrics.scl3 ~]# hpacucli controller slot=3 show config Smart Array P400 in Slot 3 (sn: P61630G9SVN6IK) Internal Drive Cage at Port 1I, Box 1, OK Internal Drive Cage at Port 2I, Box 1, OK array A (SAS, Unused Space: 0 MB) logicaldrive 1 (546.8 GB, RAID 6, OK) physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 146 GB, OK) physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 146 GB, OK) physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 146 GB, OK) physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 146 GB, OK) physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 146 GB, OK) physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 146 GB, OK) [root@vertica2.stage.metrics.scl3 ~]# hpacucli controller all show Smart Array E200i in Slot 0 (Embedded) (sn: PBACB0A9VVH1B9) Smart Array P400 in Slot 3 (sn: P61630G9SVN6IK) [root@vertica2.stage.metrics.scl3 ~]# hpacucli controller all show status Smart Array E200i in Slot 0 (Embedded) Controller Status: OK Cache Status: OK Smart Array P400 in Slot 3 Controller Status: OK Cache Status: Temporarily Disabled Battery/Capacitor Status: Failed (Replace Batteries/Capacitors) [root@vertica2.stage.metrics.scl3 ~]#
Updated•8 years ago
|
Assignee: infra → server-ops-dcops
Component: Infrastructure: Other → DCOps
QA Contact: jdow
Assignee | ||
Comment 2•8 years ago
|
||
looks like another failed RAID battery in the p400 storage blade with no warranty information. the host is out of warranty since feb 2011 so it's be safe to assume they both expired at the same time. do we want to renew the warranty on these 2 devices?
Flags: needinfo?(mpressman)
Assignee | ||
Comment 3•8 years ago
|
||
also please note this is a G1 blade so perhaps it's also better to just upgrade and renew the service contract on the blade if we decide to keep the host?
QA Contact: jbarnell
Assignee | ||
Updated•8 years ago
|
colo-trip: --- → scl3
Comment 4•8 years ago
|
||
I'm not sure we know what the time frame is with regard to the vertica servers. They were initially planned to be decommissioned last year and then this quarter, but I don't know if we want to spend more on the stage servers. As of right now, their only purpose is for testing upgrades and I don't see us upgrading without a future plan for the prod service. So, for right now, we can probably hold off on fixing this until a decision is made
Flags: needinfo?(mpressman)
Assignee | ||
Comment 5•8 years ago
|
||
this is a stage server and it's only affecting the cache on the storage array. going to WONTFIX per c#4.
Assignee: server-ops-dcops → vle
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Comment 6•8 years ago
|
||
Just resolved the following: vertica2.stage.metrics.scl3.mozilla.com:HP Health is CRITICAL: CHECK_NRPE: Socket timeout after 60 seconds. (http://m.mozilla.org/HP+Health) with a hp-health service restart. Is this machine still appropriate for nagios monitoring?
You need to log in
before you can comment on or make changes to this bug.
Description
•