Closed Bug 1235844 Opened 8 years ago Closed 8 years ago

vertica2.stage.metrics.scl3.mozilla.com:HP RAID is CRITICAL: RAID CRITICAL - HP Smart Array Failed: Smart Array E200i in Slot 0

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mlankford, Assigned: van)

Details

8:38 AM <@nagios-scl3> Wed 08:38:13 PST [5180] vertica2.stage.metrics.scl3.mozilla.com:HP RAID is CRITICAL: RAID CRITICAL - HP Smart Array Failed: Smart Array E200i in Slot 0 (Embedded) Controller Status: OK Cache Status: OK Smart Array P400 in Slot 3 Controller Status: OK Cache Status: Temporarily Disabled Battery/Capacitor Status: Failed (Replace Batteries/Capacitors) (http://m.mozilla.org/HP+RAID)
[root@vertica2.stage.metrics.scl3 ~]# hpacucli controller slot=3 show config

Smart Array P400 in Slot 3                (sn: P61630G9SVN6IK)


   Internal Drive Cage at Port 1I, Box 1, OK

   Internal Drive Cage at Port 2I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (546.8 GB, RAID 6, OK)

      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 146 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 146 GB, OK)
      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 146 GB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 146 GB, OK)
      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 146 GB, OK)
      physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 146 GB, OK)

[root@vertica2.stage.metrics.scl3 ~]# hpacucli controller all show

Smart Array E200i in Slot 0 (Embedded)    (sn: PBACB0A9VVH1B9)
Smart Array P400 in Slot 3                (sn: P61630G9SVN6IK)

[root@vertica2.stage.metrics.scl3 ~]# hpacucli controller all show status

Smart Array E200i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK

Smart Array P400 in Slot 3
   Controller Status: OK
   Cache Status: Temporarily Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)


[root@vertica2.stage.metrics.scl3 ~]#
Assignee: infra → server-ops-dcops
Component: Infrastructure: Other → DCOps
QA Contact: jdow
looks like another failed RAID battery in the p400 storage blade with no warranty information. the host is out of warranty since feb 2011 so it's be safe to assume they both expired at the same time. do we want to renew the warranty on these 2 devices?
Flags: needinfo?(mpressman)
also please note this is a G1 blade so perhaps it's also better to just upgrade and renew the service contract on the blade if we decide to keep the host?
QA Contact: jbarnell
colo-trip: --- → scl3
I'm not sure we know what the time frame is with regard to the vertica servers. They were initially planned to be decommissioned last year and then this quarter, but I don't know if we want to spend more on the stage servers. As of right now, their only purpose is for testing upgrades and I don't see us upgrading without a future plan for the prod service. So, for right now, we can probably hold off on fixing this until a decision is made
Flags: needinfo?(mpressman)
this is a stage server and it's only affecting the cache on the storage array. going to WONTFIX per c#4.
Assignee: server-ops-dcops → vle
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Just resolved the following:

vertica2.stage.metrics.scl3.mozilla.com:HP Health is CRITICAL: CHECK_NRPE: Socket timeout after 60 seconds. (http://m.mozilla.org/HP+Health)

with a hp-health service restart.  Is this machine still appropriate for nagios monitoring?
You need to log in before you can comment on or make changes to this bug.