If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

vertica2.stage.metrics.scl3.mozilla.com:HP RAID is CRITICAL: RAID CRITICAL - HP Smart Array Failed: Smart Array E200i in Slot 0

RESOLVED WONTFIX

Status

Infrastructure & Operations
DCOps
RESOLVED WONTFIX
2 years ago
a year ago

People

(Reporter: Marlena, Assigned: van)

Tracking

Details

(Reporter)

Description

2 years ago
8:38 AM <@nagios-scl3> Wed 08:38:13 PST [5180] vertica2.stage.metrics.scl3.mozilla.com:HP RAID is CRITICAL: RAID CRITICAL - HP Smart Array Failed: Smart Array E200i in Slot 0 (Embedded) Controller Status: OK Cache Status: OK Smart Array P400 in Slot 3 Controller Status: OK Cache Status: Temporarily Disabled Battery/Capacitor Status: Failed (Replace Batteries/Capacitors) (http://m.mozilla.org/HP+RAID)
(Reporter)

Comment 1

2 years ago
[root@vertica2.stage.metrics.scl3 ~]# hpacucli controller slot=3 show config

Smart Array P400 in Slot 3                (sn: P61630G9SVN6IK)


   Internal Drive Cage at Port 1I, Box 1, OK

   Internal Drive Cage at Port 2I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (546.8 GB, RAID 6, OK)

      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 146 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 146 GB, OK)
      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 146 GB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 146 GB, OK)
      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 146 GB, OK)
      physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 146 GB, OK)

[root@vertica2.stage.metrics.scl3 ~]# hpacucli controller all show

Smart Array E200i in Slot 0 (Embedded)    (sn: PBACB0A9VVH1B9)
Smart Array P400 in Slot 3                (sn: P61630G9SVN6IK)

[root@vertica2.stage.metrics.scl3 ~]# hpacucli controller all show status

Smart Array E200i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK

Smart Array P400 in Slot 3
   Controller Status: OK
   Cache Status: Temporarily Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)


[root@vertica2.stage.metrics.scl3 ~]#
Assignee: infra → server-ops-dcops
Component: Infrastructure: Other → DCOps
QA Contact: jdow
(Assignee)

Comment 2

2 years ago
looks like another failed RAID battery in the p400 storage blade with no warranty information. the host is out of warranty since feb 2011 so it's be safe to assume they both expired at the same time. do we want to renew the warranty on these 2 devices?
Flags: needinfo?(mpressman)
(Assignee)

Comment 3

2 years ago
also please note this is a G1 blade so perhaps it's also better to just upgrade and renew the service contract on the blade if we decide to keep the host?
QA Contact: jbarnell
(Assignee)

Updated

2 years ago
colo-trip: --- → scl3
I'm not sure we know what the time frame is with regard to the vertica servers. They were initially planned to be decommissioned last year and then this quarter, but I don't know if we want to spend more on the stage servers. As of right now, their only purpose is for testing upgrades and I don't see us upgrading without a future plan for the prod service. So, for right now, we can probably hold off on fixing this until a decision is made
Flags: needinfo?(mpressman)
(Assignee)

Comment 5

2 years ago
this is a stage server and it's only affecting the cache on the storage array. going to WONTFIX per c#4.
Assignee: server-ops-dcops → vle
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
Just resolved the following:

vertica2.stage.metrics.scl3.mozilla.com:HP Health is CRITICAL: CHECK_NRPE: Socket timeout after 60 seconds. (http://m.mozilla.org/HP+Health)

with a hp-health service restart.  Is this machine still appropriate for nagios monitoring?
You need to log in before you can comment on or make changes to this bug.