vertica2.stage.metrics.scl3.mozilla.com:HP RAID is CRITICAL: RAID CRITICAL

RESOLVED WONTFIX

Status

mozilla.org Graveyard
Server Operations
RESOLVED WONTFIX
3 years ago
3 years ago

People

(Reporter: w0ts0n, Unassigned)

Tracking

Details

(Whiteboard: out of warranty [data: consultative])

(Reporter)

Description

3 years ago
nagios-scl3	 Tue 07:12:19 PDT [5277] vertica2.stage.metrics.scl3.mozilla.com:HP RAID is CRITICAL: RAID CRITICAL - HP Smart Array Failed:  Smart Array E200i in Slot 0 (Embedded) Controller Status: OK Cache Status: OK Smart Array P400 in Slot 3 Controller Status: OK Cache Status: Temporarily Disabled Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)


 sudo hpacucli controller all show status

Smart Array E200i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK

Smart Array P400 in Slot 3
   Controller Status: OK
   Cache Status: Temporarily Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)

Updated

3 years ago
colo-trip: --- → scl3
  <nagios-scl3>	 Tue 13:12:20 PDT [5034] vertica2.stage.metrics.scl3.mozilla.com:HP RAID is CRITICAL: RAID CRITICAL - HP Smart Array Failed:  Smart Array E200i in Slot 0 (Embedded) Controller Status: OK Cache Status: OK Smart Array P400 in Slot 3 Controller Status: OK Cache Status: Temporarily Disabled Battery/Capacitor Status: Failed (Replace Batteries/Capacitors) (http://m.mozilla.org/HP+RAID)

Comment 2

3 years ago
this blade is a G1 and out of warranty since 2011. any idea who the owner is? we should ask if they want to upgrade this server (we have spare g6/g7), p2v, or decommission it.
Assignee: server-ops-dcops → server-ops
Component: Server Operations: DCOps → Server Operations
QA Contact: dmoore → shyam
Whiteboard: out of warranty

Comment 3

3 years ago
vertica is a metrics server.  cc/ tmary srich

Comment 4

3 years ago
What's the impact here? Is Vertica accessible?  This cannot be decom'd or PTV'd. I'm assuming the blade can be replaced, but not sure if both 1 and 2 need to be upgraded.
Note: this is a stage server.

Impact: Vertica is still accessible - production is not affected, and stage is still up. RAID functionality on the disks is compromised, so the system is not as fully redundant as we'd like it to be.

This is a stage server, which is why it's out of warranty, as are vertica1 and vertica2 in stage.

We are working on testing out how much hardware we want/need for Vertica stage, but other issues have prevented us from moving forward the last 2 months. However, this is a priority for Q4.

For now, please upgrade with the spare g6. You can create it as vertica4.stage, and we'll add it to the cluster and decommission vertica2.stage. In Q4 we will come up with a plan for what we want to do with the rest of the hardware in Vertica stage, as well as Vertica production.

Comment 6

3 years ago
Thanks, Sheeri. 

Rick, Ashlee, Ryan: will you confirm this is not vertica2.stage.metrics.scl3-storage? Even though it's named stage, it's listed in Inventory as production.
We have 4 statuses in inventory :
Building
Decomisionned
spare
production

Stage servers are marked as production.
Ludo - Sean is asking if this is the storage blade, or the server blade.

Comment 9

3 years ago
(In reply to Sheeri Cabral [:sheeri] from comment #8)
> Ludo - Sean is asking if this is the storage blade, or the server blade.

The alarm is for the battery on the *storage blade* itself. I can see that is not clear in the alert.  The storage blade has an integrated raid controller, with a replaceable battery.  


Storage Blade
Manufacturer	HP
Product Name	HP StorageWorks SB40c
Part Number	411243-B21
System Board Spare Part Number	430798-001
Serial Number	SGI833003L
ROM Version	1.20
*nod* and from what I understand, the battery is only used if there's a power outage, to finish saving any disk changes. Is that right?

Seems like a very low risk here, for a stage machine, if that's the case.
Whiteboard: out of warranty → out of warranty [consultative]
Whiteboard: out of warranty [consultative] → out of warranty [data: consultative]

Updated

3 years ago
See Also: → bug 1083805

Updated

3 years ago
See Also: → bug 833472
(Reporter)

Comment 11

3 years ago
Closing this bug out as there a multiple bugs for decom in 2015 and as Sheeri mentions, this is low risk.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.