Some blades in SCL3 with SSDs report predictive failure

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
--
blocker
RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: jabba, Assigned: dumitru)

Tracking

Details

(Reporter)

Description

6 years ago
[root@dev1.db.scl3 ~]# hpacucli controller slot=0 show config

Smart Array P410i in Slot 0 (Embedded)    (sn: 5001438018F37170)

   array A (Solid State SATA, Unused Space: 0 MB)


      logicaldrive 1 (223.5 GB, RAID 1, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 240.0 GB, Predictive Failure)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 240.0 GB, Predictive Failure)

   SEP (Vendor ID PMCSIERA, Model  SRC 8x6G) 250 (WWID: 5001438018F3717F)

[root@dev1.db.scl3 ~]# 

This prevents the cache to be enabled, which in turn makes puppet fail to create my.cnf, so it is blocking rolling out DB servers properly. The same issue is on bedrock1.db.scl3, but not on bedrock2.db.scl3.. Perhaps a difference in firmware for either the disks or the controller?
(Reporter)

Updated

6 years ago
Severity: minor → critical
Cannot find any differences in the drive firmware nor the controller firmware.
Comparison:

bedrock1.db.scl3                               bedrock2.db.scl3
Smart Array P410i in Slot 0 (Embedded)	Smart Array P410i in Slot 0 (Embedded)
   Bus Interface: PCI	   Bus Interface: PCI
   Slot: 0	   Slot: 0
   Serial Number: 5001438018EC9860	   Serial Number: 5001438018E46E50
   Cache Serial Number: PBCDH0CRH1FAUW	   Cache Serial Number: PBCDH0CRH1FARF
   RAID 6 (ADG) Status: Disabled	   RAID 6 (ADG) Status: Disabled
   Controller Status: OK	   Controller Status: OK   
   Hardware Revision: Rev C	   Hardware Revision: Rev C
   Firmware Version: 5.14	   Firmware Version: 5.14  
   Rebuild Priority: Medium	   Rebuild Priority: Medium
   Expand Priority: Medium	   Expand Priority: Medium 
   Surface Scan Delay: 15 secs	   Surface Scan Delay: 15 secs
   Surface Scan Mode: Idle	   Surface Scan Mode: Idle 
   Queue Depth: Automatic	   Queue Depth: Automatic  
   Monitor and Performance Delay: 60 min	   Monitor and Performance Delay: 60 min
   Elevator Sort: Enabled	   Elevator Sort: Enabled  
   Degraded Performance Optimization: Disabled	   Degraded Performance Optimization: Disabled
   Inconsistency Repair Policy: Disabled	   Inconsistency Repair Policy: Disabled
   Wait for Cache Room: Disabled	   Wait for Cache Room: Disabled
   Surface Analysis Inconsistency Notification: Disabled	   Surface Analysis Inconsistency Notification: Disabled
   Post Prompt Timeout: 0 secs	   Post Prompt Timeout: 0 secs
   Cache Board Present: True	   Cache Board Present: True
   Cache Status: OK	   Cache Status: OK
   Accelerator Ratio: 25% Read / 75% Write	   Accelerator Ratio: 25% Read / 75% Write
   Drive Write Cache: Disabled	   Drive Write Cache: Enabled
   Total Cache Size: 512 MB	   Total Cache Size: 512 MB
   No-Battery Write Cache: Disabled	   No-Battery Write Cache: Disabled
   Cache Backup Power Source: Capacitors	   Cache Backup Power Source: Capacitors
   Battery/Capacitor Count: 1	   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK	   Battery/Capacitor Status: OK
   SATA NCQ Supported: True	   SATA NCQ Supported: True
	
   Array: A	   Array: A
      Interface Type: Solid State SATA	      Interface Type: Solid State SATA
      Unused Space: 0 MB	      Unused Space: 0 MB   
      Status: OK	      Status: OK
	
	
	
      Logical Drive: 1	      Logical Drive: 1
         Size: 223.5 GB	         Size: 223.5 GB
         Fault Tolerance: RAID 1	         Fault Tolerance: RAID 1
         Heads: 255	         Heads: 255
         Sectors Per Track: 32	         Sectors Per Track: 32
         Cylinders: 57450	         Cylinders: 57450
         Strip Size: 256 KB	         Strip Size: 256 KB
         Status: OK	         Status: OK
         Array Accelerator: Enabled	         Array Accelerator: Enabled
         Unique Identifier: 600508B1001CB4AB3C8F929917B45C86	         Unique Identifier: 600508B1001C731CDDE9AFE8693F03F6
         Disk Name: /dev/sda	         Disk Name: /dev/sda
         Mount Points: /boot 100 MB, / 221.4 GB	         Mount Points: /boot 100 MB, / 221.4 GB
         Logical Drive Label: AF57FFBD5001438018EC98603540	         Logical Drive Label: AF7465BA5001438018E46E5067F4
         Mirror Group 0:	         Mirror Group 0:   
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 240.0 GB, Predictive Failure)	            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
         Mirror Group 1:	         Mirror Group 1:   
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 240.0 GB, Predictive Failure)	            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
	
      physicaldrive 1I:1:1	      physicaldrive 1I:1:1 
         Port: 1I	         Port: 1I
         Box: 1	         Box: 1
         Bay: 1	         Bay: 1
         Status: Predictive Failure	         Status: OK
         Drive Type: Data Drive	         Drive Type: Data Drive
         Interface Type: Solid State SATA	         Interface Type: Solid State SATA
         Size: 240.0 GB	         Size: 240.0 GB
         Firmware Revision: 321ABBF0	         Firmware Revision: 321ABBF0
         Serial Number: 240BA0002299        	         Serial Number: 240BA0001992
         Model: ATA     KINGSTON SKC100S	         Model: ATA     KINGSTON SKC100S
         SATA NCQ Capable: True	         SATA NCQ Capable: True
         SATA NCQ Enabled: True	         SATA NCQ Enabled: True
         SSD Smart Trip Wearout: Not Supported	         SSD Smart Trip Wearout: Not Supported
         PHY Count: 1	         PHY Count: 1
         PHY Transfer Rate: 3.0GBPS	         PHY Transfer Rate: 3.0GBPS
	
      physicaldrive 1I:1:2	      physicaldrive 1I:1:2 
         Port: 1I	         Port: 1I
         Box: 1	         Box: 1
         Bay: 2	         Bay: 2
         Status: Predictive Failure	         Status: OK
         Drive Type: Data Drive	         Drive Type: Data Drive
         Interface Type: Solid State SATA	         Interface Type: Solid State SATA
         Size: 240.0 GB	         Size: 240.0 GB
         Firmware Revision: 321ABBF0	         Firmware Revision: 321ABBF0
         Serial Number: 240BA0002313        	         Serial Number: 240BA0002606
         Model: ATA     KINGSTON SKC100S	         Model: ATA     KINGSTON SKC100S
         SATA NCQ Capable: True	         SATA NCQ Capable: True
         SATA NCQ Enabled: True	         SATA NCQ Enabled: True
         SSD Smart Trip Wearout: Not Supported	         SSD Smart Trip Wearout: Not Supported
         PHY Count: 1	         PHY Count: 1
         PHY Transfer Rate: 3.0GBPS	         PHY Transfer Rate: 3.0GBPS
	
	
   SEP (Vendor ID PMCSIERA, Model  SRC 8x6G) 250	   SEP (Vendor ID PMCSIERA, Model  SRC 8x6G) 250
      Device Number: 250	      Device Number: 250
      Firmware Version: RevC	      Firmware Version: RevC
      WWID: 5001438018EC986F	      WWID: 5001438018E46E5F
      Vendor ID: PMCSIERA	      Vendor ID: PMCSIERA
      Model:  SRC 8x6G       	      Model:  SRC 8x6G
Assignee: server-ops → ashish
dev1.db.scl3's IML shows:

Event: 7 Added: 03/29/2012 15:59
CAUTION: POST Messages - POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure.

However, IML on bedrock1/2 was cleared on the same day, so I have no info about that. The SSDs aren't from HP, assigning to phong to take this up with Rich.
Assignee: ashish → phong
Blocks: 741100
I'm pretty sure this might actually be hard drive failure. 

If it's not, this is worrying...because we have a lot of these SSDs and predictive failures on them that aren't real == tons of false alerts, possibly lost data.
Rich,

Is there a known issue with HP and false positives for SSD "predictive failure"??  For some reason that rings a bell - but we have about half of the DB blades with Kingston SSDs showing imminent failure here.
Corey

I do not have access to the bug but we did discuss this when we were looking at the use of 3rd party SSD's.  The issue of false positive is a known one - it has to do with interpretation of thermal output as I believe most 3rd party SSD do not have a temp sensor as well as 3rd party drives not including the HP "SMART" technology.  Please make sure you have latest firmware and BIOS on Smart Array and Ilo

As you recall we sent a demo unit to Phong for eval - he should have seen this error message during testing.

Feel free to call me to discuss 

Rich
617 308 9117
So, my fault for not escalating this sooner - but we have production DBs moving to this gear and it needs fixed.
Severity: critical → blocker
Can't reach phong on IRC and its noon.. This needs resolved today.  giving to oncall
Assignee: phong → server-ops

Comment 8

6 years ago
Adding Rich from Terminal (SSD vendor) to the bug to see what we can do about this.
Assignee: server-ops → mburns
(In reply to Phong Tran [:phong] from comment #8)
> Adding Rich from Terminal (SSD vendor) to the bug to see what we can do
> about this.

I already did this, and he couldn't see the bug after so he emailed me directly.
What's the update here?
08:27:17 < nagios-scl3> [519] b1-db1.db.scl3.mozilla.com:HP Log is WARNING: WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure

08:33:58 < nagios-scl3> [520] bugzilla1.stage.db.scl3.mozilla.com:HP Log is WARNING: WARNING 0008: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure
Group: infra

Updated

6 years ago
Assignee: mburns → phong
b1-db1 is the master in production for internal stuff. Let's schedule a time we can fix it.

Bugzilla stage isn't used yet. Can we get whatever needs to be done, done on bugzilla stage first, and then schedule manteinance for b1-db1?
I'm here in SCL3 tonight, figured I would make a definitive list of all nodes showing amber SSD disks:

buildbot2.db
b1-db2.db  (both disks)
developer2.db
b1-db1.db  (both disks)
getpersonas1.db 
bedrock1.db (both disks)
dev1.db  (both disks)
dev2.db  (both disks)
bugzilla1.stage.db  (both disks)
In use:
b1-db1.db  (both disks)
b1-db2.db  (both disks)

Not in use yet:
bugzilla1.stage.db  (both disks)
buildbot2.db
developer2.db
getpersonas1.db 
bedrock1.db (both disks)
dev1.db  (both disks)
dev2.db  (both disks)

So plenty of work can be done if we know what to do, there are only 2 that can't be done at any time.

Comment 15

6 years ago
Phong

I have been in constant contact with Kingston and one of their technical managers.  He said they are aware of the issue and are working on a correction.  They have recreated the issue in their labs and are working on a solution.  In all of my research before and after the sale of the Kingston drives it appears that this is a very common issue with all 3rd party SSD vendors with the HP Smart Array.  What is your time frame on this - I assume yesterday.  I know that Kingston is working on the issue but there is not a definitive timetable as to when they hope to have a resolution.  The only answer may be to use only HP SSD drives but that is a costly answer and one that I am sure you don't want to pursue.  Can we give Kingston a few more days to try to come up with an answer?

Rich
Two more hosts paged at the same time:

02:30:54 < nagios-scl3> [571] b1-db1.db.scl3.mozilla.com:HP Log is CRITICAL: (Service Check Timed Out)
02:30:54 < nagios-scl3> [572] b1-db2.db.scl3.mozilla.com:HP Log is CRITICAL: (Service Check Timed Out)
02:35:53 < nagios-scl3> [573] b1-db1.db.scl3.mozilla.com:HP Log is WARNING: WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure
02:35:54 < nagios-scl3> [574] b1-db2.db.scl3.mozilla.com:HP Log is WARNING: WARNING 0013: POST Error: 1720-S.M.A.R.T. Hard Drive Detects Imminent Failure
(Assignee)

Updated

6 years ago
Duplicate of this bug: 753714
(Assignee)

Updated

6 years ago
Duplicate of this bug: 753717
(Assignee)

Updated

6 years ago
Duplicate of this bug: 753720
(Assignee)

Comment 20

6 years ago
Corey, Phong, do you have any updates from Rich about this matter?
(Assignee)

Updated

6 years ago
Duplicate of this bug: 753749
(Assignee)

Comment 22

6 years ago
bugzilla2.stage.db.scl3 too
(Assignee)

Updated

6 years ago
Duplicate of this bug: 754666
Any updates from Rich about this matter?

Comment 25

6 years ago
Kingston is still working on the issue I am checkign with them every few days
Duplicate of this bug: 758985
Duplicate of this bug: 758987

Comment 28

6 years ago
Still waiting for word from Kingston.

Comment 29

6 years ago
I got some new drives from Kingston with the new firmware, but unfortunately they are 120 GB drives instead of 240 GB.
(Assignee)

Updated

6 years ago
See Also: → bug 769073
Also affecting getpersonas2.db.scl3.
Also affecting developer1.db.scl3.
(Assignee)

Updated

6 years ago
Assignee: phong → dgherman

Updated

6 years ago
Duplicate of this bug: 765516

Comment 33

6 years ago
bedrock2.db.scl3 predictive drive failure

Comment 34

6 years ago
dev1.db.scl3.mozilla.com
dev2.db.scl3.mozilla.com

Updated

6 years ago
Duplicate of this bug: 741100
(Assignee)

Comment 36

6 years ago
We have 32 new SSD drives shipped and we replaced them in dev1.db.scl3. So far looks good. If no issues or alerts show up until Thursday, we'll swap them in all the blades:

[root@dev1.db.scl3 ~]# hpacucli ctrl slot=0 pd all show

Smart Array P410i in Slot 0 (Embedded)

   array A

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
Status: NEW → ASSIGNED
(Assignee)

Comment 37

6 years ago
All drives replaced and array recovered.
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.