sync75.db.scl2.svc: failed drive


(Whiteboard: [/dev/sdc swapped, selftested, resync in progress])

sdc was locking up the mysql raid device, so i removed it from md125 and started a badblocks -v -w on /dev/sdc in a root screen titled 'badblocks'.
aborted badblocks as it found unrecoverably damaged bad sectors that were not repaired by writes.  please swap during next scl2 trip.

[root@sync75.db.scl2.svc rsoderberg]# smartctl -x /dev/sdc
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen,

Model Family:     SAMSUNG SpinPoint F1 DT series
Device Model:     SAMSUNG HD103UJ
Serial Number:    S13PJ1MQ605724
Firmware Version: 1AA01112
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Sun Oct  9 11:57:25 2011 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (12088) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 202) minutes.
Conveyance self-test routine
recommended polling time: 	 (  22) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x000f   099   021   051    Pre-fail  Always   In_the_past 12226
  3 Spin_Up_Time            0x0007   067   067   011    Pre-fail  Always       -       10720
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       67
  5 Reallocated_Sector_Ct   0x0033   067   067   010    Pre-fail  Always       -       1438
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       17910
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       2
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       67
 13 Read_Soft_Error_Rate    0x000e   099   022   000    Old_age   Always       -       11546
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       26370
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   077   065   000    Old_age   Always       -       23 (Lifetime Min/Max 22/23)
194 Temperature_Celsius     0x0022   077   060   000    Old_age   Always       -       23 (Lifetime Min/Max 22/24)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       394025282
196 Reallocated_Event_Count 0x0032   066   066   000    Old_age   Always       -       1438
197 Current_Pending_Sector  0x0012   093   091   000    Old_age   Always       -       283
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   099   099   000    Old_age   Always       -       6

General Purpose Logging (GPL) feature set supported
General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
GP/S  Log at address 0x00 has    1 sectors [Log Directory]
SMART Log at address 0x01 has    1 sectors [Summary SMART error log]
SMART Log at address 0x02 has    2 sectors [Comprehensive SMART error log]
GP    Log at address 0x03 has    2 sectors [Ext. Comprehensive SMART error log]
SMART Log at address 0x06 has    1 sectors [SMART self-test log]
GP    Log at address 0x07 has    2 sectors [Extended self-test log]
SMART Log at address 0x09 has    1 sectors [Selective self-test log]
GP    Log at address 0x10 has    1 sectors [NCQ Command Error]
GP    Log at address 0x11 has    1 sectors [SATA Phy Event Counters]
GP/S  Log at address 0xe0 has    1 sectors [SCT Command/Status]
GP/S  Log at address 0xe1 has    1 sectors [SCT Data Transfer]

SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
Device Error Count: 18272 (device log contains only the most recent 8 errors)
	CR     = Command Register
	FEATR  = Features Register
	COUNT  = Count (was: Sector Count) Register
	LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
	LH     = LBA High (was: Cylinder High) Register    ]   LBA
	LM     = LBA Mid (was: Cylinder Low) Register      ] Register
	LL     = LBA Low (was: Sector Number) Register     ]
	DV     = Device (was: Device/Head) Register
	DC     = Device Control Register
	ER     = Error register
	ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 18272 [7] occurred at disk power-on lifetime: 17907 hours (746 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 03 b2 21 db e3 00  Error: UNC at LBA = 0x03b221db = 62005723

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  c8 00 00 00 80 00 00 00 b2 21 60 e3 08  3d+17:09:17.670  READ DMA
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:17.670  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 00 00 00 00 a0 08  3d+17:09:17.650  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08  3d+17:09:17.650  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:17.610  READ NATIVE MAX ADDRESS EXT

Error 18271 [6] occurred at disk power-on lifetime: 17907 hours (746 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 03 b2 21 db e3 00  Error: UNC at LBA = 0x03b221db = 62005723

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  c8 00 00 00 80 00 00 00 b2 21 60 e3 08  3d+17:09:16.110  READ DMA
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:16.100  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 00 00 00 00 a0 08  3d+17:09:16.080  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08  3d+17:09:16.080  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:16.080  READ NATIVE MAX ADDRESS EXT

Error 18270 [5] occurred at disk power-on lifetime: 17907 hours (746 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 03 b2 21 db e3 00  Error: UNC at LBA = 0x03b221db = 62005723

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  c8 00 00 00 80 00 00 00 b2 21 60 e3 08  3d+17:09:14.750  READ DMA
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:14.750  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 00 00 00 00 a0 08  3d+17:09:14.730  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08  3d+17:09:14.730  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:14.730  READ NATIVE MAX ADDRESS EXT

Error 18269 [4] occurred at disk power-on lifetime: 17907 hours (746 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 03 b2 21 db e3 00  Error: UNC at LBA = 0x03b221db = 62005723

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  c8 00 00 00 80 00 00 00 b2 21 60 e3 08  3d+17:09:12.750  READ DMA
  c8 00 00 00 80 00 00 00 b2 20 e0 e3 08  3d+17:09:12.750  READ DMA
  c8 00 00 00 80 00 00 00 b2 20 60 e3 08  3d+17:09:12.750  READ DMA
  c8 00 00 00 80 00 00 00 b2 1f e0 e3 08  3d+17:09:12.750  READ DMA
  c8 00 00 00 80 00 00 00 b2 1f 60 e3 08  3d+17:09:12.750  READ DMA

Error 18268 [3] occurred at disk power-on lifetime: 17907 hours (746 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 03 b0 e9 09 e3 00  Error: UNC at LBA = 0x03b0e909 = 61925641

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  c8 00 00 00 08 00 00 00 b0 e9 08 e3 08  3d+17:09:09.830  READ DMA
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:09.820  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 00 00 00 00 a0 08  3d+17:09:09.810  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08  3d+17:09:09.810  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:09.810  READ NATIVE MAX ADDRESS EXT

Error 18267 [2] occurred at disk power-on lifetime: 17907 hours (746 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 03 b0 e9 09 e3 00  Error: UNC at LBA = 0x03b0e909 = 61925641

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  c8 00 00 00 08 00 00 00 b0 e9 08 e3 08  3d+17:09:08.320  READ DMA
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:08.310  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 00 00 00 00 a0 08  3d+17:09:08.290  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08  3d+17:09:08.290  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:08.290  READ NATIVE MAX ADDRESS EXT

Error 18266 [1] occurred at disk power-on lifetime: 17907 hours (746 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 03 b0 e9 09 e3 00  Error: UNC at LBA = 0x03b0e909 = 61925641

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  c8 00 00 00 08 00 00 00 b0 e9 08 e3 08  3d+17:09:06.910  READ DMA
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:06.910  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 00 00 00 00 a0 08  3d+17:09:06.890  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08  3d+17:09:06.890  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:06.890  READ NATIVE MAX ADDRESS EXT

Error 18265 [0] occurred at disk power-on lifetime: 17907 hours (746 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 03 b0 e9 09 e3 00  Error: UNC at LBA = 0x03b0e909 = 61925641

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  c8 00 00 00 08 00 00 00 b0 e9 08 e3 08  3d+17:09:05.500  READ DMA
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:05.490  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 00 00 00 00 a0 08  3d+17:09:05.470  IDENTIFY DEVICE
  ef 00 03 00 46 00 00 00 00 00 00 a0 08  3d+17:09:05.470  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 00 00 00 00 e0 08  3d+17:09:05.470  READ NATIVE MAX ADDRESS EXT

SMART Extended Self-test Log Version: 0 (2 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  2
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                 23 Celsius
Power Cycle Max Temperature:         24 Celsius
Lifetime    Max Temperature:         47 Celsius
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:     -4/72 Celsius
Min/Max Temperature Limit:           -9/77 Celsius
Temperature History Size (Index):    128 (30)

Index    Estimated Time   Temperature Celsius
  31    2011-10-09 09:50    23  ****
 ...    ..(126 skipped).    ..  ****
  30    2011-10-09 11:57    23  ****

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2        13974  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2        13974  Transition from drive PhyRdy to drive PhyNRdy
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC

[root@sync75.db.scl2.svc rsoderberg]#
Whiteboard: [swap /dev/sdc]
Swapped /dev/sdc with a new drive (detected as /dev/sdc).  SMART selftest in progress:

# smartctl -S on -s on -t long /dev/sdc
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen,

SMART Enabled.
SMART Attribute Autosave Enabled.

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 194 minutes for test to complete.
Test will complete after Thu Oct 13 23:01:36 2011

Use smartctl -X to abort test.
Whiteboard: [swap /dev/sdc] → [/dev/sdc swapped, selftesting]
Re-added /dev/sdc1 to the RAID10 device, resync in progress (will close after completion):

# cat /proc/mdstat 
Personalities : [raid1] [raid10] 
md125 : active raid10 sdc1[9] sdl1[8] sdk1[7] sdj1[6] sdf1[5] sde1[4] sdd1[3] sdb1[1] sda1[0]
      2930277888 blocks super 1.2 256K chunks 3 offset-copies [9/8] [UU_UUUUUU]
      [>....................]  recovery =  0.0% (40704/976759296) finish=799.5min speed=20352K/sec
Whiteboard: [/dev/sdc swapped, selftesting] → [/dev/sdc swapped, selftested, resync in progress]
Resync completed successfully
Assignee: nobody → jlaz
Closed: 13 years ago
Resolution: --- → FIXED
Component: Operations: Hardware → Operations
