Closed Bug 666943 Opened 14 years ago Closed 14 years ago

Drive failure suspected on sync48.db.scl2.svc

Categories

(Cloud Services :: Operations: Miscellaneous, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlaz, Assigned: jlaz)

Details

Looks like /dev/sdf may be dead on sync48.db.scl2.svc We received a nagios alert for high load at around 3AM: 03:07:01 < nagios-sjc1> [95] sync48.db.scl2.svc:load is CRITICAL: CRITICAL - load average: 28.21, 14.22, 8.41 03:11:57 < nagios-sjc1> sync48.db.scl2.svc:load is OK: OK - load average: 6.79, 9.29, 7.83 Contents of /proc/mdstat show that md125 is in a rebuild state: md125 : active raid10 sdd1[0] sdl1[8](S) sdk1[7] sdj1[6] sdi1[5] sdh1[4] sdg1[3] sdf1[2] sde1[1] 3907037184 blocks super 1.1 512K chunks 2 near-copies [8/8] [UUUUUUUU] [==>..................] check = 12.3% (482911616/3907037184) finish=5761.3min speed=9904K/sec bitmap: 4/30 pages [16KB], 65536KB chunk and a look at dmesg shows that /dev/sdf could be our failed drive: ata6.00: error: { UNC } ata6.00: configured for UDMA/133 ata6: EH complete ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata6.00: irq_stat 0x40000001 ata6.00: failed command: READ DMA EXT ata6.00: cmd 25/00:00:00:26:01/00:04:00:00:00/e0 tag 0 dma 524288 in res 51/40:00:3e:27:01/00:00:00:00:00/e0 Emask 0x9 (media error) ata6.00: status: { DRDY ERR } ata6.00: error: { UNC } ata6.00: configured for UDMA/133 sd 6:0:0:0: [sdf] Unhandled sense code sd 6:0:0:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 6:0:0:0: [sdf] Sense Key : Medium Error [current] [descriptor] Descriptor sense data with sense descriptors (in hex): 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 00 01 27 3e sd 6:0:0:0: [sdf] Add. Sense: Unrecovered read error - auto reallocate failed sd 6:0:0:0: [sdf] CDB: Read(10): 28 00 00 01 26 00 00 04 00 00
Drive swapped, RAID rebuilding, and sdf marked as a spare. [root@sync48.db.scl2.svc ~]# mdadm --manage /dev/md125 --add ${DEAD}1 mdadm: added /dev/sdf1 [root@sync48.db.scl2.svc ~]# cat /proc/mdstat Personalities : [raid1] [raid10] md125 : active raid10 sdf1[9](S) sdd1[0] sdl1[8] sdk1[7] sdj1[6] sdi1[5] sdh1[4] sdg1[3] sde1[1] 3907037184 blocks super 1.1 512K chunks 2 near-copies [8/7] [UU_UUUUU] [>....................] recovery = 0.1% (1320000/976759296) finish=1655.4min speed=9819K/sec bitmap: 4/30 pages [16KB], 65536KB chunk md126 : active raid1 sda1[0] sdc1[2](S) sdb1[1] 102388 blocks super 1.0 [2/2] [UU] md127 : active raid1 sda2[0] sdc2[2](S) sdb2[1] 958095228 blocks super 1.1 [2/2] [UU] bitmap: 0/8 pages [0KB], 65536KB chunk unused devices: <none>
Assignee: nobody → jlazaro
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Operations: Hardware → Operations
You need to log in before you can comment on or make changes to this bug.