Closed
Bug 1116759
Opened 10 years ago
Closed 10 years ago
install.build.releng.scl3.mozilla.com has a failed raid1 disk
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Assigned: dividehex)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4268] bad drive)
This machine is alerting constantly in #buildduty for failing nagios checks.
It's not desperate yet, but we won't be able to re-image any Mac build machines until this machine is back online.
Comment 1•10 years ago
|
||
host gets stuck at boot up, I suspect a bad drive but we don't have any replacements.
(SN open for 500gb drives)
Comment 2•10 years ago
|
||
After another reboot host came back online. Please reopen if issues come up.
[sespinoza@admin1a.private.scl3 ~]$ fping install.build.releng.scl3.mozilla.com
install.build.releng.scl3.mozilla.com is alive
[sespinoza@admin1a.private.scl3 ~]$ ssh !$
ssh install.build.releng.scl3.mozilla.com
The authenticity of host 'install.build.releng.scl3.mozilla.com (10.26.52.17)' can't be established.
RSA key fingerprint is 70:32:94:83:9e:7c:c0:3c:a3:fa:85:55:0a:48:65:fb.
Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 3•10 years ago
|
||
One of the disks in the raid1 has failed. We need to backup the data on /Deploy and the current netboot nbi ASAP
install:~ root# diskutil info /dev/disk2
Device Identifier: disk2
Device Node: /dev/disk2
Part of Whole: disk2
Device / Media Name: Raid1
Volume Name: Raid1
Escaped with Unicode: Raid1
Mounted: Yes
Mount Point: /
Escaped with Unicode: /
File System Personality: Journaled HFS+
Type (Bundle): hfs
Name (User Visible): Mac OS Extended (Journaled)
Journal: Journal size 40960 KB at offset 0xe8e000
Owners: Enabled
Content (IOContent): Apple_HFS
OS Can Be Installed: Yes
Media Type: Generic
Protocol: SATA
SMART Status: Not Supported
Volume UUID: F3351D6D-8032-34FA-8B08-1A763966E501
Total Size: 499.8 GB (499763838976 Bytes) (exactly 976101248 512-Byte-Blocks)
Volume Free Space: 260.9 GB (260928294912 Bytes) (exactly 509625576 512-Byte-Blocks)
Device Block Size: 512 Bytes
Read-Only Media: No
Read-Only Volume: No
Ejectable: No
Whole: Yes
Internal: Yes
Solid State: No
OS 9 Drivers: No
Low Level Format: Not supported
This disk is a RAID Set. RAID Set Information:
Set Name: Raid1
RAID Set UUID: 47126AFD-E94F-451A-AA93-08BEBFADCC43
Level Type: Mirror
Status: Degraded
Chunk Count: 7625791
Assignee | ||
Updated•10 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•10 years ago
|
colo-trip: --- → scl3
Whiteboard: bad drive
Assignee | ||
Comment 4•10 years ago
|
||
Sal attached a ext hdd and I've started a tarball of /Deploy. When it is done I'll grab the nbi image also. I'm still trying to determine the serial of the bad hdd though. Hopefully the system_profiler with give that up.
Assignee | ||
Comment 5•10 years ago
|
||
Looks like disk1 has failed. Serial # is 110411PCG420GLHD11DC
AppleRAID sets (1 found)
===============================================================================
Name: Raid1
Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43
Type: Mirror
Status: Degraded
Size: 499.8 GB (499763838976 Bytes)
Rebuild: manual
Device Node: disk2
-------------------------------------------------------------------------------
# DevNode UUID Status Size
-------------------------------------------------------------------------------
0 disk1s2 3FF0599F-7428-494C-96DF-DC82392BC309 Failed 499763838976
1 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976
===============================================================================
Hitachi HTS725050A9A362:
Capacity: 500.11 GB (500,107,862,016 bytes)
Model: Hitachi HTS725050A9A362
Revision: PC4ACB1E
Serial Number: 110411PCG420GLHD11DC
Native Command Queuing: Yes
Queue Depth: 32
Removable Media: No
Detachable Drive: No
BSD Name: disk1
Assignee | ||
Updated•10 years ago
|
Summary: install.build.releng.scl3.mozilla.com is unavailable → install.build.releng.scl3.mozilla.com has a failed raid1 disk
Assignee | ||
Updated•10 years ago
|
Assignee: server-ops-dcops → relops
Component: DCOps → RelOps
QA Contact: arich
Assignee | ||
Comment 6•10 years ago
|
||
I've requested this system get another reboot since it seems to have locked up again. When it comes back, I'm going to double check the /Deploy and the nbi have finished backing up. Then I'm going to remove the failed drive from the array and removed physically. Hopefully that will stop it from continually ending up in limbo.
Assignee | ||
Comment 7•10 years ago
|
||
Sal wasn't able to bring the system back up even after removing the ext usb hhd. I've requested that he remove the failed drive in hopes that it will allow it to boot the degraded raid array up. Luckily, the backup of /Deploy finished before I left last night. I'm just not sure if the NBI backed up. It isn't a big deal if it hadn't and aren't able to recover the raid since it is rebuild-able. If the system is able to boot the array on just the remaining disk, I'll remove the old hdd uuid so it doesn't keep looking for it.
Assignee | ||
Comment 8•10 years ago
|
||
With the failed disk removed, the system successfully booted. I've gone ahead and removed the failed disk member from the array. When the new disk arrives it just needs to be installed and added to the array. The rebuild flag is set to manual so that will need to be initiated by hand.
install:~ root# diskutil listRAID 47126AFD-E94F-451A-AA93-08BEBFADCC43
===============================================================================
Name: Raid1
Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43
Type: Mirror
Status: Degraded
Size: 499.8 GB (499763838976 Bytes)
Rebuild: manual
Device Node: disk1
-------------------------------------------------------------------------------
# DevNode UUID Status Size
-------------------------------------------------------------------------------
- -none- 3FF0599F-7428-494C-96DF-DC82392BC309 Missing/Damaged
1 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976
===============================================================================
install:~ root# diskutil removeFromRaid 3FF0599F-7428-494C-96DF-DC82392BC309 47126AFD-E94F-451A-AA93-08BEBFADCC43
Started RAID operation on disk1 Raid1
Removing disk from RAID
Finished RAID operation on disk1 Raid1
install:~ root# diskutil listRAID 47126AFD-E94F-451A-AA93-08BEBFADCC43
===============================================================================
Name: Raid1
Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43
Type: Mirror
Status: Online
Size: 499.8 GB (499763838976 Bytes)
Rebuild: manual
Device Node: disk1
-------------------------------------------------------------------------------
# DevNode UUID Status Size
-------------------------------------------------------------------------------
0 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976
===============================================================================
Comment 9•10 years ago
|
||
It's back to screaming alerts again.
Updated•10 years ago
|
Whiteboard: bad drive → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4268] bad drive
Comment 10•10 years ago
|
||
we certainly need to backup this ASAP, since this has failed down again.
Flags: needinfo?(jwatkins)
Assignee | ||
Comment 11•10 years ago
|
||
I've ask dcops for an eta on the replacement drive and to reboot the host. I'm a bit surprised it stopped responding again. Makes me wonder if it is more than just the disk that had failed. There is a backup of the /Deploy and the nbi files. I was able to grab them before it went down.
When the host comes back up, I'll dig a little more into why it is going deaf to ssh and nrpe.
Flags: needinfo?(jwatkins)
Assignee | ||
Updated•10 years ago
|
Assignee: relops → jwatkins
Assignee | ||
Comment 12•10 years ago
|
||
I couldn't find anything really obvious as to why it went offline last time. DCOps was able to find a spare hdd before the ordered disk arrives. They installed it and it came back online (after awhile).
The new disk info according to system profiler:
NVidia MCP89 AHCI:
Vendor: NVidia
Product: MCP89 AHCI
Link Speed: 3 Gigabit
Negotiated Link Speed: 3 Gigabit
Description: AHCI Version 1.30 Supported
Hitachi HTS727550A9E364:
Capacity: 500.11 GB (500,107,862,016 bytes)
Model: Hitachi HTS727550A9E364
Revision: JF3OA0E0
Serial Number: J3350080GPLNBC
Native Command Queuing: Yes
Queue Depth: 32
Removable Media: No
Detachable Drive: No
BSD Name: disk2
Rotational Rate: 7200
Medium Type: Rotational
Bay Name: Upper
Partition Map Type: Unknown
S.M.A.R.T. status: Verified
Assignee | ||
Comment 13•10 years ago
|
||
It took some futzing with since removing a failed drive from mirrored array causes the array to be converted to a "single disk" mirror array (oh apple) but I got the drive added as a member. It required the AutoRebuild key to be set true before it would rebuild. At it current rebuild rate, it shouldn't take more than a few hours to complete. I'll check back towards EOB.
Here is the correct steps to adding the disk to rebuild the array:
install:~ root# diskutil appleraid update AutoRebuild 1 47126AFD-E94F-451A-AA93-08BEBFADCC43
The RAID has been successfully updated
install:~ root# diskutil appleraid add member /dev/disk2 47126AFD-E94F-451A-AA93-08BEBFADCC43
Started RAID operation on disk1 Raid1
Unmounting disk
Repartitioning disk2 so it can be in a RAID set
Unmounting disk
Creating the partition map
Adding disk2s2 to the RAID Set
Finished RAID operation on disk1 Raid1
install:~ root# diskutil appleraid list
AppleRAID sets (1 found)
===============================================================================
Name: Raid1
Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43
Type: Mirror
Status: Degraded
Size: 499.8 GB (499763838976 Bytes)
Rebuild: automatic
Device Node: disk1
-------------------------------------------------------------------------------
# DevNode UUID Status Size
-------------------------------------------------------------------------------
0 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976
1 disk2s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 0% (Rebuilding)499763838976
===============================================================================
Assignee | ||
Comment 14•10 years ago
|
||
I've been checking on the rebuild status throughout the day and over the past hour it looks to be stuck on 98% progress. The only thing I can think to do at this time is check on it in the morning and hope it completes. :-/
install:~ root# diskutil appleraid list
AppleRAID sets (1 found)
===============================================================================
Name: Raid1
Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43
Type: Mirror
Status: Degraded
Size: 499.8 GB (499763838976 Bytes)
Rebuild: automatic
Device Node: disk1
-------------------------------------------------------------------------------
# DevNode UUID Status Size
-------------------------------------------------------------------------------
0 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976
1 disk2s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 98% (Rebuilding)499763838976
===============================================================================
Assignee | ||
Comment 15•10 years ago
|
||
It finally finished. Now let's hope it stays up and running.
install:~ root# diskutil appleraid list
AppleRAID sets (1 found)
===============================================================================
Name: Raid1
Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43
Type: Mirror
Status: Online
Size: 499.8 GB (499763838976 Bytes)
Rebuild: automatic
Device Node: disk1
-------------------------------------------------------------------------------
# DevNode UUID Status Size
-------------------------------------------------------------------------------
0 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976
1 disk2s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 Online 499763838976
===============================================================================
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 16•10 years ago
|
||
I just logged into install.build and it now it looks like the OTHER drive has failed.
[root@install.build.releng.scl3.mozilla.com ~]# diskutil appleRaid list
AppleRAID sets (1 found)
===============================================================================
Name: Raid1
Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43
Type: Mirror
Status: Degraded
Size: 499.8 GB (499763838976 Bytes)
Rebuild: automatic
Device Node: disk2
-------------------------------------------------------------------------------
# DevNode UUID Status Size
-------------------------------------------------------------------------------
- -none- 1064CBEB-795D-4F86-8EF9-B876B283FB90 Missing/Damaged
1 disk1s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 Online 499763838976
===============================================================================
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 17•10 years ago
|
||
Vinh replaced the failed drive but when I added the new drive to the array it locked up. still pingable but ssh and vnc are unresponsive. :-/ It's been able 15m so I've given up on it coming back on its own. Going to powercycle it
Assignee | ||
Comment 18•10 years ago
|
||
All looks good after a powercycle. The Raid is rebuilding. I'll check back in tomorrow morning.
[root@install.build.releng.scl3.mozilla.com ~]# diskutil appleraid list
AppleRAID sets (1 found)
===============================================================================
Name: Raid1
Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43
Type: Mirror
Status: Degraded
Size: 499.8 GB (499763838976 Bytes)
Rebuild: automatic
Device Node: disk2
-------------------------------------------------------------------------------
# DevNode UUID Status Size
-------------------------------------------------------------------------------
0 disk1s2 B1E94151-ED13-40F0-B8AA-BF53EF05E223 0% (Rebuilding)499763838976
1 disk0s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 Online 499763838976
===============================================================================
Assignee | ||
Comment 19•10 years ago
|
||
Raid is 100% rebuilt. And this time it only took 2.5 hours to rebuild. Which was much faster than the first rebuild. You can see this from the relevant log snippets below.
AppleRAID sets (1 found)
===============================================================================
Name: Raid1
Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43
Type: Mirror
Status: Online
Size: 499.8 GB (499763838976 Bytes)
Rebuild: automatic
Device Node: disk2
-------------------------------------------------------------------------------
# DevNode UUID Status Size
-------------------------------------------------------------------------------
0 disk1s2 B1E94151-ED13-40F0-B8AA-BF53EF05E223 Online 499763838976
1 disk0s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 Online 499763838976
===============================================================================
<drive fails>
Jan 12 02:40:29 install kernel[0]: Failed to issue COM RESET successfully after 3 attempts. Failing...
Jan 12 02:40:29 install kernel[0]: AppleRAIDMember::synchronizeCacheCallout: failed with e00002c0 on 1064CBEB-795D-4F86-8EF9-B876B283FB90
Jan 12 02:40:29 install kernel[0]: IOBlockStorageDriver[IOBlockStorageDriver]; executeRequest: request failed to start!
Jan 12 02:40:29 install kernel[0]: AppleRAID::recover() member 1064CBEB-795D-4F86-8EF9-B876B283FB90 from set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43) has been marked offline.
Jan 12 02:40:29 install kernel[0]: AppleRAID::restartSet - restarting set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43).
<system was rebooted after getting hung up when the disk was insert into the array as a hot spare>
Jan 13 17:59:11 localhost kernel[0]: Darwin Kernel Version 11.2.0: Tue Aug 9 20:54:00 PDT 2011; root:xnu-1699.24.8~1/RELEASE_X86_64
<rebuild completed>
Jan 13 20:33:07 install kernel[0]: AppleRAID::restartSet - restarting set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43).
Jan 13 20:33:07 install kernel[0]: AppleRAIDMirrorSet::rebuild complete for set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43).
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•