Closed Bug 1116759 Opened 10 years ago Closed 10 years ago

install.build.releng.scl3.mozilla.com has a failed raid1 disk

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: dividehex)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4268] bad drive)

This machine is alerting constantly in #buildduty for failing nagios checks. It's not desperate yet, but we won't be able to re-image any Mac build machines until this machine is back online.
host gets stuck at boot up, I suspect a bad drive but we don't have any replacements. (SN open for 500gb drives)
After another reboot host came back online. Please reopen if issues come up. [sespinoza@admin1a.private.scl3 ~]$ fping install.build.releng.scl3.mozilla.com install.build.releng.scl3.mozilla.com is alive [sespinoza@admin1a.private.scl3 ~]$ ssh !$ ssh install.build.releng.scl3.mozilla.com The authenticity of host 'install.build.releng.scl3.mozilla.com (10.26.52.17)' can't be established. RSA key fingerprint is 70:32:94:83:9e:7c:c0:3c:a3:fa:85:55:0a:48:65:fb. Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
One of the disks in the raid1 has failed. We need to backup the data on /Deploy and the current netboot nbi ASAP install:~ root# diskutil info /dev/disk2 Device Identifier: disk2 Device Node: /dev/disk2 Part of Whole: disk2 Device / Media Name: Raid1 Volume Name: Raid1 Escaped with Unicode: Raid1 Mounted: Yes Mount Point: / Escaped with Unicode: / File System Personality: Journaled HFS+ Type (Bundle): hfs Name (User Visible): Mac OS Extended (Journaled) Journal: Journal size 40960 KB at offset 0xe8e000 Owners: Enabled Content (IOContent): Apple_HFS OS Can Be Installed: Yes Media Type: Generic Protocol: SATA SMART Status: Not Supported Volume UUID: F3351D6D-8032-34FA-8B08-1A763966E501 Total Size: 499.8 GB (499763838976 Bytes) (exactly 976101248 512-Byte-Blocks) Volume Free Space: 260.9 GB (260928294912 Bytes) (exactly 509625576 512-Byte-Blocks) Device Block Size: 512 Bytes Read-Only Media: No Read-Only Volume: No Ejectable: No Whole: Yes Internal: Yes Solid State: No OS 9 Drivers: No Low Level Format: Not supported This disk is a RAID Set. RAID Set Information: Set Name: Raid1 RAID Set UUID: 47126AFD-E94F-451A-AA93-08BEBFADCC43 Level Type: Mirror Status: Degraded Chunk Count: 7625791
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
colo-trip: --- → scl3
Whiteboard: bad drive
Sal attached a ext hdd and I've started a tarball of /Deploy. When it is done I'll grab the nbi image also. I'm still trying to determine the serial of the bad hdd though. Hopefully the system_profiler with give that up.
Looks like disk1 has failed. Serial # is 110411PCG420GLHD11DC AppleRAID sets (1 found) =============================================================================== Name: Raid1 Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43 Type: Mirror Status: Degraded Size: 499.8 GB (499763838976 Bytes) Rebuild: manual Device Node: disk2 ------------------------------------------------------------------------------- # DevNode UUID Status Size ------------------------------------------------------------------------------- 0 disk1s2 3FF0599F-7428-494C-96DF-DC82392BC309 Failed 499763838976 1 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976 =============================================================================== Hitachi HTS725050A9A362: Capacity: 500.11 GB (500,107,862,016 bytes) Model: Hitachi HTS725050A9A362 Revision: PC4ACB1E Serial Number: 110411PCG420GLHD11DC Native Command Queuing: Yes Queue Depth: 32 Removable Media: No Detachable Drive: No BSD Name: disk1
Summary: install.build.releng.scl3.mozilla.com is unavailable → install.build.releng.scl3.mozilla.com has a failed raid1 disk
Depends on: 1117192
Assignee: server-ops-dcops → relops
Component: DCOps → RelOps
QA Contact: arich
I've requested this system get another reboot since it seems to have locked up again. When it comes back, I'm going to double check the /Deploy and the nbi have finished backing up. Then I'm going to remove the failed drive from the array and removed physically. Hopefully that will stop it from continually ending up in limbo.
Sal wasn't able to bring the system back up even after removing the ext usb hhd. I've requested that he remove the failed drive in hopes that it will allow it to boot the degraded raid array up. Luckily, the backup of /Deploy finished before I left last night. I'm just not sure if the NBI backed up. It isn't a big deal if it hadn't and aren't able to recover the raid since it is rebuild-able. If the system is able to boot the array on just the remaining disk, I'll remove the old hdd uuid so it doesn't keep looking for it.
With the failed disk removed, the system successfully booted. I've gone ahead and removed the failed disk member from the array. When the new disk arrives it just needs to be installed and added to the array. The rebuild flag is set to manual so that will need to be initiated by hand. install:~ root# diskutil listRAID 47126AFD-E94F-451A-AA93-08BEBFADCC43 =============================================================================== Name: Raid1 Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43 Type: Mirror Status: Degraded Size: 499.8 GB (499763838976 Bytes) Rebuild: manual Device Node: disk1 ------------------------------------------------------------------------------- # DevNode UUID Status Size ------------------------------------------------------------------------------- - -none- 3FF0599F-7428-494C-96DF-DC82392BC309 Missing/Damaged 1 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976 =============================================================================== install:~ root# diskutil removeFromRaid 3FF0599F-7428-494C-96DF-DC82392BC309 47126AFD-E94F-451A-AA93-08BEBFADCC43 Started RAID operation on disk1 Raid1 Removing disk from RAID Finished RAID operation on disk1 Raid1 install:~ root# diskutil listRAID 47126AFD-E94F-451A-AA93-08BEBFADCC43 =============================================================================== Name: Raid1 Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43 Type: Mirror Status: Online Size: 499.8 GB (499763838976 Bytes) Rebuild: manual Device Node: disk1 ------------------------------------------------------------------------------- # DevNode UUID Status Size ------------------------------------------------------------------------------- 0 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976 ===============================================================================
It's back to screaming alerts again.
Whiteboard: bad drive → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4268] bad drive
we certainly need to backup this ASAP, since this has failed down again.
Flags: needinfo?(jwatkins)
I've ask dcops for an eta on the replacement drive and to reboot the host. I'm a bit surprised it stopped responding again. Makes me wonder if it is more than just the disk that had failed. There is a backup of the /Deploy and the nbi files. I was able to grab them before it went down. When the host comes back up, I'll dig a little more into why it is going deaf to ssh and nrpe.
Flags: needinfo?(jwatkins)
Assignee: relops → jwatkins
I couldn't find anything really obvious as to why it went offline last time. DCOps was able to find a spare hdd before the ordered disk arrives. They installed it and it came back online (after awhile). The new disk info according to system profiler: NVidia MCP89 AHCI: Vendor: NVidia Product: MCP89 AHCI Link Speed: 3 Gigabit Negotiated Link Speed: 3 Gigabit Description: AHCI Version 1.30 Supported Hitachi HTS727550A9E364: Capacity: 500.11 GB (500,107,862,016 bytes) Model: Hitachi HTS727550A9E364 Revision: JF3OA0E0 Serial Number: J3350080GPLNBC Native Command Queuing: Yes Queue Depth: 32 Removable Media: No Detachable Drive: No BSD Name: disk2 Rotational Rate: 7200 Medium Type: Rotational Bay Name: Upper Partition Map Type: Unknown S.M.A.R.T. status: Verified
It took some futzing with since removing a failed drive from mirrored array causes the array to be converted to a "single disk" mirror array (oh apple) but I got the drive added as a member. It required the AutoRebuild key to be set true before it would rebuild. At it current rebuild rate, it shouldn't take more than a few hours to complete. I'll check back towards EOB. Here is the correct steps to adding the disk to rebuild the array: install:~ root# diskutil appleraid update AutoRebuild 1 47126AFD-E94F-451A-AA93-08BEBFADCC43 The RAID has been successfully updated install:~ root# diskutil appleraid add member /dev/disk2 47126AFD-E94F-451A-AA93-08BEBFADCC43 Started RAID operation on disk1 Raid1 Unmounting disk Repartitioning disk2 so it can be in a RAID set Unmounting disk Creating the partition map Adding disk2s2 to the RAID Set Finished RAID operation on disk1 Raid1 install:~ root# diskutil appleraid list AppleRAID sets (1 found) =============================================================================== Name: Raid1 Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43 Type: Mirror Status: Degraded Size: 499.8 GB (499763838976 Bytes) Rebuild: automatic Device Node: disk1 ------------------------------------------------------------------------------- # DevNode UUID Status Size ------------------------------------------------------------------------------- 0 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976 1 disk2s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 0% (Rebuilding)499763838976 ===============================================================================
I've been checking on the rebuild status throughout the day and over the past hour it looks to be stuck on 98% progress. The only thing I can think to do at this time is check on it in the morning and hope it completes. :-/ install:~ root# diskutil appleraid list AppleRAID sets (1 found) =============================================================================== Name: Raid1 Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43 Type: Mirror Status: Degraded Size: 499.8 GB (499763838976 Bytes) Rebuild: automatic Device Node: disk1 ------------------------------------------------------------------------------- # DevNode UUID Status Size ------------------------------------------------------------------------------- 0 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976 1 disk2s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 98% (Rebuilding)499763838976 ===============================================================================
It finally finished. Now let's hope it stays up and running. install:~ root# diskutil appleraid list AppleRAID sets (1 found) =============================================================================== Name: Raid1 Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43 Type: Mirror Status: Online Size: 499.8 GB (499763838976 Bytes) Rebuild: automatic Device Node: disk1 ------------------------------------------------------------------------------- # DevNode UUID Status Size ------------------------------------------------------------------------------- 0 disk0s2 1064CBEB-795D-4F86-8EF9-B876B283FB90 Online 499763838976 1 disk2s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 Online 499763838976 ===============================================================================
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
I just logged into install.build and it now it looks like the OTHER drive has failed. [root@install.build.releng.scl3.mozilla.com ~]# diskutil appleRaid list AppleRAID sets (1 found) =============================================================================== Name: Raid1 Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43 Type: Mirror Status: Degraded Size: 499.8 GB (499763838976 Bytes) Rebuild: automatic Device Node: disk2 ------------------------------------------------------------------------------- # DevNode UUID Status Size ------------------------------------------------------------------------------- - -none- 1064CBEB-795D-4F86-8EF9-B876B283FB90 Missing/Damaged 1 disk1s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 Online 499763838976 ===============================================================================
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Vinh replaced the failed drive but when I added the new drive to the array it locked up. still pingable but ssh and vnc are unresponsive. :-/ It's been able 15m so I've given up on it coming back on its own. Going to powercycle it
All looks good after a powercycle. The Raid is rebuilding. I'll check back in tomorrow morning. [root@install.build.releng.scl3.mozilla.com ~]# diskutil appleraid list AppleRAID sets (1 found) =============================================================================== Name: Raid1 Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43 Type: Mirror Status: Degraded Size: 499.8 GB (499763838976 Bytes) Rebuild: automatic Device Node: disk2 ------------------------------------------------------------------------------- # DevNode UUID Status Size ------------------------------------------------------------------------------- 0 disk1s2 B1E94151-ED13-40F0-B8AA-BF53EF05E223 0% (Rebuilding)499763838976 1 disk0s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 Online 499763838976 ===============================================================================
Raid is 100% rebuilt. And this time it only took 2.5 hours to rebuild. Which was much faster than the first rebuild. You can see this from the relevant log snippets below. AppleRAID sets (1 found) =============================================================================== Name: Raid1 Unique ID: 47126AFD-E94F-451A-AA93-08BEBFADCC43 Type: Mirror Status: Online Size: 499.8 GB (499763838976 Bytes) Rebuild: automatic Device Node: disk2 ------------------------------------------------------------------------------- # DevNode UUID Status Size ------------------------------------------------------------------------------- 0 disk1s2 B1E94151-ED13-40F0-B8AA-BF53EF05E223 Online 499763838976 1 disk0s2 CC9FCBFA-FCEA-45DD-BBE7-3EF355823401 Online 499763838976 =============================================================================== <drive fails> Jan 12 02:40:29 install kernel[0]: Failed to issue COM RESET successfully after 3 attempts. Failing... Jan 12 02:40:29 install kernel[0]: AppleRAIDMember::synchronizeCacheCallout: failed with e00002c0 on 1064CBEB-795D-4F86-8EF9-B876B283FB90 Jan 12 02:40:29 install kernel[0]: IOBlockStorageDriver[IOBlockStorageDriver]; executeRequest: request failed to start! Jan 12 02:40:29 install kernel[0]: AppleRAID::recover() member 1064CBEB-795D-4F86-8EF9-B876B283FB90 from set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43) has been marked offline. Jan 12 02:40:29 install kernel[0]: AppleRAID::restartSet - restarting set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43). <system was rebooted after getting hung up when the disk was insert into the array as a hot spare> Jan 13 17:59:11 localhost kernel[0]: Darwin Kernel Version 11.2.0: Tue Aug 9 20:54:00 PDT 2011; root:xnu-1699.24.8~1/RELEASE_X86_64 <rebuild completed> Jan 13 20:33:07 install kernel[0]: AppleRAID::restartSet - restarting set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43). Jan 13 20:33:07 install kernel[0]: AppleRAIDMirrorSet::rebuild complete for set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43).
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.