install.build.releng.scl3.mozilla.com has a failed raid1 disk

RESOLVED FIXED

Status

--
major
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: coop, Assigned: dividehex)

Tracking

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4268] bad drive)

(Reporter)

Description

4 years ago
This machine is alerting constantly in #buildduty for failing nagios checks.

It's not desperate yet, but we won't be able to re-image any Mac build machines until this machine is back online.
host gets stuck at boot up, I suspect a bad drive but we don't have any replacements. 
(SN open for 500gb drives)
After another reboot host came back online. Please reopen if issues come up. 


[sespinoza@admin1a.private.scl3 ~]$ fping install.build.releng.scl3.mozilla.com
install.build.releng.scl3.mozilla.com is alive
[sespinoza@admin1a.private.scl3 ~]$ ssh !$
ssh install.build.releng.scl3.mozilla.com
The authenticity of host 'install.build.releng.scl3.mozilla.com (10.26.52.17)' can't be established.
RSA key fingerprint is 70:32:94:83:9e:7c:c0:3c:a3:fa:85:55:0a:48:65:fb.
Are you sure you want to continue connecting (yes/no)?
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
(Assignee)

Comment 3

4 years ago
One of the disks in the raid1 has failed.  We need to backup the data on /Deploy and the current netboot nbi ASAP

install:~ root# diskutil info /dev/disk2
   Device Identifier:        disk2
   Device Node:              /dev/disk2
   Part of Whole:            disk2
   Device / Media Name:      Raid1

   Volume Name:              Raid1
   Escaped with Unicode:     Raid1

   Mounted:                  Yes
   Mount Point:              /
   Escaped with Unicode:     /

   File System Personality:  Journaled HFS+
   Type (Bundle):            hfs
   Name (User Visible):      Mac OS Extended (Journaled)
   Journal:                  Journal size 40960 KB at offset 0xe8e000
   Owners:                   Enabled

   Content (IOContent):      Apple_HFS
   OS Can Be Installed:      Yes
   Media Type:               Generic
   Protocol:                 SATA
   SMART Status:             Not Supported
   Volume UUID:              F3351D6D-8032-34FA-8B08-1A763966E501

   Total Size:               499.8 GB (499763838976 Bytes) (exactly 976101248 512-Byte-Blocks)
   Volume Free Space:        260.9 GB (260928294912 Bytes) (exactly 509625576 512-Byte-Blocks)
   Device Block Size:        512 Bytes

   Read-Only Media:          No
   Read-Only Volume:         No
   Ejectable:                No

   Whole:                    Yes
   Internal:                 Yes
   Solid State:              No
   OS 9 Drivers:             No
   Low Level Format:         Not supported

   This disk is a RAID Set.  RAID Set Information:
      Set Name:          Raid1
      RAID Set UUID:     47126AFD-E94F-451A-AA93-08BEBFADCC43
      Level Type:        Mirror
      Status:            Degraded
      Chunk Count:       7625791
(Assignee)

Updated

4 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
colo-trip: --- → scl3
Whiteboard: bad drive
(Assignee)

Comment 4

4 years ago
Sal attached a ext hdd and I've started a tarball of /Deploy. When it is done I'll grab the nbi image also. I'm still trying to determine the serial of the bad hdd though. Hopefully the system_profiler with give that up.
(Assignee)

Comment 5

4 years ago
Looks like disk1 has failed. Serial # is 110411PCG420GLHD11DC

AppleRAID sets (1 found)
===============================================================================
Name:                 Raid1
Unique ID:            47126AFD-E94F-451A-AA93-08BEBFADCC43
Type:                 Mirror
Status:               Degraded
Size:                 499.8 GB (499763838976 Bytes)
Rebuild:              manual
Device Node:          disk2
-------------------------------------------------------------------------------
#  DevNode   UUID                                  Status     Size
-------------------------------------------------------------------------------
0  disk1s2   3FF0599F-7428-494C-96DF-DC82392BC309  Failed     499763838976
1  disk0s2   1064CBEB-795D-4F86-8EF9-B876B283FB90  Online     499763838976
===============================================================================

Hitachi HTS725050A9A362:

          Capacity: 500.11 GB (500,107,862,016 bytes)
          Model: Hitachi HTS725050A9A362
          Revision: PC4ACB1E
          Serial Number: 110411PCG420GLHD11DC
          Native Command Queuing: Yes
          Queue Depth: 32
          Removable Media: No
          Detachable Drive: No
          BSD Name: disk1
(Assignee)

Updated

4 years ago
Summary: install.build.releng.scl3.mozilla.com is unavailable → install.build.releng.scl3.mozilla.com has a failed raid1 disk
(Assignee)

Updated

4 years ago
Depends on: 1117192
(Assignee)

Updated

4 years ago
Assignee: server-ops-dcops → relops
Component: DCOps → RelOps
QA Contact: arich
(Assignee)

Comment 6

4 years ago
I've requested this system get another reboot since it seems to have locked up again.  When it comes back, I'm going to double check the /Deploy and the nbi have finished backing up.  Then I'm going to remove the failed drive from the array and removed physically.  Hopefully that will stop it from continually ending up in limbo.
(Assignee)

Comment 7

4 years ago
Sal wasn't able to bring the system back up even after removing the ext usb hhd.  I've requested that he remove the failed drive in hopes that it will allow it to boot the degraded raid array up.  Luckily, the backup of /Deploy finished before I left last night. I'm just not sure if the NBI backed up.  It isn't a big deal if it hadn't and aren't able to recover the raid since it is rebuild-able.  If the system is able to boot the array on just the remaining disk, I'll remove the old hdd uuid so it doesn't keep looking for it.
(Assignee)

Comment 8

4 years ago
With the failed disk removed, the system successfully booted.  I've gone ahead and removed the failed disk member from the array.  When the new disk arrives it just needs to be installed and added to the array.  The rebuild flag is set to manual so that will need to be initiated by hand.

install:~ root# diskutil listRAID 47126AFD-E94F-451A-AA93-08BEBFADCC43
===============================================================================
Name:                 Raid1
Unique ID:            47126AFD-E94F-451A-AA93-08BEBFADCC43
Type:                 Mirror
Status:               Degraded
Size:                 499.8 GB (499763838976 Bytes)
Rebuild:              manual
Device Node:          disk1
-------------------------------------------------------------------------------
#  DevNode   UUID                                  Status     Size
-------------------------------------------------------------------------------
-  -none-    3FF0599F-7428-494C-96DF-DC82392BC309  Missing/Damaged
1  disk0s2   1064CBEB-795D-4F86-8EF9-B876B283FB90  Online     499763838976
===============================================================================
install:~ root# diskutil removeFromRaid 3FF0599F-7428-494C-96DF-DC82392BC309 47126AFD-E94F-451A-AA93-08BEBFADCC43
Started RAID operation on disk1 Raid1
Removing disk from RAID
Finished RAID operation on disk1 Raid1
install:~ root# diskutil listRAID 47126AFD-E94F-451A-AA93-08BEBFADCC43
===============================================================================
Name:                 Raid1
Unique ID:            47126AFD-E94F-451A-AA93-08BEBFADCC43
Type:                 Mirror
Status:               Online
Size:                 499.8 GB (499763838976 Bytes)
Rebuild:              manual
Device Node:          disk1
-------------------------------------------------------------------------------
#  DevNode   UUID                                  Status     Size
-------------------------------------------------------------------------------
0  disk0s2   1064CBEB-795D-4F86-8EF9-B876B283FB90  Online     499763838976
===============================================================================
It's back to screaming alerts again.

Updated

4 years ago
Whiteboard: bad drive → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4268] bad drive
we certainly need to backup this ASAP, since this has failed down again.
Flags: needinfo?(jwatkins)
(Assignee)

Comment 11

4 years ago
I've ask dcops for an eta on the replacement drive and to reboot the host.  I'm a bit surprised it stopped responding again.  Makes me wonder if it is more than just the disk that had failed.  There is a backup of the /Deploy and the nbi files.  I was able to grab them before it went down.

When the host comes back up, I'll dig a little more into why it is going deaf to ssh and nrpe.
Flags: needinfo?(jwatkins)
(Assignee)

Updated

4 years ago
Assignee: relops → jwatkins
(Assignee)

Comment 12

4 years ago
I couldn't find anything really obvious as to why it went offline last time.  DCOps was able to find a spare hdd before the ordered disk arrives.  They installed it and it came back online (after awhile).

The new disk info according to system profiler:
    NVidia MCP89 AHCI:

      Vendor: NVidia
      Product: MCP89 AHCI
      Link Speed: 3 Gigabit
      Negotiated Link Speed: 3 Gigabit
      Description: AHCI Version 1.30 Supported

        Hitachi HTS727550A9E364:

          Capacity: 500.11 GB (500,107,862,016 bytes)
          Model: Hitachi HTS727550A9E364
          Revision: JF3OA0E0
          Serial Number:       J3350080GPLNBC
          Native Command Queuing: Yes
          Queue Depth: 32
          Removable Media: No
          Detachable Drive: No
          BSD Name: disk2
          Rotational Rate: 7200
          Medium Type: Rotational
          Bay Name: Upper
          Partition Map Type: Unknown
          S.M.A.R.T. status: Verified
(Assignee)

Comment 13

4 years ago
It took some futzing with since removing a failed drive from mirrored array causes the array to be converted to a "single disk" mirror array (oh apple) but I got the drive added as a member.  It required the AutoRebuild key to be set true before it would rebuild.  At it current rebuild rate, it shouldn't take more than a few hours to complete.  I'll check back towards EOB.

Here is the correct steps to adding the disk to rebuild the array:

install:~ root# diskutil appleraid update AutoRebuild 1 47126AFD-E94F-451A-AA93-08BEBFADCC43
The RAID has been successfully updated

install:~ root# diskutil appleraid add member /dev/disk2 47126AFD-E94F-451A-AA93-08BEBFADCC43
Started RAID operation on disk1 Raid1
Unmounting disk
Repartitioning disk2 so it can be in a RAID set
Unmounting disk
Creating the partition map
Adding disk2s2 to the RAID Set
Finished RAID operation on disk1 Raid1

install:~ root# diskutil appleraid list
AppleRAID sets (1 found)
===============================================================================
Name:                 Raid1
Unique ID:            47126AFD-E94F-451A-AA93-08BEBFADCC43
Type:                 Mirror
Status:               Degraded
Size:                 499.8 GB (499763838976 Bytes)
Rebuild:              automatic
Device Node:          disk1
-------------------------------------------------------------------------------
#  DevNode   UUID                                  Status     Size
-------------------------------------------------------------------------------
0  disk0s2   1064CBEB-795D-4F86-8EF9-B876B283FB90  Online     499763838976
1  disk2s2   CC9FCBFA-FCEA-45DD-BBE7-3EF355823401  0% (Rebuilding)499763838976
===============================================================================
(Assignee)

Comment 14

4 years ago
I've been checking on the rebuild status throughout the day and over the past hour it looks to be stuck on 98% progress.  The only thing I can think to do at this time is check on it in the morning and hope it completes. :-/

install:~ root# diskutil appleraid list
AppleRAID sets (1 found)
===============================================================================
Name:                 Raid1
Unique ID:            47126AFD-E94F-451A-AA93-08BEBFADCC43
Type:                 Mirror
Status:               Degraded
Size:                 499.8 GB (499763838976 Bytes)
Rebuild:              automatic
Device Node:          disk1
-------------------------------------------------------------------------------
#  DevNode   UUID                                  Status     Size
-------------------------------------------------------------------------------
0  disk0s2   1064CBEB-795D-4F86-8EF9-B876B283FB90  Online     499763838976
1  disk2s2   CC9FCBFA-FCEA-45DD-BBE7-3EF355823401  98% (Rebuilding)499763838976
===============================================================================
(Assignee)

Comment 15

4 years ago
It finally finished.  Now let's hope it stays up and running.

install:~ root# diskutil appleraid list
AppleRAID sets (1 found)
===============================================================================
Name:                 Raid1
Unique ID:            47126AFD-E94F-451A-AA93-08BEBFADCC43
Type:                 Mirror
Status:               Online
Size:                 499.8 GB (499763838976 Bytes)
Rebuild:              automatic
Device Node:          disk1
-------------------------------------------------------------------------------
#  DevNode   UUID                                  Status     Size
-------------------------------------------------------------------------------
0  disk0s2   1064CBEB-795D-4F86-8EF9-B876B283FB90  Online     499763838976
1  disk2s2   CC9FCBFA-FCEA-45DD-BBE7-3EF355823401  Online     499763838976
===============================================================================
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
(Assignee)

Comment 16

4 years ago
I just logged into install.build and it now it looks like the OTHER drive has failed.

[root@install.build.releng.scl3.mozilla.com ~]# diskutil appleRaid list
AppleRAID sets (1 found)
===============================================================================
Name:                 Raid1
Unique ID:            47126AFD-E94F-451A-AA93-08BEBFADCC43
Type:                 Mirror
Status:               Degraded
Size:                 499.8 GB (499763838976 Bytes)
Rebuild:              automatic
Device Node:          disk2
-------------------------------------------------------------------------------
#  DevNode   UUID                                  Status     Size
-------------------------------------------------------------------------------
-  -none-    1064CBEB-795D-4F86-8EF9-B876B283FB90  Missing/Damaged
1  disk1s2   CC9FCBFA-FCEA-45DD-BBE7-3EF355823401  Online     499763838976
===============================================================================
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 17

4 years ago
Vinh replaced the failed drive but when I added the new drive to the array it locked up. still pingable but ssh and vnc are unresponsive. :-/  It's been able 15m so I've given up on it coming back on its own.  Going to powercycle it
(Assignee)

Comment 18

4 years ago
All looks good after a powercycle. The Raid is rebuilding.  I'll check back in tomorrow morning.  

[root@install.build.releng.scl3.mozilla.com ~]# diskutil appleraid list
AppleRAID sets (1 found)
===============================================================================
Name:                 Raid1
Unique ID:            47126AFD-E94F-451A-AA93-08BEBFADCC43
Type:                 Mirror
Status:               Degraded
Size:                 499.8 GB (499763838976 Bytes)
Rebuild:              automatic
Device Node:          disk2
-------------------------------------------------------------------------------
#  DevNode   UUID                                  Status     Size
-------------------------------------------------------------------------------
0  disk1s2   B1E94151-ED13-40F0-B8AA-BF53EF05E223  0% (Rebuilding)499763838976
1  disk0s2   CC9FCBFA-FCEA-45DD-BBE7-3EF355823401  Online     499763838976
===============================================================================
(Assignee)

Comment 19

4 years ago
Raid is 100% rebuilt.  And this time it only took 2.5 hours to rebuild.  Which was much faster than the first rebuild.  You can see this from the relevant log snippets below.

AppleRAID sets (1 found)
===============================================================================
Name:                 Raid1
Unique ID:            47126AFD-E94F-451A-AA93-08BEBFADCC43
Type:                 Mirror
Status:               Online
Size:                 499.8 GB (499763838976 Bytes)
Rebuild:              automatic
Device Node:          disk2
-------------------------------------------------------------------------------
#  DevNode   UUID                                  Status     Size
-------------------------------------------------------------------------------
0  disk1s2   B1E94151-ED13-40F0-B8AA-BF53EF05E223  Online     499763838976
1  disk0s2   CC9FCBFA-FCEA-45DD-BBE7-3EF355823401  Online     499763838976
===============================================================================

<drive fails>
Jan 12 02:40:29 install kernel[0]: Failed to issue COM RESET successfully after 3 attempts. Failing...
Jan 12 02:40:29 install kernel[0]: AppleRAIDMember::synchronizeCacheCallout: failed with e00002c0 on 1064CBEB-795D-4F86-8EF9-B876B283FB90
Jan 12 02:40:29 install kernel[0]: IOBlockStorageDriver[IOBlockStorageDriver]; executeRequest: request failed to start!
Jan 12 02:40:29 install kernel[0]: AppleRAID::recover() member 1064CBEB-795D-4F86-8EF9-B876B283FB90 from set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43) has been marked offline.
Jan 12 02:40:29 install kernel[0]: AppleRAID::restartSet - restarting set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43).

<system was rebooted after getting hung up when the disk was insert into the array as a hot spare>
Jan 13 17:59:11 localhost kernel[0]: Darwin Kernel Version 11.2.0: Tue Aug  9 20:54:00 PDT 2011; root:xnu-1699.24.8~1/RELEASE_X86_64

<rebuild completed>
Jan 13 20:33:07 install kernel[0]: AppleRAID::restartSet - restarting set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43).
Jan 13 20:33:07 install kernel[0]: AppleRAIDMirrorSet::rebuild complete for set "Raid1" (47126AFD-E94F-451A-AA93-08BEBFADCC43).
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.