Fallout from storage hiccup in maintenance window

RESOLVED FIXED

Status

task
--
major
RESOLVED FIXED
11 years ago
6 years ago

People

(Reporter: nthomas, Assigned: aravind)

Tracking

Details

The following machines need help (because they've fallen, and can't get up).

On bm-vwmare11
 moz2-linux-slave19 - hung, did not respond to my Reset request
 
On bm-vmware02 (DRS attempted to migrate these and has failed)
 egg
 fx-win32-1.9-slave1
 karma
 production-prometheus-vm
 qm-buildbot01

aravind says both ESX servers will require reboots to resolve this, which truly sucks on VMware's part (or the iSCSI providers or both). There are a lot of machines on bm-vmware11, only two got moved off by putting it into maintenance mode (then it got blocked).
Equallogic says that they discovered an issue with recent VMWare updates and their eql arrays, that causes VMs to be less fault tolerant when their arrays go into a spasm like this.  They have an internal firmware update thats supposed to help with the problem.  They are promising to make this build available to us ASAP.
Assignee: server-ops → aravind
bm-vmware02 may have affected tb-linux-tbox.
bm-vmware11 is rebooted, while try-linux-slave01 and moz2-linux-slave19 are started again. They're booting really slowly because the eql array is rebuilding after a disk dying (the initiator for all the problems). I'll be back later to build up more machines.

bm-vmware02 to be rebooted tomorrow.
Update:

This all started when a disk died in the eql array, there's an ongoing performance hit while the array is rebuilt. We've seen some hg timeouts from this.

All the slaves that were running on bm-vmware11 are back up, except
* moz2-win32-slave16 & 19 are running chkdsk on boot, should come up by themselves
* moz2-win32-slave18 is running chkdsk on e: as administrator, will need a reboot when done
* all three have clobbers set on the build running when the array glitched

All the slaves on bm-vmware02 will need attention after that host is rebooted. Some of them are still cycling now, some are not.
(In reply to comment #4)
> Update:
> 
> This all started when a disk died in the eql array, there's an ongoing
> performance hit while the array is rebuilt. We've seen some hg timeouts from
> this.
> 
> All the slaves that were running on bm-vmware11 are back up, except
> * moz2-win32-slave16 & 19 are running chkdsk on boot, should come up by
> themselves
> * moz2-win32-slave18 is running chkdsk on e: as administrator, will need a
> reboot when done

Sadly, none of these machines have finished chkdsk yet. slave16 is stuck at 82%, slave19 at 70%, and slave18 has the chkdsk dialog up with 0% completion.

Considering they've been this way for hours I'm going to give them a kick and hope it helps...
(In reply to comment #5)
> (In reply to comment #4)
> > All the slaves that were running on bm-vmware11 are back up, except
> > * moz2-win32-slave16 & 19 are running chkdsk on boot, should come up by
> > themselves
> > * moz2-win32-slave18 is running chkdsk on e: as administrator, will need a
> > reboot when done
> 
> Sadly, none of these machines have finished chkdsk yet. slave16 is stuck at
> 82%, slave19 at 70%, and slave18 has the chkdsk dialog up with 0% completion.
> 
> Considering they've been this way for hours I'm going to give them a kick and
> hope it helps...

I rebooted slave18 and when I went to check on the other two they moved up to 83% and 71% complete. Guess they just need more time.
(In reply to comment #2)
> bm-vmware02 may have affected tb-linux-tbox.

Yes, and tb-linux-tbox is still down, most likely as a result of this.
(In reply to comment #7)
> (In reply to comment #2)
> > bm-vmware02 may have affected tb-linux-tbox.
> 
> Yes, and tb-linux-tbox is still down, most likely as a result of this.

Yep. bm-vmware02 is still in a disconnected state in the VI client.
Can we get an update on this? Has the storage array finished rebuilding? What's the status of bm-vmware02?

Going back through some history, I see we've had timeouts on VMs on the following hosts/storage arrays:
eq01-bm01
eql01-bm02
eql01-bm04
eql01-bm05
eql01-bm07
bm-vmware10
bm-vmware03
bm-vmware08
bm-vmware05
bm-vmware11
bm-vmware13

And currently, some VMs are running _extremely_ slow. moz2-win32-slave18 is a prime example of this. It's running so slow that despite the VNC server being started I cannot connect to it with a VNC client.
bm-vmware02 has been rebooted and should be available now.

The array itself is still rebuilding and is now at 7% completion.. looks like its going to take a while to be done.  At this point, there isn't much we can do to improve performance.  It would probably help to turn off non-essential VMs backed by these eql volumes.

We will be contacting eql to see if there is anything we can do to speed things up.
I've shutdown the following VMs to help the rebuild go faster:
patrocles - tb 1.8 builds
xr-linux-tbox - xr 1.9 builds
fx-linux-1.9-slave2 - prod. 1.9 l10n/release
fx-win32-1.9-slave08 - 1.9 unittest (currently failing anyways)
fx-win32-1.9-slave09 - same as above
moz2-win32-slave04 - staging
moz2-linux-slave04 - staging
test-winslave2 - test
test-opsi - test
test-winslave - test
try-linux-slave05 - staging
try-win32-slave05 - staging
fx-win32-1.9-slave2 - prod 1.9 l10n/release
moz2-win32-slave21 - staging
moz2-linux-slave17 - staging


Aravind, how are things progressing? Has it sped up at all?
(In reply to comment #10)
> bm-vmware02 has been rebooted and should be available now.
> 
> The array itself is still rebuilding and is now at 7% completion.. looks like
> its going to take a while to be done.  At this point, there isn't much we can
> do to improve performance.  It would probably help to turn off non-essential
> VMs backed by these eql volumes.
> 
> We will be contacting eql to see if there is anything we can do to speed things
> up.
1) Any update on when these disks will be online again?

2) When did we upgrade VMware to trigger this EQL bug? And what does our support contract say about turnaround times on getting a fix from EQL? 

3) Until we get this fix, can we make sure that *none* of our buildbot master VMs are on any EQL volumes?
(In reply to comment #12)
> 1) Any update on when these disks will be online again?
> 

Aravind told me we're 8% rebuilt now. FTR, they're not "offline", they're just slow.

> 3) Until we get this fix, can we make sure that *none* of our buildbot master
> VMs are on any EQL volumes?

AIUI we can't live-migrate from one storage to another. The Buildbot masters aren't the big problem here though - they're actually quite lively. Were getting hit hard on the build slaves themselves, though, builds seem to be much slower right now.
There are two storage arrays, one SAS and one SATA.  The failed drive was in the SATA array - it's been replaced and has been reconstructing since lastnight (on the hot spare).

The issue is with any volume that's currently on that array.  Some volumes are on the SAS array, some aren't.

Aravind's preferenced the build volumes to prefer the SAS array but there isn't sufficient storage for all the volumes.  

You could storage vmotion from equallogic over to netapp but I'm not sure I'd suggest doing that unless it was a last resort type thing.

We'll have the firmware fix soon but that requires a reboot of the storage array and I wouldn't do that until the reconstruction is done (and it won't resolve -this- issue, only help prevent it from happening again).
Aravind tells me that EQL has upgraded the case on their side to Critical and we're currently waiting to hear back from them.
According to equallogic, there is no easy fix to this problem, short of buying a new array and vacating the old one.

Their suggestion is to reduce the i/o contention on the box so the background reconstruction process has a chance to make some progress.

From looking at i/o per sec values, bm05, bm04 and bm02 have the most i/o.  Any load taken off those should help.

As a record here is the current status of the reconstruction.

RAID LUN 0 Degraded.
  14 Drives (0,8,1,9,2,10,3,11,4,12,5,7r,6,14)
  RAID 50 (64KB sectPerSU)
  Capacity 8,641,761,509,376 bytes
  Reconstruction of drive 7 (%10.99 complete)
Available Drives List: 13,15
We basically turned off all the VMs that build and use eql storage, both for the main pool and the try server slaves. 

Since then I've turned fx-win32-1.9-slave08 back on because of the Firefox 3.0.8 firedrill. I'm also moving some VMs from eql to netapp in the hope we can move from a closed tree to sheriff controlled checkins.
Only got as far as 
  moz2-win32-slave09 and 10 to c-sata-build-002
  try-linux-slave1 to d-fcal-build-001
  try-linux-slave2 to d-fcal-build-002

The eql array is now rebuilt so I'll start adding booting up our slaves again.
Just about everything has been booted and any manual startup done. No doubt there will have been some errors in doing that, and the odd build left in a broken state that needs a clobber. Lets handle that via followups and only reopen this if there are slowness problems across the board.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.