485123 - Fallout from storage hiccup in maintenance window

Reporter

Description

•

17 years ago

The following machines need help (because they've fallen, and can't get up). On bm-vwmare11 moz2-linux-slave19 - hung, did not respond to my Reset request On bm-vmware02 (DRS attempted to migrate these and has failed) egg fx-win32-1.9-slave1 karma production-prometheus-vm qm-buildbot01 aravind says both ESX servers will require reboots to resolve this, which truly sucks on VMware's part (or the iSCSI providers or both). There are a lot of machines on bm-vmware11, only two got moved off by putting it into maintenance mode (then it got blocked).

Aravind Gottipati [:aravind]

Assignee

Comment 1

•

17 years ago

Equallogic says that they discovered an issue with recent VMWare updates and their eql arrays, that causes VMs to be less fault tolerant when their arrays go into a spasm like this. They have an internal firmware update thats supposed to help with the problem. They are promising to make this build available to us ASAP.

Assignee: server-ops → aravind

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

17 years ago

bm-vmware02 may have affected tb-linux-tbox.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

17 years ago

bm-vmware11 is rebooted, while try-linux-slave01 and moz2-linux-slave19 are started again. They're booting really slowly because the eql array is rebuilding after a disk dying (the initiator for all the problems). I'll be back later to build up more machines. bm-vmware02 to be rebooted tomorrow.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 4

•

17 years ago

Update: This all started when a disk died in the eql array, there's an ongoing performance hit while the array is rebuilt. We've seen some hg timeouts from this. All the slaves that were running on bm-vmware11 are back up, except * moz2-win32-slave16 & 19 are running chkdsk on boot, should come up by themselves * moz2-win32-slave18 is running chkdsk on e: as administrator, will need a reboot when done * all three have clobbers set on the build running when the array glitched All the slaves on bm-vmware02 will need attention after that host is rebooted. Some of them are still cycling now, some are not.

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

17 years ago

(In reply to comment #4) > Update: > > This all started when a disk died in the eql array, there's an ongoing > performance hit while the array is rebuilt. We've seen some hg timeouts from > this. > > All the slaves that were running on bm-vmware11 are back up, except > * moz2-win32-slave16 & 19 are running chkdsk on boot, should come up by > themselves > * moz2-win32-slave18 is running chkdsk on e: as administrator, will need a > reboot when done Sadly, none of these machines have finished chkdsk yet. slave16 is stuck at 82%, slave19 at 70%, and slave18 has the chkdsk dialog up with 0% completion. Considering they've been this way for hours I'm going to give them a kick and hope it helps...

bhearsum@mozilla.com (:bhearsum)

Comment 6

•

17 years ago

(In reply to comment #5) > (In reply to comment #4) > > All the slaves that were running on bm-vmware11 are back up, except > > * moz2-win32-slave16 & 19 are running chkdsk on boot, should come up by > > themselves > > * moz2-win32-slave18 is running chkdsk on e: as administrator, will need a > > reboot when done > > Sadly, none of these machines have finished chkdsk yet. slave16 is stuck at > 82%, slave19 at 70%, and slave18 has the chkdsk dialog up with 0% completion. > > Considering they've been this way for hours I'm going to give them a kick and > hope it helps... I rebooted slave18 and when I went to check on the other two they moved up to 83% and 71% complete. Guess they just need more time.

Philippe M. Chiasson (:gozer)

Comment 7

•

17 years ago

(In reply to comment #2) > bm-vmware02 may have affected tb-linux-tbox. Yes, and tb-linux-tbox is still down, most likely as a result of this.

bhearsum@mozilla.com (:bhearsum)

Comment 8

•

17 years ago

(In reply to comment #7) > (In reply to comment #2) > > bm-vmware02 may have affected tb-linux-tbox. > > Yes, and tb-linux-tbox is still down, most likely as a result of this. Yep. bm-vmware02 is still in a disconnected state in the VI client.

bhearsum@mozilla.com (:bhearsum)

Comment 9

•

17 years ago

Can we get an update on this? Has the storage array finished rebuilding? What's the status of bm-vmware02? Going back through some history, I see we've had timeouts on VMs on the following hosts/storage arrays: eq01-bm01 eql01-bm02 eql01-bm04 eql01-bm05 eql01-bm07 bm-vmware10 bm-vmware03 bm-vmware08 bm-vmware05 bm-vmware11 bm-vmware13 And currently, some VMs are running _extremely_ slow. moz2-win32-slave18 is a prime example of this. It's running so slow that despite the VNC server being started I cannot connect to it with a VNC client.

Aravind Gottipati [:aravind]

Assignee

Comment 10

•

17 years ago

bm-vmware02 has been rebooted and should be available now. The array itself is still rebuilding and is now at 7% completion.. looks like its going to take a while to be done. At this point, there isn't much we can do to improve performance. It would probably help to turn off non-essential VMs backed by these eql volumes. We will be contacting eql to see if there is anything we can do to speed things up.

bhearsum@mozilla.com (:bhearsum)

Comment 11

•

17 years ago

I've shutdown the following VMs to help the rebuild go faster: patrocles - tb 1.8 builds xr-linux-tbox - xr 1.9 builds fx-linux-1.9-slave2 - prod. 1.9 l10n/release fx-win32-1.9-slave08 - 1.9 unittest (currently failing anyways) fx-win32-1.9-slave09 - same as above moz2-win32-slave04 - staging moz2-linux-slave04 - staging test-winslave2 - test test-opsi - test test-winslave - test try-linux-slave05 - staging try-win32-slave05 - staging fx-win32-1.9-slave2 - prod 1.9 l10n/release moz2-win32-slave21 - staging moz2-linux-slave17 - staging Aravind, how are things progressing? Has it sped up at all?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 12

•

17 years ago

(In reply to comment #10) > bm-vmware02 has been rebooted and should be available now. > > The array itself is still rebuilding and is now at 7% completion.. looks like > its going to take a while to be done. At this point, there isn't much we can > do to improve performance. It would probably help to turn off non-essential > VMs backed by these eql volumes. > > We will be contacting eql to see if there is anything we can do to speed things > up. 1) Any update on when these disks will be online again? 2) When did we upgrade VMware to trigger this EQL bug? And what does our support contract say about turnaround times on getting a fix from EQL? 3) Until we get this fix, can we make sure that *none* of our buildbot master VMs are on any EQL volumes?

bhearsum@mozilla.com (:bhearsum)

Comment 13

•

17 years ago

(In reply to comment #12) > 1) Any update on when these disks will be online again? > Aravind told me we're 8% rebuilt now. FTR, they're not "offline", they're just slow. > 3) Until we get this fix, can we make sure that *none* of our buildbot master > VMs are on any EQL volumes? AIUI we can't live-migrate from one storage to another. The Buildbot masters aren't the big problem here though - they're actually quite lively. Were getting hit hard on the build slaves themselves, though, builds seem to be much slower right now.

matthew zeier [:mrz]

Comment 14

•

17 years ago

There are two storage arrays, one SAS and one SATA. The failed drive was in the SATA array - it's been replaced and has been reconstructing since lastnight (on the hot spare). The issue is with any volume that's currently on that array. Some volumes are on the SAS array, some aren't. Aravind's preferenced the build volumes to prefer the SAS array but there isn't sufficient storage for all the volumes. You could storage vmotion from equallogic over to netapp but I'm not sure I'd suggest doing that unless it was a last resort type thing. We'll have the firmware fix soon but that requires a reboot of the storage array and I wouldn't do that until the reconstruction is done (and it won't resolve -this- issue, only help prevent it from happening again).

bhearsum@mozilla.com (:bhearsum)

Comment 15

•

17 years ago

Aravind tells me that EQL has upgraded the case on their side to Critical and we're currently waiting to hear back from them.

Aravind Gottipati [:aravind]

Assignee

Comment 16

•

17 years ago

According to equallogic, there is no easy fix to this problem, short of buying a new array and vacating the old one. Their suggestion is to reduce the i/o contention on the box so the background reconstruction process has a chance to make some progress. From looking at i/o per sec values, bm05, bm04 and bm02 have the most i/o. Any load taken off those should help. As a record here is the current status of the reconstruction. RAID LUN 0 Degraded. 14 Drives (0,8,1,9,2,10,3,11,4,12,5,7r,6,14) RAID 50 (64KB sectPerSU) Capacity 8,641,761,509,376 bytes Reconstruction of drive 7 (%10.99 complete) Available Drives List: 13,15

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 17

•

17 years ago

We basically turned off all the VMs that build and use eql storage, both for the main pool and the try server slaves. Since then I've turned fx-win32-1.9-slave08 back on because of the Firefox 3.0.8 firedrill. I'm also moving some VMs from eql to netapp in the hope we can move from a closed tree to sheriff controlled checkins.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 18

•

17 years ago

Only got as far as moz2-win32-slave09 and 10 to c-sata-build-002 try-linux-slave1 to d-fcal-build-001 try-linux-slave2 to d-fcal-build-002 The eql array is now rebuilt so I'll start adding booting up our slaves again.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 19

•

17 years ago

Reverted the master change http://hg.mozilla.org/build/buildbot-configs/rev/7a381d880314

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 20

•

17 years ago

Just about everything has been booted and any manual startup done. No doubt there will have been some errors in doing that, and the odd build left in a broken state that needs a clobber. Lets handle that via followups and only reopen this if there are slowness problems across the board.

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Bugzilla

Fallout from storage hiccup in maintenance window

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: nthomas, Assigned: aravind)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Updated