Closed Bug 488447 Opened 13 years ago Closed 13 years ago

Storage latency problems

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Windows Server 2003
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: aravind)

References

Details

(Keywords: intermittent-failure)

win32 mozilla-central build on moz2-win32-slave26 at 2009/04/15 00:15:13:
timed out after 1 hour trying to execute configure
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1239779713.1239785647.8913.gz
win32 mozilla-central build on moz2-win32-slave13 at 2009/04/14 20:06:02
timed out during compile
Building deps for /e/builds/moz2_slave/mozilla-central-win32/build/content/events/src/nsContentEventHandler.cpp
nsSVGPoint.cpp

command timed out: 5400 seconds without output
program finished with exit code 1
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1239764762.1239775149.26456.gz
win32 unit try build on try-win32-slave04 at 2009/04/15 00:10:01
rm -rf mozilla/
command timed out: 3600 seconds without output
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1239779401.1239783012.5575.gz

win32 try build on try-win32-slave04 at 2009/04/14 20:55:36
rm -rf mozilla/
command timed out: 3600 seconds without output
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1239767736.1239771345.20512.gz

Same for the matching unit test build on try-win32-slave06
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1239767736.1239771348.20518.gz
moz2-win32-slave26 - eql01-bm06 - bm-vmware05
moz2-win32-slave13 - eql01-bm04 - bm-vmware09 at timeout (moved from 10 at 21:47)
 try-win32-slave04 - eql01-bm07 - bm-vmware13
 try-win32-slave06 - eql01-bm06 - bm-vmware07

Common factor appears to be eql01 controller.
bug 488362 may be a warning sign - builds not being able to open a compilation product earlier today.

The code freeze for 3.5b4 is scheduled for Wednesday so we need to have the pool of slaves in tip top condition, and not causing spurious bustage. Raising severity to critical.

mrz, what does the fancy reporting say about the Equallogic setup ?
Severity: normal → critical
Changes made on our side today include starting work on try-linux-slave06 thru 09 (all on eql01-bm08/09 but proportionally a small increase in I/O I'd have guessed). Setting noatime on the rest of the linux slaves (bug 486765), should mean less I/O rather than more.
Where is the 100GB shared drive for bug 472185?  I started using that today for some ccache testing.
Assignee: server-ops → aravind
Yup, looks like the average read latency went way up.  It used to be around 20ms after the third array was added, its now around 50ms.  I think phong provisioned a bunch of new VMs on these arrays in the last week or so.  I will open a case with eql about this and asking for their recommendations.  In the meantime I'd suspect that the newly created VMs are causing this.  Is there any way for you folks to turn things off etc?
I think we can shut off the following VMs on eql storage:
xr-linux-tbox
moz2-win32-slave04
moz2-linux-slave04
moz2-linux-slave17
test-opsi
test-winslave
try-linux-slave05
try-win32-slave05
moz2-win32-slave21

All of the try-* and moz2-* machines are staging, and we can probably live without them for a bit. xr-linux-tbox is pretty non-critical but cycles all the time.
(In reply to comment #7)
> Is there any
> way for you folks to turn things off etc?

Any idea how long you want them to be off, by the way?
(In reply to comment #8)
> I think we can shut off the following VMs on eql storage:

Aravind helpfully points out that it's only bm05, 07, and 08 are the only arrays affected. There's only two VMs (try-{win32|linux}-slave05) which we can turn off there - both of which are on 07. Almost all of the others are moz2-* production slaves. We could look at turning some off, but I'm weary given that the freeze is today.
Looks like latency numbers are back down to under 30ms now, stuff should be stable now.

I am waiting to hear back from eql, but its probably a good idea to move some VMs around from the high i/o volumes to the low i/o ones.  eql should automatically be doing that (moving data blocks around on the arrays, so things perform as well as they can), but looks like that isn't working correctly.

For now we are going to leave things the way they are.  If stuff stars going bad again, we will look at moving some VMs or turning off some.
Blocks: 488457
Blocks: 488462
Duplicate of this bug: 488345
Blocks: 488530
(In reply to comment #11)
> stuff should be stable now.

Is that a very recent development?  Because the timeout-during-CVS-checkout issue has happened on 3 WinXP unittest cycles so far today, with the most recent one starting at 9:43 this morning (2 hours ago):

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.0/1239813816.1239815280.31925.gz
WINNT 5.2 fx-win32-1.9-slave08 dep unit test on 2009/04/15 09:43:36
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.0/1239779997.1239781470.3138.gz
WINNT 5.2 fx-win32-1.9-slave08 dep unit test on 2009/04/15 00:19:57
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.0/1239779998.1239781502.3212.gz
WINNT 5.2 fx-win32-1.9-slave09 (pgo01) dep unit test on 2009/04/15 00:19:58
And here's another timeout during CVS checkout from noonish today:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.0/1239822947.1239824355.14440.gz
WINNT 5.2 fx-win32-1.9-slave09 (pgo01) dep unit test on 2009/04/15 12:15:47

(Apologies if this is unnecessary information -- I just got the impression from comment 11 that this issue is thought to be okay for the time being, whereas it looks to me like it's still failing.)
Its not entirely fixed and I have a case open with equallogic about these problems.  What I meant was that this seems to be okay at the moment and we were probably going to be okay (which is clearly incorrect).

I am still working with eql for a solution.
dholbert, I'm 99% sure those failures are unrelated. Sadly, that's been going on for long before this issue started.
Duplicate of this bug: 488486
Here's an update, based on IRC traffic. Aravind has been talking a lot with Equallogic support today and the conclusion is that the SATA shelf is overloaded with I/O traffic. We (catlee, me, aki) have temporarily shut down the VMs
  moz2-linux-slave01,13,15,16,18; moz2-win32-slave15,16
so that Aravind can configure the Equallogic controller to move two of the busiest eql01-bmXX LUNs from SATA to faster shelfs, which have "super low latency right now".

When that's done we'll need to boot them up and re-enable nagios notifications.
(In reply to comment #18)
> Here's an update, based on IRC traffic. Aravind has been talking a lot with
> Equallogic support today and the conclusion is that the SATA shelf is
> overloaded with I/O traffic. We (catlee, me, aki) have temporarily shut down
> the VMs
>   moz2-linux-slave01,13,15,16,18; moz2-win32-slave15,16

These have been started up again. I'll get the nagios notifications soon.


We're working on bm08 now, and I'm shutting down/will be shutting down the following VMs:
linux-slave20, 21, 22, 23, 24, 25
win32-slave06, 27, 28, 29

Same deal as before, once they're started back up we'll have to re-enable nagios.
(In reply to comment #19)
> (In reply to comment #18)
> > Here's an update, based on IRC traffic. Aravind has been talking a lot with
> > Equallogic support today and the conclusion is that the SATA shelf is
> > overloaded with I/O traffic. We (catlee, me, aki) have temporarily shut down
> > the VMs
> >   moz2-linux-slave01,13,15,16,18; moz2-win32-slave15,16
> 
> These have been started up again. I'll get the nagios notifications soon.
> 

Re-enabled notifications for these.

> 
> We're working on bm08 now, and I'm shutting down/will be shutting down the
> following VMs:
> linux-slave20, 21, 22, 23, 24, 25
> win32-slave06, 27, 28, 29

As it turns out, the only one nagios is watching is win32-slave06 - so no need to worry about the rest.
(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #18)
> > > Here's an update, based on IRC traffic. Aravind has been talking a lot with
> > > Equallogic support today and the conclusion is that the SATA shelf is
> > > overloaded with I/O traffic. We (catlee, me, aki) have temporarily shut down
> > > the VMs
> > >   moz2-linux-slave01,13,15,16,18; moz2-win32-slave15,16
> > 
> > These have been started up again. I'll get the nagios notifications soon.
> > 
> 
> Re-enabled notifications for these.
> 
> > 
> > We're working on bm08 now, and I'm shutting down/will be shutting down the
> > following VMs:
> > linux-slave20, 21, 22, 23, 24, 25
> > win32-slave06, 27, 28, 29
> 

slave06 and 27 have been very busy with builds and I haven't had a chance to grab them yet. slave06 is idle at the moment, but I think it will serve us better to leave it in the pool. We've shut down more than half the VMs on bm08 - is that enough, Aravind?
(In reply to comment #21)
> slave06 and 27 have been very busy with builds and I haven't had a chance to
> grab them yet. slave06 is idle at the moment, but I think it will serve us
> better to leave it in the pool. We've shut down more than half the VMs on bm08
> - is that enough, Aravind?

Thats just fine for now.
(In reply to comment #21)
> (In reply to comment #20)
> > (In reply to comment #19)
> > > (In reply to comment #18)
> > > > Here's an update, based on IRC traffic. Aravind has been talking a lot with
> > > > Equallogic support today and the conclusion is that the SATA shelf is
> > > > overloaded with I/O traffic. We (catlee, me, aki) have temporarily shut down
> > > > the VMs
> > > >   moz2-linux-slave01,13,15,16,18; moz2-win32-slave15,16
> > > 
> > > These have been started up again. I'll get the nagios notifications soon.
> > > 
> > 
> > Re-enabled notifications for these.
> > 
> > > 
> > > We're working on bm08 now, and I'm shutting down/will be shutting down the
> > > following VMs:
> > > linux-slave20, 21, 22, 23, 24, 25
> > > win32-slave06, 27, 28, 29


All of these are back up and connected to the main pool now. Aravind says he's working on bm07 next - without having us shut VMs down.
We had exactly zero timeouts between last night and this morning (yay!).
Okay, things seem stable now and latency numbers are down as well.  Closing this out, we have a separate bug on file to monitor these arrays for problems like this.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [orange]
Whiteboard: [orange]
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.