Closed
Bug 488447
Opened 15 years ago
Closed 15 years ago
Storage latency problems
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: aravind)
References
Details
(Keywords: intermittent-failure)
win32 mozilla-central build on moz2-win32-slave26 at 2009/04/15 00:15:13: timed out after 1 hour trying to execute configure http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1239779713.1239785647.8913.gz
Reporter | ||
Comment 1•15 years ago
|
||
win32 mozilla-central build on moz2-win32-slave13 at 2009/04/14 20:06:02 timed out during compile Building deps for /e/builds/moz2_slave/mozilla-central-win32/build/content/events/src/nsContentEventHandler.cpp nsSVGPoint.cpp command timed out: 5400 seconds without output program finished with exit code 1 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1239764762.1239775149.26456.gz
Reporter | ||
Comment 2•15 years ago
|
||
win32 unit try build on try-win32-slave04 at 2009/04/15 00:10:01 rm -rf mozilla/ command timed out: 3600 seconds without output http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1239779401.1239783012.5575.gz win32 try build on try-win32-slave04 at 2009/04/14 20:55:36 rm -rf mozilla/ command timed out: 3600 seconds without output http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1239767736.1239771345.20512.gz Same for the matching unit test build on try-win32-slave06 http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1239767736.1239771348.20518.gz
Reporter | ||
Comment 3•15 years ago
|
||
moz2-win32-slave26 - eql01-bm06 - bm-vmware05 moz2-win32-slave13 - eql01-bm04 - bm-vmware09 at timeout (moved from 10 at 21:47) try-win32-slave04 - eql01-bm07 - bm-vmware13 try-win32-slave06 - eql01-bm06 - bm-vmware07 Common factor appears to be eql01 controller.
Reporter | ||
Comment 4•15 years ago
|
||
bug 488362 may be a warning sign - builds not being able to open a compilation product earlier today. The code freeze for 3.5b4 is scheduled for Wednesday so we need to have the pool of slaves in tip top condition, and not causing spurious bustage. Raising severity to critical. mrz, what does the fancy reporting say about the Equallogic setup ?
Severity: normal → critical
Reporter | ||
Comment 5•15 years ago
|
||
Changes made on our side today include starting work on try-linux-slave06 thru 09 (all on eql01-bm08/09 but proportionally a small increase in I/O I'd have guessed). Setting noatime on the rest of the linux slaves (bug 486765), should mean less I/O rather than more.
Comment 6•15 years ago
|
||
Where is the 100GB shared drive for bug 472185? I started using that today for some ccache testing.
Assignee | ||
Comment 7•15 years ago
|
||
Yup, looks like the average read latency went way up. It used to be around 20ms after the third array was added, its now around 50ms. I think phong provisioned a bunch of new VMs on these arrays in the last week or so. I will open a case with eql about this and asking for their recommendations. In the meantime I'd suspect that the newly created VMs are causing this. Is there any way for you folks to turn things off etc?
Comment 8•15 years ago
|
||
I think we can shut off the following VMs on eql storage: xr-linux-tbox moz2-win32-slave04 moz2-linux-slave04 moz2-linux-slave17 test-opsi test-winslave try-linux-slave05 try-win32-slave05 moz2-win32-slave21 All of the try-* and moz2-* machines are staging, and we can probably live without them for a bit. xr-linux-tbox is pretty non-critical but cycles all the time.
Comment 9•15 years ago
|
||
(In reply to comment #7) > Is there any > way for you folks to turn things off etc? Any idea how long you want them to be off, by the way?
Comment 10•15 years ago
|
||
(In reply to comment #8) > I think we can shut off the following VMs on eql storage: Aravind helpfully points out that it's only bm05, 07, and 08 are the only arrays affected. There's only two VMs (try-{win32|linux}-slave05) which we can turn off there - both of which are on 07. Almost all of the others are moz2-* production slaves. We could look at turning some off, but I'm weary given that the freeze is today.
Assignee | ||
Comment 11•15 years ago
|
||
Looks like latency numbers are back down to under 30ms now, stuff should be stable now. I am waiting to hear back from eql, but its probably a good idea to move some VMs around from the high i/o volumes to the low i/o ones. eql should automatically be doing that (moving data blocks around on the arrays, so things perform as well as they can), but looks like that isn't working correctly. For now we are going to leave things the way they are. If stuff stars going bad again, we will look at moving some VMs or turning off some.
Comment 13•15 years ago
|
||
(In reply to comment #11) > stuff should be stable now. Is that a very recent development? Because the timeout-during-CVS-checkout issue has happened on 3 WinXP unittest cycles so far today, with the most recent one starting at 9:43 this morning (2 hours ago): http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.0/1239813816.1239815280.31925.gz WINNT 5.2 fx-win32-1.9-slave08 dep unit test on 2009/04/15 09:43:36 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.0/1239779997.1239781470.3138.gz WINNT 5.2 fx-win32-1.9-slave08 dep unit test on 2009/04/15 00:19:57 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.0/1239779998.1239781502.3212.gz WINNT 5.2 fx-win32-1.9-slave09 (pgo01) dep unit test on 2009/04/15 00:19:58
Comment 14•15 years ago
|
||
And here's another timeout during CVS checkout from noonish today: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.0/1239822947.1239824355.14440.gz WINNT 5.2 fx-win32-1.9-slave09 (pgo01) dep unit test on 2009/04/15 12:15:47 (Apologies if this is unnecessary information -- I just got the impression from comment 11 that this issue is thought to be okay for the time being, whereas it looks to me like it's still failing.)
Assignee | ||
Comment 15•15 years ago
|
||
Its not entirely fixed and I have a case open with equallogic about these problems. What I meant was that this seems to be okay at the moment and we were probably going to be okay (which is clearly incorrect). I am still working with eql for a solution.
Comment 16•15 years ago
|
||
dholbert, I'm 99% sure those failures are unrelated. Sadly, that's been going on for long before this issue started.
Reporter | ||
Comment 18•15 years ago
|
||
Here's an update, based on IRC traffic. Aravind has been talking a lot with Equallogic support today and the conclusion is that the SATA shelf is overloaded with I/O traffic. We (catlee, me, aki) have temporarily shut down the VMs moz2-linux-slave01,13,15,16,18; moz2-win32-slave15,16 so that Aravind can configure the Equallogic controller to move two of the busiest eql01-bmXX LUNs from SATA to faster shelfs, which have "super low latency right now". When that's done we'll need to boot them up and re-enable nagios notifications.
Comment 19•15 years ago
|
||
(In reply to comment #18) > Here's an update, based on IRC traffic. Aravind has been talking a lot with > Equallogic support today and the conclusion is that the SATA shelf is > overloaded with I/O traffic. We (catlee, me, aki) have temporarily shut down > the VMs > moz2-linux-slave01,13,15,16,18; moz2-win32-slave15,16 These have been started up again. I'll get the nagios notifications soon. We're working on bm08 now, and I'm shutting down/will be shutting down the following VMs: linux-slave20, 21, 22, 23, 24, 25 win32-slave06, 27, 28, 29 Same deal as before, once they're started back up we'll have to re-enable nagios.
Comment 20•15 years ago
|
||
(In reply to comment #19) > (In reply to comment #18) > > Here's an update, based on IRC traffic. Aravind has been talking a lot with > > Equallogic support today and the conclusion is that the SATA shelf is > > overloaded with I/O traffic. We (catlee, me, aki) have temporarily shut down > > the VMs > > moz2-linux-slave01,13,15,16,18; moz2-win32-slave15,16 > > These have been started up again. I'll get the nagios notifications soon. > Re-enabled notifications for these. > > We're working on bm08 now, and I'm shutting down/will be shutting down the > following VMs: > linux-slave20, 21, 22, 23, 24, 25 > win32-slave06, 27, 28, 29 As it turns out, the only one nagios is watching is win32-slave06 - so no need to worry about the rest.
Comment 21•15 years ago
|
||
(In reply to comment #20) > (In reply to comment #19) > > (In reply to comment #18) > > > Here's an update, based on IRC traffic. Aravind has been talking a lot with > > > Equallogic support today and the conclusion is that the SATA shelf is > > > overloaded with I/O traffic. We (catlee, me, aki) have temporarily shut down > > > the VMs > > > moz2-linux-slave01,13,15,16,18; moz2-win32-slave15,16 > > > > These have been started up again. I'll get the nagios notifications soon. > > > > Re-enabled notifications for these. > > > > > We're working on bm08 now, and I'm shutting down/will be shutting down the > > following VMs: > > linux-slave20, 21, 22, 23, 24, 25 > > win32-slave06, 27, 28, 29 > slave06 and 27 have been very busy with builds and I haven't had a chance to grab them yet. slave06 is idle at the moment, but I think it will serve us better to leave it in the pool. We've shut down more than half the VMs on bm08 - is that enough, Aravind?
Assignee | ||
Comment 22•15 years ago
|
||
(In reply to comment #21) > slave06 and 27 have been very busy with builds and I haven't had a chance to > grab them yet. slave06 is idle at the moment, but I think it will serve us > better to leave it in the pool. We've shut down more than half the VMs on bm08 > - is that enough, Aravind? Thats just fine for now.
Comment 23•15 years ago
|
||
(In reply to comment #21) > (In reply to comment #20) > > (In reply to comment #19) > > > (In reply to comment #18) > > > > Here's an update, based on IRC traffic. Aravind has been talking a lot with > > > > Equallogic support today and the conclusion is that the SATA shelf is > > > > overloaded with I/O traffic. We (catlee, me, aki) have temporarily shut down > > > > the VMs > > > > moz2-linux-slave01,13,15,16,18; moz2-win32-slave15,16 > > > > > > These have been started up again. I'll get the nagios notifications soon. > > > > > > > Re-enabled notifications for these. > > > > > > > > We're working on bm08 now, and I'm shutting down/will be shutting down the > > > following VMs: > > > linux-slave20, 21, 22, 23, 24, 25 > > > win32-slave06, 27, 28, 29 All of these are back up and connected to the main pool now. Aravind says he's working on bm07 next - without having us shut VMs down.
Comment 24•15 years ago
|
||
We had exactly zero timeouts between last night and this morning (yay!).
Assignee | ||
Comment 25•15 years ago
|
||
Okay, things seem stable now and latency numbers are down as well. Closing this out, we have a separate bug on file to monitor these arrays for problems like this.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•14 years ago
|
Whiteboard: [orange]
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange]
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•