Closed
Bug 483896
Opened 15 years ago
Closed 15 years ago
Problems with builds backed by eql storage
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: aravind)
References
Details
Attachments
(1 file)
1.76 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
Since about 7:30am this morning any unit test build running on a linux VM, and backed by the eql storage controller, has failed to complete the compilation. In contrast, builds on the c-fcal or d-fcal controllers have not had this problem. We've also seen intermittent problems pulling from hg.m.o (I/O intensive) on Windows VMs backed by eql. Please urgently check loading on the eql array.
Updated•15 years ago
|
Assignee: server-ops → aravind
Assignee | ||
Comment 1•15 years ago
|
||
We are aware that the equallogic arrays are overloaded, but didn't expect the problem to manifest itself like this. I cleared out some space on the faster array. This should allow more volumes to use this array and we should see some performance improvements. Give it overnight to settle and please comment here if you notice similar failures tomorrow as well.
Comment 3•15 years ago
|
||
Things are looking better today. Since approx. 6pm PDT yesterday (March 17) I only see 3 failures caused by timeouts with the most recent being at 4am PDT (almost 3 hours ago). What's the long term plan here? Just getting more storage so each array is less loaded?
Assignee | ||
Comment 4•15 years ago
|
||
We are talking to the array vendors to figure out future growth plans. We will most likely buy more arrays, but for now, this bug is resolved. Please re-open if needed.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment 5•15 years ago
|
||
(In reply to comment #4) > We are talking to the array vendors to figure out future growth plans. We will > most likely buy more arrays, but for now, this bug is resolved. Please re-open > if needed. We've had 4 failures due to timeout today...that's a lot better than yesterday but I wouldn't call it "resolved". Is there anything else we can do in the meantime to help this situation?
Assignee | ||
Comment 6•15 years ago
|
||
Alright, which machines are those? Can you tell me what datastore they are on?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 7•15 years ago
|
||
moz2-win32-slave12 @ 8:43am - eql01-bm04 moz2-linux-slave19 @ 8:35am - eql01-bm07 moz2-win32-slave14 @ 8:27am - eql01-bm05 moz2-win32-slave18 @ 8:06am - eql01-bm05
Assignee | ||
Comment 8•15 years ago
|
||
Hmm.. thats pretty much all of them. I moved a few more volumes around. Lets see if that improves things.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1237405530.1237407025.19881.gz WINNT 5.2 mozilla-central unit test on 2009/03/18 12:45:30 failed with a timeout in "hg clone".
Comment 10•15 years ago
|
||
(In reply to comment #9) > http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1237405530.1237407025.19881.gz > > WINNT 5.2 mozilla-central unit test on 2009/03/18 12:45:30 failed with a > timeout in "hg clone". There's been two additional ones, too. Aravind, would it help if we shut off a few VMs?
Comment 11•15 years ago
|
||
Could we maybe re-try the clone in the buildbot script? It seems thats where we fail most the time. It doesn't solve the overload but it might help us limp a bit further while we fix this bug.
Comment 12•15 years ago
|
||
(In reply to comment #11) > Could we maybe re-try the clone in the buildbot script? It seems thats where we > fail most the time. It doesn't solve the overload but it might help us limp a > bit further while we fix this bug. I think the most common case is linking libxul.so on Linux, actually. In any case, it's not as simple of a change as you'd think to reclone. I'm going to bump up the length of the timeouts, however, since it doesn't seem like this problem is going to go away real soon.
Comment 13•15 years ago
|
||
I removed the special casing for Windows in the UnittestBuildFactory, too, since the only difference was the timeout. I also bumped the post-build cleanup for the nightlies to 90 minutes because we've seen some of them timeout as well.
Attachment #368101 -
Flags: review?(aki)
Updated•15 years ago
|
Attachment #368101 -
Flags: review?(aki) → review+
Comment 14•15 years ago
|
||
Comment on attachment 368101 [details] [diff] [review] [checked in] bump unittest clone, compile timeout to 1 hour looks good
Comment 15•15 years ago
|
||
Comment on attachment 368101 [details] [diff] [review] [checked in] bump unittest clone, compile timeout to 1 hour changeset: 225:0d1967cfdd60
Attachment #368101 -
Attachment description: bump unittest clone, compile timeout to 1 hour → [checked in] bump unittest clone, compile timeout to 1 hour
Comment 16•15 years ago
|
||
(In reply to comment #15) > (From update of attachment 368101 [details] [diff] [review]) > changeset: 225:0d1967cfdd60 The buildbot master has been reconfig'ed for this. Hopefully this will prevent Buildbot from timing out the process....
Comment 17•15 years ago
|
||
Aravind, things are looking better since your last shuffle & the patch I checked in. I don't see any failures at all overnight.
Assignee | ||
Comment 18•15 years ago
|
||
Okay, will close this out since things look stable.
Status: REOPENED → RESOLVED
Closed: 15 years ago → 15 years ago
Resolution: --- → FIXED
Comment 19•15 years ago
|
||
Ben, Aravind: aiui, extending the timeouts was just as a workaround. If this is really "fixed", can we now undo/revert those extended timeouts?
Do we log how long it took each of these timeoutable steps to happen? That would let us tell when we're inching closer towards to the threshold, and let us distinguish "timeout slightly too aggressive" from "something off the rails".
Comment 21•15 years ago
|
||
(In reply to comment #20) > Do we log how long it took each of these timeoutable steps to happen? That > would let us tell when we're inching closer towards to the threshold, and let > us distinguish "timeout slightly too aggressive" from "something off the > rails". We log how long each step takes to run. However, Buildbot timeouts are reset whenever there is output from the step. So: a timeout of 60min isn't "kill this step after 60min" but rather "kill this step after 60min without output". We don't keep track of "most time without output" for steps.
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•