Closed Bug 483896 Opened 15 years ago Closed 15 years ago

Problems with builds backed by eql storage

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: aravind)

References

Details

Attachments

(1 file)

Since about 7:30am this morning any unit test build running on a linux VM, and backed by the eql storage controller, has failed to complete the compilation. In contrast, builds on the c-fcal or d-fcal controllers have not had this problem. We've also seen intermittent problems pulling from hg.m.o (I/O intensive) on Windows VMs backed by eql.

Please urgently check loading on the eql array.
Assignee: server-ops → aravind
We are aware that the equallogic arrays are overloaded, but didn't expect the problem to manifest itself like this.  I cleared out some space on the faster array.  This should allow more volumes to use this array and we should see some performance improvements.

Give it overnight to settle and please comment here if you notice similar failures tomorrow as well.
Things are looking better today. Since approx. 6pm PDT yesterday (March 17) I only see 3 failures caused by timeouts with the most recent being at 4am PDT (almost 3 hours ago).

What's the long term plan here? Just getting more storage so each array is less loaded?
We are talking to the array vendors to figure out future growth plans.  We will most likely buy more arrays, but for now, this bug is resolved.  Please re-open if needed.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
(In reply to comment #4)
> We are talking to the array vendors to figure out future growth plans.  We will
> most likely buy more arrays, but for now, this bug is resolved.  Please re-open
> if needed.

We've had 4 failures due to timeout today...that's a lot better than yesterday but I wouldn't call it "resolved". Is there anything else we can do in the meantime to help this situation?
Alright, which machines are those?  Can you tell me what datastore they are on?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
moz2-win32-slave12 @ 8:43am - eql01-bm04
moz2-linux-slave19 @ 8:35am - eql01-bm07
moz2-win32-slave14 @ 8:27am - eql01-bm05
moz2-win32-slave18 @ 8:06am - eql01-bm05
Hmm.. thats pretty much all of them.  I moved a few more volumes around.  Lets see if that improves things.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1237405530.1237407025.19881.gz

WINNT 5.2 mozilla-central unit test on 2009/03/18 12:45:30 failed with a timeout in "hg clone".
(In reply to comment #9)
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1237405530.1237407025.19881.gz
> 
> WINNT 5.2 mozilla-central unit test on 2009/03/18 12:45:30 failed with a
> timeout in "hg clone".

There's been two additional ones, too.

Aravind, would it help if we shut off a few VMs?
Could we maybe re-try the clone in the buildbot script? It seems thats where we fail most the time. It doesn't solve the overload but it might help us limp a bit further while we fix this bug.
(In reply to comment #11)
> Could we maybe re-try the clone in the buildbot script? It seems thats where we
> fail most the time. It doesn't solve the overload but it might help us limp a
> bit further while we fix this bug.

I think the most common case is linking libxul.so on Linux, actually.

In any case, it's not as simple of a change as you'd think to reclone. I'm going to bump up the length of the timeouts, however, since it doesn't seem like this problem is going to go away real soon.
I removed the special casing for Windows in the UnittestBuildFactory, too, since the only difference was the timeout. I also bumped the post-build cleanup for the nightlies to 90 minutes because we've seen some of them timeout as well.
Attachment #368101 - Flags: review?(aki)
Attachment #368101 - Flags: review?(aki) → review+
Comment on attachment 368101 [details] [diff] [review]
[checked in] bump unittest clone, compile timeout to 1 hour

looks good
Comment on attachment 368101 [details] [diff] [review]
[checked in] bump unittest clone, compile timeout to 1 hour

changeset:   225:0d1967cfdd60
Attachment #368101 - Attachment description: bump unittest clone, compile timeout to 1 hour → [checked in] bump unittest clone, compile timeout to 1 hour
(In reply to comment #15)
> (From update of attachment 368101 [details] [diff] [review])
> changeset:   225:0d1967cfdd60

The buildbot master has been reconfig'ed for this. Hopefully this will prevent Buildbot from timing out the process....
Aravind, things are looking better since your last shuffle & the patch I checked in. I don't see any failures at all overnight.
Okay, will close this out since things look stable.
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Ben, Aravind: aiui, extending the timeouts was just as a workaround. If this is really "fixed", can we now undo/revert those extended timeouts?
Do we log how long it took each of these timeoutable steps to happen?  That would let us tell when we're inching closer towards to the threshold, and let us distinguish "timeout slightly too aggressive" from "something off the rails".
(In reply to comment #20)
> Do we log how long it took each of these timeoutable steps to happen?  That
> would let us tell when we're inching closer towards to the threshold, and let
> us distinguish "timeout slightly too aggressive" from "something off the
> rails".

We log how long each step takes to run. However, Buildbot timeouts are reset whenever there is output from the step. So: a timeout of 60min isn't "kill this step after 60min" but rather "kill this step after 60min without output".

We don't keep track of "most time without output" for steps.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: