Closed Bug 483896 Opened 16 years ago Closed 16 years ago

Problems with builds backed by eql storage

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: aravind)

References

Details

Attachments

(1 file)

[checked in] bump unittest clone, compile timeout to 1 hour 16 years ago bhearsum@mozilla.com (:bhearsum) 1.76 KB, patch	mozilla : review+	Details \| Diff \| Splinter Review

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

16 years ago

Since about 7:30am this morning any unit test build running on a linux VM, and backed by the eql storage controller, has failed to complete the compilation. In contrast, builds on the c-fcal or d-fcal controllers have not had this problem. We've also seen intermittent problems pulling from hg.m.o (I/O intensive) on Windows VMs backed by eql. Please urgently check loading on the eql array.

Jeremy Orem [:oremj]

Updated

•

16 years ago

Assignee: server-ops → aravind

Aravind Gottipati [:aravind]

Assignee

Comment 1

•

16 years ago

We are aware that the equallogic arrays are overloaded, but didn't expect the problem to manifest itself like this. I cleared out some space on the faster array. This should allow more volumes to use this array and we should see some performance improvements. Give it overnight to settle and please comment here if you notice similar failures tomorrow as well.

bhearsum@mozilla.com (:bhearsum)

Comment 3

•

16 years ago

Things are looking better today. Since approx. 6pm PDT yesterday (March 17) I only see 3 failures caused by timeouts with the most recent being at 4am PDT (almost 3 hours ago). What's the long term plan here? Just getting more storage so each array is less loaded?

Aravind Gottipati [:aravind]

Assignee

Comment 4

•

16 years ago

We are talking to the array vendors to figure out future growth plans. We will most likely buy more arrays, but for now, this bug is resolved. Please re-open if needed.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

16 years ago

(In reply to comment #4) > We are talking to the array vendors to figure out future growth plans. We will > most likely buy more arrays, but for now, this bug is resolved. Please re-open > if needed. We've had 4 failures due to timeout today...that's a lot better than yesterday but I wouldn't call it "resolved". Is there anything else we can do in the meantime to help this situation?

Aravind Gottipati [:aravind]

Assignee

Comment 6

•

16 years ago

Alright, which machines are those? Can you tell me what datastore they are on?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

16 years ago

moz2-win32-slave12 @ 8:43am - eql01-bm04 moz2-linux-slave19 @ 8:35am - eql01-bm07 moz2-win32-slave14 @ 8:27am - eql01-bm05 moz2-win32-slave18 @ 8:06am - eql01-bm05

Aravind Gottipati [:aravind]

Assignee

Comment 8

•

16 years ago

Hmm.. thats pretty much all of them. I moved a few more volumes around. Lets see if that improves things.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 9

•

16 years ago

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1237405530.1237407025.19881.gz WINNT 5.2 mozilla-central unit test on 2009/03/18 12:45:30 failed with a timeout in "hg clone".

bhearsum@mozilla.com (:bhearsum)

Comment 10

•

16 years ago

(In reply to comment #9) > http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1237405530.1237407025.19881.gz > > WINNT 5.2 mozilla-central unit test on 2009/03/18 12:45:30 failed with a > timeout in "hg clone". There's been two additional ones, too. Aravind, would it help if we shut off a few VMs?

Andreas Gal :gal

Comment 11

•

16 years ago

Could we maybe re-try the clone in the buildbot script? It seems thats where we fail most the time. It doesn't solve the overload but it might help us limp a bit further while we fix this bug.

bhearsum@mozilla.com (:bhearsum)

Comment 12

•

16 years ago

(In reply to comment #11) > Could we maybe re-try the clone in the buildbot script? It seems thats where we > fail most the time. It doesn't solve the overload but it might help us limp a > bit further while we fix this bug. I think the most common case is linking libxul.so on Linux, actually. In any case, it's not as simple of a change as you'd think to reclone. I'm going to bump up the length of the timeouts, however, since it doesn't seem like this problem is going to go away real soon.

bhearsum@mozilla.com (:bhearsum)

Comment 13

•

16 years ago

Attached patch [checked in] bump unittest clone, compile timeout to 1 hour — Details — Splinter Review

I removed the special casing for Windows in the UnittestBuildFactory, too, since the only difference was the timeout. I also bumped the post-build cleanup for the nightlies to 90 minutes because we've seen some of them timeout as well.

Attachment #368101 - Flags: review?(aki)

Aki Sasaki (not active)

Updated

•

16 years ago

Attachment #368101 - Flags: review?(aki) → review+

Aki Sasaki (not active)

Comment 14

•

16 years ago

Comment on attachment 368101 [details] [diff] [review] [checked in] bump unittest clone, compile timeout to 1 hour looks good

bhearsum@mozilla.com (:bhearsum)

Comment 15

•

16 years ago

Comment on attachment 368101 [details] [diff] [review] [checked in] bump unittest clone, compile timeout to 1 hour changeset: 225:0d1967cfdd60

Attachment #368101 - Attachment description: bump unittest clone, compile timeout to 1 hour → [checked in] bump unittest clone, compile timeout to 1 hour

bhearsum@mozilla.com (:bhearsum)

Comment 16

•

16 years ago

(In reply to comment #15) > (From update of attachment 368101 [details] [diff] [review]) > changeset: 225:0d1967cfdd60 The buildbot master has been reconfig'ed for this. Hopefully this will prevent Buildbot from timing out the process....

bhearsum@mozilla.com (:bhearsum)

Comment 17

•

16 years ago

Aravind, things are looking better since your last shuffle & the patch I checked in. I don't see any failures at all overnight.

Aravind Gottipati [:aravind]

Assignee

Comment 18

•

16 years ago

Okay, will close this out since things look stable.

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 19

•

16 years ago

Ben, Aravind: aiui, extending the timeouts was just as a workaround. If this is really "fixed", can we now undo/revert those extended timeouts?

Mike Shaver (:shaver emeritus)

Comment 20

•

16 years ago

Do we log how long it took each of these timeoutable steps to happen? That would let us tell when we're inching closer towards to the threshold, and let us distinguish "timeout slightly too aggressive" from "something off the rails".

bhearsum@mozilla.com (:bhearsum)

Comment 21

•

16 years ago

(In reply to comment #20) > Do we log how long it took each of these timeoutable steps to happen? That > would let us tell when we're inching closer towards to the threshold, and let > us distinguish "timeout slightly too aggressive" from "something off the > rails". We log how long each step takes to run. However, Buildbot timeouts are reset whenever there is output from the step. So: a timeout of 60min isn't "kill this step after 60min" but rather "kill this step after 60min without output". We don't keep track of "most time without output" for steps.

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

You need to log in before you can comment on or make changes to this bug.