Closed Bug 981903 Opened 8 years ago Closed 8 years ago

Enlarge the mozilla-inbound jacuzzis, or get rid of them until we can figure out how to calculate the proper size

Categories

(Release Engineering :: General, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

Details

When I just looked at mozilla-inbound pending, we had Windows debug builds pending across 15 pushes and 70 minutes, and b2g emulator ics builds pending across 8 pushes and 40 minutes (unamusingly, the emulator jb builds which don't matter nearly as much since they don't run tests, and which are not in a jacuzzi, are keeping up reasonably well).

Neither 70 minutes nor 40 minutes worth of inbound is a workable chunk to coalesce. What is a workable chunk? Not really sure, and the fact that we can't say is one reason why we're not actually ready to deploy jacuzzis: we don't know how to determine the correct size, in part because we flat out have absolutely no idea how to say whether any particular size is working. If, as I think is the case, our only measure for "that's enough slaves" is "the sheriffs aren't yelling at us, and aren't closing the tree over it," well then the current size is too small, because when I saw 15 pushes and 70 minutes of Win debug, I closed the tree.
Just to document it here:

[20:45:23]	<bhearsum>	nthomas: i was going to turn them off, but philor told me to leave them on in #releng ~20min ago

the timestamp is ET
Things are open this morning, and I've been busy with release work so I haven't done anything about this.

We should probably do some calculations before doing anything here though:
* to verify that the current jacuzzi is actually slower - eg, wait+build time is actually longer than previous wait+build time. If we're just seeing elevated wait times, but quicker turnaround overall - I don't think there's as great of a need to make a change.
* to make sure that we don't starve other builders before adding more machines to these jacuzzis.
70min wait times sound terrible. Can we build on non-jacuzzis once wait times cross over some threshold?
Alternatively we could let sheriffs trigger 'build on non-jacuzzis nodes' manually during busy times while we figure out a heuristic.
Full build cycles of coalescing should actually be completely expected, particularly from mozilla-inbound where we land tons of bustage and close all the time. Close for bustage, reopen, five pushes land fairly quickly, the entire Windows jacuzzis get occupied, and then everything else that lands for 90-100 minutes waits for those first five pushes to give up a slave, and gets coalesced into one build, will be an every day occurrence.

Another immeasurable to add to the calculation is the enormous risk of coalescing: had someone landed Windows bustage without an obvious source during that coalescing of 15 pushes yesterday, I would have closed and retriggered to find out where it came from. With 15 pushes, 5 slaves in the jacuzzi, and some bad luck, that closure would have been three Windows build cycles long.
I'm very busy with release stuff today, but I disabled the mozilla-inbound jacuzzis, per IRC: https://github.com/bhearsum/static-jacuzzis/commit/dd9df6d731f85ffb410bfdbe677fd524620b5554
And one more measurement challenge: you can't compare speed gains across branches. On b2g-inbound, where the typical push needs to rebuild absolutely none of Gecko, emulator builds went from 90-140 minutes down to 20-25 minutes. On mozilla-inbound, they only dropped to around 50 minutes. On a hypothetical zbarsky-inbound, where every single push needs to rebuild all of js, or dom, or layout, or all three, they'd probably only drop to 80 minutes.
(In reply to Taras Glek (:taras) from comment #4)
> Alternatively we could let sheriffs trigger 'build on non-jacuzzis nodes'
> manually during busy times while we figure out a heuristic.

The way we're implemented right now (static files manually pulled) we don't have a way to give them a switch to flip. If we were OK with automatically pulling, I could give them access to push to https://github.com/bhearsum/static-jacuzzis, but it would still be a very fiddly manual process. I wouldn't really recommend this until we have a more robust system. These static files are just a short term hack IMO.

If/when we do make this possible, we should probably make sure that it's in addition to the jacuzzi'ed machines. Ie, don't let jacuzzi-allocated machines all of a sudden do any build we have, because they're objdirs are likely going to get clobbered if they do. Instead, do builds on the jacuzzi'ed machines + any others.
For those who thought following this bug would actually keep you updated: two or three weeks ago, m-i got tossed into the jacuzzi again, this time with 13 slaves for 32+64 opt, and 13 slaves for 32+64 debug, which may possibly be the reasonable number: it is absolutely possible to get all of both pools running, and then get more pushes, and get coalescing because of the jacuzzi, but at least so far I haven't seen pushing that's evenly spaced enough to make it clearly only because of the jacuzzi. If we get multiple pushes within a few minutes, yeah, we coalesce, always have.

Given the build times and the size of the jacuzzis, there's probably some theoretical spacing, like one push every 9 minutes for 90 minutes, which would result in unacceptable coalescing, but we don't push in an orderly fashion, ever. We close, reopen, someone pushes, everyone sees there was a reopening and rebases like mad and pushes over the next 10 minutes, there's a gap with a push or two, we close. Probably, this is close enough.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.