Closed Bug 1137047 Opened 9 years ago Closed 9 years ago

Reallocate 10 buildpool Mac slaves to trybuildpool to balance load

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: jlund)

References

Details

Attachments

(1 file, 1 obsolete file)

Percentage of OS X build jobs starting in the first 15 minutes after being scheduled:

      Try   Nontry
2-17  34%    98%
2-18  45%   100%
2-19  37%    79%
2-20  34%    99%
2-21  91%   100%
2-22  96%   100%
2-23  20%    96%
2-24   7%   100%

It's probably a good thing to keep the Mac try build pool a little lean, to encourage people who only need one platform to stay away from it so they don't use our precious Mac test slaves, but 7%? That's getting a little too lean.
Hmm, that 2-24 7% is 372 jobs, and 1-22 had 460 jobs and started 40% in 15 minutes, and all of them in 135 (2-24 claims the final 3% were in 240 minutes, but between the way we're carrying load over to the next day and the way I'm pretty sure we started lying about extremes and calling everything over 240 240, that probably isn't entirely true). So, something's busted.
One (very inconvenient, since nobody's going to want the blame or the ownership for this) thing that's busted is that most if not all the slaves are too-frequently doing "command timed out: 10800 seconds without output" at various random points during builds that appeared to be going just fine, so instead of taking 45-90 minutes per job, they take 3 hours for those.

If that timeout's adjustable per-platform, we could lop an awful lot of it off for Mac - a Windows build might spend some significant percentage of 180 minutes without output, I'd be willing to believe that, but if a Mac build takes 180 minutes *with* output it's broken, so 180 minutes without output is way too generous.

Given the number of busted disks we've replaced in non-try Mac build slaves lately, it wouldn't surprise me if what's behind the frequent timeouts is bad disks, but we're very unlikely to be told about intermittent "I/O error" failures on try, and I don't know how we could just blindly run diagnostics on them, since... 7%. Guess maybe we could move 10 non-try slave to try, and 10 try slaves to non-try, so their failures would actually get noticed, and then later move the other 10.
I looked at this from the standpoint of how much work was going through both the build and try pools for Mac in the past week, based on the data from the wait times emails.

3650 total jobs
1547 try (42%)
2103 build (58%)

If we wanted to be equitable based on load, we would divide the 79 builder minis with 46 in build and 33 in try. 

I'm fine with some smaller-that-what-we're-currently-seeing wait times on try (7% is ridiculous), so I'd advocate moving 10 slaves from build->try.

(In reply to Phil Ringnalda (:philor) from comment #2)
> If that timeout's adjustable per-platform, we could lop an awful lot of it
> off for Mac - a Windows build might spend some significant percentage of 180
> minutes without output, I'd be willing to believe that, but if a Mac build
> takes 180 minutes *with* output it's broken, so 180 minutes without output
> is way too generous.

This should be doable. Let's spin out a separate bug for that.

> Given the number of busted disks we've replaced in non-try Mac build slaves
> lately, it wouldn't surprise me if what's behind the frequent timeouts is
> bad disks, but we're very unlikely to be told about intermittent "I/O error"
> failures on try, and I don't know how we could just blindly run diagnostics
> on them, since... 7%. Guess maybe we could move 10 non-try slave to try, and
> 10 try slaves to non-try, so their failures would actually get noticed, and
> then later move the other 10.

If it were easier to move machines between the two pools, this would be an easier sell. I think we tackle the other two points (balance pools & shorten Mac timeouts), and circle back to this one if we feel we still need to.
Assignee: nobody → jlund
Summary: Rebalance the Mac build slaves between buildpool and trybuildpool → Reallocate 10 buildpool Mac slaves to trybuildpool to balance load
See Also: → 1138524
Depends on: 1138672
I think this is all that's needed from buildbot repos
Attachment #8571638 - Flags: review?(coop)
Attachment #8571638 - Flags: review?(coop) → review+
we are no longer moving 006 and instead just moving 9 mac build slaves to try pool

interdiff:
diff --git a/mozilla/production_config.py b/mozilla/production_config.py
index edb309d..2fb71dd 100644
--- a/mozilla/production_config.py
+++ b/mozilla/production_config.py
@@ -1,4 +1,4 @@
-MAC_LION_MINIS = ['bld-lion-r5-%03d' % x for x in range(1,6) + range(41,69) + \
+MAC_LION_MINIS = ['bld-lion-r5-%03d' % x for x in range(1,7) + range(41,69) + \
                   range(70,87) + range(88,95)]
 WIN64_REV2     = ['b-2008-ix-%04i' % x for x in range(1,18) + range(65,89) + range(90,159) + range(161,173)]
 LINUX64_EC2    = ['bld-linux64-ec2-%03d' % x for x in range(1, 50) + range(301, 350)] + \
@@ -16,7 +16,7 @@ TRY_LINUX64_EC2 = ['try-linux64-ec2-%03d' % x for x in range(1, 60) + range(301,
     ['try-linux64-spot-%03d' % x for x in range(1, 200) + range(300,500)] + \
     ['try-linux64-spot-%d' % x for x in range(1000, 1100)]
 TRY_WIN64_REV2 = ['b-2008-ix-%04i' % x for x in range(18, 65) + range(173,185)]
-TRY_LION         = ['bld-lion-r5-%03d' % x for x in range(6,37)]
+TRY_LION         = ['bld-lion-r5-%03d' % x for x in range(7,37)]
 if set(TRY_WIN64_REV2).intersection(WIN64_REV2):
     raise Exception('TRY_WIN64_REV2 and WIN64_REV2 overlap')
Attachment #8571638 - Attachment is obsolete: true
Attachment #8573011 - Flags: review?(coop)
slavealloc has been updated to reflect that it is in the try pool now
Attachment #8573011 - Flags: review?(coop) → review+
buildbot-config patch is in production
I've enabled the these slaves and they have started taking jobs. leaving open till we get a green confirmation.
first jobs that's completed look good. A couple TB failure but that doesn't alarm me given TB build history outside these slaves. I'll leave them running.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: