Closed Bug 545136 Opened 14 years ago Closed 14 years ago

[Tracking bug] Add 52 physical builders into production pool-o-slaves

Categories

(Release Engineering :: General, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: bhearsum)

References

Details

(Whiteboard: [buildslaves])

Attachments

(5 files, 2 obsolete files)

Buildbot configs go here. Also, the setup in staging and then move to production should be tracked here.
We need to build with "-j1" on these multi-core physical machines, to avoid a broken build. Unclear if we should use different flags for VMs and physical, or else explicitly set "-j1" on all VMs and physical - with some possible time delays on VMs.

We need to resolve this before we roll these machines into production.
(In reply to comment #1)
> We need to build with "-j1" on these multi-core physical machines, to avoid a
> broken build. Unclear if we should use different flags for VMs and physical, or
> else explicitly set "-j1" on all VMs and physical - with some possible time
> delays on VMs.

We do not need to use a different -j flag for Linux, only for Windows.

> We need to resolve this before we roll these machines into production.

I'll test out -j1 on VMs and various -j options on the physical machines when they come into staging.
For the curious, physical spec of machines: 

1U Mercury Rackmount Server
* 2 Fixed SATA Drive Bay 
* 260W  Power Supply 
* Single Socket Server Board – Dual GigE NICs, 
* VGA
* Integrated IPMI Remote Management w/dedicated LAN 
* Intel X3430 2.4GHz Quad Core, 8MB Cache 
* 2 x 2GB DDR­3 1066 ECC Unbuffered DIMM (4GB Total) 
* 1 x Seagate 250GB SATA Desktop Drive, 7200RPM/8MB Cache
OS: Mac OS X → All
Pushing to bhearsum, as he's doing all the work here already.
Assignee: nobody → bhearsum
Attachment #426353 - Flags: review?(nrthomas)
Comment on attachment 426353 [details] [diff] [review]
pick fast slaves for builds, slow slaves for tests

Looks good, r+
Attachment #426353 - Flags: review?(nrthomas) → review+
Comment on attachment 426354 [details] [diff] [review]
set fast slave regexes on production/staging

Nice one!
Attachment #426354 - Flags: review?(nrthomas) → review+
Comment on attachment 426353 [details] [diff] [review]
pick fast slaves for builds, slow slaves for tests

>+def _nextSlowSlave(builder, available_slaves):
>+def _nextFastSlave(builder, available_slaves):

Oh hmm, perahps these should fall back to available_slaves if there's a problem with _partitionSlaves?
this adds exception handling, and prioritizing slaves that most recently completed a build on the builder that wants to start a new build.  this should give better depend build behaviour.
Attachment #426353 - Attachment is obsolete: true
Attachment #426372 - Flags: review?(nrthomas)
Comment on attachment 426372 [details] [diff] [review]
pick fast slaves for builds, slow slaves for tests v2

>diff --git a/misc.py b/misc.py
>+            log.msg("No fast or slow slaves found, choosing randomly instead")

I'd suggest adding the builder name here too, just to make the logs less ambiguous. r+ if this works out on staging.
Attachment #426372 - Flags: review?(nrthomas) → review+
Depends on: 545797
Depends on: 545801
linux-ix-slave01 is having problems connecting to clobberer.
(In reply to comment #12)
> linux-ix-slave01 is having problems connecting to clobberer.

We can fix that in bug 545567, it's the same subnet.
Depends on: 545567
Comment on attachment 426354 [details] [diff] [review]
set fast slave regexes on production/staging

changeset:   2082:ffa8f302bdc3
Attachment #426354 - Flags: checked-in+
Comment on attachment 426372 [details] [diff] [review]
pick fast slaves for builds, slow slaves for tests v2

changeset:   610:eee27559b198
Attachment #426372 - Flags: checked-in+
Here's the latest on these machines, as best I know:
* We got all of the currently powered on ones into staging
* Some of the Linux ones are hung at a 'GRUB' screen, refusing to boot
* The Windows ones are having trouble either staying connected to Buildbot, or connecting to it in the first place.
* https://bugzilla.mozilla.org/show_bug.cgi?id=484799 is breaking the Windows 'make package' step. When the tree is in good enough shape, we'll be backing it out from mozilla-central to fix that.

So, we need to resolve these issues and then continue running them in staging, making sure they build ok.
Depends on: 546731
(In reply to comment #20)

> Here's the latest on these machines, as best I know:
> * We got all of the currently powered on ones into staging
> * Some of the Linux ones are hung at a 'GRUB' screen, refusing to boot

Catlee is looking into this

> * The Windows ones are having trouble either staying connected to Buildbot, or
> connecting to it in the first place.

Turns out the Linux ones are too, and dmoore knows how to fix it.  bug 546731 has the details.

> * https://bugzilla.mozilla.org/show_bug.cgi?id=484799 is breaking the Windows
> 'make package' step. When the tree is in good enough shape, we'll be backing it
> out from mozilla-central to fix that.

Still need to back this out, looking to do it tomorrow.


I looked over all the results, too, and other than some known rando-orange tests and failed builds for reasons like somebody shutting the master down, they're green. Windows builds have worked fine with -j4, so we should just leave that on.


So, once the stability issues are resolved on the Linux machines, they can move to production. Same thing for the Windows machines, but we need to back out that m-c patch first, too.
Note that I've kept the l10n slaves on VMs, because they don't need as much processing power, and it lets us be more flexible with the IX machine balance between masters.
Attachment #427590 - Flags: review?(catlee)
Comment on attachment 427590 [details] [diff] [review]
add ix machines to production configs

> SLAVES = {
>-    'linux': ['moz2-linux-slave%02i' % x for x in [1,2] +
>-              range(5,17) + range(18,51)],
>+    'linux': LINUX_VMS + LINUX_IXS,
>     'linux64': ['moz2-linux64-slave%02i' % x for x in range(1,13)],
>-    'win32': ['win32-slave%02i' % x for x in [1,2] + range(5,21) +
>-              range(22,60)],
>+    'win32': WIN32_VMS + LINUX_IXS,

This should be WIN32_IXS.

> ACTIVE_BRANCHES = ['mozilla-central', 'mozilla-1.9.2', 'places']
> L10N_SLAVES = {
>-    'linux': SLAVES['linux'][:8],
>-    'win32': SLAVES['win32'][:8],
>+    'linux': LINUX_VMS[:8],
>+    'win32': WIN32_VMS[:8],
>     'macosx': MAC_MINIS[:6] + XSERVES[:2],
> }

>-    'linux': SLAVES['linux'][-8:],
>-    'win32': SLAVES['win32'][-8:],
>+    'linux': LINUX_VMS[:8],
>+    'win32': WIN32_VMS[:8],
>     'macosx': MAC_MINIS[-6:] + XSERVES[-2:],

I think the second set of these should use *_VMS[-8:] instead of *_VMS[:8]
Attachment #427590 - Flags: review?(catlee) → review-
Attachment #427590 - Attachment is obsolete: true
Attachment #427757 - Flags: review?(catlee)
Attachment #427757 - Flags: review?(catlee) → review+
(In reply to comment #21)
> (In reply to comment #20)
> 
> > Here's the latest on these machines, as best I know:
> > * We got all of the currently powered on ones into staging
> > * Some of the Linux ones are hung at a 'GRUB' screen, refusing to boot
> 
> Catlee is looking into this
> 

Still having issues here.

> > * The Windows ones are having trouble either staying connected to Buildbot, or
> > connecting to it in the first place.
> 
> Turns out the Linux ones are too, and dmoore knows how to fix it.  bug 546731
> has the details.
> 

This problem has been fixed.

> > * https://bugzilla.mozilla.org/show_bug.cgi?id=484799 is breaking the Windows
> > 'make package' step. When the tree is in good enough shape, we'll be backing it
> > out from mozilla-central to fix that.
> 
> Still need to back this out, looking to do it tomorrow.

I just landed this back out.


Once the backout has been cleared I think we're ready to put the Windows machines in production. However, I'd rather wait until Monday to do so, since there will be more visibility and eyes on them at that time.
Comment on attachment 427757 [details] [diff] [review]
add ix machines to production configs, v2

changeset:   2099:296d434d0eca

I pushed all but the LINUX_IXS parts of this.
Attachment #427757 - Flags: checked-in+
This is the same thing as the staging patch, but with an adjusted path for production.
Attachment #428221 - Flags: review?(catlee)
Whiteboard: [buildslaves]
Comment on attachment 428221 [details] [diff] [review]
set proper MOZ_MAKE_FLAGS

Looks good.  You missed removing the MOZ_MAKE_FLAGS like from the tracemonkey xulrunner configs.
Attachment #428221 - Flags: review?(catlee) → review+
Comment on attachment 428221 [details] [diff] [review]
set proper MOZ_MAKE_FLAGS

changeset:   2102:338b8f2db996
Attachment #428221 - Flags: checked-in+
Current state of the Windows machines:
In production: 02-06, 18, 20-23, 25
In staging: 01, 07, 09, 10, 12-17, 24
Not up at all: 8, 11
Waiting for re-imaging: 19

Note that 01 is going to stay in staging permanently.
Depends on: 547799
08 is up now, and in staging
At this point, all Windows machines are in their final location:
01 - staging
02-16 - pm01
17-25 - pm02

Earlier, they seemed to be flapping between connect/disconnected, similar to before when the firewall was dropping long-idle TCP connections, but when I came back to it later they were all OK, but I don't doubt it will happen again. I'm going to ask Derek to double check and make sure all the firewalls/switches are set-up correctly.

Also, these are suffering from the same intermittent issue that VMs 50-59 hit -- sometimes the Buildbot start-up script exits quickly without actually starting Buildbot, requiring manual intervention.
I haven't seen any flapping at all today and I'm beginning to wonder if things were just rebooting or I was catching them right after they started buildbot, or something.
(In reply to comment #33)
> I haven't seen any flapping at all today and I'm beginning to wonder if things
> were just rebooting or I was catching them right after they started buildbot,
> or something.

Of course, now I see issues. There have 3 test runs today that happened on an IX machine that hit exceptions due to a spontaneous connection loss.

Digging through the logs I find messages like this for every slave that hits this:
twistd.log.33:2010-02-24 13:25:01-0800 [Broker,10153,10.250.49.187] duplicate slave mw32-ix-slave14 replacing old one
twistd.log.33:2010-02-24 13:25:01-0800 [Broker,10153,10.250.49.187] disconnecting old slave mw32-ix-slave14 now
twistd.log.33:2010-02-24 13:25:01-0800 [Broker,9997,10.250.49.187] BuildSlave.detached(mw32-ix-slave14)


This generally means one of two things. Either one slave is starting two Buildbot processes, or two slaves are identifying themselves identically. I'm not sure yet if this is the case. It *could* explain all the "lost remote" messages we see in the log. More digging here is needed.
Attachment #429617 - Flags: review?(nrthomas) → review+
Comment on attachment 429617 [details] [diff] [review]
Add linux ix slaves to production configs

changeset:   2122:aeb9169ad002
Attachment #429617 - Flags: checked-in+
Actually, we have 52, not 50 of these 1u machines. We've already set aside two other machines as "ref images", so lets just use these two as slaves.
Depends on: 549524
Summary: [Tracking bug] Add 50 physical builders into production pool-o-slaves → [Tracking bug] Add 52 physical builders into production pool-o-slaves
linux slaves 02,03,05-13 are connected to pm01 now.
Depends on: 550815
Depends on: 551950
(In reply to comment #39)
And they had some problems and got moved off again pretty quickly. Today the linux machines are on pm01 and pm02 now. What split was used Ben ?
(In reply to comment #40)
> (In reply to comment #39)
> And they had some problems and got moved off again pretty quickly. Today the
> linux machines are on pm01 and pm02 now. What split was used Ben ?

I put 02 through 11 om pm01; 12-24 on pm02
These slaves were fine overnight, besides one random disconnect, which doesn't seem to be specific to them (it also happened on a Linux VM).

Removing blocking on the win32 issue, since it's not specific to these machines.

We're done here!
Status: NEW → RESOLVED
Closed: 14 years ago
No longer depends on: 550815
Resolution: --- → FIXED
Depends on: 550815
fixing deps
No longer depends on: 550815
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: