Closed Bug 545136 Opened 14 years ago Closed 14 years ago

[Tracking bug] Add 52 physical builders into production pool-o-slaves

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: bhearsum)

References

Details

(Whiteboard: [buildslaves])

Attachments

(5 files, 2 obsolete files)

pick fast slaves for builds, slow slaves for tests 14 years ago Chris AtLee [:catlee] 9.42 KB, patch	nthomas : review+	Details \| Diff \| Splinter Review
set fast slave regexes on production/staging 14 years ago Chris AtLee [:catlee] 1.95 KB, patch	nthomas : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
pick fast slaves for builds, slow slaves for tests v2 14 years ago Chris AtLee [:catlee] 11.00 KB, patch	nthomas : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
add ix machines to production configs 14 years ago bhearsum@mozilla.com (:bhearsum) 2.92 KB, patch	catlee : review-	Details \| Diff \| Splinter Review
add ix machines to production configs, v2 14 years ago bhearsum@mozilla.com (:bhearsum) 2.92 KB, patch	catlee : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
set proper MOZ_MAKE_FLAGS 14 years ago bhearsum@mozilla.com (:bhearsum) 11.92 KB, patch	catlee : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
Add linux ix slaves to production configs 14 years ago Chris AtLee [:catlee] 1.09 KB, patch	nthomas : review+ catlee : checked-in+	Details \| Diff \| Splinter Review

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Description

•

14 years ago

Buildbot configs go here. Also, the setup in staging and then move to production should be tracked here.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 1

•

14 years ago

We need to build with "-j1" on these multi-core physical machines, to avoid a broken build. Unclear if we should use different flags for VMs and physical, or else explicitly set "-j1" on all VMs and physical - with some possible time delays on VMs.

We need to resolve this before we roll these machines into production.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 2

•

14 years ago

(In reply to comment #1)
> We need to build with "-j1" on these multi-core physical machines, to avoid a
> broken build. Unclear if we should use different flags for VMs and physical, or
> else explicitly set "-j1" on all VMs and physical - with some possible time
> delays on VMs.

We do not need to use a different -j flag for Linux, only for Windows.

> We need to resolve this before we roll these machines into production.

I'll test out -j1 on VMs and various -j options on the physical machines when they come into staging.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 3

•

14 years ago

For the curious, physical spec of machines: 

1U Mercury Rackmount Server
* 2 Fixed SATA Drive Bay 
* 260W  Power Supply 
* Single Socket Server Board – Dual GigE NICs, 
* VGA
* Integrated IPMI Remote Management w/dedicated LAN 
* Intel X3430 2.4GHz Quad Core, 8MB Cache 
* 2 x 2GB DDR3 1066 ECC Unbuffered DIMM (4GB Total) 
* 1 x Seagate 250GB SATA Desktop Drive, 7200RPM/8MB Cache

OS: Mac OS X → All

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 4

•

14 years ago

Pushing to bhearsum, as he's doing all the work here already.

Assignee: nobody → bhearsum

Chris AtLee [:catlee]

Comment 5

•

14 years ago

Attached patch pick fast slaves for builds, slow slaves for tests (obsolete) — Details — Splinter Review

Attachment #426353 - Flags: review?(nrthomas)

Chris AtLee [:catlee]

Comment 6

•

14 years ago

Attached patch set fast slave regexes on production/staging — Details — Splinter Review

Attachment #426354 - Flags: review?(nrthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

14 years ago

Comment on attachment 426353 [details] [diff] [review]
pick fast slaves for builds, slow slaves for tests

Looks good, r+

Attachment #426353 - Flags: review?(nrthomas) → review+

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

14 years ago

Comment on attachment 426354 [details] [diff] [review]
set fast slave regexes on production/staging

Nice one!

Attachment #426354 - Flags: review?(nrthomas) → review+

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

14 years ago

Comment on attachment 426353 [details] [diff] [review]
pick fast slaves for builds, slow slaves for tests

>+def _nextSlowSlave(builder, available_slaves):
>+def _nextFastSlave(builder, available_slaves):

Oh hmm, perahps these should fall back to available_slaves if there's a problem with _partitionSlaves?

Chris AtLee [:catlee]

Comment 10

•

14 years ago

Attached patch pick fast slaves for builds, slow slaves for tests v2 — Details — Splinter Review

this adds exception handling, and prioritizing slaves that most recently completed a build on the builder that wants to start a new build.  this should give better depend build behaviour.

Attachment #426353 - Attachment is obsolete: true

Attachment #426372 - Flags: review?(nrthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 11

•

14 years ago

Comment on attachment 426372 [details] [diff] [review]
pick fast slaves for builds, slow slaves for tests v2

>diff --git a/misc.py b/misc.py
>+            log.msg("No fast or slow slaves found, choosing randomly instead")

I'd suggest adding the builder name here too, just to make the logs less ambiguous. r+ if this works out on staging.

Attachment #426372 - Flags: review?(nrthomas) → review+

Chris AtLee [:catlee]

Updated

•

14 years ago

Depends on: 545797

Chris AtLee [:catlee]

Updated

•

14 years ago

Depends on: 545801

Chris AtLee [:catlee]

Comment 12

•

14 years ago

linux-ix-slave01 is having problems connecting to clobberer.

Nick Thomas [:nthomas] (UTC+12)

Comment 13

•

14 years ago

(In reply to comment #12)
> linux-ix-slave01 is having problems connecting to clobberer.

We can fix that in bug 545567, it's the same subnet.

Depends on: 545567

Chris AtLee [:catlee]

Comment 18

•

14 years ago

Comment on attachment 426354 [details] [diff] [review]
set fast slave regexes on production/staging

changeset:   2082:ffa8f302bdc3

Attachment #426354 - Flags: checked-in+

Chris AtLee [:catlee]

Comment 19

•

14 years ago

Comment on attachment 426372 [details] [diff] [review]
pick fast slaves for builds, slow slaves for tests v2

changeset:   610:eee27559b198

Attachment #426372 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 20

•

14 years ago

Here's the latest on these machines, as best I know:
* We got all of the currently powered on ones into staging
* Some of the Linux ones are hung at a 'GRUB' screen, refusing to boot
* The Windows ones are having trouble either staying connected to Buildbot, or connecting to it in the first place.
* https://bugzilla.mozilla.org/show_bug.cgi?id=484799 is breaking the Windows 'make package' step. When the tree is in good enough shape, we'll be backing it out from mozilla-central to fix that.

So, we need to resolve these issues and then continue running them in staging, making sure they build ok.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

14 years ago

Depends on: 546731

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 21

•

14 years ago

(In reply to comment #20)

> Here's the latest on these machines, as best I know:
> * We got all of the currently powered on ones into staging
> * Some of the Linux ones are hung at a 'GRUB' screen, refusing to boot

Catlee is looking into this

> * The Windows ones are having trouble either staying connected to Buildbot, or
> connecting to it in the first place.

Turns out the Linux ones are too, and dmoore knows how to fix it.  bug 546731 has the details.

> * https://bugzilla.mozilla.org/show_bug.cgi?id=484799 is breaking the Windows
> 'make package' step. When the tree is in good enough shape, we'll be backing it
> out from mozilla-central to fix that.

Still need to back this out, looking to do it tomorrow.


I looked over all the results, too, and other than some known rando-orange tests and failed builds for reasons like somebody shutting the master down, they're green. Windows builds have worked fine with -j4, so we should just leave that on.


So, once the stability issues are resolved on the Linux machines, they can move to production. Same thing for the Windows machines, but we need to back out that m-c patch first, too.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 22

•

14 years ago

Attached patch add ix machines to production configs (obsolete) — Details — Splinter Review

Note that I've kept the l10n slaves on VMs, because they don't need as much processing power, and it lets us be more flexible with the IX machine balance between masters.

Attachment #427590 - Flags: review?(catlee)

Chris AtLee [:catlee]

Comment 23

•

14 years ago

Comment on attachment 427590 [details] [diff] [review]
add ix machines to production configs

> SLAVES = {
>-    'linux': ['moz2-linux-slave%02i' % x for x in [1,2] +
>-              range(5,17) + range(18,51)],
>+    'linux': LINUX_VMS + LINUX_IXS,
>     'linux64': ['moz2-linux64-slave%02i' % x for x in range(1,13)],
>-    'win32': ['win32-slave%02i' % x for x in [1,2] + range(5,21) +
>-              range(22,60)],
>+    'win32': WIN32_VMS + LINUX_IXS,

This should be WIN32_IXS.

> ACTIVE_BRANCHES = ['mozilla-central', 'mozilla-1.9.2', 'places']
> L10N_SLAVES = {
>-    'linux': SLAVES['linux'][:8],
>-    'win32': SLAVES['win32'][:8],
>+    'linux': LINUX_VMS[:8],
>+    'win32': WIN32_VMS[:8],
>     'macosx': MAC_MINIS[:6] + XSERVES[:2],
> }

>-    'linux': SLAVES['linux'][-8:],
>-    'win32': SLAVES['win32'][-8:],
>+    'linux': LINUX_VMS[:8],
>+    'win32': WIN32_VMS[:8],
>     'macosx': MAC_MINIS[-6:] + XSERVES[-2:],

I think the second set of these should use *_VMS[-8:] instead of *_VMS[:8]

Attachment #427590 - Flags: review?(catlee) → review-

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 24

•

14 years ago

Attached patch add ix machines to production configs, v2 — Details — Splinter Review

Attachment #427590 - Attachment is obsolete: true

Attachment #427757 - Flags: review?(catlee)

Chris AtLee [:catlee]

Updated

•

14 years ago

Attachment #427757 - Flags: review?(catlee) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 25

•

14 years ago

(In reply to comment #21)
> (In reply to comment #20)
> 
> > Here's the latest on these machines, as best I know:
> > * We got all of the currently powered on ones into staging
> > * Some of the Linux ones are hung at a 'GRUB' screen, refusing to boot
> 
> Catlee is looking into this
> 

Still having issues here.

> > * The Windows ones are having trouble either staying connected to Buildbot, or
> > connecting to it in the first place.
> 
> Turns out the Linux ones are too, and dmoore knows how to fix it.  bug 546731
> has the details.
> 

This problem has been fixed.

> > * https://bugzilla.mozilla.org/show_bug.cgi?id=484799 is breaking the Windows
> > 'make package' step. When the tree is in good enough shape, we'll be backing it
> > out from mozilla-central to fix that.
> 
> Still need to back this out, looking to do it tomorrow.

I just landed this back out.


Once the backout has been cleared I think we're ready to put the Windows machines in production. However, I'd rather wait until Monday to do so, since there will be more visibility and eyes on them at that time.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 26

•

14 years ago

Comment on attachment 427757 [details] [diff] [review]
add ix machines to production configs, v2

changeset:   2099:296d434d0eca

I pushed all but the LINUX_IXS parts of this.

Attachment #427757 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 27

•

14 years ago

Attached patch set proper MOZ_MAKE_FLAGS — Details — Splinter Review

This is the same thing as the staging patch, but with an adjusted path for production.

Attachment #428221 - Flags: review?(catlee)

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

14 years ago

Whiteboard: [buildslaves]

Chris AtLee [:catlee]

Comment 28

•

14 years ago

Comment on attachment 428221 [details] [diff] [review]
set proper MOZ_MAKE_FLAGS

Looks good.  You missed removing the MOZ_MAKE_FLAGS like from the tracemonkey xulrunner configs.

Attachment #428221 - Flags: review?(catlee) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 29

•

14 years ago

Comment on attachment 428221 [details] [diff] [review]
set proper MOZ_MAKE_FLAGS

changeset:   2102:338b8f2db996

Attachment #428221 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 30

•

14 years ago

Current state of the Windows machines:
In production: 02-06, 18, 20-23, 25
In staging: 01, 07, 09, 10, 12-17, 24
Not up at all: 8, 11
Waiting for re-imaging: 19

Note that 01 is going to stay in staging permanently.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

14 years ago

Depends on: 547799

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 31

•

14 years ago

08 is up now, and in staging

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 32

•

14 years ago

At this point, all Windows machines are in their final location:
01 - staging
02-16 - pm01
17-25 - pm02

Earlier, they seemed to be flapping between connect/disconnected, similar to before when the firewall was dropping long-idle TCP connections, but when I came back to it later they were all OK, but I don't doubt it will happen again. I'm going to ask Derek to double check and make sure all the firewalls/switches are set-up correctly.

Also, these are suffering from the same intermittent issue that VMs 50-59 hit -- sometimes the Buildbot start-up script exits quickly without actually starting Buildbot, requiring manual intervention.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 33

•

14 years ago

I haven't seen any flapping at all today and I'm beginning to wonder if things were just rebooting or I was catching them right after they started buildbot, or something.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 34

•

14 years ago

(In reply to comment #33)
> I haven't seen any flapping at all today and I'm beginning to wonder if things
> were just rebooting or I was catching them right after they started buildbot,
> or something.

Of course, now I see issues. There have 3 test runs today that happened on an IX machine that hit exceptions due to a spontaneous connection loss.

Digging through the logs I find messages like this for every slave that hits this:
twistd.log.33:2010-02-24 13:25:01-0800 [Broker,10153,10.250.49.187] duplicate slave mw32-ix-slave14 replacing old one
twistd.log.33:2010-02-24 13:25:01-0800 [Broker,10153,10.250.49.187] disconnecting old slave mw32-ix-slave14 now
twistd.log.33:2010-02-24 13:25:01-0800 [Broker,9997,10.250.49.187] BuildSlave.detached(mw32-ix-slave14)


This generally means one of two things. Either one slave is starting two Buildbot processes, or two slaves are identifying themselves identically. I'm not sure yet if this is the case. It *could* explain all the "lost remote" messages we see in the log. More digging here is needed.

Chris AtLee [:catlee]

Comment 36

•

14 years ago

Attached patch Add linux ix slaves to production configs — Details — Splinter Review

Attachment #429617 - Flags: review?(nrthomas)

Nick Thomas [:nthomas] (UTC+12)

Updated

•

14 years ago

Attachment #429617 - Flags: review?(nrthomas) → review+

Chris AtLee [:catlee]

Comment 37

•

14 years ago

Comment on attachment 429617 [details] [diff] [review]
Add linux ix slaves to production configs

changeset:   2122:aeb9169ad002

Attachment #429617 - Flags: checked-in+

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 38

•

14 years ago

Actually, we have 52, not 50 of these 1u machines. We've already set aside two other machines as "ref images", so lets just use these two as slaves.

Depends on: 549524

Summary: [Tracking bug] Add 50 physical builders into production pool-o-slaves → [Tracking bug] Add 52 physical builders into production pool-o-slaves

Chris AtLee [:catlee]

Comment 39

•

14 years ago

linux slaves 02,03,05-13 are connected to pm01 now.

Nick Thomas [:nthomas] (UTC+12)

Updated

•

14 years ago

Depends on: 550815

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

14 years ago

Depends on: 551950

Nick Thomas [:nthomas] (UTC+12)

Comment 40

•

14 years ago

(In reply to comment #39)
And they had some problems and got moved off again pretty quickly. Today the linux machines are on pm01 and pm02 now. What split was used Ben ?

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 41

•

14 years ago

(In reply to comment #40)
> (In reply to comment #39)
> And they had some problems and got moved off again pretty quickly. Today the
> linux machines are on pm01 and pm02 now. What split was used Ben ?

I put 02 through 11 om pm01; 12-24 on pm02

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 42

•

14 years ago

These slaves were fine overnight, besides one random disconnect, which doesn't seem to be specific to them (it also happened on a Linux VM).

Removing blocking on the win32 issue, since it's not specific to these machines.

We're done here!

Status: NEW → RESOLVED

Closed: 14 years ago

No longer depends on: 550815

Resolution: --- → FIXED

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

14 years ago

Depends on: 550815

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 43

•

14 years ago

fixing deps

No longer depends on: 550815

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.