Closed
Bug 545136
Opened 14 years ago
Closed 13 years ago
[Tracking bug] Add 52 physical builders into production pool-o-slaves
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: bhearsum)
References
Details
(Whiteboard: [buildslaves])
Attachments
(5 files, 2 obsolete files)
1.95 KB,
patch
|
nthomas
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
11.00 KB,
patch
|
nthomas
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
2.92 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
11.92 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
1.09 KB,
patch
|
nthomas
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
Buildbot configs go here. Also, the setup in staging and then move to production should be tracked here.
Reporter | ||
Comment 1•14 years ago
|
||
We need to build with "-j1" on these multi-core physical machines, to avoid a broken build. Unclear if we should use different flags for VMs and physical, or else explicitly set "-j1" on all VMs and physical - with some possible time delays on VMs. We need to resolve this before we roll these machines into production.
Assignee | ||
Comment 2•14 years ago
|
||
(In reply to comment #1) > We need to build with "-j1" on these multi-core physical machines, to avoid a > broken build. Unclear if we should use different flags for VMs and physical, or > else explicitly set "-j1" on all VMs and physical - with some possible time > delays on VMs. We do not need to use a different -j flag for Linux, only for Windows. > We need to resolve this before we roll these machines into production. I'll test out -j1 on VMs and various -j options on the physical machines when they come into staging.
Reporter | ||
Comment 3•14 years ago
|
||
For the curious, physical spec of machines: 1U Mercury Rackmount Server * 2 Fixed SATA Drive Bay * 260W Power Supply * Single Socket Server Board – Dual GigE NICs, * VGA * Integrated IPMI Remote Management w/dedicated LAN * Intel X3430 2.4GHz Quad Core, 8MB Cache * 2 x 2GB DDR3 1066 ECC Unbuffered DIMM (4GB Total) * 1 x Seagate 250GB SATA Desktop Drive, 7200RPM/8MB Cache
OS: Mac OS X → All
Reporter | ||
Comment 4•14 years ago
|
||
Pushing to bhearsum, as he's doing all the work here already.
Assignee: nobody → bhearsum
Comment 5•14 years ago
|
||
Attachment #426353 -
Flags: review?(nrthomas)
Comment 6•14 years ago
|
||
Attachment #426354 -
Flags: review?(nrthomas)
Comment 7•14 years ago
|
||
Comment on attachment 426353 [details] [diff] [review] pick fast slaves for builds, slow slaves for tests Looks good, r+
Attachment #426353 -
Flags: review?(nrthomas) → review+
Comment 8•14 years ago
|
||
Comment on attachment 426354 [details] [diff] [review] set fast slave regexes on production/staging Nice one!
Attachment #426354 -
Flags: review?(nrthomas) → review+
Comment 9•14 years ago
|
||
Comment on attachment 426353 [details] [diff] [review] pick fast slaves for builds, slow slaves for tests >+def _nextSlowSlave(builder, available_slaves): >+def _nextFastSlave(builder, available_slaves): Oh hmm, perahps these should fall back to available_slaves if there's a problem with _partitionSlaves?
Comment 10•14 years ago
|
||
this adds exception handling, and prioritizing slaves that most recently completed a build on the builder that wants to start a new build. this should give better depend build behaviour.
Attachment #426353 -
Attachment is obsolete: true
Attachment #426372 -
Flags: review?(nrthomas)
Comment 11•14 years ago
|
||
Comment on attachment 426372 [details] [diff] [review] pick fast slaves for builds, slow slaves for tests v2 >diff --git a/misc.py b/misc.py >+ log.msg("No fast or slow slaves found, choosing randomly instead") I'd suggest adding the builder name here too, just to make the logs less ambiguous. r+ if this works out on staging.
Attachment #426372 -
Flags: review?(nrthomas) → review+
Comment 12•14 years ago
|
||
linux-ix-slave01 is having problems connecting to clobberer.
Comment 13•14 years ago
|
||
(In reply to comment #12) > linux-ix-slave01 is having problems connecting to clobberer. We can fix that in bug 545567, it's the same subnet.
Depends on: 545567
Comment 18•14 years ago
|
||
Comment on attachment 426354 [details] [diff] [review] set fast slave regexes on production/staging changeset: 2082:ffa8f302bdc3
Attachment #426354 -
Flags: checked-in+
Comment 19•14 years ago
|
||
Comment on attachment 426372 [details] [diff] [review] pick fast slaves for builds, slow slaves for tests v2 changeset: 610:eee27559b198
Attachment #426372 -
Flags: checked-in+
Assignee | ||
Comment 20•14 years ago
|
||
Here's the latest on these machines, as best I know: * We got all of the currently powered on ones into staging * Some of the Linux ones are hung at a 'GRUB' screen, refusing to boot * The Windows ones are having trouble either staying connected to Buildbot, or connecting to it in the first place. * https://bugzilla.mozilla.org/show_bug.cgi?id=484799 is breaking the Windows 'make package' step. When the tree is in good enough shape, we'll be backing it out from mozilla-central to fix that. So, we need to resolve these issues and then continue running them in staging, making sure they build ok.
Assignee | ||
Comment 21•14 years ago
|
||
(In reply to comment #20) > Here's the latest on these machines, as best I know: > * We got all of the currently powered on ones into staging > * Some of the Linux ones are hung at a 'GRUB' screen, refusing to boot Catlee is looking into this > * The Windows ones are having trouble either staying connected to Buildbot, or > connecting to it in the first place. Turns out the Linux ones are too, and dmoore knows how to fix it. bug 546731 has the details. > * https://bugzilla.mozilla.org/show_bug.cgi?id=484799 is breaking the Windows > 'make package' step. When the tree is in good enough shape, we'll be backing it > out from mozilla-central to fix that. Still need to back this out, looking to do it tomorrow. I looked over all the results, too, and other than some known rando-orange tests and failed builds for reasons like somebody shutting the master down, they're green. Windows builds have worked fine with -j4, so we should just leave that on. So, once the stability issues are resolved on the Linux machines, they can move to production. Same thing for the Windows machines, but we need to back out that m-c patch first, too.
Assignee | ||
Comment 22•14 years ago
|
||
Note that I've kept the l10n slaves on VMs, because they don't need as much processing power, and it lets us be more flexible with the IX machine balance between masters.
Attachment #427590 -
Flags: review?(catlee)
Comment 23•14 years ago
|
||
Comment on attachment 427590 [details] [diff] [review] add ix machines to production configs > SLAVES = { >- 'linux': ['moz2-linux-slave%02i' % x for x in [1,2] + >- range(5,17) + range(18,51)], >+ 'linux': LINUX_VMS + LINUX_IXS, > 'linux64': ['moz2-linux64-slave%02i' % x for x in range(1,13)], >- 'win32': ['win32-slave%02i' % x for x in [1,2] + range(5,21) + >- range(22,60)], >+ 'win32': WIN32_VMS + LINUX_IXS, This should be WIN32_IXS. > ACTIVE_BRANCHES = ['mozilla-central', 'mozilla-1.9.2', 'places'] > L10N_SLAVES = { >- 'linux': SLAVES['linux'][:8], >- 'win32': SLAVES['win32'][:8], >+ 'linux': LINUX_VMS[:8], >+ 'win32': WIN32_VMS[:8], > 'macosx': MAC_MINIS[:6] + XSERVES[:2], > } >- 'linux': SLAVES['linux'][-8:], >- 'win32': SLAVES['win32'][-8:], >+ 'linux': LINUX_VMS[:8], >+ 'win32': WIN32_VMS[:8], > 'macosx': MAC_MINIS[-6:] + XSERVES[-2:], I think the second set of these should use *_VMS[-8:] instead of *_VMS[:8]
Attachment #427590 -
Flags: review?(catlee) → review-
Assignee | ||
Comment 24•14 years ago
|
||
Attachment #427590 -
Attachment is obsolete: true
Attachment #427757 -
Flags: review?(catlee)
Updated•14 years ago
|
Attachment #427757 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 25•14 years ago
|
||
(In reply to comment #21) > (In reply to comment #20) > > > Here's the latest on these machines, as best I know: > > * We got all of the currently powered on ones into staging > > * Some of the Linux ones are hung at a 'GRUB' screen, refusing to boot > > Catlee is looking into this > Still having issues here. > > * The Windows ones are having trouble either staying connected to Buildbot, or > > connecting to it in the first place. > > Turns out the Linux ones are too, and dmoore knows how to fix it. bug 546731 > has the details. > This problem has been fixed. > > * https://bugzilla.mozilla.org/show_bug.cgi?id=484799 is breaking the Windows > > 'make package' step. When the tree is in good enough shape, we'll be backing it > > out from mozilla-central to fix that. > > Still need to back this out, looking to do it tomorrow. I just landed this back out. Once the backout has been cleared I think we're ready to put the Windows machines in production. However, I'd rather wait until Monday to do so, since there will be more visibility and eyes on them at that time.
Assignee | ||
Comment 26•14 years ago
|
||
Comment on attachment 427757 [details] [diff] [review] add ix machines to production configs, v2 changeset: 2099:296d434d0eca I pushed all but the LINUX_IXS parts of this.
Attachment #427757 -
Flags: checked-in+
Assignee | ||
Comment 27•14 years ago
|
||
This is the same thing as the staging patch, but with an adjusted path for production.
Attachment #428221 -
Flags: review?(catlee)
Assignee | ||
Updated•14 years ago
|
Whiteboard: [buildslaves]
Comment 28•14 years ago
|
||
Comment on attachment 428221 [details] [diff] [review] set proper MOZ_MAKE_FLAGS Looks good. You missed removing the MOZ_MAKE_FLAGS like from the tracemonkey xulrunner configs.
Attachment #428221 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 29•14 years ago
|
||
Comment on attachment 428221 [details] [diff] [review] set proper MOZ_MAKE_FLAGS changeset: 2102:338b8f2db996
Attachment #428221 -
Flags: checked-in+
Assignee | ||
Comment 30•14 years ago
|
||
Current state of the Windows machines: In production: 02-06, 18, 20-23, 25 In staging: 01, 07, 09, 10, 12-17, 24 Not up at all: 8, 11 Waiting for re-imaging: 19 Note that 01 is going to stay in staging permanently.
Assignee | ||
Comment 31•14 years ago
|
||
08 is up now, and in staging
Assignee | ||
Comment 32•14 years ago
|
||
At this point, all Windows machines are in their final location: 01 - staging 02-16 - pm01 17-25 - pm02 Earlier, they seemed to be flapping between connect/disconnected, similar to before when the firewall was dropping long-idle TCP connections, but when I came back to it later they were all OK, but I don't doubt it will happen again. I'm going to ask Derek to double check and make sure all the firewalls/switches are set-up correctly. Also, these are suffering from the same intermittent issue that VMs 50-59 hit -- sometimes the Buildbot start-up script exits quickly without actually starting Buildbot, requiring manual intervention.
Assignee | ||
Comment 33•14 years ago
|
||
I haven't seen any flapping at all today and I'm beginning to wonder if things were just rebooting or I was catching them right after they started buildbot, or something.
Assignee | ||
Comment 34•14 years ago
|
||
(In reply to comment #33) > I haven't seen any flapping at all today and I'm beginning to wonder if things > were just rebooting or I was catching them right after they started buildbot, > or something. Of course, now I see issues. There have 3 test runs today that happened on an IX machine that hit exceptions due to a spontaneous connection loss. Digging through the logs I find messages like this for every slave that hits this: twistd.log.33:2010-02-24 13:25:01-0800 [Broker,10153,10.250.49.187] duplicate slave mw32-ix-slave14 replacing old one twistd.log.33:2010-02-24 13:25:01-0800 [Broker,10153,10.250.49.187] disconnecting old slave mw32-ix-slave14 now twistd.log.33:2010-02-24 13:25:01-0800 [Broker,9997,10.250.49.187] BuildSlave.detached(mw32-ix-slave14) This generally means one of two things. Either one slave is starting two Buildbot processes, or two slaves are identifying themselves identically. I'm not sure yet if this is the case. It *could* explain all the "lost remote" messages we see in the log. More digging here is needed.
Comment 36•13 years ago
|
||
Attachment #429617 -
Flags: review?(nrthomas)
Updated•13 years ago
|
Attachment #429617 -
Flags: review?(nrthomas) → review+
Comment 37•13 years ago
|
||
Comment on attachment 429617 [details] [diff] [review] Add linux ix slaves to production configs changeset: 2122:aeb9169ad002
Attachment #429617 -
Flags: checked-in+
Reporter | ||
Comment 38•13 years ago
|
||
Actually, we have 52, not 50 of these 1u machines. We've already set aside two other machines as "ref images", so lets just use these two as slaves.
Depends on: 549524
Summary: [Tracking bug] Add 50 physical builders into production pool-o-slaves → [Tracking bug] Add 52 physical builders into production pool-o-slaves
Comment 39•13 years ago
|
||
linux slaves 02,03,05-13 are connected to pm01 now.
Comment 40•13 years ago
|
||
(In reply to comment #39) And they had some problems and got moved off again pretty quickly. Today the linux machines are on pm01 and pm02 now. What split was used Ben ?
Assignee | ||
Comment 41•13 years ago
|
||
(In reply to comment #40) > (In reply to comment #39) > And they had some problems and got moved off again pretty quickly. Today the > linux machines are on pm01 and pm02 now. What split was used Ben ? I put 02 through 11 om pm01; 12-24 on pm02
Assignee | ||
Comment 42•13 years ago
|
||
These slaves were fine overnight, besides one random disconnect, which doesn't seem to be specific to them (it also happened on a Linux VM). Removing blocking on the win32 issue, since it's not specific to these machines. We're done here!
Updated•10 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•