Closed Bug 508312 Opened 15 years ago Closed 15 years ago

separate production-master/moz2-master into two masters

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(7 files, 9 obsolete files)

1.90 KB, text/plain
Details
185.60 KB, application/octet-stream
Details
193.31 KB, application/octet-stream
Details
27.92 KB, patch
catlee
: review+
nthomas
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
149.35 KB, patch
nthomas
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
1.30 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
1.13 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
The moz2-master instance on production-master has been slowing down more and more over time and as we throw more and more work at it we're inching closer to pushing it beyond its limit. To help this situation we should split it in two. This means at least the following:
* Clone a second production-master VM.
* Set up buildbot-configs/mozilla2/ to have two masters, which split the work between them.
* Once the new master is up, move some slaves over to it.

I think the most logical way to split the jobs is as follows:
* mozilla-central, mozilla-1.9.1, and mozilla-1.9.2 branches will stay on the existing production-master
* places, tracemonkey, electrolysis, and mobile builds will move to the new one
* release builders will exist on both, giving us the ability to two releases at once

I think we should shoot to get this done before we branch for mozilla-1.9.2.
Depends on: 508313
I'm currently testing this in staging, probably needs a bit of tweaking.
Kairo, you'll probably care about this, since it's going to revamp our buildbot configs a bit.
I've got the two new staging masters running today. While they've been churning through builds I did some verifications with the aid of the Buildbot manhole module and python-deep for comparison purposes. I'll attach some pickles and a script shortly, in case anyone wants to double check my work.

In short, the currently running is exactly the same when it comes to builders, change sources, and schedulers as the two new style staging masters combined. I didn't compare Status modules because they gave me a lot of trouble when I was trying to dump them out, and I didn't include any really basic master.cfg parameters. All of the excluded things are pretty easy to manually verify, however.

For the other things, I used the Buildbot Manhole to get dumps of all of their important attributes, and pickled those to a file. For the new style masters, I pickled one, switched configurations, loaded the pickle, updated the data, and re-pickled it.

I then used the python-deep module to compare the two data structures. They were exactly the same except in the following:
* ['schedulers']['l10n']['builders'] differed ('fennec10x' and 'fx36x' missing from the new style one) - this is because the contents of 'builders' was overwritten when the second master was added to the data structure.
* I couldn't compare with python-deep ['schedulers']['l10n']['treeprops'] because it contained an instance of defaultdict in it - when compared manually both versions were the same
* ['schedulers']['l10n']['inipath'] differed - this is because the new style master had m-c l10n on one, and 1.9.1 l10n on another, and thus they needed separate configuration files

After eliminating those differences, python-deep showed the objects to be exactly the same.

One caveat is that WithProperties objects were pickled as the class, and not the string which they contained - so there is the small possibility that they're could be undetected differences there. Given that all "basic" types were equal, I don't think this is the case.

All of that is a long way of saying "these patches are pretty damn safe".
Attached patch generateBranchConfig updates (obsolete) — Splinter Review
This copy of generateBranchConfig is mostly the same as the one I showed you yesterday, but it also deals with some differences we had between staging/production, namely:
* Tinderbox tree hardcoding for l10n, xulrunner, weekly builds - these have been moved to config.py
* Tinderbox tree assumptions for packaged unittest builders - this has been moved to config.py, too
* comment/formatting differences
* difference in number of packaged builds to trigger with the NightlyRebuild scheduler (hardcoded to 2 now, to match the existing production version)
* different TALOS_MASTERS/talosMasters handling for staging/production - synced up to the staging version because it handles defaults
Attachment #392715 - Attachment is obsolete: true
Attachment #392990 - Flags: review?(nthomas)
Attachment #392990 - Flags: review?(catlee)
I looked through pushlog for all of the things we poll and got the following data:
(repo)              (pushes in the last month)
mozilla-central:      538
tracemonkey:          183
mozilla-1.9.1:         98
electrolysis:          72
places:                39
mobile-browser:        37

Based on that, I've split the jobs as follows:
master1:
* mozilla-central (+110n)
* places

master2:
* tracemonkey
* mozilla-1.9.1 (+110n)
* electrolysis
* mobile (all branches, +l10n)

This is done an the assumption that the 1.9.2 branch (+l10n) will go on master1 when landed. While this doesn't get us an even distribution of jobs per master it splits up the number of builders pretty well, and we know that more builders == more boxes on the waterfall == more slow.

I'm totally open to splitting things differently if someone has a better idea.

Other notes:
* Because the mozilla-central based mobile builds are on a different master than the other m-c builds, I had to add an HgPoller for that repository in master2.cfg. This should go away once we have out-of-process polling.
* lowercase everything in BRANCH_LEVEL_VARS from the start, rather than calling lower() a bunch of times.
* addition of many *_tinderbox_tree vars to accommodate staging/production differences
Attachment #392716 - Attachment is obsolete: true
Attachment #392992 - Flags: review?(nthomas)
Attachment #392992 - Flags: review?(catlee)
Comment on attachment 392990 [details] [diff] [review]
generateBranchConfig updates

>diff -r 2f8dcfefe2ab misc.py
>--- a/misc.py	Tue Aug 04 20:33:26 2009 +0200
>+++ b/misc.py	Thu Aug 06 15:11:25 2009 -0400
>@@ -1,5 +1,31 @@
> from urlparse import urljoin
> 
>+import buildbot.status.tinderbox
>+reload(buildbot.status.tinderbox)

No need to reload this.

>+def generateBranchObjects(config, name):

Some comments or a docstring here would be nice, explaining the basic format of
`config`, and what `name` is.

>+    branchObjects['status'].append(TinderboxMailNotifier(
>+        fromaddr="mozilla2.buildbot@build.mozilla.org",
>+        tree=config['tinderbox_tree'],
>+        extraRecipients=["tinderbox-daemon@tinderbox.mozilla.org"],
>+        relayhost="mail.build.mozilla.org",
>+        builders=builders + nightlyBuilders,
>+        logCompression="bzip2"
>+    ))

Did we want to be using relayhost='localhost' these days?

>+                 env= xr_env,

Extra space here.

All in all, it looks great!  Splitting this out from buildbot-configs will make our lives much better.
Attachment #392990 - Flags: review?(catlee) → review-
(In reply to comment #10)
> (From update of attachment 392990 [details] [diff] [review])
> >diff -r 2f8dcfefe2ab misc.py
> >--- a/misc.py	Tue Aug 04 20:33:26 2009 +0200
> >+++ b/misc.py	Thu Aug 06 15:11:25 2009 -0400
> >@@ -1,5 +1,31 @@
> > from urlparse import urljoin
> > 
> >+import buildbot.status.tinderbox
> >+reload(buildbot.status.tinderbox)
> 
> No need to reload this.
> 

Removed.

> >+def generateBranchObjects(config, name):
> 
> Some comments or a docstring here would be nice, explaining the basic format of
> `config`, and what `name` is.
> 

Yeah, good point. Added.

> >+    branchObjects['status'].append(TinderboxMailNotifier(
> >+        fromaddr="mozilla2.buildbot@build.mozilla.org",
> >+        tree=config['tinderbox_tree'],
> >+        extraRecipients=["tinderbox-daemon@tinderbox.mozilla.org"],
> >+        relayhost="mail.build.mozilla.org",
> >+        builders=builders + nightlyBuilders,
> >+        logCompression="bzip2"
> >+    ))
> 
> Did we want to be using relayhost='localhost' these days?
> 

Why would we? In any case, let's deal with that separately if we want to change it.

> >+                 env= xr_env,
> 
> Extra space here.

Fixed.
Attachment #392990 - Attachment is obsolete: true
Attachment #393214 - Flags: review?(nthomas)
Attachment #393214 - Flags: review?(catlee)
Attachment #392990 - Flags: review?(nthomas)
(In reply to comment #11)
> Created an attachment (id=393214) [details]
> generateBranchConfig - catlee's comments addressed
> > Did we want to be using relayhost='localhost' these days?
> Why would we? In any case, let's deal with that separately if we want to 
> change it.

That came out of tinderbox mail getting dropped the last time dm-mail01/02 went down (buildbot doesn't retry). Using the local sendmail as a buffer we can deliver it later.
Comment on attachment 392992 [details] [diff] [review]
buildbot-configs changes for multiple master support

r+ with assorted comments, suggetsions, and the odd question.

>diff -r 5c47b32a4364 mozilla2-staging/config.py
>+    # List of unittest masters to notify of new builds to test,
>+    # and if a failure to notify the master should result in a warning
>+    # TODO: Do we need to diffentiate between both staging masters somehow?
>+    'unittest_masters': [('localhost:9010', True)],

Interesting, s-m2 will submit unittest jobs to s-m. Kinda the consequence of having one VMs vs two.

>+# We copy the global vars in first, so branches can override them if they want to
>+for branch in BRANCHES.keys():
>+    for attr in BRANCH_LEVEL_VARS.keys():
>+        BRANCHES[branch][attr] = deepcopy(BRANCH_LEVEL_VARS[attr])

Using deepcopy here but not for prod ?

>diff -r 5c47b32a4364 mozilla2-staging/l10nbuilds1.ini

l10nbuilds.ini can be removed then ? Same for production.

>diff -r 5c47b32a4364 mozilla2-staging/l10nbuilds2.ini
>+builders = Maemo\ mozilla-central\ l10n Linux\ Desktop\ Fennec\ mozilla-central\ l10n

Should be Linux\ Fennec\ Desktop, an existing bug.

>diff -r 5c47b32a4364 mozilla2-staging/release_master.py
>     repack_factory = ReleaseRepackFactory(

Missed setting 
  slavenames': branchConfig['l10n_slaves'][platform]
when using the factory (so if staging ever gets more than 8 slaves we're gonna be ... slightly inconvenienced).

>diff -r 5c47b32a4364 mozilla2/config.py
>+    # for nss/nspr
>+    'cvsroot': ':ext:stgbld@cvs.mozilla.org:/cvsroot',

The comment appears here but not in the staging copy. Out of date comment now that we don't pull nss/nspr from CVS ?

>+    # List of unittest masters to notify of new builds to test,
>+    # and if a failure to notify the master should result in a warning
>+    'l10n_slaves': {
>+        'linux': SLAVES['linux'][:8],
>+        'win32': SLAVES['win32'][:8],
>+        'macosx': SLAVES['macosx'][:8],
>+    },

What's the plan to make sure we have enough l10n slaves on each branch ? I'm sure we discussed this but can't recall the result. I'd also move the l10n_slaves definition away from the unittest masters comment block, maybe down to the end of the branch vars next to the tbox tree.

>diff -r 5c47b32a4364 mozilla2/master-main.cfg
>+# the 'buildbotURL' string should point to the location where the buildbot's
...

Nit: This comment block kinda dangles here at the bottom of the file. I won't be sad if it meets Mr Ketch at the gallows for a swift drop.

>diff -r 5c47b32a4364 mozilla2/master1.cfg

Don't see a removal for master.cfg anywhere, same for staging.

>diff -r 5c47b32a4364 mozilla2/master2.cfg
>+# We need to add an HgPoller for mozilla-central because we're running
>+# mobile builds off mozilla-central in this master instance, but
>+# running mozilla-central proper in the other.

Do you prefer to put this here, rather than the bottom of mobile_master.py with the mobile-browser HgPoller ?
Attachment #392992 - Flags: review?(nthomas) → review+
Attachment #393214 - Flags: review?(nthomas) → review+
Comment on attachment 393214 [details] [diff] [review]
generateBranchConfig - catlee's comments addressed

r+ based on a quick read, since this is really validated by your testing. Just one very very urgent issue you should fix ...

>diff -r 417d1b245bad misc.py
>+    for platform in config['platforms'].keys():
>+        base_name =  config['platforms'][platform]['base_name']

Nit: extra space after the =
(In reply to comment #13)
> (From update of attachment 392992 [details] [diff] [review])
> >diff -r 5c47b32a4364 mozilla2-staging/config.py
> >+    # List of unittest masters to notify of new builds to test,
> >+    # and if a failure to notify the master should result in a warning
> >+    # TODO: Do we need to diffentiate between both staging masters somehow?
> >+    'unittest_masters': [('localhost:9010', True)],
> 
> Interesting, s-m2 will submit unittest jobs to s-m. Kinda the consequence of
> having one VMs vs two.
> 

This is a good point, and an oversight. I'm going to add in both of the instances - otherwise we're never going to packaged unittest runs on the second instance.

> >+# We copy the global vars in first, so branches can override them if they want to
> >+for branch in BRANCHES.keys():
> >+    for attr in BRANCH_LEVEL_VARS.keys():
> >+        BRANCHES[branch][attr] = deepcopy(BRANCH_LEVEL_VARS[attr])
> 
> Using deepcopy here but not for prod ?
> 

fixed.

> >diff -r 5c47b32a4364 mozilla2-staging/l10nbuilds2.ini
> >+builders = Maemo\ mozilla-central\ l10n Linux\ Desktop\ Fennec\ mozilla-central\ l10n
> 
> Should be Linux\ Fennec\ Desktop, an existing bug.
>

I'm going to leave this one to be solved as a separate bug, then.


> >diff -r 5c47b32a4364 mozilla2-staging/release_master.py
> >     repack_factory = ReleaseRepackFactory(
> 
> Missed setting 
>   slavenames': branchConfig['l10n_slaves'][platform]
> when using the factory (so if staging ever gets more than 8 slaves we're gonna
> be ... slightly inconvenienced).
> 

Fixed.

> >diff -r 5c47b32a4364 mozilla2/config.py
> >+    # for nss/nspr
> >+    'cvsroot': ':ext:stgbld@cvs.mozilla.org:/cvsroot',
> 
> The comment appears here but not in the staging copy. Out of date comment now
> that we don't pull nss/nspr from CVS ?
> 

The only place we use CVS in these masters is for releases - and we have that set in release*.py - I've dropped these entries entirely.

> >+    # List of unittest masters to notify of new builds to test,
> >+    # and if a failure to notify the master should result in a warning
> >+    'l10n_slaves': {
> >+        'linux': SLAVES['linux'][:8],
> >+        'win32': SLAVES['win32'][:8],
> >+        'macosx': SLAVES['macosx'][:8],
> >+    },
> 
> What's the plan to make sure we have enough l10n slaves on each branch ? I'm
> sure we discussed this but can't recall the result. I'd also move the
> l10n_slaves definition away from the unittest masters comment block, maybe down
> to the end of the branch vars next to the tbox tree.
> 

IIRC we decided that we should have "static" slaves for both masters and "dynamic" ones - which we can move about. I'm thinking that the first 8 should be static for production-master and the last 8 should be static for production-master02. This would mean the l10n slaves would change whenever we add more slaves for a platform but that shouldn't be a problem, right?

Implementation of this was surprisingly difficult. I tried a few approaches and ran into various dependency problems. Eg, config.py can't define l10n_slaves because we can't have it importing master1/master2.cfg. We also couldn't have master1/master2.cfg just defining them for ACTIVE_BRANCHES because you should be able to do a release on any branch on any master, not just an active one.

I settled on defining L10N_SLAVES in master1/master2.cfg before exec'ing master-main.cfg - and having master-main.cfg add them to every branch.

> >diff -r 5c47b32a4364 mozilla2/master-main.cfg
> >+# the 'buildbotURL' string should point to the location where the buildbot's
> ...
> 
> Nit: This comment block kinda dangles here at the bottom of the file. I won't
> be sad if it meets Mr Ketch at the gallows for a swift drop.
> 

Executed.

> >diff -r 5c47b32a4364 mozilla2/master2.cfg
> >+# We need to add an HgPoller for mozilla-central because we're running
> >+# mobile builds off mozilla-central in this master instance, but
> >+# running mozilla-central proper in the other.
> 
> Do you prefer to put this here, rather than the bottom of mobile_master.py with
> the mobile-browser HgPoller ?

Hmmmm. I think I'd rather keep it here - if we were to move mobile to the other master it seems pretty easy to forget to update the HgPoller if it's hiding off in mobile_master.py


I'm not sure what happened to my file additions/removals - I'll try to get them included when I post the next patch, which is incoming soon.
(In reply to comment #14)
> (From update of attachment 393214 [details] [diff] [review])
> r+ based on a quick read, since this is really validated by your testing. Just
> one very very urgent issue you should fix ...
> 
> >diff -r 417d1b245bad misc.py
> >+    for platform in config['platforms'].keys():
> >+        base_name =  config['platforms'][platform]['base_name']
> 
> Nit: extra space after the =

I'll fix that on check-in.
Attached patch buildbot-configs, round 2 (obsolete) — Splinter Review
This patch contains all of the changes described in my previous comment.
Attachment #392992 - Attachment is obsolete: true
Attachment #393516 - Flags: review?(nthomas)
Attachment #393516 - Flags: review?(catlee)
Attachment #392992 - Flags: review?(catlee)
Comment on attachment 393516 [details] [diff] [review]
buildbot-configs, round 2

whoops, there's a reload problem with this one. better one incoming in a few
Attachment #393516 - Attachment is obsolete: true
Attachment #393516 - Flags: review?(nthomas)
Attachment #393516 - Flags: review?(catlee)
This patch is the same as round 2, except I've removed the reload(nightly_config) from release_master.py. The metadiff between round2 and this is pretty messy due to 3.6a1 and staging bbdb patches landing.

Sorry for the churn.
Attachment #393524 - Flags: review?(nthomas)
Attachment #393524 - Flags: review?(catlee)
Comment on attachment 393524 [details] [diff] [review]
round3 - same as before without the excessive config.py reload

catlee pointed out that this patch contains some bad diffs, because of a merge commit. I'll post a better one.
Attachment #393524 - Attachment is obsolete: true
Attachment #393524 - Flags: review?(nthomas)
Attachment #393524 - Flags: review?(catlee)
Attached patch round 4 - a diff that applies (obsolete) — Splinter Review
This is a better diff of the previous patch - it differs in no meaningful way.
Attachment #393599 - Flags: review?(nthomas)
Attachment #393599 - Flags: review?(catlee)
I don't know how the heck this happened in the first place, but this patch fixes it.
Attachment #393599 - Attachment is obsolete: true
Attachment #393615 - Flags: review?(nthomas)
Attachment #393615 - Flags: review?(catlee)
Attachment #393599 - Flags: review?(nthomas)
Attachment #393599 - Flags: review?(catlee)
Attachment #393615 - Attachment is obsolete: true
Attachment #393628 - Flags: review?(nthomas)
Attachment #393628 - Flags: review?(catlee)
Attachment #393615 - Flags: review?(nthomas)
Attachment #393615 - Flags: review?(catlee)
Attachment #393628 - Flags: review?(catlee) → review+
Catlee gave this r+ over IRC
Attachment #393628 - Attachment is obsolete: true
Attachment #393631 - Flags: review?(nthomas)
Attachment #393628 - Flags: review?(nthomas)
Attachment #393631 - Flags: review?(nthomas) → review+
Depends on: 509582
Comment on attachment 393631 [details] [diff] [review]
round 7 - same as before, but with all the proper reload()'s 

Just pushed this in a few changesets, ending at changeset:   1418:13f7841c27b9
Attachment #393631 - Flags: checked-in+
Attachment #393214 - Flags: checked-in+
Comment on attachment 393214 [details] [diff] [review]
generateBranchConfig - catlee's comments addressed

changeset:   377:516d5914ab17
production-master02 is ready to go - The buildbot master has been started. I'm going to move the following slaves over:
* xserve19, 22, darwin9-slave12 -> 18 (8/21)
* linux16 -> 25 (9/22)
* win32 23 -> 39 (16/36)
* clone new linux 64-bit slave (oops)
(In reply to comment #27)
> production-master02 is ready to go - The buildbot master has been started. I'm
> going to move the following slaves over:
> * xserve19, 22, darwin9-slave12 -> 18 (8/21)
> * linux16 -> 25 (9/22)
> * win32 23 -> 39 (16/36)

All of these slaves are on the new master - except the moz2-darwin9 ones that are down (which have been added to the latest reboot bug).

I'm currently cloning the linux 64-bit slave.
Depends on: 509698
production-master and production-master02 are both up, running the new configs and appear to be working fine. The dependent bug is tracking getting a new linux 64-bit slave - almost done here.
Attachment #393774 - Flags: review?(catlee) → review+
Comment on attachment 393774 [details] [diff] [review]
explicitly copy the env in ReleaseBuildFactory to avoid screwing up nightly/dep builds

changeset:   382:7b3c9c41fbc9
required two reconfigs to take, but worked fine
Attachment #393774 - Flags: checked-in+
Attachment #393799 - Flags: review?(catlee) → review+
Comment on attachment 393799 [details] [diff] [review]
add moz2-linux64-slave02 to buildbot configs

changeset:   1423:cfb529beab4a
Attachment #393799 - Flags: checked-in+
Other than the shared dict issue this has been fine - closing now.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Attachment #393214 - Flags: review?(catlee) → review+
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: