Closed Bug 892122 Opened 11 years ago Closed 10 years ago

Speed up reconfigs on linux test masters

Categories

(Release Engineering :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: bhearsum)

References

Details

Attachments

(7 files, 2 obsolete files)

Reconfigs on tests1-linux masters can be quite slow (many minutes) and are the long pole in any code deployment. This is likely to be related to the number of jobs and/or builders running on each master.

We should figure out which is the major issue. If it's just the number of jobs we can scale by adding more masters. If it's the # of builders then we can split the linux tests out by platform, ie 32bit and 64 bit on a few masters each.
Platform                      # builders  # slaves  # jobs  # masters
linux                               3857     2030*   13716     6
macosx                              3358      257    11748     6
windows                             4064      654    13053     6

# of builders and slaves from dumping masters
* = not really, more slaves than we actually have connected or created
# of jobs in https://secure.pub.build.mozilla.org/builddata/buildjson/builds-2013-07-10.js.gz
# of enabled masters from production-masters.json

I guess that says more masters first.
Actually we should split the platforms.

* if reconfigs are slow from load from running jobs/handling jobs, then we'd expect the number of jobs to be higher for linux, but it's not. I'm assuming that all platforms handle the same quantity of data
* if reconfigs are slow because of the amount of work recreating builders and reparenting slaves then linux should be slower, and it is. Even though the number of builders isn't higher.
Product: mozilla.org → Release Engineering
Catlee and I talked about this on IRC. Callek already moved in-house machines to their own pool in bug 971780, so we don't need to worry about them. We are going to split up the existing tests-aws-us-east-1-linux and tests-aws-us-west-2-linux into 2 additional pools, though. When we're done we should have the following pools:
* tests-use1-linux32 (~700 machines)
* tests-usw2-linux32 (~700 machines)
* tests-use1-linux64 (~800 machines)
* tests-usw2-linux64 (~800 machines)

In most other pools we've got one master for roughly every 100 slaves. I suspect we can push more than that, though. We currently have 6 masters for all of these slaves, and if we double that to 12 we can give each pool 3 masters. We can change the existing ones to be the -linux32 masters, and the new ones can take on the -linux64 work. If that's not enough, we can add more later.

I'll be working on the mechanics of bringing the new masters up (launching the instances, netflows, puppet, slavealloc, etc.). Catlee is going to work on making it possible to limit masters to linux32 or linux64 tests.
Assignee: nobody → bhearsum
These will need to be adjusted once we have the new limit* options, but this should be enough to bring up the instances.
Attachment #8392390 - Flags: review?(catlee)
Attachment #8392390 - Flags: review?(catlee) → review+
You'll notice that the names have switched from "bmNN-tests-linux" to "bmNN-tests-linux32". My plan is to bring these new masters up, move the slaves to the new pool, and then change the existing masters' instances to bmNN-tests-linux64, move their slaves, and then get rid of their old instances. This will require two instances on these masters for a period of time, but only one will be running at any given time. I'll have to ease them in one master a time to avoid downtime.
Attachment #8392402 - Flags: review?(catlee)
Attachment #8392402 - Flags: review?(catlee) → review+
Comment on attachment 8392399 [details] [diff] [review]
support limit_fx_slave_platforms, limit_b2g_slave_platforms

Review of attachment 8392399 [details] [diff] [review]:
-----------------------------------------------------------------

I hate our configs.
Attachment #8392399 - Flags: review?(bhearsum) → review+
Attached patch add masters to slavealloc (obsolete) — Splinter Review
This adds all of the new instances & pools, not just instances for the masters that need to be created.
Attachment #8392447 - Flags: review?(catlee)
Comment on attachment 8392447 [details] [diff] [review]
add masters to slavealloc

don't forget the passwords table for these new pools. I made that mistake in my inhouse work
Attachment #8392447 - Flags: feedback+
Attachment #8392447 - Flags: review?(catlee) → review+
Attachment #8392399 - Flags: checked-in+
Live in production.
Attachment #8392390 - Flags: checked-in+
Attachment #8392402 - Flags: checked-in+
DNS entries have been added. I'm going to starting bringing up the instances shortly.
This has everything disabled still, but should set things up as follows:
- bm01-06 are linux32 AWS test masters. 64-bit platforms are disabled and in-house slave platforms are disabled.
- bm51-54, 67, and 68 are linux64 AWS test masters. 32-bit platforms and in-house slave platforms are disabled
- bm103-105 are in house linux test masters for 32 and 64-bit. They have AWS slave platforms disabled. Note that their "name" has this bug number appended for now, because I'm pretty sure that our tools need that to be unique.
Attachment #8393046 - Flags: review?(catlee)
Attached file better master sql (obsolete) —
Adds all of the new masters (existing in house ones don't need to be added, because none of their slavealloc details have changed).

Adds slave passwords.
Attachment #8392447 - Attachment is obsolete: true
Depends on: 985113
Attached file final sql for masters
I ran this on the production slavealloc db. Had to fix a couple more fqdns + slave password syntax. I subbed in the correct password, of course.
Attachment #8393064 - Attachment is obsolete: true
Attachment #8393512 - Flags: checked-in+
I need to enable b2g reftests across the board and I can't because I'm reaching the max number of builders for the 64-bit test instances.

I don't know what the right solution is.
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #15)
> I need to enable b2g reftests across the board and I can't because I'm
> reaching the max number of builders for the 64-bit test instances.
> 
> I don't know what the right solution is.

Per IRC, this isn't going to help.
Fully rolling this out is going to be a multi-step process. Here's my plan:
1) After IT bugs are resolved, verify the new masters by locking a slave to each of them.
2) Once things look OK, enable the new masters in slavealloc & production-masters.json
3) Repoint all of the tst-linux32 masters to the tst-use1-linux32 and tst-usw2-linux32

At this point, the tst-linux32 aws machines will migrate over to the new masters as they reboot. Next:

4) Create & enable one of the new 64-bit only masters on an existing master (let's say bm51).
5) Verify the master by locking a slave to it.
6) Once it looks OK, enable the master in slavealloc & production-masters.json, and disable the old master instance in slavealloc, production-masters.json, and puppet.
7) Move some of the tst-linux64 aws machines into the new pool
8) Once it's idle shut down/delete the old master instance
9) Repeat steps 4 through 8 until bm51, 52, 53, 54, 67, and 68 are only running the new master instances. At this point, all of the tst-linux64 ec2 machines should be in the tests-{use1,usw2}-linux64 pools.

Once all of the above is completed there's some cleanup that should be done:
* Make sure that the old instances aren't referred to in slavealloc, production-masters.json, nor puppet.
* Delete the old pools (eg: tests-aws-us-east-1-linux)
(In reply to Ben Hearsum [:bhearsum] from comment #17)
> Fully rolling this out is going to be a multi-step process. Here's my plan:
> 1) After IT bugs are resolved, verify the new masters by locking a slave to
> each of them.

IT bugs are resolved, I've locked a slave to each master now, and they'll be running overnight.
Only 2 of the masters managed to take jobs overnight. All of the jobs completed without issue and I see results on TBPL for them as well as logs on FTP. Given that these are homogenous, I'm going to go ahead with steps 2 & 3 today and move all of the linux 32-bit test machines over to these masters.
(In reply to Ben Hearsum [:bhearsum] from comment #17)
> 2) Once things look OK, enable the new masters in slavealloc &
> production-masters.json
> 3) Repoint all of the tst-linux32 masters to the tst-use1-linux32 and
> tst-usw2-linux32

These are done. I'm watching the masters to make sure that slaves connect without issue and take jobs.
(In reply to Ben Hearsum [:bhearsum] from comment #17)
> 4) Create & enable one of the new 64-bit only masters on an existing master
> (let's say bm51).
> 5) Verify the master by locking a slave to it.
> 6) Once it looks OK, enable the master in slavealloc &
> production-masters.json, and disable the old master instance in slavealloc,
> production-masters.json, and puppet.
> 7) Move some of the tst-linux64 aws machines into the new pool
> 8) Once it's idle shut down/delete the old master instance
> 9) Repeat steps 4 through 8 until bm51, 52, 53, 54, 67, and 68 are only
> running the new master instances. At this point, all of the tst-linux64 ec2
> machines should be in the tests-{use1,usw2}-linux64 pools.

Discovered a bug in these last steps, new steps are:
4) Disable one of master instances on the existing masters in slavealloc/production-masters.json.
5) Graceful shutdown the master, wait for it to stop.
6) Create one of the new 64-bit only masters on the same master
7) Verify it by locking a slave to it
8) Once it looks OK, create the new master instance through Puppet.
9) Lock a slave to it to verify.
10) Once it looks OK, enable the master in slavealloc + production-masters.json.
11) Move some of the tst-linux64 aws machines into the new pool.
12) Repeat until bm51, 52, 53, 54, 67, and 68 are only running the new master instances.

I've started this on bm51 (use1) and bm53 (usw2).
Attachment #8393046 - Flags: review?(catlee) → review+
Attachment #8393046 - Flags: checked-in+
Attachment #8394895 - Flags: review?(rail) → review+
Attachment #8394895 - Flags: checked-in+
OK, the new master instances on bm51 and bm53 are up and I've moved over a portion of the slaves to the tests-use1-linux64 and tests-usw2-linux64 pools. There was some additional commits due to silly errors in my production-masters.json patch, and also some bustage fix for the support for limit_*_slave_platforms.

I also landed a follow-up in Puppet  to remove the old masters from bm51/bm53 - so that Nagios will look for the correct number of Buildbot processes.

At this point I'm done making changes for the day. Todo next week:
* Migrate to new buildbot instances on bm52/54/67/68/103/104/105
* Make sure data in production-masters.json is correct
* Make sure all aws test instances are in the new pools
* Delete old master instances and pools from slavealloc
* Delete old master dirs from /builds/buildbot
(In reply to Ben Hearsum [:bhearsum] from comment #23)
> OK, the new master instances on bm51 and bm53 are up and I've moved over a
> portion of the slaves to the tests-use1-linux64 and tests-usw2-linux64
> pools. There was some additional commits due to silly errors in my
> production-masters.json patch, and also some bustage fix for the support for
> limit_*_slave_platforms.
> 
> I also landed a follow-up in Puppet  to remove the old masters from
> bm51/bm53 - so that Nagios will look for the correct number of Buildbot
> processes.
> 
> At this point I'm done making changes for the day. Todo next week:
> * Migrate to new buildbot instances on bm52/54/67/68/103/104/105
> * Make sure data in production-masters.json is correct
> * Make sure all aws test instances are in the new pools
> * Delete old master instances and pools from slavealloc
> * Delete old master dirs from /builds/buildbot

bm103-105 have been updated to limit their slave platforms. The existing instances on bm52/54 are currently shutting down, ready to be replaced with the new instances.
I'm not sure how I screwed this up in the first place =\.
Attachment #8395656 - Flags: review?(catlee)
Attachment #8395656 - Flags: review?(catlee) → review+
Attachment #8395656 - Flags: checked-in+
bm52 & 54 are migrated. I've shunted additional slaves over to the tests-linux64-* pools.

bm67/68 are in progress.
All of the masters have been transitioned to new instances. I had one follow-up fix to my latest production-masters.json -- I forgot to add a couple platforms to the linux64 masters (I only added slave platforms): https://hg.mozilla.org/build/tools/rev/cef73fa82d58

Now cleaning up old slavealloc data...
OK, all traces of the old pools are now gone from slavealloc. I had to set a bunch of slaves' current_masterid to null to accomplish this, but they'll get that fixed the next time they boot.

I deleted the old instances' dirs on the masters, too.

Reconfigs on these new instances _should_ be much better. I did trial ones earlier and they were about 60% faster than before. We'll see how it goes.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 987759
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: