Closed
Bug 892122
Opened 11 years ago
Closed 10 years ago
Speed up reconfigs on linux test masters
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: bhearsum)
References
Details
Attachments
(7 files, 2 obsolete files)
8.67 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
8.23 KB,
patch
|
bhearsum
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
2.12 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
23.81 KB,
patch
|
rail
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
3.14 KB,
text/plain
|
bhearsum
:
checked-in+
|
Details |
1.02 KB,
patch
|
rail
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
4.07 KB,
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
Reconfigs on tests1-linux masters can be quite slow (many minutes) and are the long pole in any code deployment. This is likely to be related to the number of jobs and/or builders running on each master. We should figure out which is the major issue. If it's just the number of jobs we can scale by adding more masters. If it's the # of builders then we can split the linux tests out by platform, ie 32bit and 64 bit on a few masters each.
Reporter | ||
Comment 1•11 years ago
|
||
Platform # builders # slaves # jobs # masters linux 3857 2030* 13716 6 macosx 3358 257 11748 6 windows 4064 654 13053 6 # of builders and slaves from dumping masters * = not really, more slaves than we actually have connected or created # of jobs in https://secure.pub.build.mozilla.org/builddata/buildjson/builds-2013-07-10.js.gz # of enabled masters from production-masters.json I guess that says more masters first.
Reporter | ||
Comment 2•11 years ago
|
||
Actually we should split the platforms. * if reconfigs are slow from load from running jobs/handling jobs, then we'd expect the number of jobs to be higher for linux, but it's not. I'm assuming that all platforms handle the same quantity of data * if reconfigs are slow because of the amount of work recreating builders and reparenting slaves then linux should be slower, and it is. Even though the number of builders isn't higher.
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Assignee | ||
Comment 3•10 years ago
|
||
Catlee and I talked about this on IRC. Callek already moved in-house machines to their own pool in bug 971780, so we don't need to worry about them. We are going to split up the existing tests-aws-us-east-1-linux and tests-aws-us-west-2-linux into 2 additional pools, though. When we're done we should have the following pools: * tests-use1-linux32 (~700 machines) * tests-usw2-linux32 (~700 machines) * tests-use1-linux64 (~800 machines) * tests-usw2-linux64 (~800 machines) In most other pools we've got one master for roughly every 100 slaves. I suspect we can push more than that, though. We currently have 6 masters for all of these slaves, and if we double that to 12 we can give each pool 3 masters. We can change the existing ones to be the -linux32 masters, and the new ones can take on the -linux64 work. If that's not enough, we can add more later. I'll be working on the mechanics of bringing the new masters up (launching the instances, netflows, puppet, slavealloc, etc.). Catlee is going to work on making it possible to limit masters to linux32 or linux64 tests.
Assignee: nobody → bhearsum
Assignee | ||
Comment 4•10 years ago
|
||
These will need to be adjusted once we have the new limit* options, but this should be enough to bring up the instances.
Attachment #8392390 -
Flags: review?(catlee)
Updated•10 years ago
|
Attachment #8392390 -
Flags: review?(catlee) → review+
Comment 5•10 years ago
|
||
Attachment #8392399 -
Flags: review?(bhearsum)
Assignee | ||
Comment 6•10 years ago
|
||
You'll notice that the names have switched from "bmNN-tests-linux" to "bmNN-tests-linux32". My plan is to bring these new masters up, move the slaves to the new pool, and then change the existing masters' instances to bmNN-tests-linux64, move their slaves, and then get rid of their old instances. This will require two instances on these masters for a period of time, but only one will be running at any given time. I'll have to ease them in one master a time to avoid downtime.
Attachment #8392402 -
Flags: review?(catlee)
Updated•10 years ago
|
Attachment #8392402 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 7•10 years ago
|
||
Comment on attachment 8392399 [details] [diff] [review] support limit_fx_slave_platforms, limit_b2g_slave_platforms Review of attachment 8392399 [details] [diff] [review]: ----------------------------------------------------------------- I hate our configs.
Attachment #8392399 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 8•10 years ago
|
||
This adds all of the new instances & pools, not just instances for the masters that need to be created.
Attachment #8392447 -
Flags: review?(catlee)
Comment 9•10 years ago
|
||
Comment on attachment 8392447 [details] [diff] [review] add masters to slavealloc don't forget the passwords table for these new pools. I made that mistake in my inhouse work
Attachment #8392447 -
Flags: feedback+
Updated•10 years ago
|
Attachment #8392447 -
Flags: review?(catlee) → review+
Updated•10 years ago
|
Attachment #8392399 -
Flags: checked-in+
Comment 10•10 years ago
|
||
Live in production.
Assignee | ||
Updated•10 years ago
|
Attachment #8392390 -
Flags: checked-in+
Assignee | ||
Updated•10 years ago
|
Attachment #8392402 -
Flags: checked-in+
Assignee | ||
Comment 11•10 years ago
|
||
DNS entries have been added. I'm going to starting bringing up the instances shortly.
Assignee | ||
Comment 12•10 years ago
|
||
This has everything disabled still, but should set things up as follows: - bm01-06 are linux32 AWS test masters. 64-bit platforms are disabled and in-house slave platforms are disabled. - bm51-54, 67, and 68 are linux64 AWS test masters. 32-bit platforms and in-house slave platforms are disabled - bm103-105 are in house linux test masters for 32 and 64-bit. They have AWS slave platforms disabled. Note that their "name" has this bug number appended for now, because I'm pretty sure that our tools need that to be unique.
Attachment #8393046 -
Flags: review?(catlee)
Assignee | ||
Comment 13•10 years ago
|
||
Adds all of the new masters (existing in house ones don't need to be added, because none of their slavealloc details have changed). Adds slave passwords.
Attachment #8392447 -
Attachment is obsolete: true
Assignee | ||
Comment 14•10 years ago
|
||
I ran this on the production slavealloc db. Had to fix a couple more fqdns + slave password syntax. I subbed in the correct password, of course.
Attachment #8393064 -
Attachment is obsolete: true
Attachment #8393512 -
Flags: checked-in+
Comment 15•10 years ago
|
||
I need to enable b2g reftests across the board and I can't because I'm reaching the max number of builders for the 64-bit test instances. I don't know what the right solution is.
Assignee | ||
Comment 16•10 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #15) > I need to enable b2g reftests across the board and I can't because I'm > reaching the max number of builders for the 64-bit test instances. > > I don't know what the right solution is. Per IRC, this isn't going to help.
Assignee | ||
Comment 17•10 years ago
|
||
Fully rolling this out is going to be a multi-step process. Here's my plan: 1) After IT bugs are resolved, verify the new masters by locking a slave to each of them. 2) Once things look OK, enable the new masters in slavealloc & production-masters.json 3) Repoint all of the tst-linux32 masters to the tst-use1-linux32 and tst-usw2-linux32 At this point, the tst-linux32 aws machines will migrate over to the new masters as they reboot. Next: 4) Create & enable one of the new 64-bit only masters on an existing master (let's say bm51). 5) Verify the master by locking a slave to it. 6) Once it looks OK, enable the master in slavealloc & production-masters.json, and disable the old master instance in slavealloc, production-masters.json, and puppet. 7) Move some of the tst-linux64 aws machines into the new pool 8) Once it's idle shut down/delete the old master instance 9) Repeat steps 4 through 8 until bm51, 52, 53, 54, 67, and 68 are only running the new master instances. At this point, all of the tst-linux64 ec2 machines should be in the tests-{use1,usw2}-linux64 pools. Once all of the above is completed there's some cleanup that should be done: * Make sure that the old instances aren't referred to in slavealloc, production-masters.json, nor puppet. * Delete the old pools (eg: tests-aws-us-east-1-linux)
Assignee | ||
Comment 18•10 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #17) > Fully rolling this out is going to be a multi-step process. Here's my plan: > 1) After IT bugs are resolved, verify the new masters by locking a slave to > each of them. IT bugs are resolved, I've locked a slave to each master now, and they'll be running overnight.
Assignee | ||
Comment 19•10 years ago
|
||
Only 2 of the masters managed to take jobs overnight. All of the jobs completed without issue and I see results on TBPL for them as well as logs on FTP. Given that these are homogenous, I'm going to go ahead with steps 2 & 3 today and move all of the linux 32-bit test machines over to these masters.
Assignee | ||
Comment 20•10 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #17) > 2) Once things look OK, enable the new masters in slavealloc & > production-masters.json > 3) Repoint all of the tst-linux32 masters to the tst-use1-linux32 and > tst-usw2-linux32 These are done. I'm watching the masters to make sure that slaves connect without issue and take jobs.
Assignee | ||
Comment 21•10 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #17) > 4) Create & enable one of the new 64-bit only masters on an existing master > (let's say bm51). > 5) Verify the master by locking a slave to it. > 6) Once it looks OK, enable the master in slavealloc & > production-masters.json, and disable the old master instance in slavealloc, > production-masters.json, and puppet. > 7) Move some of the tst-linux64 aws machines into the new pool > 8) Once it's idle shut down/delete the old master instance > 9) Repeat steps 4 through 8 until bm51, 52, 53, 54, 67, and 68 are only > running the new master instances. At this point, all of the tst-linux64 ec2 > machines should be in the tests-{use1,usw2}-linux64 pools. Discovered a bug in these last steps, new steps are: 4) Disable one of master instances on the existing masters in slavealloc/production-masters.json. 5) Graceful shutdown the master, wait for it to stop. 6) Create one of the new 64-bit only masters on the same master 7) Verify it by locking a slave to it 8) Once it looks OK, create the new master instance through Puppet. 9) Lock a slave to it to verify. 10) Once it looks OK, enable the master in slavealloc + production-masters.json. 11) Move some of the tst-linux64 aws machines into the new pool. 12) Repeat until bm51, 52, 53, 54, 67, and 68 are only running the new master instances. I've started this on bm51 (use1) and bm53 (usw2).
Assignee | ||
Comment 22•10 years ago
|
||
Attachment #8394895 -
Flags: review?(rail)
Updated•10 years ago
|
Attachment #8393046 -
Flags: review?(catlee) → review+
Assignee | ||
Updated•10 years ago
|
Attachment #8393046 -
Flags: checked-in+
Updated•10 years ago
|
Attachment #8394895 -
Flags: review?(rail) → review+
Assignee | ||
Updated•10 years ago
|
Attachment #8394895 -
Flags: checked-in+
Assignee | ||
Comment 23•10 years ago
|
||
OK, the new master instances on bm51 and bm53 are up and I've moved over a portion of the slaves to the tests-use1-linux64 and tests-usw2-linux64 pools. There was some additional commits due to silly errors in my production-masters.json patch, and also some bustage fix for the support for limit_*_slave_platforms. I also landed a follow-up in Puppet to remove the old masters from bm51/bm53 - so that Nagios will look for the correct number of Buildbot processes. At this point I'm done making changes for the day. Todo next week: * Migrate to new buildbot instances on bm52/54/67/68/103/104/105 * Make sure data in production-masters.json is correct * Make sure all aws test instances are in the new pools * Delete old master instances and pools from slavealloc * Delete old master dirs from /builds/buildbot
Assignee | ||
Comment 24•10 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #23) > OK, the new master instances on bm51 and bm53 are up and I've moved over a > portion of the slaves to the tests-use1-linux64 and tests-usw2-linux64 > pools. There was some additional commits due to silly errors in my > production-masters.json patch, and also some bustage fix for the support for > limit_*_slave_platforms. > > I also landed a follow-up in Puppet to remove the old masters from > bm51/bm53 - so that Nagios will look for the correct number of Buildbot > processes. > > At this point I'm done making changes for the day. Todo next week: > * Migrate to new buildbot instances on bm52/54/67/68/103/104/105 > * Make sure data in production-masters.json is correct > * Make sure all aws test instances are in the new pools > * Delete old master instances and pools from slavealloc > * Delete old master dirs from /builds/buildbot bm103-105 have been updated to limit their slave platforms. The existing instances on bm52/54 are currently shutting down, ready to be replaced with the new instances.
Assignee | ||
Comment 25•10 years ago
|
||
I'm not sure how I screwed this up in the first place =\.
Attachment #8395656 -
Flags: review?(catlee)
Updated•10 years ago
|
Attachment #8395656 -
Flags: review?(catlee) → review+
Assignee | ||
Updated•10 years ago
|
Attachment #8395656 -
Flags: checked-in+
Assignee | ||
Comment 26•10 years ago
|
||
bm52 & 54 are migrated. I've shunted additional slaves over to the tests-linux64-* pools. bm67/68 are in progress.
Assignee | ||
Comment 27•10 years ago
|
||
All of the masters have been transitioned to new instances. I had one follow-up fix to my latest production-masters.json -- I forgot to add a couple platforms to the linux64 masters (I only added slave platforms): https://hg.mozilla.org/build/tools/rev/cef73fa82d58 Now cleaning up old slavealloc data...
Assignee | ||
Comment 28•10 years ago
|
||
OK, all traces of the old pools are now gone from slavealloc. I had to set a bunch of slaves' current_masterid to null to accomplish this, but they'll get that fixed the next time they boot. I deleted the old instances' dirs on the masters, too. Reconfigs on these new instances _should_ be much better. I did trial ones earlier and they were about 60% faster than before. We'll see how it goes.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•