Closed
Bug 1090139
Opened 10 years ago
Closed 10 years ago
Add yet some more linux64 masters
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Assigned: Callek)
References
Details
Attachments
(7 files, 1 obsolete file)
3.70 KB,
patch
|
coop
:
review+
|
Details | Diff | Splinter Review |
8.92 KB,
patch
|
rail
:
review+
|
Details | Diff | Splinter Review |
1.61 KB,
patch
|
rail
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
3.70 KB,
text/plain
|
rail
:
review+
|
Details |
480 bytes,
text/plain
|
rail
:
review+
|
Details |
27.26 KB,
patch
|
rail
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
1.28 KB,
patch
|
Callek
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
By far the slowest buildbot masters to reconfig are the linux64 test masters. At the moment we do not really know what the cause is, but they appear to be orders of magnitude slower at reconfig'ing than their brothers and sisters.
We also should try to understand the root causes of the traceback exceptions we often get for them during reconfigs. Basically, a deep dive is needed into the internals of the reconfig process to understand what is going on, causing tracebacks, and taking so long.
Updated•10 years ago
|
Component: Buildduty → General Automation
QA Contact: bugspam.Callek → catlee
Comment 1•10 years ago
|
||
Rail is trying to add more slaves in bug 1090568, so let's get some more masters setup to spread out the potential load. 2 masters in each of usw1 and use1 should suffice.
Reporter | ||
Comment 2•10 years ago
|
||
Nick suggests repurposing buildbot-master69.srv.releng.use1.mozilla.com - a Windows test master originally set up by jhopkins in use1 which is disabled in production-masters.json and no longer used/needed.
Reporter | ||
Comment 3•10 years ago
|
||
I've repurposed buildbot-master69 in this patch to be a linux64 test master in use1, rather than a windows test master.
I've left it disabled for now.
We will also need to update slavealloc, fix up nagios reporting, reimage machine, and when it is live and enabled, we will probably want to manually disable and then re-enable some slaves, to get the load balanced out (or maybe they will automatically pull down a new buildbot.tac when they next reboot, which might be soon enough?).
Then of course, we still need to set up three more test masters - one more in use1, and then two in usw1.
Attachment #8516557 -
Flags: review?(coop)
Comment 4•10 years ago
|
||
Comment on attachment 8516557 [details] [diff] [review]
bug1090139_tools_repurpose_bbm69_v1.patch
Review of attachment 8516557 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good. Can I ask you to make sure the old Windows master dirs are removed on bm69 when this lands?
Attachment #8516557 -
Flags: review?(coop) → review+
Assignee | ||
Comment 5•10 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #3)
> Created attachment 8516557 [details] [diff] [review]
> bug1090139_tools_repurpose_bbm69_v1.patch
>
> We will also need to update slavealloc, fix up nagios reporting, reimage
> machine, and when it is live and enabled, we will probably want to manually
Careful IF you choose to reimage, if the IP changes we *need* a firewall flow change bug on file, otherwise we'll have alternate failures, and possible flow issues in the future (when adding/changing flow needs)
Should you need to reimage and are getting the ip changed anyway, I would actually suggest *.bb.* rather than its existing fqdn/vlan. Since the latter is what we want to standardize on anyway, and is negligible extra work.
Reporter | ||
Comment 6•10 years ago
|
||
So I took a look at buildbot-master52 today, since a reconfig just lasts forever on it.
The logs scroll by very quickly (buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master/twistd.log) with:
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] [Failure instance: Traceback: <class 'twisted.spread.pb.Error'>: Maximum PB reference count exceeded. Goodbye.
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/internet/defer.py:249:addCallbacks
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/internet/defer.py:441:_runCallbacks
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/buildbot-0.8.2_hg_3ce9eb030a5f_production_0.8-py2.7.egg/buildbot/process/builder.py:107:doSetMaster
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:328:callRemote
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] --- <exception caught here> ---
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:809:_sendMessage
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:763:serialize
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:1122:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:534:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:588:_jellyIterable
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:475:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/flavors.py:127:jellyFor
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:664:registerReference
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] ]
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] [Failure instance: Traceback: <class 'twisted.spread.pb.Error'>: Maximum PB reference count exceeded. Goodbye.
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/internet/defer.py:249:addCallbacks
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/internet/defer.py:441:_runCallbacks
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/buildbot-0.8.2_hg_3ce9eb030a5f_production_0.8-py2.7.egg/buildbot/process/builder.py:107:doSetMaster
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:328:callRemote
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] --- <exception caught here> ---
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:809:_sendMessage
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:763:serialize
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:1122:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:534:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:588:_jellyIterable
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:475:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/flavors.py:127:jellyFor
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:664:registerReference
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] ]
This looks like a bad case of bug 712244 which suggests the problem is not too many slaves per master, but instead, too many builders per slave.
For reference, found some docs here:
http://twistedmatrix.com/documents/current/core/howto/pb.html
Maybe time to either jacuzzi up linux64 test slaves, or do create some more generic builders...
Reporter | ||
Comment 7•10 years ago
|
||
Unfortunately I can't even hit web interface to gracefully shut down:
http://buildbot-master52.srv.releng.use1.mozilla.com:8201
Assignee | ||
Comment 8•10 years ago
|
||
Last time we added aws linux test masters was Bug 1035863 fwiw.
I'm going to be adding 2 masters per region
bm120,bm121 for use1
bm122,bm123 for usw2
Assignee | ||
Comment 9•10 years ago
|
||
Attachment #8562572 -
Flags: review?(rail)
Assignee | ||
Comment 10•10 years ago
|
||
[while I'm at it, I notice some SYS entries from :dustin for new .bb. stuff that is incorrect (specifying use instead of usw in informative columns)] so f? on him for that tidbit as well.
Attachment #8562586 -
Flags: review?(rail)
Attachment #8562586 -
Flags: feedback?(dustin)
Comment 11•10 years ago
|
||
Comment on attachment 8562586 [details]
[bash commands] add masters to inventory
LGTM. One thing to watch out it that free_ips.py doesn't reserve IPs, so there is a chance to get I dupe. Better to either validate or use -n $number_of_ips and then deal with them.
Attachment #8562586 -
Flags: review?(rail) → review+
Updated•10 years ago
|
Attachment #8562572 -
Flags: review?(rail) → review+
Comment 12•10 years ago
|
||
Comment on attachment 8562569 [details] [diff] [review]
[tools] moar_masters
stamp
Attachment #8562569 -
Flags: review?(rail) → review+
Comment 13•10 years ago
|
||
Thanks for taking care of this bug. There is a doc, that you may want to update with the invtool commands: https://wiki.mozilla.org/ReleaseEngineering/AWS_Master_Setup.
Assignee | ||
Comment 15•10 years ago
|
||
This is being used right now, with post-run review to do the following:
root@relengwebadm.private.scl3
cd /data/releng/www/slavealloc
set +o history # turn off history so I can enter private data
slavealloc_server=mysql://user@pass:server/DB_NAME #from gpg secrets
set -o history
/data/releng/www/slavealloc/slavealloc dbimport -D $slavealloc_server \
--master-data ./import-b1090139.csv
slavealloc_server=wiped
Attachment #8563171 -
Flags: review?(rail)
Assignee | ||
Comment 16•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #14)
> Callek, what are you seeing exactly?
Answered in IRC, but to clarify here as well: the SYS entries in inventory had for one region's hosts the wrong --rack-pk ;; looks like its fixed now for clarity.
Flags: needinfo?(bugspam.Callek)
Assignee | ||
Comment 17•10 years ago
|
||
In buildduty@aws-manager:/builds/aws_manager/cloud-tools
In 4 seperate screen sessions, I'm running each of the following commands now:
aws_create_instance -c configs/buildbot-master -r us-east-1 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key ~/.ssh/aws-ssh-key -i ./instance_data/us-east-1.instance_data_master.json buildbot-master120
aws_create_instance -c configs/buildbot-master -r us-east-1 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key ~/.ssh/aws-ssh-key -i ./instance_data/us-east-1.instance_data_master.json buildbot-master121
aws_create_instance -c configs/buildbot-master -r us-west-2 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key ~/.ssh/aws-ssh-key -i ./instance_data/us-west-2.instance_data_master.json buildbot-master122
aws_create_instance -c configs/buildbot-master -r us-west-2 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key ~/.ssh/aws-ssh-key -i ./instance_data/us-west-2.instance_data_master.json buildbot-master123
Assignee | ||
Comment 18•10 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #17)
> In buildduty@aws-manager:/builds/aws_manager/cloud-tools
>
> In 4 seperate screen sessions, I'm running each of the following commands
> now:
For some reason, despite being the commands in dustin's gist, this created cent 6.2 masters. So I'm going to wait for him to reply to an e-mail so I can move forward here.
Assignee | ||
Comment 19•10 years ago
|
||
The patch is straightforward, so I'll use this comment to double as a status point.
----------
So the issue with teh initial bringup was git not being found trying to pull updates from cloud tools, so the fact that we had updated cloud tools to use 6.5 didn't propagate.
With the new version, install and an extra puppet run, I verified kernel with uname -r (2.6.32-504.3.3.el6.x86_64) and that there was no REBOOT_REQUIRED in motd! \o/
Attachment #8563258 -
Flags: review?(rail)
Assignee | ||
Updated•10 years ago
|
Attachment #8562586 -
Flags: feedback?(dustin)
Comment 20•10 years ago
|
||
Comment on attachment 8563258 [details] [diff] [review]
[puppet] add to known_hosts
Review of attachment 8563258 [details] [diff] [review]:
-----------------------------------------------------------------
Adding tests masters to this file is not necessary. release-runner won't touch tests masters for a release reconfig. We don't have other tests masters in this files, so let's avoid possible future confusion and not add them.
Attachment #8563258 -
Flags: review?(rail) → review-
Updated•10 years ago
|
Attachment #8563171 -
Flags: review?(rail) → review+
Reporter | ||
Comment 21•10 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #6)
> This looks like a bad case of bug 712244 which suggests the problem is not
> too many slaves per master, but instead, too many builders per slave.
>
> For reference, found some docs here:
> http://twistedmatrix.com/documents/current/core/howto/pb.html
>
> Maybe time to either jacuzzi up linux64 test slaves, or do create some more
> generic builders...
I noticed that new test masters are getting added in this bug, but based on comment 6, I'm not sure this alone will help. Do we have plans to reduce the number of builders too?
Comment 22•10 years ago
|
||
Comment on attachment 8562572 [details] [diff] [review]
[puppet] moar_masters
http://hg.mozilla.org/build/puppet/rev/312875020520
Attachment #8562572 -
Flags: checked-in+
Comment 23•10 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #20)
> Adding tests masters to this file is not necessary. release-runner won't
> touch tests masters for a release reconfig. We don't have other tests
> masters in this files, so let's avoid possible future confusion and not add
> them.
Here's a patch to remove the tests masters that have already crept into the file.
Attachment #8563258 -
Attachment is obsolete: true
Attachment #8564157 -
Flags: review?(rail)
Comment 24•10 years ago
|
||
Comment on attachment 8564157 [details] [diff] [review]
[puppet] Remove tests masters from known_hosts
Thanks for the clean up!
Attachment #8564157 -
Flags: review?(rail) → review+
Assignee | ||
Comment 25•10 years ago
|
||
landed a patch to enable these and enabled in slavealloc (with jlund's IRC ok)
http://hg.mozilla.org/build/tools/rev/e116e4c666ec
Comment 26•10 years ago
|
||
Comment on attachment 8564157 [details] [diff] [review]
[puppet] Remove tests masters from known_hosts
Review of attachment 8564157 [details] [diff] [review]:
-----------------------------------------------------------------
https://hg.mozilla.org/build/puppet/rev/371921ca143a
Attachment #8564157 -
Flags: checked-in+
Assignee | ||
Comment 27•10 years ago
|
||
I'm resolving this bug, we alleviated the need of the bug it blocks by adding more masters.
There are still the "many builds" issues here, but thats well known as an issue and should be alleviated by the work we're doing with taskcluster and general desires for a "generic" builder that can eliminate many of these.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Summary: Reconfigs of linux64 test masters takes an age → Add yet some more linux64 masters
Comment 28•10 years ago
|
||
We have 8 in use1 and usw2, but you think it's 10 & 6 if you look at datacentre in production-masters.json.
Also, the work on bm69 was never finished because it hasn't been processing any jobs. See comment #3 for some things that needed doing. No slaves connected so I suspect the slavealloc config.
Attachment #8578274 -
Flags: review?(bugspam.Callek)
Assignee | ||
Comment 29•10 years ago
|
||
Comment on attachment 8578274 [details] [diff] [review]
[tools] Fix datacentre on bm122 & 123
Review of attachment 8578274 [details] [diff] [review]:
-----------------------------------------------------------------
whops, "missed one field" in my copy/pastes
Attachment #8578274 -
Flags: review?(bugspam.Callek) → review+
Comment 30•10 years ago
|
||
Comment on attachment 8578274 [details] [diff] [review]
[tools] Fix datacentre on bm122 & 123
https://hg.mozilla.org/build/tools/rev/eeca381bac0d
Attachment #8578274 -
Flags: checked-in+
Comment 31•10 years ago
|
||
Master 69 never got actually got put into production and I've just shut it down.
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•