Closed Bug 1090139 Opened 10 years ago Closed 9 years ago

Add yet some more linux64 masters

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: Callek)

References

Details

Attachments

(7 files, 1 obsolete file)

By far the slowest buildbot masters to reconfig are the linux64 test masters. At the moment we do not really know what the cause is, but they appear to be orders of magnitude slower at reconfig'ing than their brothers and sisters.

We also should try to understand the root causes of the traceback exceptions we often get for them during reconfigs. Basically, a deep dive is needed into the internals of the reconfig process to understand what is going on, causing tracebacks, and taking so long.
Component: Buildduty → General Automation
QA Contact: bugspam.Callek → catlee
Blocks: 1078260
Blocks: 1090568
No longer blocks: 1078260
Rail is trying to add more slaves in bug 1090568, so let's get some more masters setup to spread out the potential load. 2 masters in each of usw1 and use1 should suffice.
Nick suggests repurposing buildbot-master69.srv.releng.use1.mozilla.com - a Windows test master originally set up by jhopkins in use1 which is disabled in production-masters.json and no longer used/needed.
See Also: → 1021086
I've repurposed buildbot-master69 in this patch to be a linux64 test master in use1, rather than a windows test master.

I've left it disabled for now.

We will also need to update slavealloc, fix up nagios reporting, reimage machine, and when it is live and enabled, we will probably want to manually disable and then re-enable some slaves, to get the load balanced out (or maybe they will automatically pull down a new buildbot.tac when they next reboot, which might be soon enough?).

Then of course, we still need to set up three more test masters - one more in use1, and then two in usw1.
Attachment #8516557 - Flags: review?(coop)
Comment on attachment 8516557 [details] [diff] [review]
bug1090139_tools_repurpose_bbm69_v1.patch

Review of attachment 8516557 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Can I ask you to make sure the old Windows master dirs are removed on bm69 when this lands?
Attachment #8516557 - Flags: review?(coop) → review+
(In reply to Pete Moore [:pete][:pmoore] from comment #3)
> Created attachment 8516557 [details] [diff] [review]
> bug1090139_tools_repurpose_bbm69_v1.patch
> 
> We will also need to update slavealloc, fix up nagios reporting, reimage
> machine, and when it is live and enabled, we will probably want to manually

Careful IF you choose to reimage, if the IP changes we *need* a firewall flow change bug on file, otherwise we'll have alternate failures, and possible flow issues in the future (when adding/changing flow needs)

Should you need to reimage and are getting the ip changed anyway, I would actually suggest *.bb.* rather than its existing fqdn/vlan. Since the latter is what we want to standardize on anyway, and is negligible extra work.
So I took a look at buildbot-master52 today, since a reconfig just lasts forever on it.

The logs scroll by very quickly (buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master/twistd.log) with:

2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] [Failure instance: Traceback: <class 'twisted.spread.pb.Error'>: Maximum PB reference count exceeded.  Goodbye.
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/internet/defer.py:249:addCallbacks
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/internet/defer.py:441:_runCallbacks
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/buildbot-0.8.2_hg_3ce9eb030a5f_production_0.8-py2.7.egg/buildbot/process/builder.py:107:doSetMaster
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:328:callRemote
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] --- <exception caught here> ---
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:809:_sendMessage
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:763:serialize
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:1122:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:534:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:588:_jellyIterable
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:475:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/flavors.py:127:jellyFor
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:664:registerReference
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] ]
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] [Failure instance: Traceback: <class 'twisted.spread.pb.Error'>: Maximum PB reference count exceeded.  Goodbye.
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/internet/defer.py:249:addCallbacks
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/internet/defer.py:441:_runCallbacks
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/buildbot-0.8.2_hg_3ce9eb030a5f_production_0.8-py2.7.egg/buildbot/process/builder.py:107:doSetMaster
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:328:callRemote
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] --- <exception caught here> ---
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:809:_sendMessage
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:763:serialize
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:1122:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:534:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:588:_jellyIterable
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/jelly.py:475:jelly
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/flavors.py:127:jellyFor
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] /builds/buildbot/tests1-linux64/lib/python2.7/site-packages/twisted/spread/pb.py:664:registerReference
2014-11-07 05:38:49-0800 [Broker,11994,10.134.156.13] ]


This looks like a bad case of bug 712244 which suggests the problem is not too many slaves per master, but instead, too many builders per slave.

For reference, found some docs here:
http://twistedmatrix.com/documents/current/core/howto/pb.html

Maybe time to either jacuzzi up linux64 test slaves, or do create some more generic builders...
Unfortunately I can't even hit web interface to gracefully shut down:
http://buildbot-master52.srv.releng.use1.mozilla.com:8201
Last time we added aws linux test masters was Bug 1035863 fwiw.

I'm going to be adding 2 masters per region

bm120,bm121 for use1
bm122,bm123 for usw2
Assignee: nobody → bugspam.Callek
Status: NEW → ASSIGNED
Attachment #8562569 - Flags: review?(rail)
Attachment #8562572 - Flags: review?(rail)
[while I'm at it, I notice some SYS entries from :dustin for new .bb. stuff that is incorrect (specifying use instead of usw in informative columns)]  so f? on him for that tidbit as well.
Attachment #8562586 - Flags: review?(rail)
Attachment #8562586 - Flags: feedback?(dustin)
Comment on attachment 8562586 [details]
[bash commands] add masters to inventory

LGTM. One thing to watch out it that free_ips.py doesn't reserve IPs, so there is a chance to get I dupe. Better to either validate or use -n $number_of_ips and then deal with them.
Attachment #8562586 - Flags: review?(rail) → review+
Attachment #8562572 - Flags: review?(rail) → review+
Comment on attachment 8562569 [details] [diff] [review]
[tools] moar_masters

stamp
Attachment #8562569 - Flags: review?(rail) → review+
Thanks for taking care of this bug. There is a doc, that you may want to update with the invtool commands: https://wiki.mozilla.org/ReleaseEngineering/AWS_Master_Setup.
Callek, what are you seeing exactly?
Flags: needinfo?(bugspam.Callek)
Attached file [raw] slavealloc csv
This is being used right now, with post-run review to do the following:

root@relengwebadm.private.scl3

cd /data/releng/www/slavealloc
set +o history # turn off history so I can enter private data
slavealloc_server=mysql://user@pass:server/DB_NAME   #from gpg secrets
set -o history
/data/releng/www/slavealloc/slavealloc dbimport -D $slavealloc_server \
    --master-data ./import-b1090139.csv
slavealloc_server=wiped
Attachment #8563171 - Flags: review?(rail)
(In reply to Dustin J. Mitchell [:dustin] from comment #14)
> Callek, what are you seeing exactly?

Answered in IRC, but to clarify here as well: the SYS entries in inventory had for one region's hosts the wrong --rack-pk ;; looks like its fixed now for clarity.
Flags: needinfo?(bugspam.Callek)
In buildduty@aws-manager:/builds/aws_manager/cloud-tools

In 4 seperate screen sessions, I'm running each of the following commands now:

aws_create_instance -c configs/buildbot-master -r us-east-1 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key ~/.ssh/aws-ssh-key -i ./instance_data/us-east-1.instance_data_master.json buildbot-master120
aws_create_instance -c configs/buildbot-master -r us-east-1 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key ~/.ssh/aws-ssh-key -i ./instance_data/us-east-1.instance_data_master.json buildbot-master121
aws_create_instance -c configs/buildbot-master -r us-west-2 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key ~/.ssh/aws-ssh-key -i ./instance_data/us-west-2.instance_data_master.json buildbot-master122
aws_create_instance -c configs/buildbot-master -r us-west-2 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key ~/.ssh/aws-ssh-key -i ./instance_data/us-west-2.instance_data_master.json buildbot-master123
(In reply to Justin Wood (:Callek) from comment #17)
> In buildduty@aws-manager:/builds/aws_manager/cloud-tools
> 
> In 4 seperate screen sessions, I'm running each of the following commands
> now:

For some reason, despite being the commands in dustin's gist, this created cent 6.2 masters. So I'm going to wait for him to reply to an e-mail so I can move forward here.
Attached patch [puppet] add to known_hosts (obsolete) — Splinter Review
The patch is straightforward, so I'll use this comment to double as a status point.
----------

So the issue with teh initial bringup was git not being found trying to pull updates from cloud tools, so the fact that we had updated cloud tools to use 6.5 didn't propagate.

With the new version, install and an extra puppet run, I verified kernel with uname -r (2.6.32-504.3.3.el6.x86_64) and that there was no REBOOT_REQUIRED in motd! \o/
Attachment #8563258 - Flags: review?(rail)
Attachment #8562586 - Flags: feedback?(dustin)
Comment on attachment 8563258 [details] [diff] [review]
[puppet] add to known_hosts

Review of attachment 8563258 [details] [diff] [review]:
-----------------------------------------------------------------

Adding tests masters to this file is not necessary. release-runner won't touch tests masters for a release reconfig. We don't have other tests masters in this files, so let's avoid possible future confusion and not add them.
Attachment #8563258 - Flags: review?(rail) → review-
Attachment #8563171 - Flags: review?(rail) → review+
(In reply to Pete Moore [:pete][:pmoore] from comment #6)

> This looks like a bad case of bug 712244 which suggests the problem is not
> too many slaves per master, but instead, too many builders per slave.
> 
> For reference, found some docs here:
> http://twistedmatrix.com/documents/current/core/howto/pb.html
> 
> Maybe time to either jacuzzi up linux64 test slaves, or do create some more
> generic builders...

I noticed that new test masters are getting added in this bug, but based on comment 6, I'm not sure this alone will help. Do we have plans to reduce the number of builders too?
(In reply to Rail Aliiev [:rail] from comment #20)
> Adding tests masters to this file is not necessary. release-runner won't
> touch tests masters for a release reconfig. We don't have other tests
> masters in this files, so let's avoid possible future confusion and not add
> them.

Here's a patch to remove the tests masters that have already crept into the file.
Attachment #8563258 - Attachment is obsolete: true
Attachment #8564157 - Flags: review?(rail)
Comment on attachment 8564157 [details] [diff] [review]
[puppet] Remove tests masters from known_hosts

Thanks for the clean up!
Attachment #8564157 - Flags: review?(rail) → review+
landed a patch to enable these and enabled in slavealloc (with jlund's IRC ok)

http://hg.mozilla.org/build/tools/rev/e116e4c666ec
Comment on attachment 8564157 [details] [diff] [review]
[puppet] Remove tests masters from known_hosts

Review of attachment 8564157 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/puppet/rev/371921ca143a
Attachment #8564157 - Flags: checked-in+
I'm resolving this bug, we alleviated the need of the bug it blocks by adding more masters.

There are still the "many builds" issues here, but thats well known as an issue and should be alleviated by the work we're doing with taskcluster and general desires for a "generic" builder that can eliminate many of these.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Summary: Reconfigs of linux64 test masters takes an age → Add yet some more linux64 masters
We have 8 in use1 and usw2, but you think it's 10 & 6 if you look at datacentre in production-masters.json.

Also, the work on bm69 was never finished because it hasn't been processing any jobs. See comment #3 for some things that needed doing. No slaves connected so I suspect the slavealloc config.
Attachment #8578274 - Flags: review?(bugspam.Callek)
Comment on attachment 8578274 [details] [diff] [review]
[tools] Fix datacentre on bm122 & 123

Review of attachment 8578274 [details] [diff] [review]:
-----------------------------------------------------------------

whops, "missed one field" in my copy/pastes
Attachment #8578274 - Flags: review?(bugspam.Callek) → review+
Master 69 never got actually got put into production and I've just shut it down.
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: