Closed Bug 712244 Opened 13 years ago Closed 12 years ago

increase or work around hitting MAX_BROKER_REFS on test masters (too many builders per slave problem)

Categories

(Release Engineering :: General, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: mozilla)

References

Details

(Whiteboard: [buildmasters][capacity])

Attachments

(5 files)

Seen on talos-r3-xp-037 trying to talk to buildbot-master16:

2011-12-19 11:19:51-0800 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2011-12-19 11:19:51-0800 [Broker,client] While trying to connect:
        Traceback from remote host -- Traceback (most recent call last):
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause
            self._runCallbacks()
...
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/pb.py", line 664, in registerReference
            raise Error("Maximum PB reference count exceeded.  "
        twisted.spread.pb.Error: Maximum PB reference count exceeded.  Goodbye.

Connected fine after a reboot, so tried to hit the master during a reconfig ?
Also talos-r3-xp-039 at 
2011-12-19 11:18:52-0800 [Broker,client] While trying to connect:
And talos-r3-xp-057 slightly older, full stack:

2011-12-16 18:22:01-0800 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2011-12-16 18:22:01-0800 [Broker,client] While trying to connect:
        Traceback from remote host -- Traceback (most recent call last):
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause
            self._runCallbacks()
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 397, in _continue
            self.unpause()
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause
            self._runCallbacks()
        --- <exception caught here> ---
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/pb.py", line 763, in serialize
            return jelly(object, self.security, None, self)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/jelly.py", line 1122, in jelly
            return _Jellier(taster, persistentStore, invoker).jelly(object)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/jelly.py", line 475, in jelly
            return obj.jellyFor(self)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/flavors.py", line 127, in jellyFor
            return "remote", jellier.invoker.registerReference(self)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/pb.py", line 664, in registerReference
            raise Error("Maximum PB reference count exceeded.  "
        twisted.spread.pb.Error: Maximum PB reference count exceeded.  Goodbye.
jhford has a bug to add a couple more masters but it is blocked on IT to open firewalls for DB access (bug 708804).
jhford mention those are for tegras. I am asking dustin for more masters.
Also CPU wio is happening on that master, which is not surprising.
Depends on: 712398
found in triage:

(In reply to Armen Zambrano G. [:armenzg] - (vactions from Dec. 24th & back on Jan. 9th) from comment #3)
> jhford has a bug to add a couple more masters but it is blocked on IT to
> open firewalls for DB access (bug 708804).

bug#708804 now done.

...and so too is bug#712398.
Once buildbot-master21 is created, we should use our scripts to set it up as a Windows test master, enable it in slavealloc, and let it start grabbing Windows slaves from buildbot-master16.
OS: Mac OS X → Linux
Priority: -- → P3
Summary: Too many builders on buildbot-master16 ? → Master setup for buildbot-master21
Whiteboard: [buildmasters][capacity][buildduty]
I'll set it up in here.
Assignee: nobody → armenzg
Priority: P3 → P2
lsblakk, catlee mentioned that you're setting up some masters.
Do you have instructions on how to? I would like to try one myself.
Thanks!
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #8)
> lsblakk, catlee mentioned that you're setting up some masters.
> Do you have instructions on how to? I would like to try one myself.
> Thanks!

Armen, I just set up buildbot-master{22..27} following https://wiki.mozilla.org/ReleaseEngineering/Master_Setup and have added a few notes where I found it necessary.  Am currently finished everything on my end and just trying to figure out the best practice for filing the IT bugs required. I will make notes in that wiki page with whatever I discover.
If buildbot-master16 is already windows only (win32+win64) and there really are too many builders, is there a plan to split win32 vs win64 ? Simply adding another master doesn't seem like it'll help any.
If we're hitting the PB limit, it's a problem with number-of-builders-per-slave, not number-of-slaves-per-master. So we need to reduce number of builders and/or bump the PB limit.
Should we split the win64 slaves into a separate master?
or increase the PB limit?
I think that the limit to the reference count in Twisted is actually reasonable, so it may be time to look for ways to have fewer builders defined.  But you can monkey-patch it.  Details in the Twisted bug (#2045)
Does this happen when we do a reconfig that adds more builders? and then have to backout?

I just don't know when this problem happens and how often.
No longer depends on: 712398
I will leave this bug open for someone else to determine what is the way forward.
Assignee: armenzg → nobody
Priority: P2 → --
Summary: Master setup for buildbot-master21 → Too many builders on buildbot-master16 ?
(In reply to Chris AtLee [:catlee] from comment #11)
> If we're hitting the PB limit, it's a problem with
> number-of-builders-per-slave, not number-of-slaves-per-master. So we need to
> reduce number of builders and/or bump the PB limit.

* Do we actually know how many builders we have per slave? 
* Does the new PB ref limit need to be a power-of-2?
* Do we actually have a way to reduce the # of builders per slave? I assume this is a direct result of having (many project branches) x (many tests split into smaller parts)
Priority: -- → P3
It doesn't need to be a power of two.
(In reply to Chris Cooper [:coop] from comment #17)
> (In reply to Chris AtLee [:catlee] from comment #11)
> > If we're hitting the PB limit, it's a problem with
> > number-of-builders-per-slave, not number-of-slaves-per-master. So we need to
> > reduce number of builders and/or bump the PB limit.
> 
> * Do we actually know how many builders we have per slave? 

You can count it locally by editing master.cfg for a test master and running checkconfig.

> * Does the new PB ref limit need to be a power-of-2?
> * Do we actually have a way to reduce the # of builders per slave? I assume
> this is a direct result of having (many project branches) x (many tests
> split into smaller parts)

We'd have to consolidate builders e.g. doing all the different mochitest suites in the same builder, relying on runtime information to run the right suite. This also means tbpl would need updating since it looks at the builder name to determine what type of job something is.
Updating summary.
Summary: Too many builders on buildbot-master16 ? → increase or work around hitting MAX_BROKER_REFS on test masters (too many builders per slave problem)
Component: Release Engineering → Release Engineering: Automation
Priority: P3 → --
QA Contact: release → catlee
Depends on: 731814
Priority: -- → P2
Blocks: 669428
Not sure why this is marked as [buildduty]. This seems like more work than buildduty could hope to tackle in a given week in addition to everything else.
Priority: P2 → P3
Whiteboard: [buildmasters][capacity][buildduty] → [buildmasters][capacity]
Blocks: 698843
Unblocking bug 698843: with Thunderbird builders, the highest usage is talos-r3-fed-076 has 981 builders; limit is 1012; 96 percent of max.  Still, we have very little wiggle room for new builders.
No longer blocks: 698843
Depends on: 754517
Dustin:

I'm guessing here, but I think this will make slavealloc give all slavealloc-enabled slaves a MAX_BROKER_REFS of 2048.

Does it look right to you?
Attachment #623878 - Flags: review?(dustin)
This patch hacks the production-0.8 copy of hg.m.o/build/buildbot to have a [larger] hardcoded MAX_BROKER_REFS.

Aiui, this doesn't affect any existing masters, but will affect any new masters.
Attachment #623884 - Flags: review?(catlee)
Blocks: 754429
Comment on attachment 623878 [details] [diff] [review]
update slavealloc's buildbot.tac template

looks right to me, untested
Attachment #623878 - Flags: review?(dustin) → review+
Attachment #623884 - Flags: review?(catlee) → review+
Did a quick'n'dirty test:

1. Started up buildmaster with

  import twisted.spread.pb
  twisted.spread.pb.MAX_BROKER_REFS = 2048

in the master's buildbot.tac.  Connected a linux64 slave with no MAX_BROKER_REFS update.  Forced a build; the slave picked it up and started building.  Killed the job.


2. Stopped the master + slave.  Commented out the MAX_BROKER_REFS lines on the master; added those lines to the slave.  Restarted both, forced a build.  The slave picked it up and started building.
Comment on attachment 623884 [details] [diff] [review]
hack buildbot/master/buildbot/scripts/runner.py

http://hg.mozilla.org/build/buildbot/rev/082cd6ddcb18

Landed on the production-0.8 branch.
Any new masters created should have this fix.
We still need to land+deploy to slavealloc, and manually update the existing masters.

Catlee: is there anything else I need to do re: the buildbot repo?
Attachment #623884 - Flags: checked-in+
I set up a new test master on dev-master01 via 

make -f Makefile.setup \
MASTER_NAME=preproduction-tests_master \
BASEDIR=/builds/buildbot/aki/test-master \
PYTHON=/usr/bin/python26 \
VIRTUALENV="/usr/bin/python26 /tools/misc-python/virtualenv.py" \
BUILDBOTCUSTOM_BRANCH=default \
BUILDBOTCONFIGS_BRANCH=default \
virtualenv deps install-buildbot master master-makefile

The buildbot.tac had the MAX_BROKER_REFS fix.
When pointing at http://staging-puppet.build.mozilla.org/staging/python-packages/ , it dies when trying to find sqlalchemy.  Pointing it at repos/python/packages works.
Attachment #624766 - Flags: review?(dustin)
Updated slavealloc with the above script; http://slavealloc.build.mozilla.org/gettac/linux-ix-slave03 gives me a tac with the MAX_BROKER_REFS lines.
Comment on attachment 624766 [details] [diff] [review]
fix slavealloc script

This will only work in scl1 and mtv1 right now, but since slavealloc is in scl1, this looks good.
Attachment #624766 - Flags: review?(dustin) → review+
Added the MAX_BROKER_REFS lines to existing enabled masters in production-masters.json.
This shouldn't land until we restart all our test masters, or fix via manhole.
Attachment #624777 - Flags: review?(catlee)
Attachment #624777 - Flags: review?(catlee) → review+
Attachment #627374 - Flags: review?(coop)
Comment on attachment 624777 [details] [diff] [review]
increase builder limit to 2048 in master.cfg

http://hg.mozilla.org/build/buildbot-configs/rev/6da2ccee77a0
Attachment #624777 - Flags: checked-in+
Attachment #627374 - Flags: review?(coop) → review+
Assignee: nobody → aki
All the enabled masters (including bm34) have had this change deployed via the manhole.
Thanks Coop!
-> RESO FIXED
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: