Closed
Bug 832305
Opened 12 years ago
Closed 12 years ago
linux test masters not ready for the jelly
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: bhearsum, Unassigned)
Details
We get tracebacks with jelly in the stack on seemingly every reconfig.
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] --- <exception caught here> ---
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:809:_sendMessage
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:763:serialize
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:1122:jelly
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:534:jelly
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:588:_jellyIterable
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:475:jelly
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/flavors.py:127:jellyFor
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:664:registerReference
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] ]
2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] [Failure instance: Traceback: <class 'twisted.spread.pb.Error'>: Maximum PB reference count exceeded. Goodbye.
Reporter | ||
Comment 1•12 years ago
|
||
Hit this again on bm18 today. It's been over an hour and the master is still spewing jelly errors.
Reporter | ||
Comment 2•12 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #1)
> Hit this again on bm18 today. It's been over an hour and the master is still
> spewing jelly errors.
I ended up giving up on this master and restarted it after 1.5h.
Reporter | ||
Comment 3•12 years ago
|
||
I've been hitting this a lot, and it always ends up with a hard restart (kill -9), burning a bunch of jobs.
Severity: normal → major
Reporter | ||
Comment 4•12 years ago
|
||
I had a master which was spinning on a jelly error today and seemed to get past it. Here's the last part of the log with jelly in it:
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] [Failure instance: Traceback: <class 'twisted.spread.pb.Error'>: Maximum PB reference count exceeded. Goodbye.
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/internet/defer.py:249:addCallbacks
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/internet/defer.py:441:_runCallbacks
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/buildbot-0.8.2_hg_41fc8a9db7a0_production_0.8-py2.6.egg/buildbot/process/builder.py:107:doSetMaster
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:328:callRemote
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] --- <exception caught here> ---
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:809:_sendMessage
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:763:serialize
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:1122:jelly
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:534:jelly
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:588:_jellyIterable
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:475:jelly
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/flavors.py:127:jellyFor
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:664:registerReference
2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] ]
2013-02-14 07:13:02-0800 [Broker,34002,10.12.50.71] Got slaveinfo from 'talos-r3-fed64-064'
2013-02-14 07:13:02-0800 [-] closing log <buildbot.status.builder.LogFile instance at 0x22a8c908>
2013-02-14 07:13:02-0800 [-] releaseLocks(<buildbotcustom.steps.misc.UnpackFile instance at 0x2e642998>): []
2013-02-14 07:13:02-0800 [-] step 'Unpack build' complete: success
2013-02-14 07:13:02-0800 [-] addCompleteLog(property changes)
2013-02-14 07:13:02-0800 [-] releaseLocks(<buildbotcustom.steps.misc.SetBuildProperty instance at 0x10668ab8>): []
2013-02-14 07:13:02-0800 [-] step 'set build property' complete: success
2013-02-14 07:13:02-0800 [-] addCompleteLog(property changes)
2013-02-14 07:13:02-0800 [-] releaseLocks(<buildbotcustom.steps.misc.SetBuildProperty instance at 0x2acd3998>): []
2013-02-14 07:13:02-0800 [-] step 'set build property_1' complete: success
2013-02-14 07:13:02-0800 [-] ShellCommand.startCommand(cmd=<RemoteShellCommand '['cat', '../firefox/application.ini']'>)
2013-02-14 07:13:02-0800 [-] cmd.args = {'workdir': '../talos-data/talos', 'timeout': 1200, 'env': {'MOZ_CRASHREPORTER_NO_REPORT': '1', 'XPCOM_DEBUG_BREAK': 'warn', 'NO_EM_RESTART': '1', 'MOZ_NO_REMOTE': '1', 'DISPLAY': ':0'}, 'want_stdout': 1, 'usePTY': 'slave-config', 'maxTime': None, 'logEnviron': True, 'want_stderr': 1, 'logfiles': {}}
2013-02-14 07:13:02-0800 [-] <RemoteShellCommand '['cat', '../firefox/application.ini']'>: RemoteCommand.run [519097]
2013-02-14 07:13:02-0800 [-] command '['cat', '../firefox/application.ini']' in dir '../talos-data/talos'
2013-02-14 07:13:02-0800 [-] LoggedRemoteCommand.start
After that it didn't finish the reconfig, but there was _tons_ of forced slave disconnections, like:
2013-02-14 07:19:15-0800 [-] Forcibly disconnecting talos-r3-fed-069
2013-02-14 07:19:15-0800 [-] disconnecting old slave talos-r3-fed-069 now
2013-02-14 07:19:15-0800 [-] waiting for slave to finish disconnecting
2013-02-14 07:19:15-0800 [-] Forcibly disconnecting talos-r3-fed64-020
2013-02-14 07:19:15-0800 [-] disconnecting old slave talos-r3-fed64-020 now
2013-02-14 07:19:15-0800 [-] waiting for slave to finish disconnecting
2013-02-14 07:19:15-0800 [-] Forcibly disconnecting talos-r3-fed64-012
2013-02-14 07:19:15-0800 [-] disconnecting old slave talos-r3-fed64-012 now
2013-02-14 07:19:15-0800 [-] waiting for slave to finish disconnecting
2013-02-14 07:19:15-0800 [-] Forcibly disconnecting talos-r3-fed-064
2013-02-14 07:19:15-0800 [-] disconnecting old slave talos-r3-fed-064 now
2013-02-14 07:19:15-0800 [-] waiting for slave to finish disconnecting
2013-02-14 07:19:15-0800 [-] Forcibly disconnecting talos-r3-fed64-031
2013-02-14 07:19:15-0800 [-] disconnecting old slave talos-r3-fed64-031 now
2013-02-14 07:19:15-0800 [-] waiting for slave to finish disconnecting
Afterwards I waited about 10 minutes for the reconfig to complete, but nothing happened. The twistd.log output looked like a master that was just doing normal work - there was no evidence that the reconfig was in progress. I've had some success fixing masters like this by starting a 2nd reconfig, so I tried that. This reconfig hit jelly errors too. I had to hard restart the master.
I assume this is because the master spent so long in a "hung" state that the slaves tried to reconnect or something. It's probably also why the master didn't finish the reconfig immediately after getting past the jelly error.
Comment 5•12 years ago
|
||
This seems related to http://trac.buildbot.net/ticket/2045 and https://bugzil.la/712244
Reporter | ||
Comment 6•12 years ago
|
||
(In reply to tom.prince from comment #5)
> This seems related to http://trac.buildbot.net/ticket/2045 and
> https://bugzil.la/712244
Yeah...we doubled our builder limit in that bug and even added a check to make sure we don't go over it in bug 731814. This bug presents itself a little different in that sometimes the masters manage to reconfig, and sometimes they don't.
Comment 7•12 years ago
|
||
I think reconfigs can cause extra PB refs to be created while builders and builds are migrated over between the old instances and the new ones. Depending on how quickly this happens, perhaps this can result in the master or slave hitting the max refs? We're over 50% of max refs for several types of slaves. Are those the ones that are generally hitting the problem?
Reporter | ||
Comment 8•12 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #7)
> I think reconfigs can cause extra PB refs to be created while builders and
> builds are migrated over between the old instances and the new ones.
> Depending on how quickly this happens, perhaps this can result in the master
> or slave hitting the max refs? We're over 50% of max refs for several types
> of slaves. Are those the ones that are generally hitting the problem?
It's hard to say exactly for sure (I don't know which specific slaves are having issues), but there's a strong correlation. Here's a sampling of slaves that have more than half of the max pb refs:
tegra-071 1070
talos-mtnlion-r5-017 1071
talos-r4-snow-021 1127
talos-r4-lion-073 1127
talos-r3-fed64-026 1394
talos-r3-xp-071 1519
talos-r3-w7-029 1587
talos-r3-fed-081 1834
So, clearly Linux in the closest to the limit.
Comment 9•12 years ago
|
||
I'm hitting this with bm24-tests1-linux.
Comment 10•12 years ago
|
||
Working ok for now.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Assignee | ||
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Assignee | ||
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•