Closed Bug 832305 Opened 12 years ago Closed 12 years ago

linux test masters not ready for the jelly

Categories

(Release Engineering :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: bhearsum, Unassigned)

Details

We get tracebacks with jelly in the stack on seemingly every reconfig. 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] --- <exception caught here> --- 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:809:_sendMessage 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:763:serialize 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:1122:jelly 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:534:jelly 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:588:_jellyIterable 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:475:jelly 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/flavors.py:127:jellyFor 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:664:registerReference 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] ] 2013-01-18 06:45:58-0800 [Broker,89041,10.12.50.63] [Failure instance: Traceback: <class 'twisted.spread.pb.Error'>: Maximum PB reference count exceeded. Goodbye.
Hit this again on bm18 today. It's been over an hour and the master is still spewing jelly errors.
(In reply to Ben Hearsum [:bhearsum] from comment #1) > Hit this again on bm18 today. It's been over an hour and the master is still > spewing jelly errors. I ended up giving up on this master and restarted it after 1.5h.
I've been hitting this a lot, and it always ends up with a hard restart (kill -9), burning a bunch of jobs.
Severity: normal → major
I had a master which was spinning on a jelly error today and seemed to get past it. Here's the last part of the log with jelly in it: 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] [Failure instance: Traceback: <class 'twisted.spread.pb.Error'>: Maximum PB reference count exceeded. Goodbye. 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/internet/defer.py:249:addCallbacks 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/internet/defer.py:441:_runCallbacks 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/buildbot-0.8.2_hg_41fc8a9db7a0_production_0.8-py2.6.egg/buildbot/process/builder.py:107:doSetMaster 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:328:callRemote 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] --- <exception caught here> --- 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:809:_sendMessage 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:763:serialize 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:1122:jelly 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:534:jelly 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:588:_jellyIterable 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/jelly.py:475:jelly 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/flavors.py:127:jellyFor 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] /builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/spread/pb.py:664:registerReference 2013-02-14 07:13:02-0800 [Broker,34000,10.12.50.237] ] 2013-02-14 07:13:02-0800 [Broker,34002,10.12.50.71] Got slaveinfo from 'talos-r3-fed64-064' 2013-02-14 07:13:02-0800 [-] closing log <buildbot.status.builder.LogFile instance at 0x22a8c908> 2013-02-14 07:13:02-0800 [-] releaseLocks(<buildbotcustom.steps.misc.UnpackFile instance at 0x2e642998>): [] 2013-02-14 07:13:02-0800 [-] step 'Unpack build' complete: success 2013-02-14 07:13:02-0800 [-] addCompleteLog(property changes) 2013-02-14 07:13:02-0800 [-] releaseLocks(<buildbotcustom.steps.misc.SetBuildProperty instance at 0x10668ab8>): [] 2013-02-14 07:13:02-0800 [-] step 'set build property' complete: success 2013-02-14 07:13:02-0800 [-] addCompleteLog(property changes) 2013-02-14 07:13:02-0800 [-] releaseLocks(<buildbotcustom.steps.misc.SetBuildProperty instance at 0x2acd3998>): [] 2013-02-14 07:13:02-0800 [-] step 'set build property_1' complete: success 2013-02-14 07:13:02-0800 [-] ShellCommand.startCommand(cmd=<RemoteShellCommand '['cat', '../firefox/application.ini']'>) 2013-02-14 07:13:02-0800 [-] cmd.args = {'workdir': '../talos-data/talos', 'timeout': 1200, 'env': {'MOZ_CRASHREPORTER_NO_REPORT': '1', 'XPCOM_DEBUG_BREAK': 'warn', 'NO_EM_RESTART': '1', 'MOZ_NO_REMOTE': '1', 'DISPLAY': ':0'}, 'want_stdout': 1, 'usePTY': 'slave-config', 'maxTime': None, 'logEnviron': True, 'want_stderr': 1, 'logfiles': {}} 2013-02-14 07:13:02-0800 [-] <RemoteShellCommand '['cat', '../firefox/application.ini']'>: RemoteCommand.run [519097] 2013-02-14 07:13:02-0800 [-] command '['cat', '../firefox/application.ini']' in dir '../talos-data/talos' 2013-02-14 07:13:02-0800 [-] LoggedRemoteCommand.start After that it didn't finish the reconfig, but there was _tons_ of forced slave disconnections, like: 2013-02-14 07:19:15-0800 [-] Forcibly disconnecting talos-r3-fed-069 2013-02-14 07:19:15-0800 [-] disconnecting old slave talos-r3-fed-069 now 2013-02-14 07:19:15-0800 [-] waiting for slave to finish disconnecting 2013-02-14 07:19:15-0800 [-] Forcibly disconnecting talos-r3-fed64-020 2013-02-14 07:19:15-0800 [-] disconnecting old slave talos-r3-fed64-020 now 2013-02-14 07:19:15-0800 [-] waiting for slave to finish disconnecting 2013-02-14 07:19:15-0800 [-] Forcibly disconnecting talos-r3-fed64-012 2013-02-14 07:19:15-0800 [-] disconnecting old slave talos-r3-fed64-012 now 2013-02-14 07:19:15-0800 [-] waiting for slave to finish disconnecting 2013-02-14 07:19:15-0800 [-] Forcibly disconnecting talos-r3-fed-064 2013-02-14 07:19:15-0800 [-] disconnecting old slave talos-r3-fed-064 now 2013-02-14 07:19:15-0800 [-] waiting for slave to finish disconnecting 2013-02-14 07:19:15-0800 [-] Forcibly disconnecting talos-r3-fed64-031 2013-02-14 07:19:15-0800 [-] disconnecting old slave talos-r3-fed64-031 now 2013-02-14 07:19:15-0800 [-] waiting for slave to finish disconnecting Afterwards I waited about 10 minutes for the reconfig to complete, but nothing happened. The twistd.log output looked like a master that was just doing normal work - there was no evidence that the reconfig was in progress. I've had some success fixing masters like this by starting a 2nd reconfig, so I tried that. This reconfig hit jelly errors too. I had to hard restart the master. I assume this is because the master spent so long in a "hung" state that the slaves tried to reconnect or something. It's probably also why the master didn't finish the reconfig immediately after getting past the jelly error.
(In reply to tom.prince from comment #5) > This seems related to http://trac.buildbot.net/ticket/2045 and > https://bugzil.la/712244 Yeah...we doubled our builder limit in that bug and even added a check to make sure we don't go over it in bug 731814. This bug presents itself a little different in that sometimes the masters manage to reconfig, and sometimes they don't.
I think reconfigs can cause extra PB refs to be created while builders and builds are migrated over between the old instances and the new ones. Depending on how quickly this happens, perhaps this can result in the master or slave hitting the max refs? We're over 50% of max refs for several types of slaves. Are those the ones that are generally hitting the problem?
(In reply to Chris AtLee [:catlee] from comment #7) > I think reconfigs can cause extra PB refs to be created while builders and > builds are migrated over between the old instances and the new ones. > Depending on how quickly this happens, perhaps this can result in the master > or slave hitting the max refs? We're over 50% of max refs for several types > of slaves. Are those the ones that are generally hitting the problem? It's hard to say exactly for sure (I don't know which specific slaves are having issues), but there's a strong correlation. Here's a sampling of slaves that have more than half of the max pb refs: tegra-071 1070 talos-mtnlion-r5-017 1071 talos-r4-snow-021 1127 talos-r4-lion-073 1127 talos-r3-fed64-026 1394 talos-r3-xp-071 1519 talos-r3-w7-029 1587 talos-r3-fed-081 1834 So, clearly Linux in the closest to the limit.
I'm hitting this with bm24-tests1-linux.
Working ok for now.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.