Closed Bug 668237 Opened 13 years ago Closed 10 years ago

Stale Broker errors not caught, lead to reconfig hangs

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dustin, Unassigned)

Details

(Whiteboard: [buildbot])

As I posted in #2022, this isn't what I thought. We still have a problem, but since this bug only existed to apply a solution that turned out to be incorrect, it's invalid.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
Ah, I see the disconnect now. #2022 was looking at Buildbot master, and assuming that the stale broker errors were coming from the sendBuilderList call. That's not the case, though: 2011-07-24 20:02:18-0700 [-] Log opened. 2011-07-24 20:02:18-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz2/bin/python 2.6.1) starting up. 2011-07-24 20:02:18-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor. 2011-07-24 20:02:18-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101377560> 2011-07-24 20:02:18-0700 [-] Connecting to preproduction-master.build.sjc1.mozilla.com:9012 2011-07-24 20:02:18-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective 2011-07-24 20:02:18-0700 [Broker,client] While trying to connect: Traceback from remote host -- Traceback (most recent call last): File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 1346, in remote_respond d = self.portal.login(self, mind, IPerspective) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/cred/portal.py", line 116, in login ).addCallback(self.realm.requestAvatar, mind, *interfaces File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 260, in addCallback callbackKeywords=kw) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 249, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/buildbot-0.8.2_hg_3dc678eecd11_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar p = self.botmaster.getPerspective(mind, avatarID) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/buildbot-0.8.2_hg_3dc678eecd11_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one") File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 328, in callRemote _name, args, kw) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 807, in _sendMessage raise DeadReferenceError("Calling Stale Broker") twisted.spread.pb.DeadReferenceError: Calling Stale Broker 2011-07-24 20:02:18-0700 [Broker,client] Lost connection to preproduction-master.build.sjc1.mozilla.com:9012 2011-07-24 20:02:18-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x101377560> 2011-07-24 20:02:18-0700 [-] Main loop terminated. 2011-07-24 20:02:18-0700 [-] Server Shut Down. So the error is occurring in the duplicate-connection handling code. The failing call is: d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one") which, curiously, does not errback (which would get ignored anyway, since 'd' is never handled after this point), but raises an exception. I suspect that the fix is to catch this error and handle it by deleting the old slave. Sadly, this situation does not resolve itself by rebooting the slave, as the inconsistent state is on the master. There are currently a few slaves caught in this situation: mv-moz2-linux-ix-slave01 talos-r3-snow-002 which will need to be rescued.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
I'll be doing some related work on #2022, which can probably be backported into buildbot-0.8.2. I'll be glad to help with that backporting.
Assignee: dustin → nobody
Aren't we going to roll out 0.8.4 at some point? (though aiui it would still require backporting buildbot #2022)
Whiteboard: [buildbot]
Priority: -- → P3
Product: mozilla.org → Release Engineering
Status: REOPENED → RESOLVED
Closed: 13 years ago10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.