Closed
Bug 668237
Opened 13 years ago
Closed 10 years ago
Stale Broker errors not caught, lead to reconfig hangs
Categories
(Release Engineering :: General, defect, P3)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: dustin, Unassigned)
Details
(Whiteboard: [buildbot])
Reporter | ||
Comment 1•13 years ago
|
||
As I posted in #2022, this isn't what I thought. We still have a problem, but since this bug only existed to apply a solution that turned out to be incorrect, it's invalid.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
Reporter | ||
Comment 2•13 years ago
|
||
Ah, I see the disconnect now. #2022 was looking at Buildbot master, and assuming that the stale broker errors were coming from the sendBuilderList call. That's not the case, though:
2011-07-24 20:02:18-0700 [-] Log opened.
2011-07-24 20:02:18-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz2/bin/python 2.6.1) starting up.
2011-07-24 20:02:18-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2011-07-24 20:02:18-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101377560>
2011-07-24 20:02:18-0700 [-] Connecting to preproduction-master.build.sjc1.mozilla.com:9012
2011-07-24 20:02:18-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2011-07-24 20:02:18-0700 [Broker,client] While trying to connect:
Traceback from remote host -- Traceback (most recent call last):
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 1346, in remote_respond
d = self.portal.login(self, mind, IPerspective)
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/cred/portal.py", line 116, in login
).addCallback(self.realm.requestAvatar, mind, *interfaces
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 260, in addCallback
callbackKeywords=kw)
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 249, in addCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/buildbot-0.8.2_hg_3dc678eecd11_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar
p = self.botmaster.getPerspective(mind, avatarID)
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/buildbot-0.8.2_hg_3dc678eecd11_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective
d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one")
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 328, in callRemote
_name, args, kw)
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 807, in _sendMessage
raise DeadReferenceError("Calling Stale Broker")
twisted.spread.pb.DeadReferenceError: Calling Stale Broker
2011-07-24 20:02:18-0700 [Broker,client] Lost connection to preproduction-master.build.sjc1.mozilla.com:9012
2011-07-24 20:02:18-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x101377560>
2011-07-24 20:02:18-0700 [-] Main loop terminated.
2011-07-24 20:02:18-0700 [-] Server Shut Down.
So the error is occurring in the duplicate-connection handling code.
The failing call is:
d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one")
which, curiously, does not errback (which would get ignored anyway, since 'd' is never handled after this point), but raises an exception. I suspect that the fix is to catch this error and handle it by deleting the old slave.
Sadly, this situation does not resolve itself by rebooting the slave, as the inconsistent state is on the master. There are currently a few slaves caught in this situation:
mv-moz2-linux-ix-slave01
talos-r3-snow-002
which will need to be rescued.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Reporter | ||
Comment 3•13 years ago
|
||
I'll be doing some related work on #2022, which can probably be backported into buildbot-0.8.2. I'll be glad to help with that backporting.
Assignee: dustin → nobody
Comment 4•13 years ago
|
||
Aren't we going to roll out 0.8.4 at some point? (though aiui it would still require backporting buildbot #2022)
Whiteboard: [buildbot]
Updated•13 years ago
|
Priority: -- → P3
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•10 years ago
|
Status: REOPENED → RESOLVED
Closed: 13 years ago → 10 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•