Closed Bug 974493 Opened 11 years ago Closed 11 years ago

some test machines unable to connect to their masters

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

I've seen this on a couple of machines now: 2014-02-19 09:57:00-0800 [-] Log opened. 2014-02-19 09:57:00-0800 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz2/bin/python2.7 2.7.3) starting up. 2014-02-19 09:57:00-0800 [-] reactor class: twisted.internet.selectreactor.SelectReactor. 2014-02-19 09:57:00-0800 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101427f38> 2014-02-19 09:57:00-0800 [-] Connecting to buildbot-master79.srv.releng.usw2.mozilla.com:9201 2014-02-19 09:57:00-0800 [-] Watching /builds/slave/talos-slave/shutdown.stamp's mtime to initiate shutdown 2014-02-19 09:57:00-0800 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective 2014-02-19 09:57:00-0800 [Broker,client] While trying to connect: Traceback from remote host -- Traceback (most recent call last): File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/spread/pb.py", line 1346, in remote_respond d = self.portal.login(self, mind, IPerspective) File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/cred/portal.py", line 116, in login ).addCallback(self.realm.requestAvatar, mind, *interfaces File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/internet/defer.py", line 260, in addCallback callbackKeywords=kw) File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/internet/defer.py", line 249, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/master.py", line 498, in requestAvatar p = self.botmaster.getPerspective(mind, avatarID) File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/master.py", line 364, in getPerspective d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one") File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/spread/pb.py", line 328, in callRemote _name, args, kw) File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/spread/pb.py", line 807, in _sendMessage raise DeadReferenceError("Calling Stale Broker") twisted.spread.pb.DeadReferenceError: Calling Stale Broker 2014-02-19 09:57:00-0800 [Broker,client] Lost connection to buildbot-master79.srv.releng.usw2.mozilla.com:9201 2014-02-19 09:57:00-0800 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x101427f38> 2014-02-19 09:57:00-0800 [-] Main loop terminated. 2014-02-19 09:57:00-0800 [-] Server Shut Down.
Looks like these are caused by stale connections, I think I can fix them through the manhole: 2014-02-20 05:53:35-0800 [Broker,108345,10.12.49.154] duplicate slave talos-r3-fed-027; rejecting new slave and pinging old 2014-02-20 05:53:35-0800 [Broker,108345,10.12.49.154] old slave was connected from IPv4Address(TCP, '10.12.49.154', 56939) 2014-02-20 05:53:35-0800 [Broker,108345,10.12.49.154] new slave is from IPv4Address(TCP, '10.12.49.154', 58556) 2014-02-20 05:53:35-0800 [Broker,108345,10.12.49.154] Peer will receive following PB traceback: 2014-02-20 05:53:35-0800 [Broker,108345,10.12.49.154] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/spread/pb.py", line 1346, in remote_respond d = self.portal.login(self, mind, IPerspective) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/cred/portal.py", line 116, in login ).addCallback(self.realm.requestAvatar, mind, *interfaces File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 260, in addCallback callbackKeywords=kw) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 249, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/master.py", line 498, in requestAvatar p = self.botmaster.getPerspective(mind, avatarID) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/master.py", line 364, in getPerspective d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one") File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/spread/pb.py", line 328, in callRemote _name, args, kw) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/spread/pb.py", line 807, in _sendMessage raise DeadReferenceError("Calling Stale Broker") twisted.spread.pb.DeadReferenceError: Calling Stale Broker
I tried following the instructions on https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master, but these slaves didn't have a hung TCP connection. I tried forcing the slave to drop with these two manhole statements: master.botmaster.slaves['talos-r3-fed-027'].disconnect() master.botmaster.slaves['talos-r3-fed-027'].slave.broker.transport.loseConnection() But that didn't work either. Then I noticed that the block that throws the error is conditional on slave.isConnected(), which returns slave.slave. So I set that to None: master.botmaster.slaves['talos-r3-fed-027'].slave = None And then the slaves were able to connect.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.