Closed Bug 974493 Opened 8 years ago Closed 8 years ago

some test machines unable to connect to their masters

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

I've seen this on a couple of machines now:

2014-02-19 09:57:00-0800 [-] Log opened.
2014-02-19 09:57:00-0800 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz2/bin/python2.7 2.7.3) starting up.
2014-02-19 09:57:00-0800 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2014-02-19 09:57:00-0800 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101427f38>
2014-02-19 09:57:00-0800 [-] Connecting to buildbot-master79.srv.releng.usw2.mozilla.com:9201
2014-02-19 09:57:00-0800 [-] Watching /builds/slave/talos-slave/shutdown.stamp's mtime to initiate shutdown
2014-02-19 09:57:00-0800 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2014-02-19 09:57:00-0800 [Broker,client] While trying to connect:
	Traceback from remote host -- Traceback (most recent call last):
	  File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/spread/pb.py", line 1346, in remote_respond
	    d = self.portal.login(self, mind, IPerspective)
	  File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/cred/portal.py", line 116, in login
	    ).addCallback(self.realm.requestAvatar, mind, *interfaces
	  File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/internet/defer.py", line 260, in addCallback
	    callbackKeywords=kw)
	  File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/internet/defer.py", line 249, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/master.py", line 498, in requestAvatar
	    p = self.botmaster.getPerspective(mind, avatarID)
	  File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/master.py", line 364, in getPerspective
	    d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one")
	  File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/spread/pb.py", line 328, in callRemote
	    _name, args, kw)
	  File "/builds/buildbot/tests1-macosx/lib/python2.7/site-packages/twisted/spread/pb.py", line 807, in _sendMessage
	    raise DeadReferenceError("Calling Stale Broker")
	twisted.spread.pb.DeadReferenceError: Calling Stale Broker
	
2014-02-19 09:57:00-0800 [Broker,client] Lost connection to buildbot-master79.srv.releng.usw2.mozilla.com:9201
2014-02-19 09:57:00-0800 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x101427f38>
2014-02-19 09:57:00-0800 [-] Main loop terminated.
2014-02-19 09:57:00-0800 [-] Server Shut Down.
Looks like these are caused by stale connections, I think I can fix them through the manhole:
2014-02-20 05:53:35-0800 [Broker,108345,10.12.49.154] duplicate slave talos-r3-fed-027; rejecting new slave and pinging old
2014-02-20 05:53:35-0800 [Broker,108345,10.12.49.154] old slave was connected from IPv4Address(TCP, '10.12.49.154', 56939)
2014-02-20 05:53:35-0800 [Broker,108345,10.12.49.154] new slave is from IPv4Address(TCP, '10.12.49.154', 58556)
2014-02-20 05:53:35-0800 [Broker,108345,10.12.49.154] Peer will receive following PB traceback:
2014-02-20 05:53:35-0800 [Broker,108345,10.12.49.154] Unhandled Error
	Traceback (most recent call last):
	  File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/spread/pb.py", line 1346, in remote_respond
	    d = self.portal.login(self, mind, IPerspective)
	  File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/cred/portal.py", line 116, in login
	    ).addCallback(self.realm.requestAvatar, mind, *interfaces
	  File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 260, in addCallback
	    callbackKeywords=kw)
	  File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 249, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/master.py", line 498, in requestAvatar
	    p = self.botmaster.getPerspective(mind, avatarID)
	  File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/master.py", line 364, in getPerspective
	    d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one")
	  File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/spread/pb.py", line 328, in callRemote
	    _name, args, kw)
	  File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/spread/pb.py", line 807, in _sendMessage
	    raise DeadReferenceError("Calling Stale Broker")
	twisted.spread.pb.DeadReferenceError: Calling Stale Broker
I tried following the instructions on https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master, but these slaves didn't have a hung TCP connection. I tried forcing the slave to drop with these two manhole statements:
master.botmaster.slaves['talos-r3-fed-027'].disconnect()
master.botmaster.slaves['talos-r3-fed-027'].slave.broker.transport.loseConnection()

But that didn't work either. Then I noticed that the block that throws the error is conditional on slave.isConnected(), which returns slave.slave. So I set that to None:
master.botmaster.slaves['talos-r3-fed-027'].slave = None

And then the slaves were able to connect.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.