Investigate hung slaves from tegra foopies

RESOLVED INCOMPLETE

Status

RESOLVED INCOMPLETE
7 years ago
5 months ago

People

(Reporter: Callek, Assigned: Callek)

Tracking

Details

(Assignee)

Description

7 years ago
So earlier today (Wed 5/23) we had many network issues, and some hung slaves on bm19 at least.

This prevent ben's fabric reconfig from completing, Stayed broke even after I did some lsof checking on both foopy and master even after using manhole to os.close() per https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master

After all that was said and done there was a network blip in scl3 which took down hg briefly, and I go back and look at some slaves/masters tonight and I see a bunch of hung slaves on bm19 (list to follow). I only checked with tegra-035 on foopy07 and nothing obvious stands out.

bm19: 032, 035, 105, 108, 109, 188, 225

The symptoms is a twistd traceback on the slave side:

2012-05-23 18:30:51-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2012-05-23 18:30:51-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a63c20>
2012-05-23 18:30:51-0700 [-] Connecting to buildbot-master19.build.mozilla.org:9201
2012-05-23 18:31:13-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2012-05-23 18:31:13-0700 [Broker,client] Unhandled Error
        Traceback from remote host -- Traceback (most recent call last):
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 1346, in remote_respond
            d = self.portal.login(self, mind, IPerspective)
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/cred/portal.py", l
ine 116, in login
            ).addCallback(self.realm.requestAvatar, mind, *interfaces
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 260, in addCallback
            callbackKeywords=kw)
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 249, in addCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 441, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101
fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar
            p = self.botmaster.getPerspective(mind, avatarID)
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101
fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective
            d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one")
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 328, in callRemote
            _name, args, kw)
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 807, in _sendMessage
            raise DeadReferenceError("Calling Stale Broker")
        twisted.spread.pb.DeadReferenceError: Calling Stale Broker

2012-05-23 18:31:13-0700 [Broker,client] Lost connection to buildbot-master19.build.mozilla.org:9201

2012-05-23 18:31:13-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1
01a63c20>
2012-05-23 18:31:13-0700 [-] Main loop terminated.
2012-05-23 18:31:13-0700 [-] Server Shut Down.

-------------------

And on master side:

2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] duplicate slave tegra-035; rejecting new slave
and pinging old
2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] old slave was connected from IPv4Address(TCP, '
10.250.48.208', 58786)
2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] new slave is from IPv4Address(TCP, '10.250.48.2
08', 63336)
2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] Peer will receive following PB traceback:
2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] Unhandled Error
        Traceback (most recent call last):
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 1346, in remote_respond
            d = self.portal.login(self, mind, IPerspective)
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/cred/portal.py", l
ine 116, in login
            ).addCallback(self.realm.requestAvatar, mind, *interfaces
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 260, in addCallback
            callbackKeywords=kw)
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 249, in addCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 441, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101
fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar
            p = self.botmaster.getPerspective(mind, avatarID)
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101
fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective
            d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one")
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 328, in callRemote
            _name, args, kw)
          File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 807, in _sendMessage
            raise DeadReferenceError("Calling Stale Broker")
        twisted.spread.pb.DeadReferenceError: Calling Stale Broker
(Assignee)

Comment 1

7 years ago
Doing a bit of additional investigation... Every single one of these had the remote-connection failed error from reboot.py. From my code that kills buildbot.

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

My theories on fixing it is [one or both of] 
* that we don't give enough time for the master to (fully) realize the slave went away, and then we get to the "reconnect me" part too fast.
* Because buildbot gets an exception here rather than the "eaten" exception we do for the regular slave reboots its causing inconsistent state.
Blocks: 746042
(Assignee)

Comment 2

7 years ago
Bug 746042 has been backed out from prod, Its not clear why these were hanging but time/reward is not good enough to spend too much time here right now, I'll try and verify/figure it out better when I get back at 746042 properly.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → Release Engineering

Updated

5 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.