Closed
Bug 758091
Opened 12 years ago
Closed 12 years ago
Investigate hung slaves from tegra foopies
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: Callek, Assigned: Callek)
References
Details
So earlier today (Wed 5/23) we had many network issues, and some hung slaves on bm19 at least. This prevent ben's fabric reconfig from completing, Stayed broke even after I did some lsof checking on both foopy and master even after using manhole to os.close() per https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master After all that was said and done there was a network blip in scl3 which took down hg briefly, and I go back and look at some slaves/masters tonight and I see a bunch of hung slaves on bm19 (list to follow). I only checked with tegra-035 on foopy07 and nothing obvious stands out. bm19: 032, 035, 105, 108, 109, 188, 225 The symptoms is a twistd traceback on the slave side: 2012-05-23 18:30:51-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor. 2012-05-23 18:30:51-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a63c20> 2012-05-23 18:30:51-0700 [-] Connecting to buildbot-master19.build.mozilla.org:9201 2012-05-23 18:31:13-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective 2012-05-23 18:31:13-0700 [Broker,client] Unhandled Error Traceback from remote host -- Traceback (most recent call last): File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin e 1346, in remote_respond d = self.portal.login(self, mind, IPerspective) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/cred/portal.py", l ine 116, in login ).addCallback(self.realm.requestAvatar, mind, *interfaces File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py" , line 260, in addCallback callbackKeywords=kw) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py" , line 249, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py" , line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101 fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar p = self.botmaster.getPerspective(mind, avatarID) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101 fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one") File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin e 328, in callRemote _name, args, kw) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin e 807, in _sendMessage raise DeadReferenceError("Calling Stale Broker") twisted.spread.pb.DeadReferenceError: Calling Stale Broker 2012-05-23 18:31:13-0700 [Broker,client] Lost connection to buildbot-master19.build.mozilla.org:9201 2012-05-23 18:31:13-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1 01a63c20> 2012-05-23 18:31:13-0700 [-] Main loop terminated. 2012-05-23 18:31:13-0700 [-] Server Shut Down. ------------------- And on master side: 2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] duplicate slave tegra-035; rejecting new slave and pinging old 2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] old slave was connected from IPv4Address(TCP, ' 10.250.48.208', 58786) 2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] new slave is from IPv4Address(TCP, '10.250.48.2 08', 63336) 2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] Peer will receive following PB traceback: 2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin e 1346, in remote_respond d = self.portal.login(self, mind, IPerspective) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/cred/portal.py", l ine 116, in login ).addCallback(self.realm.requestAvatar, mind, *interfaces File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py" , line 260, in addCallback callbackKeywords=kw) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py" , line 249, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py" , line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101 fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar p = self.botmaster.getPerspective(mind, avatarID) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101 fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one") File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin e 328, in callRemote _name, args, kw) File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin e 807, in _sendMessage raise DeadReferenceError("Calling Stale Broker") twisted.spread.pb.DeadReferenceError: Calling Stale Broker
Assignee | ||
Comment 1•12 years ago
|
||
Doing a bit of additional investigation... Every single one of these had the remote-connection failed error from reboot.py. From my code that kills buildbot. remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] My theories on fixing it is [one or both of] * that we don't give enough time for the master to (fully) realize the slave went away, and then we get to the "reconnect me" part too fast. * Because buildbot gets an exception here rather than the "eaten" exception we do for the regular slave reboots its causing inconsistent state.
Blocks: 746042
Assignee | ||
Comment 2•12 years ago
|
||
Bug 746042 has been backed out from prod, Its not clear why these were hanging but time/reward is not good enough to spend too much time here right now, I'll try and verify/figure it out better when I get back at 746042 properly.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INCOMPLETE
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•