Closed
Bug 758091
Opened 14 years ago
Closed 14 years ago
Investigate hung slaves from tegra foopies
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: Callek, Assigned: Callek)
References
Details
So earlier today (Wed 5/23) we had many network issues, and some hung slaves on bm19 at least.
This prevent ben's fabric reconfig from completing, Stayed broke even after I did some lsof checking on both foopy and master even after using manhole to os.close() per https://wiki.mozilla.org/ReleaseEngineering/How_To/Unstick_a_Stuck_Slave_From_A_Master
After all that was said and done there was a network blip in scl3 which took down hg briefly, and I go back and look at some slaves/masters tonight and I see a bunch of hung slaves on bm19 (list to follow). I only checked with tegra-035 on foopy07 and nothing obvious stands out.
bm19: 032, 035, 105, 108, 109, 188, 225
The symptoms is a twistd traceback on the slave side:
2012-05-23 18:30:51-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2012-05-23 18:30:51-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a63c20>
2012-05-23 18:30:51-0700 [-] Connecting to buildbot-master19.build.mozilla.org:9201
2012-05-23 18:31:13-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2012-05-23 18:31:13-0700 [Broker,client] Unhandled Error
Traceback from remote host -- Traceback (most recent call last):
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 1346, in remote_respond
d = self.portal.login(self, mind, IPerspective)
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/cred/portal.py", l
ine 116, in login
).addCallback(self.realm.requestAvatar, mind, *interfaces
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 260, in addCallback
callbackKeywords=kw)
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 249, in addCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 441, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101
fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar
p = self.botmaster.getPerspective(mind, avatarID)
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101
fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective
d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one")
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 328, in callRemote
_name, args, kw)
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 807, in _sendMessage
raise DeadReferenceError("Calling Stale Broker")
twisted.spread.pb.DeadReferenceError: Calling Stale Broker
2012-05-23 18:31:13-0700 [Broker,client] Lost connection to buildbot-master19.build.mozilla.org:9201
2012-05-23 18:31:13-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1
01a63c20>
2012-05-23 18:31:13-0700 [-] Main loop terminated.
2012-05-23 18:31:13-0700 [-] Server Shut Down.
-------------------
And on master side:
2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] duplicate slave tegra-035; rejecting new slave
and pinging old
2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] old slave was connected from IPv4Address(TCP, '
10.250.48.208', 58786)
2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] new slave is from IPv4Address(TCP, '10.250.48.2
08', 63336)
2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] Peer will receive following PB traceback:
2012-05-23 18:16:55-0700 [Broker,1450,10.250.48.208] Unhandled Error
Traceback (most recent call last):
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 1346, in remote_respond
d = self.portal.login(self, mind, IPerspective)
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/cred/portal.py", l
ine 116, in login
).addCallback(self.realm.requestAvatar, mind, *interfaces
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 260, in addCallback
callbackKeywords=kw)
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 249, in addCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/internet/defer.py"
, line 441, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101
fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar
p = self.botmaster.getPerspective(mind, avatarID)
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/buildbot-0.8.2_hg_a2045101
fe7a_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective
d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one")
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 328, in callRemote
_name, args, kw)
File "/builds/buildbot/tests1-tegra/lib/python2.6/site-packages/twisted/spread/pb.py", lin
e 807, in _sendMessage
raise DeadReferenceError("Calling Stale Broker")
twisted.spread.pb.DeadReferenceError: Calling Stale Broker
| Assignee | ||
Comment 1•14 years ago
|
||
Doing a bit of additional investigation... Every single one of these had the remote-connection failed error from reboot.py. From my code that kills buildbot.
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
My theories on fixing it is [one or both of]
* that we don't give enough time for the master to (fully) realize the slave went away, and then we get to the "reconnect me" part too fast.
* Because buildbot gets an exception here rather than the "eaten" exception we do for the regular slave reboots its causing inconsistent state.
Blocks: 746042
| Assignee | ||
Comment 2•14 years ago
|
||
Bug 746042 has been backed out from prod, Its not clear why these were hanging but time/reward is not good enough to spend too much time here right now, I'll try and verify/figure it out better when I get back at 746042 properly.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → INCOMPLETE
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•8 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•