Closed
Bug 756365
Opened 13 years ago
Closed 13 years ago
moz2-darwin10 machines are not staying connected to masters in scl1
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: nthomas)
References
Details
(Whiteboard: [buildduty][capacity][buildslaves])
In https://tbpl.mozilla.org/?tree=Mozilla-Beta&rev=5f412ea09aba things didn't go too well - the first 10.6.2 opt build lost its slave right as it was about to upload, I retriggered it and moved on, and then to my surprise when I looked 7 hours later the retriggered build still hadn't finished, so I triggered a third build. When that third build still hadn't finished after 3 hours, I got worried that it was the push that broke things, and retriggered the build on the push before.
That stayed pending (45 minutes so far), which prompted me to look at http://build.mozilla.org/builds/last-job-per-slave.html#compile which looks to me like it's saying that something broke with these slaves on Monday - 48 and 56 have been stuck since then, 44 and 54 since Tuesday morning, 50 Tuesday afternoon, 53 Tuesday evening, ...
One of my retriggers on the tip apparently made it as far as uploading, since tests are running, but there's still something seriously wrong in the moz2-darwin10 ghetto.
Reporter | ||
Comment 1•13 years ago
|
||
And now esr10 has three pending Mac builds, and zero running. One of those three is the 32-bit debug build, which it builds on 10.5. Or doesn't. Did we maybe get rid of all those slaves without remembering something still used them?
Severity: critical → blocker
Reporter | ||
Comment 2•13 years ago
|
||
esr10's closed, the correct reopen state is APPROVAL REQUIRED.
Reporter | ||
Comment 3•13 years ago
|
||
And since that retrigger on mozilla-beta's tip-minus-one is still pending, it apparently won't get Mac builds on its next push, so it's closed too.
Comment 5•13 years ago
|
||
restarted the 3 masters that handle esr10 builds.
the working theory is that the mtv1 network blip yesterday caused all of the darwin10 slaves to not want to talk to the masters.
Reporter | ||
Comment 6•13 years ago
|
||
Doesn't appear to have had any effect.
Comment 7•13 years ago
|
||
that's because while in the middle of the reset we had another mtv1 event. and now another so i'm waiting for the 4th restart to finish and praying that we won't have another mtv1 event.
Assignee | ||
Comment 8•13 years ago
|
||
I'm not sure what's going on here, but this is what I found
* the slaves (moz2-darwin10-slave40-50, 53-56) are all accessible via ssh, except for moz2-darwin10-slave45 and 53 which are down
* they're all being rebooted by idleizer after only an hour, a typical slave twistd.log has several of these:
2012-05-20 18:12:17-0700 [-] Log opened.
2012-05-20 18:12:17-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz2/bin/python 2.6.4) starting up.
2012-05-20 18:12:17-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2012-05-20 18:12:17-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x1017b93b0>
2012-05-20 18:12:17-0700 [-] Connecting to buildbot-master12.build.scl1.mozilla.com:9001
2012-05-20 18:12:19-0700 [Broker,client] message from master: attached
2012-05-20 19:12:18-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK
2012-05-20 19:12:18-0700 [-] No active connection, rebooting NOW
2012-05-20 19:12:18-0700 [-] Invoking platform-specific reboot command
2012-05-20 19:12:18-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2012-05-20 19:12:18-0700 [Broker,client] we lost the brand-new connection
2012-05-20 19:12:18-0700 [Broker,client] Lost connection to buildbot-master12.build.scl1.mozilla.com:9001
2012-05-20 19:12:18-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1017b93b0>
2012-05-20 19:12:18-0700 [-] Main loop terminated.
NB: there isn't a long list of builders after '[Broker,client] message from master: attached', or a
[Broker,client] Connected to buildbot-master13.build.scl1.mozilla.com:9001; slave is ready
* the masters are bm12, bm13, and bm25; all in scl1. I've restarted 13 and 25, which hasn't helped. bm12 is doing a graceful shutdown but I'm not expecting it to help
* there are a couple of zombie builds in buildbot which both failed during 'download props' steps, they are
http://buildbot-master25.build.scl1.mozilla.com:8001/builders/OS%20X%2010.6.2%20mozilla-beta%20build/builds/1
http://buildbot-master25.build.scl1.mozilla.com:8001/builders/TB%20OS%20X%2010.6.2%20comm-beta%20build/builds/13
There are exceptions on the slave side, but the master is in limbo - it thinks the builds finished at epoch 0.
* the six pending builds look sane in the buildrequests table of the scheduler db, from a quick eyeball
* those jobs show up as pending on waterfalls, and the moz2-darwin10-slave's are assigned to those builders
Assignee | ||
Comment 9•13 years ago
|
||
Investigating the incomplete sign-on to the masters further:
* bm12/13/25 are quite possibly the only masters to be restarted since their buildbot.tac's where modified to double the MAX_BROKER_REFS to 2048 (bug 712244)
* non moz2-darwin10 slaves are connecting OK and doing jobs with this change, but not moz2-darwin10
* taking bm25 and setting 'twisted.spread.pb.MAX_BROKER_REFS = 1024' in the buildbot.tac (ie reverting the master side of 712244), and restarting doesn't help - still no list of builders and 'slave ready'. This is with moz2-darwin10-slave47 connecting with 2048 set in it's own tac file
* setting 1024 in box tac files doesn't work either (I launched buildbot like with '/usr/bin/python /usr/local/bin/runslave.py --verbose --allocator-url http://example.org' with a hacked tac file)
So that rules out MAX_BROKER_REFS (I've reverted my modifications) and this looks like network issue. The strange thing is that we can create the initial connection OK,
$ nc -vz buildbot-master25.build.scl1.mozilla.com 9001
Connection to buildbot-master25.build.scl1.mozilla.com 9001 port [tcp/etlservicemgr] succeeded!
but then it appears to hang.
Comment 10•13 years ago
|
||
FTR, this is also affecting the Thunderbird-Beta builders (comm-beta) as well.
Comment 11•13 years ago
|
||
looking at this further, i've found/discovered that all of the moz2 slaves get this error very soon after attaching on the master side:
2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] Peer will receive following PB traceback:
2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] Unhandled Error
Traceback (most recent call last):
File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause
self._runCallbacks()
File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 397, in _continue
self.unpause()
File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause
self._runCallbacks()
--- <exception caught here> ---
File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/pb.py", line 763, in serialize
return jelly(object, self.security, None, self)
File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/jelly.py", line 1122, in jelly
return _Jellier(taster, persistentStore, invoker).jelly(object)
File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/jelly.py", line 475, in jelly
return obj.jellyFor(self)
File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/flavors.py", line 127, in jellyFor
return "remote", jellier.invoker.registerReference(self)
File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/pb.py", line 658, in registerReference
luid = self.luids.get(puid)
exceptions.AttributeError: 'NoneType' object has no attribute 'get'
2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] BuildSlave.detached(moz2-darwin10-slave48)
Comment 12•13 years ago
|
||
re-opening the Mozilla-Beta and Mozilla-Esr10 tinderboxen as Nick is attaching some of the darwin10 boxes to masters in scl3
Updated•13 years ago
|
Summary: Welfare check in the moz2-darwin10-slave ward → moz2-darwin10 machines are not staying connected to masters in scl1
Assignee | ||
Comment 13•13 years ago
|
||
Ok, so since last week the slaves in mtv1 can't talk to scl1, masters or slavealloc. They open a connection so nc looks ok, but no data flows, eg curl'ing an http request to slavealloc hangs until you disconnect and 0 data received.
I'll manually point the slaves at scl3 masters to get them processing work, and they'll come back to scl1 once the network flow is fixed and they can talk to slavealloc again. That's bug 757193.
Assignee: bear → nrthomas
Severity: blocker → major
Priority: -- → P2
Assignee | ||
Comment 14•13 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #8)
> I'm not sure what's going on here, but this is what I found
>
> * the slaves (moz2-darwin10-slave40-50, 53-56) are all accessible via ssh,
> except for moz2-darwin10-slave45 and 53 which are down
These are now split fairly evenly between bm30 and bm32, and busy clearing the backlog of pending builds.
Assignee | ||
Comment 15•13 years ago
|
||
Mac builds are working against scl1 masters.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•