Closed Bug 756365 Opened 13 years ago Closed 13 years ago

moz2-darwin10 machines are not staying connected to masters in scl1

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

x86_64
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: nthomas)

References

Details

(Whiteboard: [buildduty][capacity][buildslaves])

In https://tbpl.mozilla.org/?tree=Mozilla-Beta&rev=5f412ea09aba things didn't go too well - the first 10.6.2 opt build lost its slave right as it was about to upload, I retriggered it and moved on, and then to my surprise when I looked 7 hours later the retriggered build still hadn't finished, so I triggered a third build. When that third build still hadn't finished after 3 hours, I got worried that it was the push that broke things, and retriggered the build on the push before. That stayed pending (45 minutes so far), which prompted me to look at http://build.mozilla.org/builds/last-job-per-slave.html#compile which looks to me like it's saying that something broke with these slaves on Monday - 48 and 56 have been stuck since then, 44 and 54 since Tuesday morning, 50 Tuesday afternoon, 53 Tuesday evening, ... One of my retriggers on the tip apparently made it as far as uploading, since tests are running, but there's still something seriously wrong in the moz2-darwin10 ghetto.
And now esr10 has three pending Mac builds, and zero running. One of those three is the 32-bit debug build, which it builds on 10.5. Or doesn't. Did we maybe get rid of all those slaves without remembering something still used them?
Severity: critical → blocker
esr10's closed, the correct reopen state is APPROVAL REQUIRED.
And since that retrigger on mozilla-beta's tip-minus-one is still pending, it apparently won't get Mac builds on its next push, so it's closed too.
Assigning to buildduty.
Assignee: nobody → bear
restarted the 3 masters that handle esr10 builds. the working theory is that the mtv1 network blip yesterday caused all of the darwin10 slaves to not want to talk to the masters.
Doesn't appear to have had any effect.
that's because while in the middle of the reset we had another mtv1 event. and now another so i'm waiting for the 4th restart to finish and praying that we won't have another mtv1 event.
I'm not sure what's going on here, but this is what I found * the slaves (moz2-darwin10-slave40-50, 53-56) are all accessible via ssh, except for moz2-darwin10-slave45 and 53 which are down * they're all being rebooted by idleizer after only an hour, a typical slave twistd.log has several of these: 2012-05-20 18:12:17-0700 [-] Log opened. 2012-05-20 18:12:17-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz2/bin/python 2.6.4) starting up. 2012-05-20 18:12:17-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor. 2012-05-20 18:12:17-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x1017b93b0> 2012-05-20 18:12:17-0700 [-] Connecting to buildbot-master12.build.scl1.mozilla.com:9001 2012-05-20 18:12:19-0700 [Broker,client] message from master: attached 2012-05-20 19:12:18-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK 2012-05-20 19:12:18-0700 [-] No active connection, rebooting NOW 2012-05-20 19:12:18-0700 [-] Invoking platform-specific reboot command 2012-05-20 19:12:18-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective 2012-05-20 19:12:18-0700 [Broker,client] we lost the brand-new connection 2012-05-20 19:12:18-0700 [Broker,client] Lost connection to buildbot-master12.build.scl1.mozilla.com:9001 2012-05-20 19:12:18-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1017b93b0> 2012-05-20 19:12:18-0700 [-] Main loop terminated. NB: there isn't a long list of builders after '[Broker,client] message from master: attached', or a [Broker,client] Connected to buildbot-master13.build.scl1.mozilla.com:9001; slave is ready * the masters are bm12, bm13, and bm25; all in scl1. I've restarted 13 and 25, which hasn't helped. bm12 is doing a graceful shutdown but I'm not expecting it to help * there are a couple of zombie builds in buildbot which both failed during 'download props' steps, they are http://buildbot-master25.build.scl1.mozilla.com:8001/builders/OS%20X%2010.6.2%20mozilla-beta%20build/builds/1 http://buildbot-master25.build.scl1.mozilla.com:8001/builders/TB%20OS%20X%2010.6.2%20comm-beta%20build/builds/13 There are exceptions on the slave side, but the master is in limbo - it thinks the builds finished at epoch 0. * the six pending builds look sane in the buildrequests table of the scheduler db, from a quick eyeball * those jobs show up as pending on waterfalls, and the moz2-darwin10-slave's are assigned to those builders
Investigating the incomplete sign-on to the masters further: * bm12/13/25 are quite possibly the only masters to be restarted since their buildbot.tac's where modified to double the MAX_BROKER_REFS to 2048 (bug 712244) * non moz2-darwin10 slaves are connecting OK and doing jobs with this change, but not moz2-darwin10 * taking bm25 and setting 'twisted.spread.pb.MAX_BROKER_REFS = 1024' in the buildbot.tac (ie reverting the master side of 712244), and restarting doesn't help - still no list of builders and 'slave ready'. This is with moz2-darwin10-slave47 connecting with 2048 set in it's own tac file * setting 1024 in box tac files doesn't work either (I launched buildbot like with '/usr/bin/python /usr/local/bin/runslave.py --verbose --allocator-url http://example.org' with a hacked tac file) So that rules out MAX_BROKER_REFS (I've reverted my modifications) and this looks like network issue. The strange thing is that we can create the initial connection OK, $ nc -vz buildbot-master25.build.scl1.mozilla.com 9001 Connection to buildbot-master25.build.scl1.mozilla.com 9001 port [tcp/etlservicemgr] succeeded! but then it appears to hang.
FTR, this is also affecting the Thunderbird-Beta builders (comm-beta) as well.
looking at this further, i've found/discovered that all of the moz2 slaves get this error very soon after attaching on the master side: 2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] Peer will receive following PB traceback: 2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause self._runCallbacks() File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 397, in _continue self.unpause() File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause self._runCallbacks() --- <exception caught here> --- File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/pb.py", line 763, in serialize return jelly(object, self.security, None, self) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/jelly.py", line 1122, in jelly return _Jellier(taster, persistentStore, invoker).jelly(object) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/jelly.py", line 475, in jelly return obj.jellyFor(self) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/flavors.py", line 127, in jellyFor return "remote", jellier.invoker.registerReference(self) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/pb.py", line 658, in registerReference luid = self.luids.get(puid) exceptions.AttributeError: 'NoneType' object has no attribute 'get' 2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] BuildSlave.detached(moz2-darwin10-slave48)
Depends on: 757193
re-opening the Mozilla-Beta and Mozilla-Esr10 tinderboxen as Nick is attaching some of the darwin10 boxes to masters in scl3
Summary: Welfare check in the moz2-darwin10-slave ward → moz2-darwin10 machines are not staying connected to masters in scl1
Ok, so since last week the slaves in mtv1 can't talk to scl1, masters or slavealloc. They open a connection so nc looks ok, but no data flows, eg curl'ing an http request to slavealloc hangs until you disconnect and 0 data received. I'll manually point the slaves at scl3 masters to get them processing work, and they'll come back to scl1 once the network flow is fixed and they can talk to slavealloc again. That's bug 757193.
Assignee: bear → nrthomas
Severity: blocker → major
Priority: -- → P2
(In reply to Nick Thomas [:nthomas] from comment #8) > I'm not sure what's going on here, but this is what I found > > * the slaves (moz2-darwin10-slave40-50, 53-56) are all accessible via ssh, > except for moz2-darwin10-slave45 and 53 which are down These are now split fairly evenly between bm30 and bm32, and busy clearing the backlog of pending builds.
Mac builds are working against scl1 masters.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.