756365 - moz2-darwin10 machines are not staying connected to masters in scl1

Reporter

Description

•

13 years ago

In https://tbpl.mozilla.org/?tree=Mozilla-Beta&rev=5f412ea09aba things didn't go too well - the first 10.6.2 opt build lost its slave right as it was about to upload, I retriggered it and moved on, and then to my surprise when I looked 7 hours later the retriggered build still hadn't finished, so I triggered a third build. When that third build still hadn't finished after 3 hours, I got worried that it was the push that broke things, and retriggered the build on the push before. That stayed pending (45 minutes so far), which prompted me to look at http://build.mozilla.org/builds/last-job-per-slave.html#compile which looks to me like it's saying that something broke with these slaves on Monday - 48 and 56 have been stuck since then, 44 and 54 since Tuesday morning, 50 Tuesday afternoon, 53 Tuesday evening, ... One of my retriggers on the tip apparently made it as far as uploading, since tests are running, but there's still something seriously wrong in the moz2-darwin10 ghetto.

Phil Ringnalda (:philor)

Reporter

Comment 1

•

13 years ago

And now esr10 has three pending Mac builds, and zero running. One of those three is the 32-bit debug build, which it builds on 10.5. Or doesn't. Did we maybe get rid of all those slaves without remembering something still used them?

Severity: critical → blocker

Phil Ringnalda (:philor)

Reporter

Comment 2

•

13 years ago

esr10's closed, the correct reopen state is APPROVAL REQUIRED.

Phil Ringnalda (:philor)

Reporter

Comment 3

•

13 years ago

And since that retrigger on mozilla-beta's tip-minus-one is still pending, it apparently won't get Mac builds on its next push, so it's closed too.

Armen [:armenzg]

Comment 4

•

13 years ago

Assigning to buildduty.

Assignee: nobody → bear

Mike Taylor [:bear]

Comment 5

•

13 years ago

restarted the 3 masters that handle esr10 builds. the working theory is that the mtv1 network blip yesterday caused all of the darwin10 slaves to not want to talk to the masters.

Phil Ringnalda (:philor)

Reporter

Comment 6

•

13 years ago

Doesn't appear to have had any effect.

Mike Taylor [:bear]

Comment 7

•

13 years ago

that's because while in the middle of the reset we had another mtv1 event. and now another so i'm waiting for the 4th restart to finish and praying that we won't have another mtv1 event.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 8

•

13 years ago

I'm not sure what's going on here, but this is what I found * the slaves (moz2-darwin10-slave40-50, 53-56) are all accessible via ssh, except for moz2-darwin10-slave45 and 53 which are down * they're all being rebooted by idleizer after only an hour, a typical slave twistd.log has several of these: 2012-05-20 18:12:17-0700 [-] Log opened. 2012-05-20 18:12:17-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz2/bin/python 2.6.4) starting up. 2012-05-20 18:12:17-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor. 2012-05-20 18:12:17-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x1017b93b0> 2012-05-20 18:12:17-0700 [-] Connecting to buildbot-master12.build.scl1.mozilla.com:9001 2012-05-20 18:12:19-0700 [Broker,client] message from master: attached 2012-05-20 19:12:18-0700 [-] I feel very idle and was thinking of rebooting as soon as the buildmaster says it's OK 2012-05-20 19:12:18-0700 [-] No active connection, rebooting NOW 2012-05-20 19:12:18-0700 [-] Invoking platform-specific reboot command 2012-05-20 19:12:18-0700 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective 2012-05-20 19:12:18-0700 [Broker,client] we lost the brand-new connection 2012-05-20 19:12:18-0700 [Broker,client] Lost connection to buildbot-master12.build.scl1.mozilla.com:9001 2012-05-20 19:12:18-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1017b93b0> 2012-05-20 19:12:18-0700 [-] Main loop terminated. NB: there isn't a long list of builders after '[Broker,client] message from master: attached', or a [Broker,client] Connected to buildbot-master13.build.scl1.mozilla.com:9001; slave is ready * the masters are bm12, bm13, and bm25; all in scl1. I've restarted 13 and 25, which hasn't helped. bm12 is doing a graceful shutdown but I'm not expecting it to help * there are a couple of zombie builds in buildbot which both failed during 'download props' steps, they are http://buildbot-master25.build.scl1.mozilla.com:8001/builders/OS%20X%2010.6.2%20mozilla-beta%20build/builds/1 http://buildbot-master25.build.scl1.mozilla.com:8001/builders/TB%20OS%20X%2010.6.2%20comm-beta%20build/builds/13 There are exceptions on the slave side, but the master is in limbo - it thinks the builds finished at epoch 0. * the six pending builds look sane in the buildrequests table of the scheduler db, from a quick eyeball * those jobs show up as pending on waterfalls, and the moz2-darwin10-slave's are assigned to those builders

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 9

•

13 years ago

Investigating the incomplete sign-on to the masters further: * bm12/13/25 are quite possibly the only masters to be restarted since their buildbot.tac's where modified to double the MAX_BROKER_REFS to 2048 (bug 712244) * non moz2-darwin10 slaves are connecting OK and doing jobs with this change, but not moz2-darwin10 * taking bm25 and setting 'twisted.spread.pb.MAX_BROKER_REFS = 1024' in the buildbot.tac (ie reverting the master side of 712244), and restarting doesn't help - still no list of builders and 'slave ready'. This is with moz2-darwin10-slave47 connecting with 2048 set in it's own tac file * setting 1024 in box tac files doesn't work either (I launched buildbot like with '/usr/bin/python /usr/local/bin/runslave.py --verbose --allocator-url http://example.org' with a hacked tac file) So that rules out MAX_BROKER_REFS (I've reverted my modifications) and this looks like network issue. The strange thing is that we can create the initial connection OK, $ nc -vz buildbot-master25.build.scl1.mozilla.com 9001 Connection to buildbot-master25.build.scl1.mozilla.com 9001 port [tcp/etlservicemgr] succeeded! but then it appears to hang.

Mark Banner (:standard8)

Comment 10

•

13 years ago

FTR, this is also affecting the Thunderbird-Beta builders (comm-beta) as well.

Mike Taylor [:bear]

Comment 11

•

13 years ago

looking at this further, i've found/discovered that all of the moz2 slaves get this error very soon after attaching on the master side: 2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] Peer will receive following PB traceback: 2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause self._runCallbacks() File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 397, in _continue self.unpause() File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause self._runCallbacks() --- <exception caught here> --- File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/pb.py", line 763, in serialize return jelly(object, self.security, None, self) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/jelly.py", line 1122, in jelly return _Jellier(taster, persistentStore, invoker).jelly(object) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/jelly.py", line 475, in jelly return obj.jellyFor(self) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/flavors.py", line 127, in jellyFor return "remote", jellier.invoker.registerReference(self) File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/spread/pb.py", line 658, in registerReference luid = self.luids.get(puid) exceptions.AttributeError: 'NoneType' object has no attribute 'get' 2012-05-21 09:25:47-0700 [Broker,184,10.250.50.159] BuildSlave.detached(moz2-darwin10-slave48)

Nick Thomas [:nthomas] (UTC+12)

Assignee

Updated

•

13 years ago

Depends on: 757193

Lukas Blakk [:lsblakk] use ?needinfo

Comment 12

•

13 years ago

re-opening the Mozilla-Beta and Mozilla-Esr10 tinderboxen as Nick is attaching some of the darwin10 boxes to masters in scl3

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

13 years ago

Summary: Welfare check in the moz2-darwin10-slave ward → moz2-darwin10 machines are not staying connected to masters in scl1

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 13

•

13 years ago

Ok, so since last week the slaves in mtv1 can't talk to scl1, masters or slavealloc. They open a connection so nc looks ok, but no data flows, eg curl'ing an http request to slavealloc hangs until you disconnect and 0 data received. I'll manually point the slaves at scl3 masters to get them processing work, and they'll come back to scl1 once the network flow is fixed and they can talk to slavealloc again. That's bug 757193.

Assignee: bear → nrthomas

Severity: blocker → major

Priority: -- → P2

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 14

•

13 years ago

(In reply to Nick Thomas [:nthomas] from comment #8) > I'm not sure what's going on here, but this is what I found > > * the slaves (moz2-darwin10-slave40-50, 53-56) are all accessible via ssh, > except for moz2-darwin10-slave45 and 53 which are down These are now split fairly evenly between bm30 and bm32, and busy clearing the backlog of pending builds.

Nick Thomas [:nthomas] (UTC+12)

Assignee

Comment 15

•

13 years ago

Mac builds are working against scl1 masters.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

moz2-darwin10 machines are not staying connected to masters in scl1

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

Tracking

(Not tracked)

People

(Reporter: philor, Assigned: nthomas)

References

Details

(Whiteboard: [buildduty][capacity][buildslaves])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Updated

Comment 13

Comment 14

Comment 15

Updated

Updated

Updated