Closed Bug 638309 Opened 13 years ago Closed 13 years ago

slaves in mtv with masters elsewhere don't work

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

(Whiteboard: [slaveduty])

Attachments

(4 files)

failures.txt 13 years ago Dustin J. Mitchell [:dustin] (he/him) 6.57 KB, text/plain		Details
m638309-puppet-manifests-r1.patch 13 years ago Dustin J. Mitchell [:dustin] (he/him) 1.01 KB, patch	nthomas : review+	Details \| Diff \| Splinter Review
m638309-buildbot-configs-r1.patch 13 years ago Dustin J. Mitchell [:dustin] (he/him) 1.32 KB, patch	nthomas : review+	Details \| Diff \| Splinter Review
m638309-buildbot-configs-r1.patch tweaked for try 13 years ago Nick Thomas [:nthomas] (UTC+12) 2.81 KB, patch	nthomas : review+ nthomas : checked-in+	Details \| Diff \| Splinter Review

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Description

•

13 years ago

Someone else can add more detail here, but we're having lots of network failures with cross-dc master-slave connections.

We're going to take down most of the slaves in mtv, except try slaves will be re-parented to the try master in mtv.

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

13 years ago

Once we get the prod slaves down we can look at spinning up a moz2 master on buildbot-master3, possibly with a reduced set of branches to keep from the VM falling over. That box was previously test-master02.

Aki Sasaki (not active)

Comment 2

•

13 years ago

pm02 is currently green.
We need to either have an 0.7.x master in MV or keep the MV linux slaves on there.

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

13 years ago

All the win32 machines in MTV for non-try branches are now disabled. m-c should be OK to open.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

13 years ago

Attached file failures.txt — Details

Let me add some of the detail I didn't add before.

Today we saw a bunch of windows slaves fail at about the same time - see attached, ignoring the tegras. All of the slaves that failed were in mtv, and the masters (pm01, pm02, pm03) are in mpt.

An analysis of the logs on mw32-ix-slave03 shows that it was working on a build, with the most recent step started at 13:21. While sending log data, it encountered an _ackFailed error - a failure in communication with the master - at 15:33. It aborted the step, and began reconnecting to the master.

Over on the master, there's no evidence of a lost connection. At 15:33, we do see logging of the reconnects. The master believes it still has an old slave connected, and begins "pinging" it (via an RPC call over the connection, not ICMP), while refusing connections from the new slave. Almost 20 minutes later, at 15:51, the master finally acknowledges that the old slave is dead, and accepts the new slave's connection. At this point, the master marks the build as failed (purple), and assigns new work to the slave.

From everything I can tell, this looks like the TCP connection is being split into two isolated half-open connections: one for the slave and one for the master. Note that the network does not actually *fail* between the hosts - when the slave reconnects, its SYN packets are delivered directly to the master and everything works as expected.

Since the slave is busily sending log data to the master, once its segments go un-acknowledged for 20m or so, it gives up and starts reconnecting. That reconnection succeeds, and triggers the master to send some segments to the "old" slave on the severed TCP connection. Another 20 minutes later, the master gives up on that connection, and begins using the new connection.

All of this takes place via a firewall that we know to have a 12 hour maximum for TCP connections. On that basis, it shouldn't be expected to work, although a master-slave connection that's been up for 12 hours is more than likely an idle slave, so neither the slave nor the master ever notice the other is gone, until nagios alerts us of the idle slave and we reboot it.

I still can't explain why we had this sudden bout of failures today. This was shortly after bear brought up of a few of the hosts that were shut down for bug 636462, and ravi didn't spot any other related changes.

So we are operating on very partial information. Even so, the plans are:

- remove as many master-slave connections crossing the mtv border as possible
- disable build slaves
- redirect try slaves to the local try master (try_master1)
- if slaves are needed for releases, etc., we'll bring them up manually to avoid the 12h timeout

- add a new builder master in mtv to handle the remaining mtv slaves

- increase the priority of getting as many slaves as possible out of mtv, realizing that this is complex and correlated with a lot of other moving parts in IT

- build into slavealloc a provision that slaves can only connect to a master in their own datacenter

Chris AtLee [:catlee]

Comment 5

•

13 years ago

This is tickling my memory about why we disabled slave-side keepalive...

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

13 years ago

These failures look *very* similar to those in bug 592490.  In that case, apparently due to load, the slave got errors sending data to the master and reconnected.  There the failures occurred in batches ("3 were at 15:44, 5 more at 14:35; two on pm03, 6 on pm01"), just like this bug.

I have a hard time believing that load on the mtv link would spike high enough to kill multiple relatively low-bandwidth sessions without causing massive panic as everyone's SSH sessions bailed out, so I'm looking for more complex answers.  Ravi assures me that we've never reached the session limit on a firewall.  Hmm.

Comment 7

•

13 years ago

The following hosts have been moved, and will need puppet, ssh keys, and a master reconfig for those which are not on bm01.  I need to check the DNS too, but I think zandr fixed that.

MOVE THESE:
linux-ix-slave01 (up, not connected)
linux-ix-slave02 (up, not connected)
linux-ix-slave06 (up, not connected)
linux-ix-slave12 (up, not connected)
  bm01
linux-ix-slave13 (up, not connected)
  bm01
linux-ix-slave16 (up, not connected)
  bm01
 
w32-ix-slave01 (up, not connected)
w32-ix-slave02 (up, not connected)
w32-ix-slave03 (up, not connected)
w32-ix-slave04 (up, not connected)
w32-ix-slave22 (up, not connected)
  bm01
w32-ix-slave23 (up, not connected)
  bm01
w32-ix-slave24 (up, not connected)
  bm01
w32-ix-slave25 (up, not connected)
  bm01

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 8

•

13 years ago

Attached patch m638309-puppet-manifests-r1.patch — Details — Splinter Review

move new linux hosts to the scl puppet server

Assignee: nobody → dustin

Attachment #516695 - Flags: review?(nrthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

13 years ago

Comment on attachment 516695 [details] [diff] [review]
m638309-puppet-manifests-r1.patch

r+ if you fix scl-production.pp to list linux-ix-slave01 and linux-ix-slave01, instead of linux-ix-slave01 twice.

Attachment #516695 - Flags: review?(nrthomas) → review+

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

13 years ago

Attached patch m638309-buildbot-configs-r1.patch — Details — Splinter Review

Attachment #516699 - Flags: review?(nrthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 11

•

13 years ago

Comment on attachment 516699 [details] [diff] [review]
m638309-buildbot-configs-r1.patch

Could you make followup patch that fixes TRY_LINUX_IXS and TRY_WIN32_IXS so that we can't accidentally connect non-try prod slaves to try ? eg 
 TRY_WIN32_IXS  = ['mw32-ix-slave%02i' % x for x in range(22,26)] + \
                  ['w32-ix-slave%02i' % x for x in range(3,22)]
is wrong, should be at least 
  ['w32-ix-slave%02i' % x for x in range(6,22)]

Attachment #516699 - Flags: review?(nrthomas) → review+

Nick Thomas [:nthomas] (UTC+12)

Comment 12

•

13 years ago

Attached patch m638309-buildbot-configs-r1.patch tweaked for try — Details — Splinter Review

This is what I landed. It excludes linux-ix-slave06 (try -> prod), and w32-ix-slave01 thru 05 (try -> staging -> mostly prod, w32-ix-slave05 still in staging).

default:    http://hg.mozilla.org/build/buildbot-configs/rev/0b1f39216424
production: http://hg.mozilla.org/build/buildbot-configs/rev/3b5da1dd4f26

Attachment #516714 - Flags: review+

Attachment #516714 - Flags: checked-in+

Nick Thomas [:nthomas] (UTC+12)

Comment 13

•

13 years ago

bm3 (aka buildbot-master1:8010) is reconfig'd with that change at 15:36. Got an exception
  exceptions.ValueError: builder Linux places build uses undefined slave linux-ix-slave01
on the first reconfig (using fabric), and had to do a 'make reconfig' on the box to fix that up. Seemed pretty slow to reconfig too.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 14

•

13 years ago

For clarity, "buildbot-master1:8010" is 'bm01', aka "Build Master 03".

Nick Thomas [:nthomas] (UTC+12)

Comment 15

•

13 years ago

That's what I thought too, until I couldn't find it in http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/production-masters.json

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 16

•

13 years ago

https://intranet.mozilla.org/RelEngWiki/index.php/Masters

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 17

•

13 years ago

This will stay open until we find a good disposition for all of the disabled slaves.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Depends on: 639630

Armen [:armenzg]

Comment 18

•

13 years ago

The slaves run properly on staging.
We started clobbering them and preparing them to put them back to production on bug 639630.

Aki Sasaki (not active)

Comment 19

•

13 years ago

mv-moz2-linux-ix-slave03 through 19 (except 12) are now pointed at builder-master1.build.mozilla.org:9010.

The windows boxes are clobbered but will be moved on Monday.

Severity: critical → normal

Armen [:armenzg]

Comment 20

•

13 years ago

mw32-ix-slave[02-06,08,10-12,14-18] are now connected to:
builder-master1.build.mozilla.org:9010

Slaves 07, 09 and 13 had issues clobbering (circular directory issues) and I have set them to clobber again.

I will update this bug once they are connected.

Severity: normal → critical

bhearsum@mozilla.com (:bhearsum)

Comment 21

•

13 years ago

All of the MV slaves are being moved back to MPT based masters, because we need them there for release-purposes.

bhearsum@mozilla.com (:bhearsum)

Comment 22

•

13 years ago

mv-moz2-linux-ix-slave03-11, and 13-19 were moved back to a combination of pm01 and pm03. The others are all down for other reasons, noted in the slave tracking spreadsheet.

Armen [:armenzg]

Comment 23

•

13 years ago

(In reply to comment #20)
> mw32-ix-slave[02-06,08,10-12,14-18] are now connected to:
> builder-master1.build.mozilla.org:9010
> 
> Slaves 07, 09 and 13 had issues clobbering (circular directory issues) and I
> have set them to clobber again.
> 
> I will update this bug once they are connected.

The remaining mw32-ix-slaves have been put back into production and the previous ones have been moved to pm01 and pm03.

Only slaves moz2-darwin10-slave40 through 50 are left to be put back.

Severity: critical → normal

bhearsum@mozilla.com (:bhearsum)

Comment 24

•

13 years ago

There's also a bunch of lowered numbered linux-ix-slave and w32-ix-slave to move back

Armen [:armenzg]

Comment 25

•

13 years ago

(In reply to comment #19)
> mv-moz2-linux-ix-slave03 through 19 (except 12) are now pointed at
> builder-master1.build.mozilla.org:9010.
> 
> The windows boxes are clobbered but will be moved on Monday.

It seems some of these slaves were missing the xrbld key.

See bug 642789.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 26

•

13 years ago

The following are still running in staging and need to be moved back to a production master:

linux-ix-slave03
linux-ix-slave04 (currently signed out to lsblakk)
linux-ix-slave15
moz2-darwin10-slave40
moz2-darwin10-slave41
moz2-darwin10-slave42
moz2-darwin10-slave43
moz2-darwin10-slave44
moz2-darwin10-slave45
moz2-darwin10-slave46
moz2-darwin10-slave47
moz2-darwin10-slave48
moz2-darwin10-slave49
moz2-darwin10-slave50

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 27

•

13 years ago

moved:
moz2-darwin10-slave40 -> pm03
moz2-darwin10-slave41 -> pm01
moz2-darwin10-slave42 -> pm03
moz2-darwin10-slave43 -> pm01
moz2-darwin10-slave44 -> pm03

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 28

•

13 years ago

linux-ix-slave03 -> pm01
moz2-darwin10-slave45 -> pm01
moz2-darwin10-slave46 -> pm03
moz2-darwin10-slave47 -> pm01
moz2-darwin10-slave48 -> pm03
moz2-darwin10-slave49 -> pm01
moz2-darwin10-slave50 -> pm03

leaving just
linux-ix-slave04 (catlee is using in staging)
linux-ix-slave15 (taking forever to rm -rf)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 29

•

13 years ago

linux-ix-slave15:
/dev/sda:
 Timing cached reads:   28000 MB in  1.99 seconds = 14065.90 MB/sec
 Timing buffered disk reads:  278 MB in  3.01 seconds =  92.48 MB/sec

so it just misses the trip-back-to-IX cutoff.  At any rate, it's now running on pm01.

Let's leave linux-ix-slave04 in staging for the time being.  Meaning this bug is done.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.