Closed Bug 638309 Opened 13 years ago Closed 13 years ago

slaves in mtv with masters elsewhere don't work

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

(Whiteboard: [slaveduty])

Attachments

(4 files)

Someone else can add more detail here, but we're having lots of network failures with cross-dc master-slave connections.

We're going to take down most of the slaves in mtv, except try slaves will be re-parented to the try master in mtv.
Once we get the prod slaves down we can look at spinning up a moz2 master on buildbot-master3, possibly with a reduced set of branches to keep from the VM falling over. That box was previously test-master02.
pm02 is currently green.
We need to either have an 0.7.x master in MV or keep the MV linux slaves on there.
All the win32 machines in MTV for non-try branches are now disabled. m-c should be OK to open.
Attached file failures.txt
Let me add some of the detail I didn't add before.

Today we saw a bunch of windows slaves fail at about the same time - see attached, ignoring the tegras.  All of the slaves that failed were in mtv, and the masters (pm01, pm02, pm03) are in mpt.

An analysis of the logs on mw32-ix-slave03 shows that it was working on a build, with the most recent step started at 13:21.  While sending log data, it encountered an _ackFailed error - a failure in communication with the master - at 15:33.  It aborted the step, and began reconnecting to the master.

Over on the master, there's no evidence of a lost connection.  At 15:33, we do see logging of the reconnects.  The master believes it still has an old slave connected, and begins "pinging" it (via an RPC call over the connection, not ICMP), while refusing connections from the new slave.  Almost 20 minutes later, at 15:51, the master finally acknowledges that the old slave is dead, and accepts the new slave's connection.  At this point, the master marks the build as failed (purple), and assigns new work to the slave.

From everything I can tell, this looks like the TCP connection is being split into two isolated half-open connections: one for the slave and one for the master.  Note that the network does not actually *fail* between the hosts - when the slave reconnects, its SYN packets are delivered directly to the master and everything works as expected.

Since the slave is busily sending log data to the master, once its segments go un-acknowledged for 20m or so, it gives up and starts reconnecting.  That reconnection succeeds, and triggers the master to send some segments to the "old" slave on the severed TCP connection.  Another 20 minutes later, the master gives up on that connection, and begins using the new connection.

All of this takes place via a firewall that we know to have a 12 hour maximum for TCP connections.  On that basis, it shouldn't be expected to work, although a master-slave connection that's been up for 12 hours is more than likely an idle slave, so neither the slave nor the master ever notice the other is gone, until nagios alerts us of the idle slave and we reboot it.  

I still can't explain why we had this sudden bout of failures today.  This was shortly after bear brought up of a few of the hosts that were shut down for bug 636462, and ravi didn't spot any other related changes.

So we are operating on very partial information.  Even so, the plans are:

 - remove as many master-slave connections crossing the mtv border as possible
   - disable build slaves
   - redirect try slaves to the local try master (try_master1)
   - if slaves are needed for releases, etc., we'll bring them up manually to avoid the 12h timeout

 - add a new builder master in mtv to handle the remaining mtv slaves

 - increase the priority of getting as many slaves as possible out of mtv, realizing that this is complex and correlated with a lot of other moving parts in IT

 - build into slavealloc a provision that slaves can only connect to a master in their own datacenter
This is tickling my memory about why we disabled slave-side keepalive...
These failures look *very* similar to those in bug 592490.  In that case, apparently due to load, the slave got errors sending data to the master and reconnected.  There the failures occurred in batches ("3 were at 15:44, 5 more at 14:35; two on pm03, 6 on pm01"), just like this bug.

I have a hard time believing that load on the mtv link would spike high enough to kill multiple relatively low-bandwidth sessions without causing massive panic as everyone's SSH sessions bailed out, so I'm looking for more complex answers.  Ravi assures me that we've never reached the session limit on a firewall.  Hmm.
See Also: → 592490
The following hosts have been moved, and will need puppet, ssh keys, and a master reconfig for those which are not on bm01.  I need to check the DNS too, but I think zandr fixed that.

MOVE THESE:
linux-ix-slave01 (up, not connected)
linux-ix-slave02 (up, not connected)
linux-ix-slave06 (up, not connected)
linux-ix-slave12 (up, not connected)
  bm01
linux-ix-slave13 (up, not connected)
  bm01
linux-ix-slave16 (up, not connected)
  bm01
 
w32-ix-slave01 (up, not connected)
w32-ix-slave02 (up, not connected)
w32-ix-slave03 (up, not connected)
w32-ix-slave04 (up, not connected)
w32-ix-slave22 (up, not connected)
  bm01
w32-ix-slave23 (up, not connected)
  bm01
w32-ix-slave24 (up, not connected)
  bm01
w32-ix-slave25 (up, not connected)
  bm01
move new linux hosts to the scl puppet server
Assignee: nobody → dustin
Attachment #516695 - Flags: review?(nrthomas)
Comment on attachment 516695 [details] [diff] [review]
m638309-puppet-manifests-r1.patch

r+ if you fix scl-production.pp to list linux-ix-slave01 and linux-ix-slave01, instead of linux-ix-slave01 twice.
Attachment #516695 - Flags: review?(nrthomas) → review+
Comment on attachment 516699 [details] [diff] [review]
m638309-buildbot-configs-r1.patch

Could you make followup patch that fixes TRY_LINUX_IXS and TRY_WIN32_IXS so that we can't accidentally connect non-try prod slaves to try ? eg 
 TRY_WIN32_IXS  = ['mw32-ix-slave%02i' % x for x in range(22,26)] + \
                  ['w32-ix-slave%02i' % x for x in range(3,22)]
is wrong, should be at least 
  ['w32-ix-slave%02i' % x for x in range(6,22)]
Attachment #516699 - Flags: review?(nrthomas) → review+
This is what I landed. It excludes linux-ix-slave06 (try -> prod), and w32-ix-slave01 thru 05 (try -> staging -> mostly prod, w32-ix-slave05 still in staging).

default:    http://hg.mozilla.org/build/buildbot-configs/rev/0b1f39216424
production: http://hg.mozilla.org/build/buildbot-configs/rev/3b5da1dd4f26
Attachment #516714 - Flags: review+
Attachment #516714 - Flags: checked-in+
bm3 (aka buildbot-master1:8010) is reconfig'd with that change at 15:36. Got an exception
  exceptions.ValueError: builder Linux places build uses undefined slave linux-ix-slave01
on the first reconfig (using fabric), and had to do a 'make reconfig' on the box to fix that up. Seemed pretty slow to reconfig too.
For clarity, "buildbot-master1:8010" is 'bm01', aka "Build Master 03".
This will stay open until we find a good disposition for all of the disabled slaves.
Depends on: 639630
The slaves run properly on staging.
We started clobbering them and preparing them to put them back to production on bug 639630.
mv-moz2-linux-ix-slave03 through 19 (except 12) are now pointed at builder-master1.build.mozilla.org:9010.

The windows boxes are clobbered but will be moved on Monday.
Severity: critical → normal
mw32-ix-slave[02-06,08,10-12,14-18] are now connected to:
builder-master1.build.mozilla.org:9010

Slaves 07, 09 and 13 had issues clobbering (circular directory issues) and I have set them to clobber again.

I will update this bug once they are connected.
Severity: normal → critical
All of the MV slaves are being moved back to MPT based masters, because we need them there for release-purposes.
mv-moz2-linux-ix-slave03-11, and 13-19 were moved back to a combination of pm01 and pm03. The others are all down for other reasons, noted in the slave tracking spreadsheet.
(In reply to comment #20)
> mw32-ix-slave[02-06,08,10-12,14-18] are now connected to:
> builder-master1.build.mozilla.org:9010
> 
> Slaves 07, 09 and 13 had issues clobbering (circular directory issues) and I
> have set them to clobber again.
> 
> I will update this bug once they are connected.

The remaining mw32-ix-slaves have been put back into production and the previous ones have been moved to pm01 and pm03.

Only slaves moz2-darwin10-slave40 through 50 are left to be put back.
Severity: critical → normal
There's also a bunch of lowered numbered linux-ix-slave and w32-ix-slave to move back
(In reply to comment #19)
> mv-moz2-linux-ix-slave03 through 19 (except 12) are now pointed at
> builder-master1.build.mozilla.org:9010.
> 
> The windows boxes are clobbered but will be moved on Monday.

It seems some of these slaves were missing the xrbld key.

See bug 642789.
The following are still running in staging and need to be moved back to a production master:

linux-ix-slave03
linux-ix-slave04 (currently signed out to lsblakk)
linux-ix-slave15
moz2-darwin10-slave40
moz2-darwin10-slave41
moz2-darwin10-slave42
moz2-darwin10-slave43
moz2-darwin10-slave44
moz2-darwin10-slave45
moz2-darwin10-slave46
moz2-darwin10-slave47
moz2-darwin10-slave48
moz2-darwin10-slave49
moz2-darwin10-slave50
moved:
moz2-darwin10-slave40 -> pm03
moz2-darwin10-slave41 -> pm01
moz2-darwin10-slave42 -> pm03
moz2-darwin10-slave43 -> pm01
moz2-darwin10-slave44 -> pm03
linux-ix-slave03 -> pm01
moz2-darwin10-slave45 -> pm01
moz2-darwin10-slave46 -> pm03
moz2-darwin10-slave47 -> pm01
moz2-darwin10-slave48 -> pm03
moz2-darwin10-slave49 -> pm01
moz2-darwin10-slave50 -> pm03

leaving just
linux-ix-slave04 (catlee is using in staging)
linux-ix-slave15 (taking forever to rm -rf)
linux-ix-slave15:
/dev/sda:
 Timing cached reads:   28000 MB in  1.99 seconds = 14065.90 MB/sec
 Timing buffered disk reads:  278 MB in  3.01 seconds =  92.48 MB/sec

so it just misses the trip-back-to-IX cutoff.  At any rate, it's now running on pm01.

Let's leave linux-ix-slave04 in staging for the time being.  Meaning this bug is done.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: