Closed
Bug 638309
Opened 14 years ago
Closed 14 years ago
slaves in mtv with masters elsewhere don't work
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
References
Details
(Whiteboard: [slaveduty])
Attachments
(4 files)
6.57 KB,
text/plain
|
Details | |
1.01 KB,
patch
|
nthomas
:
review+
|
Details | Diff | Splinter Review |
1.32 KB,
patch
|
nthomas
:
review+
|
Details | Diff | Splinter Review |
2.81 KB,
patch
|
nthomas
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
Someone else can add more detail here, but we're having lots of network failures with cross-dc master-slave connections.
We're going to take down most of the slaves in mtv, except try slaves will be re-parented to the try master in mtv.
Comment 1•14 years ago
|
||
Once we get the prod slaves down we can look at spinning up a moz2 master on buildbot-master3, possibly with a reduced set of branches to keep from the VM falling over. That box was previously test-master02.
Comment 2•14 years ago
|
||
pm02 is currently green.
We need to either have an 0.7.x master in MV or keep the MV linux slaves on there.
Comment 3•14 years ago
|
||
All the win32 machines in MTV for non-try branches are now disabled. m-c should be OK to open.
Assignee | ||
Comment 4•14 years ago
|
||
Let me add some of the detail I didn't add before.
Today we saw a bunch of windows slaves fail at about the same time - see attached, ignoring the tegras. All of the slaves that failed were in mtv, and the masters (pm01, pm02, pm03) are in mpt.
An analysis of the logs on mw32-ix-slave03 shows that it was working on a build, with the most recent step started at 13:21. While sending log data, it encountered an _ackFailed error - a failure in communication with the master - at 15:33. It aborted the step, and began reconnecting to the master.
Over on the master, there's no evidence of a lost connection. At 15:33, we do see logging of the reconnects. The master believes it still has an old slave connected, and begins "pinging" it (via an RPC call over the connection, not ICMP), while refusing connections from the new slave. Almost 20 minutes later, at 15:51, the master finally acknowledges that the old slave is dead, and accepts the new slave's connection. At this point, the master marks the build as failed (purple), and assigns new work to the slave.
From everything I can tell, this looks like the TCP connection is being split into two isolated half-open connections: one for the slave and one for the master. Note that the network does not actually *fail* between the hosts - when the slave reconnects, its SYN packets are delivered directly to the master and everything works as expected.
Since the slave is busily sending log data to the master, once its segments go un-acknowledged for 20m or so, it gives up and starts reconnecting. That reconnection succeeds, and triggers the master to send some segments to the "old" slave on the severed TCP connection. Another 20 minutes later, the master gives up on that connection, and begins using the new connection.
All of this takes place via a firewall that we know to have a 12 hour maximum for TCP connections. On that basis, it shouldn't be expected to work, although a master-slave connection that's been up for 12 hours is more than likely an idle slave, so neither the slave nor the master ever notice the other is gone, until nagios alerts us of the idle slave and we reboot it.
I still can't explain why we had this sudden bout of failures today. This was shortly after bear brought up of a few of the hosts that were shut down for bug 636462, and ravi didn't spot any other related changes.
So we are operating on very partial information. Even so, the plans are:
- remove as many master-slave connections crossing the mtv border as possible
- disable build slaves
- redirect try slaves to the local try master (try_master1)
- if slaves are needed for releases, etc., we'll bring them up manually to avoid the 12h timeout
- add a new builder master in mtv to handle the remaining mtv slaves
- increase the priority of getting as many slaves as possible out of mtv, realizing that this is complex and correlated with a lot of other moving parts in IT
- build into slavealloc a provision that slaves can only connect to a master in their own datacenter
Comment 5•14 years ago
|
||
This is tickling my memory about why we disabled slave-side keepalive...
Assignee | ||
Comment 6•14 years ago
|
||
These failures look *very* similar to those in bug 592490. In that case, apparently due to load, the slave got errors sending data to the master and reconnected. There the failures occurred in batches ("3 were at 15:44, 5 more at 14:35; two on pm03, 6 on pm01"), just like this bug.
I have a hard time believing that load on the mtv link would spike high enough to kill multiple relatively low-bandwidth sessions without causing massive panic as everyone's SSH sessions bailed out, so I'm looking for more complex answers. Ravi assures me that we've never reached the session limit on a firewall. Hmm.
See Also: → 592490
Assignee | ||
Comment 7•14 years ago
|
||
The following hosts have been moved, and will need puppet, ssh keys, and a master reconfig for those which are not on bm01. I need to check the DNS too, but I think zandr fixed that.
MOVE THESE:
linux-ix-slave01 (up, not connected)
linux-ix-slave02 (up, not connected)
linux-ix-slave06 (up, not connected)
linux-ix-slave12 (up, not connected)
bm01
linux-ix-slave13 (up, not connected)
bm01
linux-ix-slave16 (up, not connected)
bm01
w32-ix-slave01 (up, not connected)
w32-ix-slave02 (up, not connected)
w32-ix-slave03 (up, not connected)
w32-ix-slave04 (up, not connected)
w32-ix-slave22 (up, not connected)
bm01
w32-ix-slave23 (up, not connected)
bm01
w32-ix-slave24 (up, not connected)
bm01
w32-ix-slave25 (up, not connected)
bm01
Assignee | ||
Comment 8•14 years ago
|
||
move new linux hosts to the scl puppet server
Assignee: nobody → dustin
Attachment #516695 -
Flags: review?(nrthomas)
Comment 9•14 years ago
|
||
Comment on attachment 516695 [details] [diff] [review]
m638309-puppet-manifests-r1.patch
r+ if you fix scl-production.pp to list linux-ix-slave01 and linux-ix-slave01, instead of linux-ix-slave01 twice.
Attachment #516695 -
Flags: review?(nrthomas) → review+
Assignee | ||
Comment 10•14 years ago
|
||
Attachment #516699 -
Flags: review?(nrthomas)
Comment 11•14 years ago
|
||
Comment on attachment 516699 [details] [diff] [review]
m638309-buildbot-configs-r1.patch
Could you make followup patch that fixes TRY_LINUX_IXS and TRY_WIN32_IXS so that we can't accidentally connect non-try prod slaves to try ? eg
TRY_WIN32_IXS = ['mw32-ix-slave%02i' % x for x in range(22,26)] + \
['w32-ix-slave%02i' % x for x in range(3,22)]
is wrong, should be at least
['w32-ix-slave%02i' % x for x in range(6,22)]
Attachment #516699 -
Flags: review?(nrthomas) → review+
Comment 12•14 years ago
|
||
This is what I landed. It excludes linux-ix-slave06 (try -> prod), and w32-ix-slave01 thru 05 (try -> staging -> mostly prod, w32-ix-slave05 still in staging).
default: http://hg.mozilla.org/build/buildbot-configs/rev/0b1f39216424
production: http://hg.mozilla.org/build/buildbot-configs/rev/3b5da1dd4f26
Attachment #516714 -
Flags: review+
Attachment #516714 -
Flags: checked-in+
Comment 13•14 years ago
|
||
bm3 (aka buildbot-master1:8010) is reconfig'd with that change at 15:36. Got an exception
exceptions.ValueError: builder Linux places build uses undefined slave linux-ix-slave01
on the first reconfig (using fabric), and had to do a 'make reconfig' on the box to fix that up. Seemed pretty slow to reconfig too.
Assignee | ||
Comment 14•14 years ago
|
||
For clarity, "buildbot-master1:8010" is 'bm01', aka "Build Master 03".
Comment 15•14 years ago
|
||
That's what I thought too, until I couldn't find it in http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/production-masters.json
Assignee | ||
Comment 16•14 years ago
|
||
Assignee | ||
Comment 17•14 years ago
|
||
This will stay open until we find a good disposition for all of the disabled slaves.
Comment 18•14 years ago
|
||
The slaves run properly on staging.
We started clobbering them and preparing them to put them back to production on bug 639630.
Comment 19•14 years ago
|
||
mv-moz2-linux-ix-slave03 through 19 (except 12) are now pointed at builder-master1.build.mozilla.org:9010.
The windows boxes are clobbered but will be moved on Monday.
Severity: critical → normal
Comment 20•14 years ago
|
||
mw32-ix-slave[02-06,08,10-12,14-18] are now connected to:
builder-master1.build.mozilla.org:9010
Slaves 07, 09 and 13 had issues clobbering (circular directory issues) and I have set them to clobber again.
I will update this bug once they are connected.
Severity: normal → critical
Comment 21•14 years ago
|
||
All of the MV slaves are being moved back to MPT based masters, because we need them there for release-purposes.
Comment 22•14 years ago
|
||
mv-moz2-linux-ix-slave03-11, and 13-19 were moved back to a combination of pm01 and pm03. The others are all down for other reasons, noted in the slave tracking spreadsheet.
Comment 23•14 years ago
|
||
(In reply to comment #20)
> mw32-ix-slave[02-06,08,10-12,14-18] are now connected to:
> builder-master1.build.mozilla.org:9010
>
> Slaves 07, 09 and 13 had issues clobbering (circular directory issues) and I
> have set them to clobber again.
>
> I will update this bug once they are connected.
The remaining mw32-ix-slaves have been put back into production and the previous ones have been moved to pm01 and pm03.
Only slaves moz2-darwin10-slave40 through 50 are left to be put back.
Severity: critical → normal
Comment 24•14 years ago
|
||
There's also a bunch of lowered numbered linux-ix-slave and w32-ix-slave to move back
Comment 25•14 years ago
|
||
(In reply to comment #19)
> mv-moz2-linux-ix-slave03 through 19 (except 12) are now pointed at
> builder-master1.build.mozilla.org:9010.
>
> The windows boxes are clobbered but will be moved on Monday.
It seems some of these slaves were missing the xrbld key.
See bug 642789.
Assignee | ||
Comment 26•14 years ago
|
||
The following are still running in staging and need to be moved back to a production master:
linux-ix-slave03
linux-ix-slave04 (currently signed out to lsblakk)
linux-ix-slave15
moz2-darwin10-slave40
moz2-darwin10-slave41
moz2-darwin10-slave42
moz2-darwin10-slave43
moz2-darwin10-slave44
moz2-darwin10-slave45
moz2-darwin10-slave46
moz2-darwin10-slave47
moz2-darwin10-slave48
moz2-darwin10-slave49
moz2-darwin10-slave50
Assignee | ||
Comment 27•14 years ago
|
||
moved:
moz2-darwin10-slave40 -> pm03
moz2-darwin10-slave41 -> pm01
moz2-darwin10-slave42 -> pm03
moz2-darwin10-slave43 -> pm01
moz2-darwin10-slave44 -> pm03
Assignee | ||
Comment 28•14 years ago
|
||
linux-ix-slave03 -> pm01
moz2-darwin10-slave45 -> pm01
moz2-darwin10-slave46 -> pm03
moz2-darwin10-slave47 -> pm01
moz2-darwin10-slave48 -> pm03
moz2-darwin10-slave49 -> pm01
moz2-darwin10-slave50 -> pm03
leaving just
linux-ix-slave04 (catlee is using in staging)
linux-ix-slave15 (taking forever to rm -rf)
Assignee | ||
Comment 29•14 years ago
|
||
linux-ix-slave15:
/dev/sda:
Timing cached reads: 28000 MB in 1.99 seconds = 14065.90 MB/sec
Timing buffered disk reads: 278 MB in 3.01 seconds = 92.48 MB/sec
so it just misses the trip-back-to-IX cutoff. At any rate, it's now running on pm01.
Let's leave linux-ix-slave04 in staging for the time being. Meaning this bug is done.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•