Closed Bug 781860 Opened 13 years ago Closed 12 years ago

pesky keepalives for buildbot

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cransom, Assigned: catlee)

Details

(Whiteboard: [buildbot][reit])

Attachments

(2 files, 1 obsolete file)

We keep butting heads against firewalls when they time out stale (30 minutes with no traffic) sessions but the buildbot connection is still alive. There are several provisions to keep this from happening and I was chatting with Dustin and the one that's most useful is built into buildbot and broken in the current deployed build. The next option is enabling kernel/tcp level keepalives which is independent of the code (as long as the code makes the necessary syscall to enable it) and from cursory glances, is being called in twisted. If it is being set, there's a one line change on the build masters to lower the keepalive to 10 minutes (or anything less than 30) and we'd never have problem with a stateful device again. I tried verifying this on a buildmaster by checking to see if SO_KEEPALIVE was set on the established connections but lsof's -T f is invalid on this platform. The extreme win would enabling the same tweaks on the client side since if the client is driving the connection, there could be a severe network disturbance (firewall reboot or something else that would drop all state) and the sessions would be rebuilt on the firewall without dropping the connection to the buildmaster. But I would be ecstatic if we get server keepalives started and I'm more than happy to give you whatever time I have next week (and beyond) to work on this.
A read through the source on the master shows no calls to setTcpKeepAlive, although it is called on the slave. The problem with the sysctl tweak is that not all of the slaves are Linux! So, this may require a small code change, below. I'll verify this with the in-use version of Buildbot before suggesting the patch. diff --git a/master/buildbot/buildslave.py b/master/buildbot/buildslave.py index e7e1d65..65ed03a 100644 --- a/master/buildbot/buildslave.py +++ b/master/buildbot/buildslave.py @@ -270,6 +270,8 @@ class AbstractBuildSlave(config.ReconfigurableServiceMixin, pb.Avatar, if self.slave_status: self.slave_status.recordConnectTime() + # make sure we're using TCP keepalives + mind.broker.transport.setTcpKeepAlive(1) if self.isConnected(): # duplicate slave - send it to arbitration in addition to a sysctl.conf addition something like net.ipv4.tcp_keepalive_time = 240 I'm going to add something similar to the above patch in the latest buildbot. The backport should be simple.
Assignee: server-ops-releng → dustin
OK, I've verified that the masters are *not* setting keepalive, so the above patch would be required. Here's the patch I committed upstream: https://github.com/buildbot/buildbot/commit/29a30c7157ffc53f997b747894f4c3f8215285e0 So, my recommendation would be applying the patch in comment 1 to hgmo/build/buildbot, and applying the sysctl in comment 1 using puppet.
Assignee: dustin → nobody
Component: Server Operations: RelEng → Release Engineering: Automation (General)
QA Contact: arich → catlee
Is there any idea when we can get a few minutes of someone's time to apply the very tiny patch?
OS: Mac OS X → All
Priority: -- → P3
Hardware: x86 → All
Whiteboard: [buildbot]
Whiteboard: [buildbot] → [buildbot][reit]
Ping on this.
A few notes from my digging back into this: This bug is *strictly* about TCP keepalives, not application-level keepalives. We need at least one end of the connection to have SO_KEEPALIVE set *and* have the keepalive duration shorter than 30m (1800s). We know how to do that on Linux, but not really on OS X, much less Windows. The suggestion in comment 1 is to set SO_KEEPALIVE and change the duration (using sysctl) on the masters, since they're all linux. 240 is less than 1800, but is smaller than strictly required.
(In reply to Ravi Pina [:ravi] from comment #4) > Ping on this. Ravi - due to bug 817597 I believe we now understand the entire picture better - planning for which long term solution we use will occur in bug 817838. (It sounds like we have some per datacenter configurations going on now, and it makes sense to sort this out before the next wave of test machines arrive.)
Attachment #688984 - Flags: review?(rail)
So I tried applying the given patch to my master. I stopped the slave partway through a build and waited to see what happened. The build step gets interrupted with: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. However, the socket is still active according to netstat: tcp 0 705 10.12.48.14:9049 10.134.65.114:34762 ESTABLISHED 26583/python
The socket finally closed itself 30 minutes later. /proc/sys/net/ipv4/tcp_keepalive_time is set to 240.
...which makes sense given that /proc/sys/net/ipv4/tcp_keepalive_probes is 9.
The patch for our version of buildbot looks like this: diff -r 39c79564a0dd master/buildbot/master.py --- a/master/buildbot/master.py Wed Dec 05 09:09:46 2012 -0500 +++ b/master/buildbot/master.py Wed Dec 05 19:31:31 2012 -0800 @@ -321,6 +321,12 @@ # record when this connection attempt occurred sl.recordConnectTime() + # try to use TCP keepalives + try: + mind.broker.transport.setTcpKeepAlive(1) + except: + pass + if sl.isConnected(): # uh-oh, we've got a duplicate slave. The most likely # explanation is that the slave is behind a slow link, thinks we
Comment on attachment 688984 [details] [diff] [review] hacky sysctl replacement for old puppet Review of attachment 688984 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/sysctl/manifests/init.pp @@ +1,5 @@ > +define sysctl($value) { > + exec { > + "sysctl-$name": > + command => "/sbin/sysctl $name=$value", > + onlyif => "/sbin/sysctl $name | grep -q -v $value"; A nit. You can use sysctl -n to print only values.
Attachment #688984 - Flags: review?(rail) → review+
Comment on attachment 688984 [details] [diff] [review] hacky sysctl replacement for old puppet Landed (with nit addressed) as http://hg.mozilla.org/build/puppet-manifests/rev/a04b7a57a19f
Attachment #688984 - Flags: checked-in+
(In reply to Chris AtLee [:catlee] from comment #10) > ...which makes sense given that /proc/sys/net/ipv4/tcp_keepalive_probes is 9. That explains the mystery of why I suggested 240s. Thanks :)
committed http://hg.mozilla.org/build/buildbot/rev/878fe8b798d6 and merged to production-0.8 branch.
So far I gracefully restarted the following masters: buildbot-master30.srv.releng.scl3.mozilla.com,scl3,build buildbot-master35.srv.releng.scl3.mozilla.com,scl3,try buildbot-master12.build.scl1.mozilla.com,scl1,build buildbot-master32.srv.releng.scl3.mozilla.com,scl3,build buildbot-master35.srv.releng.scl3.mozilla.com,scl3,build
Updated list of restarted masters: buildbot-master34.srv.releng.scl3.mozilla.com,scl3,build buildbot-master16.build.scl1.mozilla.com,scl1,tests buildbot-master21.build.scl1.mozilla.com,scl1,tests buildbot-master06.build.scl1.mozilla.com,scl1,tests buildbot-master25.build.scl1.mozilla.com,scl1,build buildbot-master04.build.scl1.mozilla.com,scl1,tests buildbot-master31.srv.releng.scl3.mozilla.com,scl3,try buildbot-master33.srv.releng.scl3.mozilla.com,scl3,try buildbot-master49.srv.releng.scl3.mozilla.com,scl3,try buildbot-master30.srv.releng.scl3.mozilla.com,scl3,build buildbot-master35.srv.releng.scl3.mozilla.com,scl3,try buildbot-master12.build.scl1.mozilla.com,scl1,build buildbot-master32.srv.releng.scl3.mozilla.com,scl3,build buildbot-master35.srv.releng.scl3.mozilla.com,scl3,build Still to be done: buildbot-master11.build.scl1.mozilla.com,scl1,tests buildbot-master15.build.scl1.mozilla.com,scl1,tests buildbot-master17.build.scl1.mozilla.com,scl1,tests buildbot-master18.build.scl1.mozilla.com,scl1,tests buildbot-master19.build.mtv1.mozilla.com,mtv1,tests buildbot-master20.build.mtv1.mozilla.com,mtv1,tests buildbot-master22.build.mtv1.mozilla.com,mtv1,tests buildbot-master10.build.mtv1.mozilla.com,mtv1,tests buildbot-master23.build.scl1.mozilla.com,scl1,tests buildbot-master24.build.scl1.mozilla.com,scl1,tests buildbot-master29.build.scl1.mozilla.com,scl1,tests buildbot-master39.build.scl1.mozilla.com,scl1,tests buildbot-master40.build.scl1.mozilla.com,scl1,tests buildbot-master41.build.scl1.mozilla.com,scl1,tests buildbot-master42.build.scl1.mozilla.com,scl1,tests buildbot-master43.build.scl1.mozilla.com,scl1,tests buildbot-master44.build.scl1.mozilla.com,scl1,tests buildbot-master45.build.scl1.mozilla.com,scl1,tests buildbot-master46.build.scl1.mozilla.com,scl1,tests buildbot-master47.build.scl1.mozilla.com,scl1,tests buildbot-master48.build.scl1.mozilla.com,scl1,tests buildbot-master37.srv.releng.scl3.mozilla.com,scl3,tests buildbot-master38.srv.releng.scl3.mozilla.com,scl3,tests buildbot-master14.build.scl1.mozilla.com,scl1,try buildbot-master13.build.scl1.mozilla.com,scl1,build buildbot-master49.srv.releng.scl3.mozilla.com,scl3,build
I've just checked the looong running console. Looks like all masters except disabled 39-48 have been restarted. \o/
I checked the sessions for buildbot-master30.srv.releng.scl3 and all the sessions are being refreshed: {primary:node0} cransom@fw1.console.releng.scl3.mozilla.net> show security flow session destination-prefix 10.26.48.17 node 1 | grep Timeout Session ID: 20008042, Policy name: srv--buildbot/639, State: Active, Timeout: 86400, Valid Session ID: 20025542, Policy name: srv--buildbot/628, State: Active, Timeout: 86384, Valid Session ID: 20026039, Policy name: srv--buildbot/639, State: Active, Timeout: 86252, Valid Session ID: 20033856, Policy name: srv--buildbot/639, State: Active, Timeout: 86396, Valid Session ID: 20045716, Policy name: srv--buildbot/639, State: Active, Timeout: 86384, Valid Session ID: 20048100, Policy name: srv--buildbot/639, State: Active, Timeout: 86334, Valid Session ID: 20052266, Policy name: srv--buildbot/628, State: Active, Timeout: 86398, Valid Session ID: 20054563, Policy name: srv--buildbot/628, State: Active, Timeout: 86398, Valid Session ID: 20055319, Policy name: srv--buildbot/639, State: Active, Timeout: 86378, Valid Session ID: 20056382, Policy name: srv--buildbot/639, State: Active, Timeout: 86380, Valid Session ID: 20057310, Policy name: srv--buildbot/628, State: Active, Timeout: 86392, Valid Session ID: 20058677, Policy name: srv--buildbot/639, State: Active, Timeout: 86400, Valid Session ID: 20061768, Policy name: srv--buildbot/628, State: Active, Timeout: 86400, Valid Session ID: 20077167, Policy name: srv--buildbot/639, State: Active, Timeout: 86396, Valid Session ID: 20086502, Policy name: srv--buildbot/639, State: Active, Timeout: 86226, Valid Session ID: 20090456, Policy name: srv--buildbot/639, State: Active, Timeout: 86180, Valid Session ID: 20104611, Policy name: srv--buildbot/639, State: Active, Timeout: 86386, Valid Session ID: 20124304, Policy name: srv--buildbot/628, State: Active, Timeout: 86206, Valid Session ID: 20128454, Policy name: srv--buildbot/628, State: Active, Timeout: 86388, Valid Session ID: 20133859, Policy name: srv--buildbot/628, State: Active, Timeout: 86334, Valid Session ID: 20134289, Policy name: srv--buildbot/639, State: Active, Timeout: 86258, Valid Session ID: 20143284, Policy name: srv--buildbot/628, State: Active, Timeout: 86232, Valid Session ID: 20159358, Policy name: srv--buildbot/639, State: Active, Timeout: 86350, Valid Session ID: 20166053, Policy name: srv--buildbot/628, State: Active, Timeout: 86234, Valid Session ID: 20197499, Policy name: srv--buildbot/639, State: Active, Timeout: 86166, Valid Session ID: 20205137, Policy name: srv--buildbot/639, State: Active, Timeout: 86254, Valid Session ID: 20227633, Policy name: srv--buildbot/628, State: Active, Timeout: 86192, Valid Session ID: 20243742, Policy name: srv--buildbot/639, State: Active, Timeout: 86368, Valid Session ID: 20251673, Policy name: srv--buildbot/639, State: Active, Timeout: 86236, Valid Session ID: 20291262, Policy name: srv--buildbot/628, State: Active, Timeout: 86338, Valid Session ID: 20295651, Policy name: srv--buildbot/628, State: Active, Timeout: 86400, Valid Session ID: 20339685, Policy name: srv--buildbot/628, State: Active, Timeout: 86168, Valid Session ID: 20344187, Policy name: srv--buildbot/628, State: Active, Timeout: 86354, Valid Session ID: 20364402, Policy name: srv--buildbot/639, State: Active, Timeout: 86218, Valid Session ID: 20365422, Policy name: srv--buildbot/628, State: Active, Timeout: 86400, Valid Session ID: 20369174, Policy name: srv--buildbot/639, State: Active, Timeout: 86220, Valid Session ID: 20372030, Policy name: srv--buildbot/639, State: Active, Timeout: 86186, Valid Session ID: 20374719, Policy name: srv--buildbot/628, State: Active, Timeout: 86276, Valid Session ID: 20382909, Policy name: srv--buildbot/628, State: Active, Timeout: 86368, Valid Session ID: 20396374, Policy name: srv--buildbot/639, State: Active, Timeout: 86288, Valid 86400 is the max timeout value so that looks good so far. The catalyst for this bug was dev-master01.build.scl1.mozilla.com which hasn't been updated yet, as far as I can tell. Is that on the list to be done as well?
Can anyone comment if the releng side fixes are complete so I can verify and close this?
Yes, all the releng-side changes are complete. Users with instances on dev-master01 may need to update their buildbot master code with the fix, but we don't need to block on that.
Attachment #715926 - Flags: review?(dustin)
Attachment #715926 - Attachment is obsolete: true
Attachment #715926 - Flags: review?(dustin)
Attachment #715928 - Flags: review?(dustin)
Attachment #715928 - Flags: review?(dustin) → review+
Attachment #715928 - Flags: checked-in+
Assignee: nobody → catlee
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Is it possible this might have made bug 710942 worse ? On 2/22 the incidence jumped from less than 0.5/day to 4, then on 2/23 to 9, and still high since. philor says it might be load related from the tests, but would we need to be starving the windows kernel of CPU time for 180s for that to be plausible ?
Flags: needinfo?(catlee)
Bug 844648 is also suspiciously well lined up with the attachment 715928 [details] [diff] [review] being deployed. I'm wondering if we need some larger values for the timeouts.
If this is happening often enough, could we capture a tcpdump of it from the master side? Most systems use TCP offload these days, so the ACK would come from the NIC, or at least require very little interaction from the slave kernel.
These masters are mostly KVM guests as well. I think that means that the kernel itself is subject to being CPU starved as well, especially if other guests on the same host are also very busy. Anything particular we should be looking for in tcpdump? FIN/RST packets?
Flags: needinfo?(catlee)
All of the TCP envelopes in the tcpdump, really - so do we see keepalives, what kind of latency is there on acks, do those stop at some point, etc. Basically, if you can capture the last hour or so of a connection that eventually fails, that'd be best.
Caught some that were disconnected as part of a reconfig. They're in root@bm35:/root/tcpdump-screen.log I'm not sure these are the types of disconnects we're looking for though.
OK, here's the capture of a failure on talos-r3-xp-034 (10.12.50.142). Note that communication here is within the VLAN, so no firewalls are involved. > 21:59:48.624287 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3637152:3638612, ack 188153, win 65335, length 1460 > 21:59:48.624438 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3638612:3640072, ack 188153, win 65335, length 1460 > 21:59:48.624631 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3640072:3641532, ack 188153, win 65335, length 1460 > a 21:59:48.660293 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3641532, win 5840, length 0 > 21:59:48.660609 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3641532:3642992, ack 188153, win 65335, length 1460 > 21:59:48.660717 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3642992:3644452, ack 188153, win 65335, length 1460 > 21:59:48.661246 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3644452:3645912, ack 188153, win 65335, length 1460 > 21:59:48.661258 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3645912:3647372, ack 188153, win 65335, length 1460 > b 21:59:48.705296 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0 > c 21:59:48.957470 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1 > 21:59:48.957484 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0 > 21:59:49.613745 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1 > 21:59:49.613767 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0 > 21:59:50.816921 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1 > 21:59:50.816954 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0 > 21:59:53.223199 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1 > 21:59:53.223232 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0 > 21:59:58.035811 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1 > 21:59:58.035829 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0 > 22:00:07.660991 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [.], seq 3647372:3647373, ack 188153, win 65335, length 1 > 22:00:07.661025 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 0, length 0 > d 22:00:23.925080 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [.], ack 3647372, win 59860, length 0 > e 22:00:24.036399 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 59860, length 16 > 22:00:24.360296 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 59860, length 16 > 22:00:25.008314 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 59860, length 16 > 22:00:26.304386 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 59860, length 16 > 22:00:28.896300 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 59860, length 16 > f 22:01:05.185304 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 65535, length 16 > 22:01:46.657318 IP 10.12.49.16.9201 > 10.12.50.142.1052: Flags [P.], seq 188153:188169, ack 3647372, win 65535, length 16 > g 22:01:46.657995 IP 10.12.50.142.1052 > 10.12.49.16.9201: Flags [R], seq 188643936, win 0, length 0 > h 22:02:46.842839 IP 10.12.50.142.1053 > 10.12.49.16.9201: Flags [S], seq 2514350768, win 65535, options [mss 1460,nop,nop,sackOK], length 0 At (a), the slave is pushing data to the master, and the master is ack'ing it. The master's receive window is getting small, probably because the process is not reading from the socket. At (b) that window shrinks to zero, but all transmitted data is ack'd. At (c), the slave sends a window probe. The timing is a bit odd - usually probes aren't sent until 5s after the window closes. The slave continues to send probes at intervals of 0.3, 0.6, 1.2, 2.5, 5, and 9 seconds, until the master's window opens up at (d). At this point, the master begins sending 16 bytes of payload (e), and retransmits those six times. At (f), the master's window opens up a bit more - another read from the socket by the process. Given the timings of the earlier window probes, the next probe should have arrived sometime around 22:00:26, if the slave hadn't received any packets from the master. Finally, at (g), the slave sends a RST, basically saying "what the heck are you talking about". The sequence number in that packet is unrelated to anything above, suggesting that the slave has no memory of the earlier connection. The slave immediately reconnects to the master at (h). There are no keepalive packets here (those look like re-transmits of the last ack'd byte). My best guess as to what's happening here is that the slave rebooted unexpectedly.
Hmmm...it's odd that it reconnects almost exactly a minute later. I don't think the slaves can reboot that quickly. Maybe just the tcp connection died, and the buildslave retried a minute later. In any case, the log for this event on the slave is gone. How do we get some more insight into what is going on in the final moments of a job like this?
The slaves reboot pretty quickly - something around 45s. So it's not unreasonable. And if the TCP connection had died but the kernel stayed up, it would still have a record of the sequence number, so it's far more likely that the kernel also restarted.
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: