781860 - pesky keepalives for buildbot

Reporter

Description

•

13 years ago

We keep butting heads against firewalls when they time out stale (30 minutes with no traffic) sessions but the buildbot connection is still alive. There are several provisions to keep this from happening and I was chatting with Dustin and the one that's most useful is built into buildbot and broken in the current deployed build. The next option is enabling kernel/tcp level keepalives which is independent of the code (as long as the code makes the necessary syscall to enable it) and from cursory glances, is being called in twisted. If it is being set, there's a one line change on the build masters to lower the keepalive to 10 minutes (or anything less than 30) and we'd never have problem with a stateful device again. I tried verifying this on a buildmaster by checking to see if SO_KEEPALIVE was set on the established connections but lsof's -T f is invalid on this platform. The extreme win would enabling the same tweaks on the client side since if the client is driving the connection, there could be a severe network disturbance (firewall reboot or something else that would drop all state) and the sessions would be rebuilt on the firewall without dropping the connection to the buildmaster. But I would be ecstatic if we get server keepalives started and I'm more than happy to give you whatever time I have next week (and beyond) to work on this.

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

13 years ago

A read through the source on the master shows no calls to setTcpKeepAlive, although it is called on the slave. The problem with the sysctl tweak is that not all of the slaves are Linux! So, this may require a small code change, below. I'll verify this with the in-use version of Buildbot before suggesting the patch. diff --git a/master/buildbot/buildslave.py b/master/buildbot/buildslave.py index e7e1d65..65ed03a 100644 --- a/master/buildbot/buildslave.py +++ b/master/buildbot/buildslave.py @@ -270,6 +270,8 @@ class AbstractBuildSlave(config.ReconfigurableServiceMixin, pb.Avatar, if self.slave_status: self.slave_status.recordConnectTime() + # make sure we're using TCP keepalives + mind.broker.transport.setTcpKeepAlive(1) if self.isConnected(): # duplicate slave - send it to arbitration in addition to a sysctl.conf addition something like net.ipv4.tcp_keepalive_time = 240 I'm going to add something similar to the above patch in the latest buildbot. The backport should be simple.

Assignee: server-ops-releng → dustin

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

13 years ago

OK, I've verified that the masters are *not* setting keepalive, so the above patch would be required. Here's the patch I committed upstream: https://github.com/buildbot/buildbot/commit/29a30c7157ffc53f997b747894f4c3f8215285e0 So, my recommendation would be applying the patch in comment 1 to hgmo/build/buildbot, and applying the sysctl in comment 1 using puppet.

Assignee: dustin → nobody

Component: Server Operations: RelEng → Release Engineering: Automation (General)

QA Contact: arich → catlee

casey ransom [:casey]

Reporter

Comment 3

•

13 years ago

Is there any idea when we can get a few minutes of someone's time to apply the very tiny patch?

hacky sysctl replacement for old puppet 12 years ago Chris AtLee [:catlee] 1.33 KB, patch	rail : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
Adjust TCP timeout settings for buildbot masters 12 years ago Chris AtLee [:catlee] 1.78 KB, patch		Details \| Diff \| Splinter Review
Adjust TCP timeout settings for buildbot masters 12 years ago Chris AtLee [:catlee] 1.78 KB, patch	dustin : review+ catlee : checked-in+	Details \| Diff \| Splinter Review