Closed Bug 637541 Opened 14 years ago Closed 14 years ago

slavealloc: set keepalive

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

(Whiteboard: [slavealloc])

The application-level keepalive is designed to allow buildslaves to traverse NATs. Sadly, the symptom of *not* having this installed is that the slave patiently waits to hear from the master, while the master thinks the slave is disconnected. Since this only happens after enough idleness for the NAT to time out, it tends to make slaves that have been somewhat idle become idle *forever*. We should fix this by adding the keepalive value to the buildslaves' buildbot.tac, via slave allocator.
ravi: I believe we're only seeing this on connections between masters and slaves in different colos (which mostly happens in staging, but occasionally in production). In that case, do you know what the NAT timeout is? That is, how long will an idle TCP connection stay in the firewall's NAT tables before it expires? We'll set our keepalive to 50% of whatever value you specify.
From IRC: 19:49 < ravi> it depends on the protocol 19:49 < ravi> ssh is typically 12 hours 19:49 < ravi> everything else is 30m 19:49 < ravi> unless there was an exception created 19:49 < dustin> ok, cool - I'll comment that on the bug (this falls under "everything else", I assume - ports 9010, 9012, etc.) So keepalives should be 15*60 = 15m
Point of clarification: The FW only has a session timer and not an idle timeout. TCP sessions default to 30m timeout regardless of idleness unless a longer one was explicitly requested. SSH, for example, typically is set to 12 hours in our environment.
Does that apply to sessions between datacenters? Does it send FIN or RST to both sides, or just expire the NAT?
The NAT just expires.
And that's just between datacenters? Or within as well? (I assume "just between", but would like confirmation)
To avoid any unnecessary confusion assume it is for all traffic flows in all DCs.
Well, that means that Buildbot won't work at all, so that's not a good assumption. Buildbot needs a persistent connection.
Connections are persistent with a max age. Can you explain the data flows (source and destination) with ports and how the communication works (duration, direction of flows, etc). This data will help me frame this in better context.
Sure. The slave connects via TCP to the master on a port in the 9000's (different ports for different masters). There is a brief negotiation, and then things are silent until the master has work for the slave, which may be hours or days. For most builds, when the work is done the slave reboots, but this is not the case for all builds. So we may have an open TCP connection that lasts for days, which is completely idle for hours at a time. This bug, in particular, would cause the slave to initiate a bidirectional interaction (application-level keepalive) every 15 minutes, but at the moment nothing is done to prevent long idle connections. For the most part, we try to locate masters and slaves in the same datacenter, but this is not the case for staging slaves, and at the moment is not the case for production slaves which are slated to move from mtv to scl "any moment now" (which has been the case for the last several months).
See bugs 476677, 592490 for why we disabled keepalive. Masters or networks that are too busy can cause slaves to disconnect and cause burning. Has the slave-side logic changed at all here? I seem to recall that the keepalive interval could be configured, but not the amount of time it would wait to receive the keepalive ack.
We don't seem to be experiencing pain due to the lack of a keepalive, so I don't think we should mess with this.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.