Closed
Bug 637541
Opened 14 years ago
Closed 14 years ago
slavealloc: set keepalive
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
Details
(Whiteboard: [slavealloc])
The application-level keepalive is designed to allow buildslaves to traverse NATs. Sadly, the symptom of *not* having this installed is that the slave patiently waits to hear from the master, while the master thinks the slave is disconnected. Since this only happens after enough idleness for the NAT to time out, it tends to make slaves that have been somewhat idle become idle *forever*.
We should fix this by adding the keepalive value to the buildslaves' buildbot.tac, via slave allocator.
Assignee | ||
Comment 1•14 years ago
|
||
ravi: I believe we're only seeing this on connections between masters and slaves in different colos (which mostly happens in staging, but occasionally in production). In that case, do you know what the NAT timeout is? That is, how long will an idle TCP connection stay in the firewall's NAT tables before it expires? We'll set our keepalive to 50% of whatever value you specify.
Assignee | ||
Comment 2•14 years ago
|
||
From IRC:
19:49 < ravi> it depends on the protocol
19:49 < ravi> ssh is typically 12 hours
19:49 < ravi> everything else is 30m
19:49 < ravi> unless there was an exception created
19:49 < dustin> ok, cool - I'll comment that on the bug (this falls under "everything else", I assume - ports 9010, 9012, etc.)
So keepalives should be 15*60 = 15m
Comment 3•14 years ago
|
||
Point of clarification: The FW only has a session timer and not an idle timeout. TCP sessions default to 30m timeout regardless of idleness unless a longer one was explicitly requested. SSH, for example, typically is set to 12 hours in our environment.
Assignee | ||
Comment 4•14 years ago
|
||
Does that apply to sessions between datacenters?
Does it send FIN or RST to both sides, or just expire the NAT?
Comment 5•14 years ago
|
||
The NAT just expires.
Assignee | ||
Comment 6•14 years ago
|
||
And that's just between datacenters? Or within as well? (I assume "just between", but would like confirmation)
Comment 7•14 years ago
|
||
To avoid any unnecessary confusion assume it is for all traffic flows in all DCs.
Assignee | ||
Comment 8•14 years ago
|
||
Well, that means that Buildbot won't work at all, so that's not a good assumption. Buildbot needs a persistent connection.
Comment 9•14 years ago
|
||
Connections are persistent with a max age. Can you explain the data flows (source and destination) with ports and how the communication works (duration, direction of flows, etc). This data will help me frame this in better context.
Assignee | ||
Comment 10•14 years ago
|
||
Sure.
The slave connects via TCP to the master on a port in the 9000's (different ports for different masters). There is a brief negotiation, and then things are silent until the master has work for the slave, which may be hours or days. For most builds, when the work is done the slave reboots, but this is not the case for all builds. So we may have an open TCP connection that lasts for days, which is completely idle for hours at a time.
This bug, in particular, would cause the slave to initiate a bidirectional interaction (application-level keepalive) every 15 minutes, but at the moment nothing is done to prevent long idle connections.
For the most part, we try to locate masters and slaves in the same datacenter, but this is not the case for staging slaves, and at the moment is not the case for production slaves which are slated to move from mtv to scl "any moment now" (which has been the case for the last several months).
Comment 11•14 years ago
|
||
See bugs 476677, 592490 for why we disabled keepalive. Masters or networks that are too busy can cause slaves to disconnect and cause burning.
Has the slave-side logic changed at all here? I seem to recall that the keepalive interval could be configured, but not the amount of time it would wait to receive the keepalive ack.
Assignee | ||
Comment 12•14 years ago
|
||
We don't seem to be experiencing pain due to the lack of a keepalive, so I don't think we should mess with this.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•