Closed Bug 637541 Opened 14 years ago Closed 14 years ago

slavealloc: set keepalive

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

(Whiteboard: [slavealloc])

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Description

•

14 years ago

The application-level keepalive is designed to allow buildslaves to traverse NATs. Sadly, the symptom of *not* having this installed is that the slave patiently waits to hear from the master, while the master thinks the slave is disconnected. Since this only happens after enough idleness for the NAT to time out, it tends to make slaves that have been somewhat idle become idle *forever*. We should fix this by adding the keepalive value to the buildslaves' buildbot.tac, via slave allocator.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 1

•

14 years ago

ravi: I believe we're only seeing this on connections between masters and slaves in different colos (which mostly happens in staging, but occasionally in production). In that case, do you know what the NAT timeout is? That is, how long will an idle TCP connection stay in the firewall's NAT tables before it expires? We'll set our keepalive to 50% of whatever value you specify.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

14 years ago

From IRC: 19:49 < ravi> it depends on the protocol 19:49 < ravi> ssh is typically 12 hours 19:49 < ravi> everything else is 30m 19:49 < ravi> unless there was an exception created 19:49 < dustin> ok, cool - I'll comment that on the bug (this falls under "everything else", I assume - ports 9010, 9012, etc.) So keepalives should be 15*60 = 15m

Ravi Pina [:ravi]

Comment 3

•

14 years ago

Point of clarification: The FW only has a session timer and not an idle timeout. TCP sessions default to 30m timeout regardless of idleness unless a longer one was explicitly requested. SSH, for example, typically is set to 12 hours in our environment.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

14 years ago

Does that apply to sessions between datacenters? Does it send FIN or RST to both sides, or just expire the NAT?

Ravi Pina [:ravi]

Comment 5

•

14 years ago

The NAT just expires.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

14 years ago

And that's just between datacenters? Or within as well? (I assume "just between", but would like confirmation)

Ravi Pina [:ravi]

Comment 7

•

14 years ago

To avoid any unnecessary confusion assume it is for all traffic flows in all DCs.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 8

•

14 years ago

Well, that means that Buildbot won't work at all, so that's not a good assumption. Buildbot needs a persistent connection.

Ravi Pina [:ravi]

Comment 9

•

14 years ago

Connections are persistent with a max age. Can you explain the data flows (source and destination) with ports and how the communication works (duration, direction of flows, etc). This data will help me frame this in better context.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

14 years ago

Sure. The slave connects via TCP to the master on a port in the 9000's (different ports for different masters). There is a brief negotiation, and then things are silent until the master has work for the slave, which may be hours or days. For most builds, when the work is done the slave reboots, but this is not the case for all builds. So we may have an open TCP connection that lasts for days, which is completely idle for hours at a time. This bug, in particular, would cause the slave to initiate a bidirectional interaction (application-level keepalive) every 15 minutes, but at the moment nothing is done to prevent long idle connections. For the most part, we try to locate masters and slaves in the same datacenter, but this is not the case for staging slaves, and at the moment is not the case for production slaves which are slated to move from mtv to scl "any moment now" (which has been the case for the last several months).

Chris AtLee [:catlee]

Comment 11

•

14 years ago

See bugs 476677, 592490 for why we disabled keepalive. Masters or networks that are too busy can cause slaves to disconnect and cause burning. Has the slave-side logic changed at all here? I seem to recall that the keepalive interval could be configured, but not the amount of time it would wait to receive the keepalive ack.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

14 years ago

We don't seem to be experiencing pain due to the lack of a keepalive, so I don't think we should mess with this.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.

Bugzilla

slavealloc: set keepalive

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: dustin)

References

Details

(Whiteboard: [slavealloc])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated