444328 - TCP-level keep alive timer

Reporter

Description

•

17 years ago

While reading other bugs, I saw this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=31884 It got me thinking about half-open connections, and whether or not we should be concerned about them. The classic answer is "no". TCP/IP Illustrated, vol 1 says: "The keepalive feature is intended to detect these half-open connections from the server side." (Chapter 21). Clients are assumed to detect their half-open connections because they would send data down a connection, and trip off the "retransmit" timer. However, this assumption was made in the days when a client was statically connected to the network, and the abnormal state of the server was easily established. These days, we have a lot of interfaces that are coming up and down. We have offline-online features, and VPNs and NATs as well. I'm not sure that we are cleaning up our open connections in all cases. HTTP has keep alive, FTP has a timeout, but we have a lot of other protocols as well. We should consider implementing using the OS-specific TCP keepalive, if it is available, because this would give us a timeout of last resort, which would drop connections to servers that are not answering after 2 hours (should be true for most UNIX implementations, I tried to research Windows, but could not find any useful info).

Rob Mueller

Comment 1

•

14 years ago

Based on some recent research I did, I'd really like to revive and prioritise this bug. http://blog.fastmail.fm/2011/06/28/http-keep-alive-connection-timeouts/ Basically it seems Chrome is setting the SO_KEEPALIVE flag on connections and specifically lowering the ping time to 45 seconds. This allows them to reuse keepalive connections as long as the server keeps them open, unlike Firefox which by default seems to limit to 115 seconds. I imagine this is due to the NAT/Firewall timeout problems Chrome talks about here: http://code.google.com/p/chromium/issues/detail?id=27400 On top of this, there's another reason this is going to be important. With EventSource connections, you basically are expected to connect to the server and just wait forever for events. Now if you do that on a laptop, and suspend the laptop, it can cause problems. 1. Browser connects to event source server, waits for events 2. User suspends the machine 3. Server sends event. Finds client isn't responding, TCP stack retries and eventually tells the server the connection is closed 4. User wakes up machine 5. Client still thinks it's connected, but never recieves anything from the server, and because it doesn't think the connection is dead, doesn't try and reconnect automatically either I haven't fully confirmed this, but anecdotally this appears to be happening on a test setup we have. This destroys the use of EventSource which is supposed to auto-reconnect, but doesn't in this case because it doesn't detect the connection is dead. Enabling SO_KEEPALIVE on the connection and setting low ping times would fix this as well.

Jason Duell

Comment 2

•

14 years ago

Will this also be an issue for Web sockets?

Patrick McManus [:mcmanus]

Comment 3

•

14 years ago

I'm skeptical than anything useful ever comes out of SO_KEEPALIVE. Websockets has its own application level ping/pong mechanism to address this problem.

Rob Mueller

Comment 4

•

14 years ago

SO_KEEPALIVE definitely has it's place. I think it's why Firefox limits HTTP keepalive connections to 115 seconds by default, but Chrome has no limit. Again, read these two links for details. http://blog.fastmail.fm/2011/06/28/http-keep-alive-connection-timeouts/ http://code.google.com/p/chromium/issues/detail?id=27400 I just tested my EventSource issue again, and I think it's unrelated to TCP keepalives. It appears that when you suspend a laptop, the connection is shut, but Firefox just reconnect when you wake up again. I'll open a separate bug about that.

Honza Bambas (:mayhemer)

Comment 5

•

14 years ago

Few cents.. As I understand, goals would be: - to keep connections longer time and save a RTT due to need of connection establishment on next request for resources from the same server/host - to quickly discover a connection is dead (silently dropped by the server or router on the way) and prevent connection hang times - to recover (detect connection close) after wake from sleep I personally don't think TCP keep alive is the right way to do it. TCP keep alives may cause false negatives when dropped. Keeping the connection with the server for a long time might waste resources (also on the client side), significantly for secure connections. Unless it is an event channel, then keeping connection for a longer time is IMO bad. But keeping only one or two connections for longer time might mitigate that.. When we reuse a connection to make an HTTP request, and it takes a really long time to detect the connection is dead, we can restart idempotent transaction more quickly using our own timer. Non-idempotent methods may always go through a new connection when all idle connections are too "old" and there is a risk of a hang due to a silent drop.

Rob Mueller

Comment 6

•

14 years ago

> - to keep connections longer time and save a RTT due to need of connection > establishment on next request for resources from the same server/host > - to quickly discover a connection is dead (silently dropped by the server > or router on the way) and prevent connection hang times > - to recover (detect connection close) after wake from sleep Right. > I personally don't think TCP keep alive is the right way to do it. TCP keep > alives may cause false negatives when dropped. Keeping the connection with > the server for a long time might waste resources (also on the client side), > significantly for secure connections. IMHO the only real issue is server resources, I can't see that keeping one or a few connections open from the client could use that many resources. And servers can control how long to keep a connection open, so I think as long as the server is willing to keep the connection open, the client should support doing that. > Unless it is an event channel, then Obviously it's a serious problem for event channels. You end up with event channels that appear alive to the client, but are actually dead. The whole point of the eventsource spec is that it auto-reconnects if the connection is lost, but it can only do that if the OS actually tells it the connection is lost, which can only happen if you enable SO_KEEPALIVE. > When we reuse a connection to make an HTTP request, and it takes a really > long time to detect the connection is dead, we can restart idempotent > transaction more quickly using our own timer. Non-idempotent methods may > always go through a new connection when all idle connections are too "old" > and there is a risk of a hang due to a silent drop. Obviously you'll always need the application level timer. But with SO_KEEPALIVE, you'll actually need to use it less often because dead connections will be detected at the OS level "in the background" and that information sent up to the application level, rather than having to wait for the application level timeout when the next actual request occurs.

Patrick McManus [:mcmanus]

Comment 7

•

14 years ago

(In reply to comment #6) > Obviously you'll always need the application level timer. But with > SO_KEEPALIVE, you'll actually need to use it less often because dead > connections will be detected at the OS level "in the background" s/in the background/while sucking your battery

Rob Mueller

Comment 8

•

14 years ago

For regular http(s) connections, yes, there is a tradeoff there. On the other hand, I don't think it's a really big one. Most servers have low keepalive times, so in 99% of cases it'll be the server disconnecting. For those servers that are happy with very long keepalive connections, a packet each way every 45 seconds doesn't seem a huge deal, Chrome has decided it's ok, and a better tradeoff than the client having to drop connections after < 2 mins due to **** NATs/firewalls. For eventsource connections though, it's clearly a problem. Maybe there should be a separate bug for that specific case?

Honza Bambas (:mayhemer)

Comment 9

•

14 years ago

(In reply to comment #6) > the OS actually tells it the connection > is lost, which can only happen if you enable SO_KEEPALIVE. > Do you have a practical test on all major platforms? My experience is to let an application detect stand by/hibernation it self. I believe some OSes support APIs to hook to detect stand by or at least wake up. We should use it to "restart" whatever is needed to be restarted after wakeing up. Skype might be an example. > Obviously you'll always need the application level timer. But with > SO_KEEPALIVE, you'll actually need to use it less often because dead > connections will be detected at the OS level "in the background" "In the background" is bad. Goal was to express I am against using SO_KEEPALIVE for idle http connections. It wastes resources, in general IMO. If google says its good to do it, doesn't mean it really is good.

Rob Mueller

Comment 10

•

14 years ago

> "In the background" is bad. Goal was to express I am against using > SO_KEEPALIVE for idle http connections. It wastes resources, in general IMO. If you're asking me for proof, I'd ask you for proof of this statement in turn. > If google says its good to do it, doesn't mean it really is good. To be fair, I never exactly said "google says it's good to do it", I just pointed to what the problem is, what Chrome are doing to solve it, and suggested that doing the same might be a good idea. The big problem I think is that I've mixed up too many things in the one ticket. So, reasons for SO_KEEPALIVE: 1. Sleep/wake - I was wrong on this, ignore this as a reason, the application generally knows when a sleep/wake cycle happens because the network goes down/comes back up 2. Keepalive connections - currently appear to be limited to 115 seconds on the client because of problematic NATs/firewalls with short memories. SO_KEEPALIVE would allow extending this, but benefit might be marginal, so not *important*, but might be useful 3. Eventsource - now this is a real problem. By definition eventsource connections are supposed to be long lived, and so problematic NATs/firewalls with short memories will definitely be a problem But then maybe it's the server that should be sending regular application or TCP pings on eventsource connections, and not the clients job at all. Seems I'm convincing myself out of all my own arguments today :)

Jason Duell

Comment 11

•

14 years ago

Note that the http://code.google.com/p/chromium/issues/detail?id=27400#c15 seems to imply that firefox was getting hit by the timeout issue there. Not sure if they tweaked the our internal keepalive timer to be longer when they tested. For HTTP connection, we have no way to do a ping, right? (barring sending some sort of bogus request and throwing away the answer). re: power issue. if we found a compelling reason to turn on SO_KEEPALIVE, we could presumably detect if the browser has gone idle for some period and turn the sockopt off/on if we want to save power?

CPuckett.Dynetics

Comment 12

•

12 years ago

I'm seeing an issue where throbber spins for days trying to load a tab for some sites. I suspect a NAT/firewall has forgotten about the TCP connection, but Firefox is still waiting for data. This behavior isn't recent either. Firefox has been doing this for as long as I can remember, back to around the 3.x versions. The offending socket still shows as established in TCPView. The problem is that the phantom connections keep hanging around until they are either manually closed by the user canceling the page load, or until Firefox hits the maximum number of connections it can have for that server. At that point, the server will be unavailable until the user manually cancels some page loads. I believe Jason is correct: "For HTTP connection, we have no way to do a ping, right? (barring sending some sort of bogus request and throwing away the answer)." The alternative would be to set the SO_KEEPALIVE option on the socket and let the OS do that for you in a way that won't generate some sort of HTTP request for the server to have to process.

Jason Duell

Comment 13

•

12 years ago

As discussed with patrick and others, the plan here now is to set TCP keepalive with a small ping count and short timeout, so that we'll quickly detect stalled connections if we hit a lame-network situation (where the connection is forgotten by a wifi gateway, etc.). Then after a minute or so we'll drop it down to a much less frequent ping, so we don't waste power/bandwidth for long lived idle websocket/Comet/EventSource connections. Looks like we may not be able to change keepalive ping count on windows (always 10): http://msdn.microsoft.com/en-us/library/windows/desktop/ee470551(v=vs.85).aspx Oddly enough it looks like they also allow you to set the ping timeout, which other OSes (sensibly) leave up to the TCP stack's estimation of RTT.

Assignee: nobody → hurley

u408661

Comment 14

•

12 years ago

Attached patch hurley's start of a keepalive patch (obsolete) — Details — Splinter Review

As discussed with Steve, he's going to take this over as part of his Q4 work. Here's the patch I have so far, Steve, feel free to take it or leave it as you see fit.

u408661

Updated

•

12 years ago

Assignee: hurley → sworkman

hurley's start of a keepalive patch 12 years ago u408661 15.31 KB, patch		Details \| Diff \| Splinter Review
v1.0 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get\|ReleaseFD_Locked 11 years ago Steve Workman [:sworkman] (INACTIVE) 16.02 KB, patch		Details \| Diff \| Splinter Review
v1.0 Add support for TCP keepalive in the Socket Transport Service 11 years ago Steve Workman [:sworkman] (INACTIVE) 27.12 KB, patch		Details \| Diff \| Splinter Review
v1.0 Enable TCP Keepalive for short and long-lived HTTP Connections 11 years ago Steve Workman [:sworkman] (INACTIVE) 16.08 KB, patch		Details \| Diff \| Splinter Review
v1.1 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get\|ReleaseFD_Locked 11 years ago Steve Workman [:sworkman] (INACTIVE) 14.51 KB, patch		Details \| Diff \| Splinter Review
v1.1 Add support for TCP keepalive in the Socket Transport Service 11 years ago Steve Workman [:sworkman] (INACTIVE) 28.55 KB, patch		Details \| Diff \| Splinter Review
v1.2 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get\|ReleaseFD_Locked 11 years ago Steve Workman [:sworkman] (INACTIVE) 14.86 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
v2.0 Add support for TCP keepalive in the Socket Transport Service 11 years ago Steve Workman [:sworkman] (INACTIVE) 29.59 KB, patch		Details \| Diff \| Splinter Review
v2.0 Enable TCP Keepalive for short and long-lived HTTP Connections 11 years ago Steve Workman [:sworkman] (INACTIVE) 28.17 KB, patch		Details \| Diff \| Splinter Review
v2.1 Add support for TCP keepalive in the Socket Transport Service 11 years ago Steve Workman [:sworkman] (INACTIVE) 31.52 KB, patch		Details \| Diff \| Splinter Review
v2.1 Enable TCP Keepalive for short and long-lived HTTP Connections 11 years ago Steve Workman [:sworkman] (INACTIVE) 28.15 KB, patch		Details \| Diff \| Splinter Review
v2.2 Add support for TCP keepalive in the Socket Transport Service 11 years ago Steve Workman [:sworkman] (INACTIVE) 34.92 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
v2.2 Enable TCP Keepalive for short and long-lived HTTP Connections 11 years ago Steve Workman [:sworkman] (INACTIVE) 23.89 KB, patch	mcmanus : review+	Details \| Diff \| Splinter Review
v1.0 Suppress spurious warnings in PRFileDescAutoLock constructor 11 years ago Steve Workman [:sworkman] (INACTIVE) 995 bytes, patch	mcmanus : review+	Details \| Diff \| Splinter Review
tcp-keepalive-traces.tar.gz 11 years ago Steve Workman [:sworkman] (INACTIVE) 9.96 MB, application/x-gzip	mcmanus : feedback+	Details