Last Comment Bug 444328 - TCP-level keep alive timer
: TCP-level keep alive timer
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Networking (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal with 3 votes (vote)
: mozilla30
Assigned To: Steve Workman [:sworkman] (please use needinfo)
:
Mentors:
Depends on: 1157301 970550
Blocks: 970939
  Show dependency treegraph
 
Reported: 2008-07-09 06:21 PDT by benc
Modified: 2016-02-14 21:55 PST (History)
15 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
hurley's start of a keepalive patch (15.31 KB, patch)
2013-10-14 11:38 PDT, Nicholas Hurley [:nwgh][:hurley]
no flags Details | Diff | Splinter Review
v1.0 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked (16.02 KB, patch)
2013-12-05 14:40 PST, Steve Workman [:sworkman] (please use needinfo)
no flags Details | Diff | Splinter Review
v1.0 Add support for TCP keepalive in the Socket Transport Service (27.12 KB, patch)
2013-12-05 14:42 PST, Steve Workman [:sworkman] (please use needinfo)
no flags Details | Diff | Splinter Review
v1.0 Enable TCP Keepalive for short and long-lived HTTP Connections (16.08 KB, patch)
2013-12-05 14:42 PST, Steve Workman [:sworkman] (please use needinfo)
no flags Details | Diff | Splinter Review
v1.1 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked (14.51 KB, patch)
2013-12-11 14:01 PST, Steve Workman [:sworkman] (please use needinfo)
no flags Details | Diff | Splinter Review
v1.1 Add support for TCP keepalive in the Socket Transport Service (28.55 KB, patch)
2013-12-11 16:55 PST, Steve Workman [:sworkman] (please use needinfo)
no flags Details | Diff | Splinter Review
v1.2 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked (14.86 KB, patch)
2013-12-12 17:06 PST, Steve Workman [:sworkman] (please use needinfo)
mcmanus: review+
Details | Diff | Splinter Review
v2.0 Add support for TCP keepalive in the Socket Transport Service (29.59 KB, patch)
2013-12-16 16:31 PST, Steve Workman [:sworkman] (please use needinfo)
no flags Details | Diff | Splinter Review
v2.0 Enable TCP Keepalive for short and long-lived HTTP Connections (28.17 KB, patch)
2013-12-16 16:39 PST, Steve Workman [:sworkman] (please use needinfo)
no flags Details | Diff | Splinter Review
v2.1 Add support for TCP keepalive in the Socket Transport Service (31.52 KB, patch)
2013-12-17 23:03 PST, Steve Workman [:sworkman] (please use needinfo)
no flags Details | Diff | Splinter Review
v2.1 Enable TCP Keepalive for short and long-lived HTTP Connections (28.15 KB, patch)
2013-12-17 23:04 PST, Steve Workman [:sworkman] (please use needinfo)
no flags Details | Diff | Splinter Review
v2.2 Add support for TCP keepalive in the Socket Transport Service (34.92 KB, patch)
2014-01-15 15:29 PST, Steve Workman [:sworkman] (please use needinfo)
mcmanus: review+
Details | Diff | Splinter Review
v2.2 Enable TCP Keepalive for short and long-lived HTTP Connections (23.89 KB, patch)
2014-01-15 15:30 PST, Steve Workman [:sworkman] (please use needinfo)
mcmanus: review+
Details | Diff | Splinter Review
v1.0 Suppress spurious warnings in PRFileDescAutoLock constructor (995 bytes, patch)
2014-01-15 15:31 PST, Steve Workman [:sworkman] (please use needinfo)
mcmanus: review+
Details | Diff | Splinter Review
tcp-keepalive-traces.tar.gz (9.96 MB, application/x-gzip)
2014-01-15 15:32 PST, Steve Workman [:sworkman] (please use needinfo)
mcmanus: feedback+
Details

Description benc 2008-07-09 06:21:25 PDT
While reading other bugs, I saw this bug:

https://bugzilla.mozilla.org/show_bug.cgi?id=31884

It got me thinking about half-open connections, and whether or not we should be concerned about them.

The classic answer is "no". TCP/IP Illustrated, vol 1 says:

"The keepalive feature is intended to detect these half-open connections from the server side." (Chapter 21).

Clients are assumed to detect their half-open connections because they would send data down a connection, and trip off the "retransmit" timer.

However, this assumption was made in the days when a client was statically connected to the network, and the abnormal state of the server was easily established.

These days, we have a lot of interfaces that are coming up and down. We have offline-online features, and VPNs and NATs as well. I'm not sure that we are cleaning up our open connections in all cases. HTTP has keep alive, FTP has a timeout, but we have a lot of other protocols as well.

We should consider implementing using the OS-specific TCP keepalive, if it is available, because this would give us a timeout of last resort, which would drop connections to servers that are not answering after 2 hours (should be true for most UNIX implementations, I tried to research Windows, but could not find any useful info).
Comment 1 Rob Mueller 2011-07-27 01:54:05 PDT
Based on some recent research I did, I'd really like to revive and prioritise this bug.

http://blog.fastmail.fm/2011/06/28/http-keep-alive-connection-timeouts/

Basically it seems Chrome is setting the SO_KEEPALIVE flag on connections and specifically lowering the ping time to 45 seconds. This allows them to reuse keepalive connections as long as the server keeps them open, unlike Firefox which by default seems to limit to 115 seconds. I imagine this is due to the NAT/Firewall timeout problems Chrome talks about here:

http://code.google.com/p/chromium/issues/detail?id=27400

On top of this, there's another reason this is going to be important.

With EventSource connections, you basically are expected to connect to the server and just wait forever for events. Now if you do that on a laptop, and suspend the laptop, it can cause problems.

1. Browser connects to event source server, waits for events
2. User suspends the machine
3. Server sends event. Finds client isn't responding, TCP stack retries and eventually tells the server the connection is closed
4. User wakes up machine
5. Client still thinks it's connected, but never recieves anything from the server, and because it doesn't think the connection is dead, doesn't try and reconnect automatically either

I haven't fully confirmed this, but anecdotally this appears to be happening on a test setup we have.

This destroys the use of EventSource which is supposed to auto-reconnect, but doesn't in this case because it doesn't detect the connection is dead. Enabling SO_KEEPALIVE on the connection and setting low ping times would fix this as well.
Comment 2 Jason Duell [:jduell] (needinfo me) 2011-07-27 12:42:01 PDT
Will this also be an issue for Web sockets?
Comment 3 Patrick McManus [:mcmanus] 2011-07-27 12:56:19 PDT
I'm skeptical than anything useful ever comes out of SO_KEEPALIVE.

Websockets has its own application level ping/pong mechanism to address this problem.
Comment 4 Rob Mueller 2011-07-27 17:28:02 PDT
SO_KEEPALIVE definitely has it's place. I think it's why Firefox limits HTTP keepalive connections to 115 seconds by default, but Chrome has no limit. Again, read these two links for details.

http://blog.fastmail.fm/2011/06/28/http-keep-alive-connection-timeouts/
http://code.google.com/p/chromium/issues/detail?id=27400

I just tested my EventSource issue again, and I think it's unrelated to TCP keepalives. It appears that when you suspend a laptop, the connection is shut, but Firefox just reconnect when you wake up again. I'll open a separate bug about that.
Comment 5 Honza Bambas (:mayhemer) 2011-08-03 09:47:19 PDT
Few cents..

As I understand, goals would be:
- to keep connections longer time and save a RTT due to need of connection establishment on next request for resources from the same server/host
- to quickly discover a connection is dead (silently dropped by the server or router on the way) and prevent connection hang times
- to recover (detect connection close) after wake from sleep

I personally don't think TCP keep alive is the right way to do it.  TCP keep alives may cause false negatives when dropped.  Keeping the connection with the server for a long time might waste resources (also on the client side), significantly for secure connections.  Unless it is an event channel, then keeping connection for a longer time is IMO bad.  But keeping only one or two connections for longer time might mitigate that..

When we reuse a connection to make an HTTP request, and it takes a really long time to detect the connection is dead, we can restart idempotent transaction more quickly using our own timer.  Non-idempotent methods may always go through a new connection when all idle connections are too "old" and there is a risk of a hang due to a silent drop.
Comment 6 Rob Mueller 2011-08-03 17:47:51 PDT
> - to keep connections longer time and save a RTT due to need of connection
> establishment on next request for resources from the same server/host
> - to quickly discover a connection is dead (silently dropped by the server
> or router on the way) and prevent connection hang times
> - to recover (detect connection close) after wake from sleep

Right.

> I personally don't think TCP keep alive is the right way to do it.  TCP keep
> alives may cause false negatives when dropped.  Keeping the connection with
> the server for a long time might waste resources (also on the client side),
> significantly for secure connections. 

IMHO the only real issue is server resources, I can't see that keeping one or a few connections open from the client could use that many resources. And servers can control how long to keep a connection open, so I think as long as the server is willing to keep the connection open, the client should support doing that.

> Unless it is an event channel, then

Obviously it's a serious problem for event channels. You end up with event channels that appear alive to the client, but are actually dead. The whole point of the eventsource spec is that it auto-reconnects if the connection is lost, but it can only do that if the OS actually tells it the connection is lost, which can only happen if you enable SO_KEEPALIVE.

> When we reuse a connection to make an HTTP request, and it takes a really
> long time to detect the connection is dead, we can restart idempotent
> transaction more quickly using our own timer.  Non-idempotent methods may
> always go through a new connection when all idle connections are too "old"
> and there is a risk of a hang due to a silent drop.

Obviously you'll always need the application level timer. But with SO_KEEPALIVE, you'll actually need to use it less often because dead connections will be detected at the OS level "in the background" and that information sent up to the application level, rather than having to wait for the application level timeout when the next actual request occurs.
Comment 7 Patrick McManus [:mcmanus] 2011-08-03 17:54:08 PDT
(In reply to comment #6)
> Obviously you'll always need the application level timer. But with
> SO_KEEPALIVE, you'll actually need to use it less often because dead
> connections will be detected at the OS level "in the background"

s/in the background/while sucking your battery
Comment 8 Rob Mueller 2011-08-03 18:06:12 PDT
For regular http(s) connections, yes, there is a tradeoff there. On the other hand, I don't think it's a really big one. Most servers have low keepalive times, so in 99% of cases it'll be the server disconnecting.

For those servers that are happy with very long keepalive connections, a packet each way every 45 seconds doesn't seem a huge deal, Chrome has decided it's ok, and a better tradeoff than the client having to drop connections after < 2 mins due to **** NATs/firewalls.

For eventsource connections though, it's clearly a problem. Maybe there should be a separate bug for that specific case?
Comment 9 Honza Bambas (:mayhemer) 2011-08-04 04:55:53 PDT
(In reply to comment #6)
> the OS actually tells it the connection
> is lost, which can only happen if you enable SO_KEEPALIVE.
> 

Do you have a practical test on all major platforms?

My experience is to let an application detect stand by/hibernation it self.  I believe some OSes support APIs to hook to detect stand by or at least wake up.  We should use it to "restart" whatever is needed to be restarted after wakeing up.  Skype might be an example.

> Obviously you'll always need the application level timer. But with
> SO_KEEPALIVE, you'll actually need to use it less often because dead
> connections will be detected at the OS level "in the background" 

"In the background" is bad.  Goal was to express I am against using SO_KEEPALIVE for idle http connections.  It wastes resources, in general IMO.

If google says its good to do it, doesn't mean it really is good.
Comment 10 Rob Mueller 2011-08-04 05:54:57 PDT
> "In the background" is bad.  Goal was to express I am against using
> SO_KEEPALIVE for idle http connections.  It wastes resources, in general IMO.

If you're asking me for proof, I'd ask you for proof of this statement in turn.

> If google says its good to do it, doesn't mean it really is good.

To be fair, I never exactly said "google says it's good to do it", I just pointed to what the problem is, what Chrome are doing to solve it, and suggested that doing the same might be a good idea.

The big problem I think is that I've mixed up too many things in the one ticket.

So, reasons for SO_KEEPALIVE:

1. Sleep/wake - I was wrong on this, ignore this as a reason, the application generally knows when a sleep/wake cycle happens because the network goes down/comes back up

2. Keepalive connections - currently appear to be limited to 115 seconds on the client because of problematic NATs/firewalls with short memories. SO_KEEPALIVE would allow extending this, but benefit might be marginal, so not *important*, but might be useful

3. Eventsource - now this is a real problem. By definition eventsource connections are supposed to be long lived, and so problematic NATs/firewalls with short memories will definitely be a problem

But then maybe it's the server that should be sending regular application or TCP pings on eventsource connections, and not the clients job at all.

Seems I'm convincing myself out of all my own arguments today :)
Comment 11 Jason Duell [:jduell] (needinfo me) 2011-10-21 19:01:34 PDT
Note that the 

  http://code.google.com/p/chromium/issues/detail?id=27400#c15

seems to imply that firefox was getting hit by the timeout issue there.   Not sure if they tweaked the our internal keepalive timer to be longer when they tested.

For HTTP connection, we have no way to do a ping, right?  (barring sending some sort of bogus request and throwing away the answer).   

re: power issue.  if we found a compelling reason to turn on SO_KEEPALIVE, we could presumably detect if the browser has gone idle for some period and turn the sockopt off/on if we want to save power?
Comment 12 CPuckett.Dynetics 2013-01-06 16:24:53 PST
I'm seeing an issue where throbber spins for days trying to load a tab for some sites.  I suspect a NAT/firewall has forgotten about the TCP connection, but Firefox is still waiting for data.  This behavior isn't recent either.  Firefox has been doing this for as long as I can remember, back to around the 3.x versions.

The offending socket still shows as established in TCPView.  The problem is that the phantom connections keep hanging around until they are either manually closed by the user canceling the page load, or until Firefox hits the maximum number of connections it can have for that server.  At that point, the server will be unavailable until the user manually cancels some page loads.

I believe Jason is correct:
"For HTTP connection, we have no way to do a ping, right?  (barring sending some sort of bogus request and throwing away the answer)."

The alternative would be to set the SO_KEEPALIVE option on the socket and let the OS do that for you in a way that won't generate some sort of HTTP request for the server to have to process.
Comment 13 Jason Duell [:jduell] (needinfo me) 2013-08-12 17:48:44 PDT
As discussed with patrick and others, the plan here now is to set TCP keepalive with a small ping count and short timeout, so that we'll quickly detect stalled connections if we hit a lame-network situation (where the connection is forgotten by a wifi gateway, etc.).  Then after a minute or so we'll drop it down to a much less frequent ping, so we don't waste power/bandwidth for long lived idle websocket/Comet/EventSource connections.

Looks like we may not be able to change keepalive ping count on windows (always 10):

   http://msdn.microsoft.com/en-us/library/windows/desktop/ee470551(v=vs.85).aspx

Oddly enough it looks like they also allow you to set the ping timeout, which other OSes (sensibly) leave up to the TCP stack's estimation of RTT.
Comment 14 Nicholas Hurley [:nwgh][:hurley] 2013-10-14 11:38:02 PDT
Created attachment 816719 [details] [diff] [review]
hurley's start of a keepalive patch

As discussed with Steve, he's going to take this over as part of his Q4 work. Here's the patch I have so far, Steve, feel free to take it or leave it as you see fit.
Comment 15 Steve Workman [:sworkman] (please use needinfo) 2013-12-05 14:40:35 PST
Created attachment 8343332 [details] [diff] [review]
v1.0 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked

Pat, Honza, I know you're both busy, so whoever gets to this first can review it :)

Get|ReleaseFD_Locked are called in a lot of places, and required to protect nsSocketTransport::mFd. I wanted to add couple of private classes to automate some of these calls before adding the new functions which call setsockopt for keepalive. Those functions will be complicated enough and I want to add some defensive code to make sure I don't miss a mutex lock or ReleaseFD_Locked. Also, I think it makes nsSocketTransport2.cpp a little easier to read :)

LockedPRFileDesc:
-- Instance of this class replaces the member variable PRFileDesc mFd in nsSocketTransport.
-- Has |operator PRFileDesc*| which requires nsSocketTransport::mLock to be locked before allowing access to the underlying PRFileDesc - only effective in Debug builds.

PRFileDescAutoLock: 
-- Stack class only for use in nsSocketTransport function calls.
-- Auto locks nsSocketTransport::mLock and calls nsSocketTransport::GetFD_Locked during construction.
-- Auto locks the mutex and calls nsSocketTransport::ReleaseFD_Locked during destruction.
-- I added an optional var in the constructor to return nsSocketTransport::mCondition, so that the class can be used in a few more places.

Note 1: There are several places where it's awkward to use PRFileDescAutoLock because many things are done while the mutex is locked. So, I tried to cover the basic cases for now. However, LockedPRFileDesc should still ensure that the mutex is locked while accessing the PRFileDesc in these cases.

Note 2: If the socket is not attached, I don't require the lock to be held.
Comment 16 Steve Workman [:sworkman] (please use needinfo) 2013-12-05 14:42:05 PST
Created attachment 8343335 [details] [diff] [review]
v1.0 Add support for TCP keepalive in the Socket Transport Service

Based on Nick's original patch.

Add's APIs to support TCP keep alive in the socket transport service.

-- Makes calls to setsockopt and (WSAIOctl in Windows) to enable keep lives, set idle time before first keep alive ping, as well as ping interval and ping count where possible.
-- Adds some global prefs - note, this code is not complete yet, and the values are set so I can test it quickly locally - they are NOT the values that will be set in the final patch to be landed.\

Notes on platform support:
-- Linux supports setting idle time, interval and ping count.
-- Windows supports setting idle time and interval.
-- Max supports setting the idle time only.

For Windows and Mac, the other variables can be configured on a system wide basis only. Sucks.
Comment 17 Steve Workman [:sworkman] (please use needinfo) 2013-12-05 14:42:58 PST
Created attachment 8343338 [details] [diff] [review]
v1.0 Enable TCP Keepalive for short and long-lived HTTP Connections

Enables TCP keepalives for HTTP connections (non-SPDY).

Piggybacks on the TimeoutTick to check how long the connection has been established.
-- For initial connections, the idle time and ping interval should be set low for fast detection of downed connections.
-- For long-lived connections, idle timer and interval can be set higher for slow detection and hopefully less network use.
Note: Like the previous patch, the values here are not correct, and I need to set up some prefs for them. So, please ignore the values and just look at the code structure for now.
Comment 18 Honza Bambas (:mayhemer) 2013-12-09 09:57:51 PST
Comment on attachment 8343332 [details] [diff] [review]
v1.0 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked

Review of attachment 8343332 [details] [diff] [review]:
-----------------------------------------------------------------

Steve, this looks pretty sweet.  Just checked the patch, but not the whole code+patch nor I've tested the patch.  Leaving some more room to review this more carefully.

::: netwerk/base/src/nsSocketTransport2.h
@@ +22,5 @@
>  #include "mozilla/net/DNS.h"
>  #include "nsASocketHandler.h"
>  
>  #include "prerror.h"
> +#include "private/pprio.h"

bad :((
Comment 19 Honza Bambas (:mayhemer) 2013-12-10 08:19:09 PST
Comment on attachment 8343335 [details] [diff] [review]
v1.0 Add support for TCP keepalive in the Socket Transport Service

Review of attachment 8343335 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/src/nsSocketTransport2.cpp
@@ +2665,5 @@
> +    int
> +#endif
> +    optval = aEnable ? 1 : 0;
> +
> +    int err = setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &optval, sizeof(optval));

We have this in NSPR already.  Absolutely no need to reimpl it.
Comment 20 Patrick McManus [:mcmanus] 2013-12-11 10:53:26 PST
Comment on attachment 8343332 [details] [diff] [review]
v1.0 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked

Review of attachment 8343332 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm steve. I'm going to give you all socket transport bugs from here on out :)

::: netwerk/base/src/nsSocketTransport2.cpp
@@ +747,5 @@
>      , mOutputClosed(true)
>      , mResolving(false)
>      , mNetAddrIsSet(false)
>      , mLock("nsSocketTransport.mLock")
> +    , mFD(this)

MOZ_THIS_IN_INITIALIZER_LIST()

::: netwerk/base/src/nsSocketTransport2.h
@@ +22,5 @@
>  #include "mozilla/net/DNS.h"
>  #include "nsASocketHandler.h"
>  
>  #include "prerror.h"
> +#include "private/pprio.h"

which part of "private" was ambiguous? :)

@@ +139,5 @@
>      // nsASocketHandler methods:
>      void OnSocketReady(PRFileDesc *, int16_t outFlags);
>      void OnSocketDetached(PRFileDesc *);
>      void IsLocal(bool *aIsLocal);
> +    void OnKeepaliveEnabledPrefChange(bool aEnabled) MOZ_OVERRIDE MOZ_FINAL;

I don't think you meant to put this in this diff

@@ +180,5 @@
> +    public:
> +      typedef mozilla::MutexAutoLock MutexAutoLock;
> +
> +      PRFileDescAutoLock(nsSocketTransport *aSocketTransport,
> +                     nsresult *aConditionWhileLocked = nullptr)

indent nit

@@ +392,5 @@
> +
> +    bool mTCPKeepaliveEnabled;
> +    int32_t mTCPKeepaliveIdleTimeS;
> +    int32_t mTCPKeepaliveRetryIntervalS;
> +    static const int kTCPKeepalivePacketCount;

I'm pretty sure you don't want these 4 members in this patch
Comment 21 Patrick McManus [:mcmanus] 2013-12-11 10:54:12 PST
Comment on attachment 8343332 [details] [diff] [review]
v1.0 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked

lets see one more version
Comment 22 Patrick McManus [:mcmanus] 2013-12-11 11:03:36 PST
Comment on attachment 8343335 [details] [diff] [review]
v1.0 Add support for TCP keepalive in the Socket Transport Service

Review of attachment 8343335 [details] [diff] [review]:
-----------------------------------------------------------------

only had a quick read of this..  it seems generally sane. I'll leave it to honza to review in detail if he wants - if he would rather I do it that's fine too. We just don't both need to do it.

I think we should have a discussion of the policy and values to use, as well as the limits of the OS.

I would like the default to be no TCP keep-alive at the socket transport level.. and then let http and whatever else opt-in, rather than having things opt-out

::: netwerk/base/src/nsSocketTransport2.cpp
@@ +2417,5 @@
> +    *aResult = mTCPKeepaliveEnabled;
> +    return NS_OK;
> +}
> +
> +#if defined(XP_WIN)

if these can go at the top of the file - that's good
Comment 23 Steve Workman [:sworkman] (please use needinfo) 2013-12-11 14:01:30 PST
Created attachment 8346177 [details] [diff] [review]
v1.1 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked

Addressed Pat's comments.

Try push:
https://tbpl.mozilla.org/?tree=Try&rev=2ca629b9362e
Comment 24 Steve Workman [:sworkman] (please use needinfo) 2013-12-11 16:55:27 PST
Created attachment 8346260 [details] [diff] [review]
v1.1 Add support for TCP keepalive in the Socket Transport Service

So, setsockopt and "private/pprio.h": yes, PR_SetSockOpt exists, but it only supports en/disabling keepalive; it doesn't support keepalive config. I can file a separate bug to add that support in NSPR, but for now "private/pprio.h" needs to be included in order to get that native file desc handle, and setsockopt and WSAIOCtl need to be used.

However, I adjusted the patch to use PR_SetSockOpt in SetKeepaliveEnabled, which just en/disables keepalive. And I moved the functions around so that "private/pprio.h" only needs to be included in nsSocketTransport2.cpp and not the header.

I also moved LogOSError and LogNSPRError, so they are helper functions in nsSocketTransport2.cpp and no longer members of PRFileDescAutoLock. Probably better since it would allow them to be used throughout the file.

Let me know what you think.
Comment 25 Patrick McManus [:mcmanus] 2013-12-12 10:27:11 PST
Comment on attachment 8346177 [details] [diff] [review]
v1.1 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked

Review of attachment 8346177 [details] [diff] [review]:
-----------------------------------------------------------------

the try run for this review request is covered in orange. Clearing r? until that is cleared up.
Comment 26 Steve Workman [:sworkman] (please use needinfo) 2013-12-12 17:06:58 PST
Created attachment 8346933 [details] [diff] [review]
v1.2 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked

Needed to add a MutexAutoLock block in nsSocketTransport::InitWithConnectedSocket.

New try - will set r? later.
https://tbpl.mozilla.org/?tree=Try&rev=1757f1e03c5c
Comment 27 Steve Workman [:sworkman] (please use needinfo) 2013-12-13 09:12:05 PST
Comment on attachment 8346933 [details] [diff] [review]
v1.2 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked

Try looks good. r?mcmanus.
Comment 28 Patrick McManus [:mcmanus] 2013-12-16 07:53:58 PST
Comment on attachment 8346933 [details] [diff] [review]
v1.2 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce calls to Get|ReleaseFD_Locked

Review of attachment 8346933 [details] [diff] [review]:
-----------------------------------------------------------------

beware 'random orange'
Comment 29 Steve Workman [:sworkman] (please use needinfo) 2013-12-16 16:31:09 PST
Created attachment 8348412 [details] [diff] [review]
v2.0 Add support for TCP keepalive in the Socket Transport Service

Updated patch; switching review to patch since Honza seems super busy with cache2. Feel free to re-assign the review.

Updated prefs/default config:
-- network.tcp.keepalives.enabled - enabled by default globally, but disabled by default for each socket. Socket owners must call nsSocketTransport:SetKeepaliveEnabled(true).
-- Idle time: 10 mins (until first probe; between successful probes).
-- Probe count: 8 (only configurable on Linux; Win is 10, Mac is 8).
-- Retry interval: 1 sec.

No big code changes.

Let me know what you think.
Comment 30 Steve Workman [:sworkman] (please use needinfo) 2013-12-16 16:39:32 PST
Created attachment 8348415 [details] [diff] [review]
v2.0 Enable TCP Keepalive for short and long-lived HTTP Connections

Enable Keepalives by default for HTTP connections (except SPDY).

HTTP Idle/'Short-lived' (i.e. first 60s after request is sent):
-- Idle time: 10s (until first probe; between successful probes).
-- Retry interval: 1 RTT (as measured at TCP connection time); minimum 1 sec.

WebSockets/'Long-lived' (i.e. after first 60s).
-- Idle time: 10 mins.
-- Retry interval: same as above.

Notes:
-- HTTP Idle connections always use the 'short-lived' keepalive config, unless they become reused.
-- WebSockets are converted to 'long-lived' upon the call to TakeTransport.
-- Keepalives always disabled for SPDY (disabled in StartSpdy).
Comment 31 Steve Workman [:sworkman] (please use needinfo) 2013-12-16 16:47:47 PST
(In reply to Patrick McManus [:mcmanus] from comment #28)
> Comment on attachment 8346933 [details] [diff] [review]
> v1.2 Add PRFileDescAutoLock and LockedPRFileDesc to automate and enforce
> calls to Get|ReleaseFD_Locked
> 
> Review of attachment 8346933 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> beware 'random orange'

Thanks for the r+ Pat. Those oranges look ok, but thanks for the warning.

Let's see what happens on inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ac84c96bc4df
Comment 32 Wes Kocher (:KWierso) 2013-12-16 21:37:47 PST
https://hg.mozilla.org/mozilla-central/rev/ac84c96bc4df
Comment 33 Steve Workman [:sworkman] (please use needinfo) 2013-12-17 23:03:28 PST
Created attachment 8349216 [details] [diff] [review]
v2.1 Add support for TCP keepalive in the Socket Transport Service

Updated:
-- Some fixes after testing.
-- Changed LogOSError and LogNSPRError to use NS_WARNING instead of SOCKET_LOG to get more detailed feedback on try runs.
-- Added maximums for setsockopt params: borrowed from Linux kernel header; don't think I can get the maxes for Win or Mac.
Comment 34 Steve Workman [:sworkman] (please use needinfo) 2013-12-17 23:04:38 PST
Created attachment 8349217 [details] [diff] [review]
v2.1 Enable TCP Keepalive for short and long-lived HTTP Connections

Some fixes after testing.
Comment 35 Patrick McManus [:mcmanus] 2013-12-18 12:24:30 PST
(In reply to Steve Workman [:sworkman] from comment #24)
> Created attachment 8346260 [details] [diff] [review]
> v1.1 Add support for TCP keepalive in the Socket Transport Service
> 
> So, setsockopt and "private/pprio.h": yes, PR_SetSockOpt exists, but it only
> supports en/disabling keepalive; it doesn't support keepalive config. I can
> file a separate bug to add that support in NSPR, but for now
> "private/pprio.h" needs to be included in order to get that native file desc
> handle, and setsockopt and WSAIOCtl need to be used.

it turns out this "private" interface is widely used.

https://mxr.mozilla.org/mozilla-central/ident?i=PR_FileDesc2NativeHandle
Comment 36 Patrick McManus [:mcmanus] 2013-12-18 13:17:57 PST
Comment on attachment 8349216 [details] [diff] [review]
v2.1 Add support for TCP keepalive in the Socket Transport Service

Review of attachment 8349216 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libpref/src/init/all.js
@@ +4072,5 @@
> +pref("network.tcp.keepalive.idle_time", 600); // seconds; 10 mins
> +// Default timeout for retransmission of unack'd keepalive probes.
> +pref("network.tcp.keepalive.retry_interval", 1); // seconds
> +// Default maximum probe retransmissions.
> +pref("network.tcp.keepalive.probe_count", 8);

this should be considerably smaller. It is only sent on idle connections after all - if they get accidentally timed out it is not the end of the world. 4?

this is also a good place to document OS limitations

::: netwerk/base/src/nsSocketTransport2.cpp
@@ +768,5 @@
>      , mInput(MOZ_THIS_IN_INITIALIZER_LIST())
>      , mOutput(MOZ_THIS_IN_INITIALIZER_LIST())
>      , mQoSBits(0x00)
> +    , mKeepaliveEnabled(false)
> +    , mKeepaliveIdleTimeS(600)

did you mean to default this to -1

@@ +769,5 @@
>      , mOutput(MOZ_THIS_IN_INITIALIZER_LIST())
>      , mQoSBits(0x00)
> +    , mKeepaliveEnabled(false)
> +    , mKeepaliveIdleTimeS(600)
> +    , mKeepaliveRetryIntervalS(1)

did you mean to default this to -1 (I say this because of SetKeepaliveEnabled())

@@ +2414,5 @@
> +nsSocketTransport::OnKeepaliveEnabledPrefChange(bool aEnabled)
> +{
> +    MOZ_ASSERT(PR_GetCurrentThread() == gSocketThread, "wrong thread");
> +
> +    // Shouldn't need to do anything if keepalive is disabled for this socket.

// changing the global pref does not KA enable existing sockets

@@ +2421,5 @@
> +        if (NS_WARN_IF(!fd.IsInitialized())) {
> +            return;
> +        }
> +        DebugOnly<nsresult> rv = fd.SetKeepaliveEnabled(aEnabled);
> +        NS_WARN_IF_FALSE(NS_SUCCEEDED(rv), "Error en/disabling TCP keepalive!");

mKeepaliveEnabled = aEnabled ?

@@ +2446,5 @@
> +                    this, aEnable ? "enabled" : "disabled"));
> +        return NS_OK;
> +    }
> +
> +    mKeepaliveEnabled = aEnable;

I think you want to do this right before the final return, so the early returns don't make it inconsistent. You'll also want to change the checks of mKeepaliveEnabled to check aEnable

@@ +2479,5 @@
> +                mKeepaliveProbeCount,
> +                gSocketTransportService->IsKeepaliveEnabled() ?
> +                "enabled" : "disabled"));
> +
> +    // Only proceed if keepalive is enabled globally.

move this check up to after the thread assert at the beginning of the function (its a precondition)

@@ +2638,5 @@
>  }
>  
>  #endif
> +
> +void LogNSPRError(const char* aPrefix, const void *aObjPtr)

static

@@ +2660,5 @@
> +nsSocketTransport::PRFileDescAutoLock::SetKeepaliveEnabled(bool aEnable)
> +{
> +    MOZ_ASSERT(PR_GetCurrentThread() == gSocketThread, "wrong thread");
> +    MOZ_ASSERT(!(aEnable && !gSocketTransportService->IsKeepaliveEnabled()),
> +               "Cannot enable keepalive if global pref is disabled!");

I kinda feel like there is a race condition in that assert (prefs on main thread, setkeeapaliveenabled on socket thread..)

@@ +2668,5 @@
> +
> +    PRSocketOptionData opt;
> +
> +    opt.option = PR_SockOpt_Keepalive;
> +    opt.value.keep_alive = true;

= aEnable ?

@@ +2678,5 @@
> +    }
> +    return NS_OK;
> +}
> +
> +void LogOSError(const char *aPrefix, const void *aObjPtr)

static

@@ +2732,5 @@
> +    if (NS_WARN_IF(aProbeCount <= 0 || NS_MAX_TCP_KEEPCNT < aProbeCount)) {
> +        return NS_ERROR_INVALID_ARG;
> +    }
> +
> +    PROsfd sock = PR_FileDesc2NativeHandle(mFd);

mxr reminds me there is a lot of code that does this. hmm.

@@ +2733,5 @@
> +        return NS_ERROR_INVALID_ARG;
> +    }
> +
> +    PROsfd sock = PR_FileDesc2NativeHandle(mFd);
> +    if (NS_WARN_IF(sock == -1)) {

is that a valid check on windows?

::: netwerk/base/src/nsSocketTransportService2.cpp
@@ +955,5 @@
> +        rv = tmpPrefService->GetBoolPref(KEEPALIVE_ENABLED_PREF,
> +                                         &keepaliveEnabled);
> +        if (NS_SUCCEEDED(rv) && keepaliveEnabled != mKeepaliveEnabledPref) {
> +            mKeepaliveEnabledPref = keepaliveEnabled;
> +            OnKeepaliveEnabledPrefChange();

you should probably do this one last, so that the other members are updated before hand

::: netwerk/base/src/nsSocketTransportService2.h
@@ +37,5 @@
>  #define NS_SOCKET_POLL_TIMEOUT PR_INTERVAL_NO_TIMEOUT
>  
>  //-----------------------------------------------------------------------------
>  
> +// These maximums are borrowed from the linux kernel.

have you checked what happens on windows with these vals?

@@ +38,5 @@
>  
>  //-----------------------------------------------------------------------------
>  
> +// These maximums are borrowed from the linux kernel.
> +#define NS_MAX_TCP_KEEPIDLE        32767  // ~9 mins.

all of these static const uint32 in mozilla::net namespace please.
Comment 37 Patrick McManus [:mcmanus] 2013-12-18 13:18:00 PST
Comment on attachment 8349217 [details] [diff] [review]
v2.1 Enable TCP Keepalive for short and long-lived HTTP Connections

Review of attachment 8349217 [details] [diff] [review]:
-----------------------------------------------------------------

so before committing the code, I'd like to see it work :)

I think that means traces that show both short and long modes and detected failures in each.. mostly from windows, but a linux look would be good too. powering off a peer ought to be sufficient to make this work.

I think it would also be good to get a download in that took a lot longer than 10 seconds (but is continuously moving data) and search the trace for evidence of keep alives which shouldn't be there

and test google.com too for spdy - search the trace for keepalives.

what happens when a set of short parameters has been started by the kernel (i.e. it is on retry 4 of 8 because 10 seconds of idle have elapsed) and we transition to the long parameters? does it invalidate that existing probe because the long parameters are now in effect? does that delay us too much?

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +943,5 @@
>          mSocketIn = mInputOverflow.forget();
>  
> +    // Change TCP Keepalive frequency to long-lived if currently short-lived.
> +    if (mTCPKeepaliveConfig == TCP_KEEPALIVE_SHORT_LIVED_CONFIG) {
> +        nsresult rv = StartLongLivedTCPKeepalives();

let's get a positive case log message in here too so that we know taketransport() did this

@@ +1684,5 @@
> +        return NS_ERROR_NOT_INITIALIZED;
> +    }
> +    LOG(("nsHttpConnection::DisableTCPKeepalives [%p]", this));
> +    if (mTCPKeepaliveConfig != TCP_KEEPALIVE_DISABLED) {
> +        nsresult rv = mSocketTransport->SetKeepaliveEnabled(true);

um.. false, right?

::: netwerk/protocol/http/nsHttpConnection.h
@@ +15,5 @@
>  
>  #include "nsIAsyncInputStream.h"
>  #include "nsIAsyncOutputStream.h"
>  #include "nsIInterfaceRequestor.h"
> +#include "nsITimer.h"

I don't think you need this

@@ +297,5 @@
>      uint32_t                        mTransactionCaps;
> +
> +    // Flag to indicate connection is in inital keepalive period (fast detect).
> +    uint32_t mTCPKeepaliveConfig;
> +    PRIntervalTime mTCPKeepaliveStartTime;

mTCPKeealiveShortStartTime

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +222,5 @@
>  
>      if (!mTimeoutTickArmed)
>          return;
>  
> +    if (mNumActiveConns || mNumIdleConns)

I understand why you're doing this.. basically to transition "short" to "long" states - right?

Its too wasteful.. it means we'll pretty much never sleep. The connections can arm individual timers themselves for their transition points.

feel free to argue the point
Comment 38 Steve Workman [:sworkman] (please use needinfo) 2014-01-15 15:29:01 PST
Created attachment 8360708 [details] [diff] [review]
v2.2 Add support for TCP keepalive in the Socket Transport Service

Thanks for the review, Pat - sorry it's taken so long to respond.

(In reply to Patrick McManus [:mcmanus] from comment #36)
> Comment on attachment 8349216 [details] [diff] [review]
> v2.1 Add support for TCP keepalive in the Socket Transport Service
> 
> > +// Default maximum probe retransmissions.
> > +pref("network.tcp.keepalive.probe_count", 8);
> 
> this should be considerably smaller. It is only sent on idle connections
> after all - if they get accidentally timed out it is not the end of the
> world. 4?

Done.

> this is also a good place to document OS limitations

Done.

> ::: netwerk/base/src/nsSocketTransport2.cpp
> > +    , mKeepaliveEnabled(false)
> > +    , mKeepaliveIdleTimeS(600)
> > +    , mKeepaliveRetryIntervalS(1)

> did you mean to default [these] to -1
Yes - I've defaulted these to -1 in nsSocketTransport, but I have some defaults in nsSocketTransportService, so it becomes a sort of authoritative source of defaults should no prefs be set. Of course, keepalive defaults to disabled without prefs.

> @@ +2414,5 @@
> > +nsSocketTransport::OnKeepaliveEnabledPrefChange(bool aEnabled)
> > …
> > +    // Shouldn't need to do anything if keepalive is disabled for this socket.
> 
> // changing the global pref does not KA enable existing sockets

// The global pref toggles keepalive as a system feature; it only affects
// an individual socket if keepalive has been specifically enabled for it.
// So, ensure keepalive is configured correctly if previously enabled.

> @@ +2421,5 @@
> …
> > +        DebugOnly<nsresult> rv = fd.SetKeepaliveEnabled(aEnabled);
> > +        NS_WARN_IF_FALSE(NS_SUCCEEDED(rv), "Error en/disabling TCP keepalive!");
> 
> mKeepaliveEnabled = aEnabled ?

Nope - mKeepaliveEnabled is the config for the nsSocketTransport object only. Keepalives are only enabled for this socket if mKeepaliveEnabled AND the global pref are true.

> @@ +2446,5 @@
> > +                    this, aEnable ? "enabled" : "disabled"));
> > +        return NS_OK;
> > +    }
> > +
> > +    mKeepaliveEnabled = aEnable;
> 
> I think you want to do this right before the final return, so the early
> returns don't make it inconsistent. You'll also want to change the checks of
> mKeepaliveEnabled to check aEnable

Actually, I want this set early. It's possible that the fd is null, e.g. if we're about to retry connecting on a second address. In that case, I want keepalives to be set on OnSocketConnected.

> @@ +2479,5 @@
> > +                gSocketTransportService->IsKeepaliveEnabled() ?
> > ...
> > +    // Only proceed if keepalive is enabled globally.
> 
> move this check up to after the thread assert at the beginning of the
> function (its a precondition)

I'd like this to be after the vars are set. If keepalives are enabled globally, I want the sockets that have been configured to use keepalives to call setsockopt - see OnKeepaliveEnabledPrefChange.

> @@ +2638,5 @@
> > ...
> > +void LogNSPRError(const char* aPrefix, const void *aObjPtr)
> …
> @@ +2678,5 @@
> > ...
> > +void LogOSError(const char *aPrefix, const void *aObjPtr)
> 
> static

Done for both.

> @@ +2660,5 @@
> > +nsSocketTransport::PRFileDescAutoLock::SetKeepaliveEnabled(bool aEnable)
> > +{
> > +    MOZ_ASSERT(PR_GetCurrentThread() == gSocketThread, "wrong thread");
> > +    MOZ_ASSERT(!(aEnable && !gSocketTransportService->IsKeepaliveEnabled()),
> > +               "Cannot enable keepalive if global pref is disabled!");
> 
> I kinda feel like there is a race condition in that assert (prefs on main
> thread, setkeeapaliveenabled on socket thread..)

Hmm. nsSocketTransportService::OnKeepaliveEnabledPrefChange is dispatched to the socket thread, so nsSocketTransport::OnKeepaliveEnabledPrefChange should only be called on the socket thread too. If the pref change is called first, then SetKeepaliveEnabled should proceed to fd.SetKeepaliveEnabled. If SetKeepaliveEnabled is called first, then OnKeepaliveEnabledPrefChange should call fd.SetKeepaliveEnabled.

I think that should be ok, right?

> @@ +2668,5 @@
> > ...
> > +    opt.option = PR_SockOpt_Keepalive;
> > +    opt.value.keep_alive = true;
> 
> = aEnable ?

Ah indeed. Stupid mistake.

> @@ +2732,5 @@
> > ...
> > +    PROsfd sock = PR_FileDesc2NativeHandle(mFd);
> 
> mxr reminds me there is a lot of code that does this. hmm.

Yup - not pretty.

> @@ +2733,5 @@
> > ...
> > +    PROsfd sock = PR_FileDesc2NativeHandle(mFd);
> > +    if (NS_WARN_IF(sock == -1)) {
> 
> is that a valid check on windows?

I think so. I checked the code and it seems ok.

> ::: netwerk/base/src/nsSocketTransportService2.cpp
> @@ +955,5 @@
> > +        rv = tmpPrefService->GetBoolPref(KEEPALIVE_ENABLED_PREF,
> > +                                         &keepaliveEnabled);
> > +        if (NS_SUCCEEDED(rv) && keepaliveEnabled != mKeepaliveEnabledPref) {
> > +            mKeepaliveEnabledPref = keepaliveEnabled;
> > +            OnKeepaliveEnabledPrefChange();
> 
> you should probably do this one last, so that the other members are updated
> before hand

Done.

> ::: netwerk/base/src/nsSocketTransportService2.h
> @@ +37,5 @@
> ...
> > +// These maximums are borrowed from the linux kernel.
> 
> have you checked what happens on windows with these vals?

Yes, and everything looks fine.

> @@ +38,5 @@
> >  
> >  //-----------------------------------------------------------------------------
> >  
> > +// These maximums are borrowed from the linux kernel.
> > +#define NS_MAX_TCP_KEEPIDLE        32767  // ~9 mins.
> 
> all of these static const uint32 in mozilla::net namespace please.

Done.


I also added some code to enable keepalives in OnSocketConnected, if they were previously enabled. If the first address fails, and we're retrying, I still want what was set in nsSocketTransport to carry over to the new connection/fd. This includes the new function SetKeepaliveEnabledInternal which groups a block of common code to enable keepalives.
Comment 39 Steve Workman [:sworkman] (please use needinfo) 2014-01-15 15:30:00 PST
Created attachment 8360709 [details] [diff] [review]
v2.2 Enable TCP Keepalive for short and long-lived HTTP Connections

(In reply to Patrick McManus [:mcmanus] from comment #37)
> Comment on attachment 8349217 [details] [diff] [review]
> v2.1 Enable TCP Keepalive for short and long-lived HTTP Connections
> 
> Review of attachment 8349217 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> what happens when a set of short parameters has been started by the kernel
> (i.e. it is on retry 4 of 8 because 10 seconds of idle have elapsed) and we
> transition to the long parameters? does it invalidate that existing probe
> because the long parameters are now in effect? does that delay us too much?

It was a good idea to check this. The keepalives stop and no probes are sent until the first one should be sent for the long-lived phase. I have added code in StartShortLivedTCPKeepalives() to allow for a full set of unacked keepalive probes at the end of the short-lived phase, and I obtained traces to prove that it works.

> ::: netwerk/protocol/http/nsHttpConnection.cpp
> @@ +943,5 @@
> > …
> > +    // Change TCP Keepalive frequency to long-lived if currently short-lived.
> > +    if (mTCPKeepaliveConfig == TCP_KEEPALIVE_SHORT_LIVED_CONFIG) {
> > +        nsresult rv = StartLongLivedTCPKeepalives();
> 
> let's get a positive case log message in here too so that we know
> taketransport() did this

Done.

> @@ +1684,5 @@
> > +        return NS_ERROR_NOT_INITIALIZED;
> > +    }
> > +    LOG(("nsHttpConnection::DisableTCPKeepalives [%p]", this));
> > +    if (mTCPKeepaliveConfig != TCP_KEEPALIVE_DISABLED) {
> > +        nsresult rv = mSocketTransport->SetKeepaliveEnabled(true);
> 
> um.. false, right?

Aiee. Yes. Fixed.

> ::: netwerk/protocol/http/nsHttpConnection.h
> @@ +15,5 @@
> >  ...
> > +#include "nsITimer.h"
> 
> I don't think you need this

I didn't; it was an artifact from an earlier version. But now I do again because I switched to nsITimers per your request :)

> @@ +297,5 @@
> > ...
> > +    PRIntervalTime mTCPKeepaliveStartTime;
> 
> mTCPKeealiveShortStartTime

With the change to nsITimer, this was deleted.

> ::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
> @@ +222,5 @@
> >  
> >      if (!mTimeoutTickArmed)
> >          return;
> >  
> > +    if (mNumActiveConns || mNumIdleConns)
> 
> I understand why you're doing this.. basically to transition "short" to
> "long" states - right?
> 
> Its too wasteful.. it means we'll pretty much never sleep. The connections
> can arm individual timers themselves for their transition points.
> 
> feel free to argue the point

Don't want to argue :) Changed to nsITimers.
Comment 40 Steve Workman [:sworkman] (please use needinfo) 2014-01-15 15:31:22 PST
Created attachment 8360712 [details] [diff] [review]
v1.0 Suppress spurious warnings in PRFileDescAutoLock constructor

Adding this one as a ride-along.

The null fd warning in the constructor for PRFileDescAutoLock should be removed; currently, it warns in ALL cases if GetFD_Locked fails. But this function could fail correctly, e.g. if the socket is about to retry on a second address, but just hasn't got there yet.

It is better to add a warning in the function that owns the PRFileDescAutoLock, where it can be decided if a null fd is a true failure/error case.
Comment 41 Steve Workman [:sworkman] (please use needinfo) 2014-01-15 15:32:47 PST
Created attachment 8360714 [details]
tcp-keepalive-traces.tar.gz

(In reply to Patrick McManus [:mcmanus] from comment #37)
> so before committing the code, I'd like to see it work :)
> 
> I think that means traces that show both short and long modes and detected
> failures in each.. mostly from windows, but a linux look would be good too.
> powering off a peer ought to be sufficient to make this work.
> 
> I think it would also be good to get a download in that took a lot longer
> than 10 seconds (but is continuously moving data) and search the trace for
> evidence of keep alives which shouldn't be there
> 
> and test google.com too for spdy - search the trace for keepalives.

I think I have traces for all of these cases in the attachment. The gzip should expand with several folders for each platform tested - Win, Linux, Mac, Android and FFOS on Keon - with READMEs describing what the traces contain in each folder. There are also two trace files to show no keepalive probes for long downloads (around an hour) and SPDY connections (www.google.com). It's a huge bunch of data, so if it's not clear, or something might be missing, please let me know.
Comment 42 Patrick McManus [:mcmanus] 2014-02-05 09:48:09 PST
Comment on attachment 8360709 [details] [diff] [review]
v2.2 Enable TCP Keepalive for short and long-lived HTTP Connections

Review of attachment 8360709 [details] [diff] [review]:
-----------------------------------------------------------------

I'm still nervous about battery issues around idle http, even though that stage really only goes on for 115 seconds, so I guess I shouldn't worry too much.

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +1079,5 @@
> +    if (NS_WARN_IF(self->mUsingSpdyVersion)) {
> +        return;
> +    }
> +
> +    // Do not reduce keepalive probe frequency for idle connections.

hmm.

@@ +1655,5 @@
> +        return rv;
> +    }
> +
> +    // Start a timer to move to long-lived keepalive config.
> +    if(!mTCPKeepaliveTransitionTimer)

nit braces

::: netwerk/protocol/http/nsHttpConnection.h
@@ +146,5 @@
>      // When the connection is active this is called every 1 second
>      void  ReadTimeoutTick(PRIntervalTime now);
>  
> +    // For Active and Idle connections, this should be called every 1 second to
> +    // check if the TCP keepalive config should move from fast-detect to long-lived.

comment is out of date

@@ +172,5 @@
>  
>  private:
> +    // Value (set in mTCPKeepaliveConfig) indicates which set of prefs to use.
> +    enum TCPKeepaliveConfig {
> +      TCP_KEEPALIVE_DISABLED = 0,

nit - use your inside voice. (lets go away from CAPS when not using macros)
Comment 43 Patrick McManus [:mcmanus] 2014-02-05 09:48:38 PST
Comment on attachment 8360708 [details] [diff] [review]
v2.2 Add support for TCP keepalive in the Socket Transport Service

Review of attachment 8360708 [details] [diff] [review]:
-----------------------------------------------------------------

I deeply apologize for the delay. not good.

::: netwerk/base/src/nsSocketTransport2.cpp
@@ +2425,5 @@
> +    // The global pref toggles keepalive as a system feature; it only affects
> +    // an individual socket if keepalive has been specifically enabled for it.
> +    // So, ensure keepalive is configured correctly if previously enabled.
> +    if (mKeepaliveEnabled) {
> +        nsresult rv = SetKeepaliveEnabledInternal(true);

why is this true instead of aEnabled? r+ dependent on working this out

::: netwerk/base/src/nsSocketTransportService2.h
@@ +43,5 @@
> +// These maximums are borrowed from the linux kernel.
> +static const int32_t NS_MAX_TCP_KEEPIDLE  = 32767; // ~9 hours.
> +static const int32_t NS_MAX_TCP_KEEPINTVL = 32767;
> +static const int32_t NS_MAX_TCP_KEEPCNT   = 127;
> +static const int32_t NS_DEFAULT_TCP_KEEPCNT =

its really just a nit, but given that these are in a nice c++ namespace we don't need the awful NS_LOOKATME macro syntax.. mozilla::net::kMaxTCPBlah would be nicer.
Comment 44 Steve Workman [:sworkman] (please use needinfo) 2014-02-05 10:54:15 PST
Thanks Pat! Patches updated, repo rebased and pushed to try before landing:

https://tbpl.mozilla.org/?tree=Try&rev=f4d268d1e1e9

Also, I'm not 100% convinced about keepalives for HTTP Idle connections either, but let's land and see what happens.
Comment 45 Steve Workman [:sworkman] (please use needinfo) 2014-02-05 11:10:15 PST
Stopped last try job - forget to rename one of the constants in the 3rd patch. New try job:

https://tbpl.mozilla.org/?tree=Try&rev=6ee081e0ef47
Comment 46 Steve Workman [:sworkman] (please use needinfo) 2014-02-06 11:52:11 PST
And one try job with actual tests being run:
https://tbpl.mozilla.org/?tree=Try&rev=6a49b705e40f

Looks good to me. Going ahead with landing on inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4c04bfa359e4
https://hg.mozilla.org/integration/mozilla-inbound/rev/d4b5aac253d2
https://hg.mozilla.org/integration/mozilla-inbound/rev/82b1710a243f
Comment 48 Jacek Caban 2014-02-07 03:51:16 PST
I pushed a trivial fixup for cross compiling on case sensitive OSes:

https://hg.mozilla.org/integration/mozilla-inbound/rev/4877ea30474f
Comment 49 Steve Workman [:sworkman] (please use needinfo) 2014-02-07 10:49:01 PST
Thanks Jacek!
Comment 50 Ryan VanderMeulen [:RyanVM] 2014-02-07 11:27:36 PST
https://hg.mozilla.org/mozilla-central/rev/4877ea30474f
Comment 51 Mark Finkle (:mfinkle) (use needinfo?) 2014-02-18 11:48:44 PST
I'm worried that the keep alive timer will consume a lot of battery on mobile devices. Do we only need to flip the "network.tcp.keepalive.enabled" pref to | false | to disable this for Fennec? Or are there more prefs?
Comment 52 Steve Workman [:sworkman] (please use needinfo) 2014-02-18 12:14:58 PST
(In reply to Mark Finkle (:mfinkle) from comment #51)
> I'm worried that the keep alive timer will consume a lot of battery on
> mobile devices. Do we only need to flip the "network.tcp.keepalive.enabled"
> pref to | false | to disable this for Fennec? Or are there more prefs?

There are several prefs :) I think the pertinent ones are:

network.http.tcp_keepalive.short_lived_connections = true  --  this one means the keepalives for the first 60 secs of idle time are enabled. There are two other prefs to control the length of the short lived period, and the frequency of probes.

network.http.tcp_keepalive.long_lived_connections = true  --  this is one you might consider disabling. Or there is another pref to control the frequency of probes, which is currently set at 10 minutes. (None for length of long-lived period since this it's for the life of the connection).

As for "network.tcp.keepalive.enabled = true" this ALLOWS keepalives at the socket level, but they must be enabled on a per socket basis via API. I wouldn't change this one unless you want to disable all keepalives everywhere, which you might want to do, but I don't think you should start there.

If you want to take some measurements, I'd start with network.http.tcp_keepalive.long_lived_connections and compare setting it to true vs false.
Comment 53 Patrick McManus [:mcmanus] 2014-02-19 05:22:40 PST
(In reply to Mark Finkle (:mfinkle) from comment #51)
> I'm worried that the keep alive timer will consume a lot of battery on
> mobile devices. Do we only need to flip the "network.tcp.keepalive.enabled"
> pref to | false | to disable this for Fennec? Or are there more prefs?

Its a concern of mine too mark - but lets be careful to measure the impact. The way this is laid out there should actually be very little traffic. the biggest exposure is on an idle http-persistent socket. on desktop that is less than 2 minutes, steve pointed out to me on mobile that it is set at 10. One alternative might be to reduce that timeout rather than disabling the TCP KA, which is designed to give robustness advantages.

I have old desktop data that shows diminishing returns on the idle persistent connections over 2 minutes anyhow - (tho its old and desktop centric) so that might be the right thing to do.
Comment 54 Jason Duell [:jduell] (needinfo me) 2014-02-19 17:37:03 PST
> the biggest exposure is on an idle http-persistent socket.

websockets and long-lived COMET/AJAX are also a possible issue, right?  They'll ping every 10 minutes IIRC (and not at the same time unless the OS batches them, which I've seen no evidence of).
Comment 55 Patrick McManus [:mcmanus] 2014-02-19 19:07:19 PST
(In reply to Jason Duell (:jduell) from comment #54)
> > the biggest exposure is on an idle http-persistent socket.
> 
> websockets and long-lived COMET/AJAX are also a possible issue, right? 
> They'll ping every 10 minutes IIRC (and not at the same time unless the OS
> batches them, which I've seen no evidence of).

sure.. there are lots of cases. I said biggest. but of course if you craft the measurement to show those, that's what you'll see.

Note You need to log in before you can comment on or make changes to this bug.