Closed Bug 592284 Opened 14 years ago Closed 14 years ago

Accelerate HTTP Connection Timeout

Categories

(Core :: Networking: HTTP, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla2.0b8
Tracking Status
blocking2.0 --- betaN+
fennec 2.0+ ---

People

(Reporter: mcmanus, Assigned: mcmanus)

References

Details

(Whiteboard: [http-conn])

Attachments

(1 file, 6 obsolete files)

Some packets are more painful to lose than others. In particular, the SYN setting up an HTTP channel is really painful because the OS relies on large constant values before resending it. The values are different between platforms, but sending a retry after 3 seconds and another at the 9 second mark (6 seconds after the first retry) is typical. Obviously a 3 second blocker in reaction to a single lost packet is very painful.

Chrome still lets the OS set the retry rate, but if 250ms expires after sending the first connection packet it opens a second parallel connection. If the problem is just a normal packet loss that happened to impact the first syn, then things get going well.

It makes a lot of sense - this is a very small amount of very useful traffic, so it is worth the risk of duplication.

This is the Firefox pattern:

00:00:00.000000 IP 192.168.16.214.53431 > 192.168.37.1.80:
00:00:02.999657 IP 192.168.16.214.53431 > 192.168.37.1.80: 
00:00:08.999659 IP 192.168.16.214.53431 > 192.168.37.1.80: 

This is the Chrome pattern:

00:00:00.000000 IP 192.168.16.214.53429 > 192.168.37.1.80:
00:00:00.251162 IP 192.168.16.214.53430 > 192.168.37.1.80:
00:00:02.991019 IP 192.168.16.214.53429 > 192.168.37.1.80: 
00:00:03.251023 IP 192.168.16.214.53430 > 192.168.37.1.80: 
00:00:08.991023 IP 192.168.16.214.53429 > 192.168.37.1.80: 
00:00:09.251025 IP 192.168.16.214.53430 > 192.168.37.1.80:
Component: General → Networking: HTTP
Product: Firefox → Core
QA Contact: general → networking.http
This restart behavior can actually be applied to any idempotent xact that is perceived to be frozen, not just on the handshake. Though the timeout on the live session should probably be balanced against how much data has already been transferred.
Blocks: network_perf
Accelerate TCP connection retries in HTTP
Bugzilla 592284

Losing a TCP SYN requires a long painful (typically 3 second) delay
before being retried. This patch creates a second parallel connection
attempt for any nsHttpConnection which has not become writable before
a timeout occurs.

If you assume .5% packet loss, this converts a full 3 second delay
from a 1 in 200 event into a 1 in 40,000 event.

Whichever connection establishes itself first is used. If another one
has been started and it does connect before the one being used is
closed then the extra one is handed to the connection manager for use
by a different transaction - essentially a persistent connection with
0 previous transactions on it.

The pref network.http.connection-retry-timeout controls the amount f
time in ms to wait for success on the initial connection before beginning
the second one. Setting it to 0 disables the parallel connection.

I have set the initial timeout, perhaps for experimentation purposes,
to an agressive 100ms. This is sure to catch a fair number of cases
where the TCP handshake is just a little sluggish - but not in a full
loss-retransmit cycle. But the effect of that is interesting - those
sites basically initiate a speculative TCP handhsake which is put in the
connection manager for reuse. As we rarely connect to a host just one
time, this is a kind of prefetching of the connection. In a completely
unscientifc test I see about 40% of these 'wasted syns' come back to
be used a little later on after a round trip through the connection
manager.
Assignee: nobody → mcmanus
Status: NEW → ASSIGNED
Attachment #475170 - Flags: review?(jduell.mcbugs)
blocking2.0: --- → ?
tracking-fennec: --- → ?
Great to see this work happening. Packet loss hitting SYN is still definitely an issue (certainly in my world, no thanks to AT&T :-P -- not to pick on them, it's a pervasive problem).

/be
Blocks: 599164
This could be a good bump in performance for our users, depending on the network they're on etc. Blocking to get this done for 2.0.
blocking2.0: ? → betaN+
tracking-fennec: ? → 2.0+
This is a good patch, but I can't imagine it blocking Firefox 4.
Attachment #475170 - Flags: review?(jduell.mcbugs) → review?(honzab.moz)
Remove an unintentional change of scope for one line of code in nsHttpConnection::Close()
Attachment #475170 - Attachment is obsolete: true
Attachment #482032 - Flags: review?(honzab.moz)
Attachment #475170 - Flags: review?(honzab.moz)
Blocks: 603503
Comment on attachment 482032 [details] [diff] [review]
restart http connections faster than tcp does v2

Patrick, in general, this patch look very very good.  My comments are more just nits and few small improvements.

>diff --git a/modules/libpref/src/init/all.js b/modules/libpref/src/init/all.js
>+pref("network.http.connection-retry-timeout", 100);

You probably want to have this at 250ms for the final patch

>diff --git a/netwerk/protocol/http/nsHttpConnection.cpp b/netwerk/protocol/http/nsHttpConnection.cpp
>+static PRUint64 sCreateTransport1 = 0;
>+static PRUint64 sCreateTransport2 = 0;
>+static PRUint64 sSuccessTransport1 = 0;
>+static PRUint64 sSuccessTransport2 = 0;
>+static PRUint64 sUnNecessaryTransport2 = 0;
>+static PRUint64 sWastedReuseCount = 0;
>+

I don't think you can use ++ operator on PRUint64 on all platforms, rather make this PRUint32 only?

>+        mIdleSynTimer = do_CreateInstance(NS_TIMER_CONTRACTID, &rv);
>+        if (NS_FAILED(rv))
>+            return rv;
>+        
>+        rv = mIdleSynTimer->InitWithFuncCallback(IdleSynTimeout, this,
>+                                                 timeout,
>+                                                 nsITimer::TYPE_ONE_SHOT);
>+        if (NS_FAILED(rv))
>+            return rv;

I wouldn't necessarily fail on creating the timer.  It is not a fatal failure.

>+nsHttpConnection::IdleSynTimeout(nsITimer *timer, void *closure)
>+                                            getter_AddRefs(self->mSocketTransport2),

80 chars please.

>@@ -487,19 +598,27 @@ nsHttpConnection::CreateTransport(PRUint
>-    mSocketTransport = strans;
>-    mSocketIn = do_QueryInterface(sin);
>-    mSocketOut = do_QueryInterface(sout);
>+    *sock = strans;
>+    (*sock)->AddRef();
>+
>+    nsCOMPtr<nsIAsyncInputStream> ain = do_QueryInterface(sin);
>+    *instream = ain;
>+    (*instream)->AddRef();
>+
>+    nsCOMPtr<nsIAsyncOutputStream> aout = do_QueryInterface(sout);
>+    *outstream = aout;
>+    (*outstream)->AddRef();
>+

Instead, do:
  
strans.forget(sock);
CallQueryInterface(sin, instream);
CallQueryInterface(sout, outstream);

> nsHttpConnection::OnSocketWritable()
>-
>+    

In general, in the whole patch, don't add new white spaces.

>+nsresult
>+nsHttpConnection::SupplementConnectionManager(nsISocketTransport *sock,
>+    nsHttpConnection *clone;

Use nsRefPtr here.

>+    if ((clone = new nsHttpConnection())) {

No need to check for result anymore.  new is infalliable.

>+        NS_ADDREF(clone);

No need with nsRefPtr.

>+        rv = clone->Init(mConnInfo, mMaxHangTime);
>+        if (NS_SUCCEEDED(rv)) {
>+            clone->mIdleTimeout = gHttpHandler->IdleTimeout();

Please explain why you are setting this value (mIdleTimeout)

>+        NS_RELEASE(clone);

No need with nsRefPtr.

>+nsresult
>+nsHttpConnection::CheckTransport(nsIAsyncOutputStream *out)

>+        if (out == mSocketOut1) {
>+
>+            sSuccessTransport1++;

No new line after if, this generally in the whole patch.

>+            mSocketTransport = mSocketTransport1;
>+            mSocketOut = mSocketOut1;
>+            mSocketIn = mSocketIn1;
>+
>+            mSocketTransport1 = nsnull;
>+            mSocketOut1 = nsnull;
>+            mSocketIn1 = nsnull;

You can do (also for mSocket*2):
  
mSocketTransport.swap(mSocketTransport1);
mSocketOut.swap(mSocketOut1);
mSocketIn.swap(mSocketIn1);

> NS_IMETHODIMP
> nsHttpConnection::OnOutputStreamReady(nsIAsyncOutputStream *out)
>+    nsresult rv;
>+    rv = (out == mSocketOut) ? NS_OK : CheckTransport(out);

I'd rather see this as:
  
  nsresult rv;
  if (out != mSocketOut)
    rv = CheckTransport(out);
  else
    rv = NS_OK;

>diff --git a/netwerk/protocol/http/nsHttpConnection.h b/netwerk/protocol/http/nsHttpConnection.h
>-    
>+

Again, no new white spaces please.

>+    nsresult     CheckTransport(nsIAsyncOutputStream *out);

Thinking of a better name for this... maybe SelectMainTransport would be better to distinguish that this is doing some mishmash with the members.

>+    nsresult     SupplementConnectionManager(nsISocketTransport *sock,

Maybe call this HandSupplementTransportOver or NewConnectionFromSupplementTransport

>diff --git a/netwerk/protocol/http/nsHttpConnectionMgr.cpp b/netwerk/protocol/http/nsHttpConnectionMgr.cpp
> nsHttpConnectionMgr::GetConnection(nsConnectionEntry *ent, PRUint8 caps,
>-    if (AtActiveConnectionLimit(ent, caps)) {
>-        LOG(("  at active connection limit!\n"));
>-        return;
>-    }
>-

Please add a comment that explains why you are doing this check only when we create a new connection.

I believe I understand why you are doing this.  If we encounter a veeery slow server, that makes us to create a spare transport+connection almost on every new connection (i.e. we receive SYN ACK from it always later then in 250 ms), we will end up with most (limit + limit - 1) number of connections.  Having the limit check before we try to find an idle connection would prevent to actively use all the connections we made, what would be waste, right?

Are we sure we cannot exceed divergently this limit in any way?

>@@ -952,32 +959,37 @@ nsHttpConnectionMgr::OnMsgReclaimConnect
>+        if (ent->mActiveConns.RemoveElement(conn)) {
>+            conn->Release();

Please rather use:

nsHttpConnection *temp = conn;
NS_RELEASE(temp);

And maybe would be good to explain that there is at least one more reference kept by the method(event) context - that we addref the connection before we post the event and release it at the end of this method (the event handler).


Having mSocket*, mSocket*1 and mSocket*2 seems to me a bit breakneck.  Also you duplicate code for handling mSocket*1 and mSocket*2 cases, maybe turn the new members to arrays?  But this is up to you, I don't require any such change.  The patch as is now is quit transparent.

r=honzab with this changes.
Attachment #482032 - Flags: review?(honzab.moz) → review+
>+pref("network.http.connection-retry-timeout", 100);
>
> You probably want to have this at 250ms for the final patch

dougt:  for mobile, what's a sensible value for this?  I.e. how long to wait for the initial SYN to be ACK'd before we send out another one?
I'd hope Patrick would have some suggestions!  Patrick, any ideas?  We can run tests over simulated networks if that is helpful.
(In reply to comment #8)

> dougt:  for mobile, what's a sensible value for this?  I.e. how long to wait
> for the initial SYN to be ACK'd before we send out another one?

We have a couple options here..

1] use the timeout value to try and exclusively pre-empt actual SYN loss
2] use it for (1) AND "pre-fetching" clean connections for future use because extra successful connections are placed in the idle connection pool

no matter what you are really doing 2 - but obviously, the lower the timeout the stronger the #2 effect is. 

The conservative thing to do is set a larger timeout (e.g. 250) which will pass muster with pretty much all well functioning broadband and dialup links.

Mobile has such a massive range of RTTs.. many of them are going to be within that 250, but sometimes they are triple that.  The good news is that the ones that are normally higher gain the biggest benefit from prefetching the connection (because the connection is so painful to establish). 

So my gut would say just to have one relatively conservative default in the 250 range for all platforms and treat any accidental pre-fetches as serendipitous accidents. And then add code later on that explicitly tries to manage pre-fetches instead of making it a side effect of this timer - that could be done out of the connection manager and be done with a little more sense of the overall state of things (e.g. are there already idle connections to this host in the pool that have arrived since we decided to start a new one?)

Of course, this really needs a little experience in some betas and nightlies imo.

I'll update the patch with all the review feedback today and set it at 250.
> >diff --git a/netwerk/protocol/http/nsHttpConnectionMgr.cpp b/netwerk/protocol/http/nsHttpConnectionMgr.cpp
> > nsHttpConnectionMgr::GetConnection(nsConnectionEntry *ent, PRUint8 caps,
> >-    if (AtActiveConnectionLimit(ent, caps)) {
> >-        LOG(("  at active connection limit!\n"));
> >-        return;
> >-    }
> >-
> 
> Please add a comment that explains why you are doing this check only when we
> create a new connection.
>

I'll put this in a comment - but you've got it right. AtConnectionLimits() governs the number of open connections and reusing an idle connection does not ever open a connection so it shouldn't really be prevented by that. It doesn't matter in the legacy code because idle+active could never be > max-conn-allowed, but as you note that max can now be marginally exceeded if we are successfully handshaking twice on an occasional attempt to open a new connection.

We could close these extra connections instead of pooling them, at least when they exceeded the limit, but that would be a tragic waste of an expensive resource for us to create (i.e. a connection with a high RTT). 

However I'm glad I thought this through for this comment - I think the call to purgeoneidleconnectionCB() needs to loop to deal with this "more than 1" concept (it only checks the global limit - not the per host limit).. indeed, it should be called *before* the atactiveconnectionlimit() check as it could potentially free up room for a new connection by closing an idle one connected to another host, instead of queueing this request. That would almost certainly be better.
Honza, thanks for the review.

This is an updated patch that addresses your review comments in c7.

I have also made the PurgeIdleConnection logic changes that I referred to in c11. Because of that substantive change I've asked for another (hopefully quickie) r?. Thanks.
Attachment #482032 - Attachment is obsolete: true
Attachment #482903 - Flags: review?(honzab.moz)
Comment on attachment 482903 [details] [diff] [review]
restart http connections faster than tcp does

Just check 80-lines in nsHttpConnectionMgr::GetConnection.

r=honzab
Attachment #482903 - Flags: review?(honzab.moz) → review+
Is there any risk that pooled connections could get "stale"?  Do we give them a max lifetime, and/or are HTTP servers generally happy to wait arbitrarily long for an initial GET/POST to come over the wire?   (I could imagine some might consider idle empty connections a DOS attack and time them out aggressively, but that's just a guess).  I assume we recover gracefully if we try to pick up an idle connection and it winds up being closed?
(In reply to comment #14)
> Is there any risk that pooled connections could get "stale"?  Do we give them a
> max lifetime, and/or are HTTP servers generally happy to wait arbitrarily long
> for an initial GET/POST to come over the wire?   (I could imagine some might
> consider idle empty connections a DOS attack and time them out aggressively,
> but that's just a guess).  I assume we recover gracefully if we try to pick up
> an idle connection and it winds up being closed?

If we pick up an idle connection it gets tested for EOF first and if that is positive we close it and try another one (or make a new connection if there aren't any more idle ones). I just tested that out to make sure it works - and it does.

It is certainly possible that servers will have shorter timeouts for transaction 1 than for transaction N>1 (I'm pretty sure I've implemented that myself on the server side), but they still won't be tiny amounts of time. Each time I have looked at the statistics of this it shows around 40%-50% of the sockets that get put in the idle connection pool without having had a transaction on them eventually successfully run a transaction. Anecdotally, the other half just seem to be unused within a reasonable period of time (i.e. not shut immediately by the server).

One thing I did notice, and this is for a different bugzilla, is that when the server closes an idle connection we don't react to it in any way until either our timeout elapses or we happen to try and use that connection. During this time the socket sits in CLOSE-WAIT status and we don't generate a FIN - which is kinda rude to the server who is trying to reclaim the connection in an orderly way and it keeps our tables full of useless idleconns that still count against maxconn, etc... If some flavor of listener was established on the idleconns they could just reap themselves when that happened. (and I do generally think the servers timeout before we do - which makes sense we have little incentive to release that resource, but the server might want to do so for scaling reasons.)
The patch for bug 591881 landed 10/14 and conflicted with this patch in a couple meaningful ways. I didn't know that was coming.

Because the merge wasn't just syntactical I'm going to err on the conservative side and bug you yet again for an r? - the only changes are in nsHttpConnectionMgr.cpp where the new code for adjusting the delta timer needs to be added to changes in reaping this patch made.. also note that rather than being the only caller of PurgeOneIdle() and calling it repetitively, I've replaced that with a smaller routine that tries to get under quota by itself.
Attachment #482903 - Attachment is obsolete: true
Attachment #483637 - Flags: review?(honzab.moz)
Comment on attachment 483637 [details] [diff] [review]
restart http connections faster than tcp does

>r=honza

If you want to refer this anywhere, then please honzab (b at the end, to avoid confusion as there is more then one honza here)

>+nsHttpConnectionMgr::PurgeExcessIdleConnectionCB(nsHashKey *key, 
>+                                                 void *data, void *closure)

Shouldn't be PurgeExcessIdleConnectionsCB (plural) ?

> {
>     nsHttpConnectionMgr *self = (nsHttpConnectionMgr *) closure;
>     nsConnectionEntry *ent = (nsConnectionEntry *) data;
> 
>-    if (ent->mIdleConns.Length() > 0) {
>+    while ((self->mNumIdleConns + self->mNumActiveConns + 1 > self->mMaxConns) &&

Nit: this condition is relatively complex and used two times in the code.  It might get changed in the future.  What about to change the code a bit like this:

{
  while ((self->mNumIdleConns + self->mNumActiveConns + 1) > self->mMaxConns) {
    if (!ent->mIdleConns.Length()) {
      // We cannot free up any more idle connections for this host, go to another one
      return kHashEnumerateNext;
    }

    ... close, release and remove the conn, do a next while round ...
  }

  return kHashEnumerateStop;
}


Please remove the trailing new line added to nsHttpConnectionMgr.cpp in patch for bug 591881.

r=honzab
Attachment #483637 - Flags: review?(honzab.moz) → review+
new patch to reflect helpful suggestions in c17.

carrying over r=honzab, but bugzilla doesn't seem to let me do that directly (anymore?)
Attachment #483947 - Flags: review+
Attachment #483947 - Flags: approval2.0?
update due to 606719 induced bitrot so it applies against moz-central
Attachment #483637 - Attachment is obsolete: true
Attachment #483947 - Attachment is obsolete: true
Attachment #486701 - Flags: review+
Attachment #486701 - Flags: approval2.0?
Attachment #483947 - Flags: approval2.0?
Comment on attachment 486701 [details] [diff] [review]
restart http connections faster than tcp does v6

this patch doesn't need approval because the bug is marked blocking-fennec which clears you to land
Attachment #486701 - Flags: approval2.0?
we're not adding this to Firefox b7. Please hold off.
Is this bug the root cause of slow connecting state on congested or otherwise lossy sites/networks?
(In reply to comment #22)
> Is this bug the root cause of slow connecting state on congested or otherwise
> lossy sites/networks?

The situation you describe is the one that this patch improves.
This is ready to land, right?
Keywords: checkin-needed
yes by me.. carries honzab's r+
(In reply to comment #25)
> yes by me.. carries honzab's r+

That's good, hope it makes b8 before it freezes.
patching file netwerk/protocol/http/nsHttpConnection.cpp
Hunk #2 succeeded at 134 with fuzz 2 (offset -1 lines).
patching file netwerk/protocol/http/nsHttpConnectionMgr.cpp
Hunk #5 FAILED at 1199
1 out of 5 hunks FAILED -- saving rejects to file netwerk/protocol/http/nsHttpConnectionMgr.cpp.rej
patching file netwerk/protocol/http/nsHttpConnectionMgr.h
Hunk #1 FAILED at 202
1 out of 1 hunks FAILED -- saving rejects to file netwerk/protocol/http/nsHttpConnectionMgr.h.rej
Keywords: checkin-needed
The patch took on irrelevant context from 595316, which is part of the series this is from (though this can stand alone). I've just made a fresh version off moz central and will attach it here after it builds and passes the netwerk unit tests. Nothing changes but some context in the diff.
updated diff context - applies against moz central as of 11/20
Attachment #486701 - Attachment is obsolete: true
Attachment #492098 - Flags: review+
Keywords: checkin-needed
http://hg.mozilla.org/mozilla-central/rev/be4b69a4babf
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.0b8
Blocks: 613977
It seems this causes some problems. It is hard to reproduce (takes a while to manifest), but since this landed I ended up with a sort of hanged connection state a few times. When this happens, all loading tabs stay as "Connecting...". Only way to get things moving is to close the tab that caused the stall, but there is no way to find out which it was, so it's a matter of luck (ie closing one by one, until it starts working again). Today it did hang because of a download, the download was running fine (and I tried IE, and that was loading just fine too), but nothing else could be loaded in Minefield until the download finished (20 minutes).
khronos, could you please report a new bug about this and CC me please?  Thank you!
No longer depends on: 614950
Blocks: 614958
No longer blocks: 613977
Depends on: 613977
Depends on: 614677
Blocks: 623921
Whiteboard: [http-conn]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: