Closed Bug 606719 Opened 14 years ago Closed 14 years ago

Browser stalls with connections left hanging in TCP close wait state

Categories

(Core :: Networking: HTTP, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- beta7+

People

(Reporter: transgressivebehaviour, Assigned: mcmanus)

Details

(Keywords: regression)

Attachments

(3 files, 4 obsolete files)

User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b8pre) Gecko/20101022 Firefox/4.0b8pre
Build Identifier: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b8pre) Gecko/20101022 Firefox/4.0b8pre

This is a problem that occurred sometime after changes to the pipelining system a few days ago. I first attributed it to GC.
When using the nightly or hourly I always end up with a stalled browser after a while with the tabs/loading indicator left hanging in the loading state. The problem happens faster when I set set network.max-connections to more than 60. But it does not seem to hinge on that. I can reproduce it with a lot less connections.
Using TCPView shows me that a lot of connections are left hanging in the Close Wait state without actually ever getting terminated. Sometimes closing all tabs restores browsing ability for a while only to stall again. Sometimes closing the browser then leads to a crash. This happens with both pipelining enabled and pipelining disabled.

Reproducible: Always
Component: General → Networking: HTTP
Product: Firefox → Core
QA Contact: general → networking.http
Version: unspecified → Trunk
Severity: critical → normal
do you have a crash id (see <about:crashes>) ?
There is no crash, it just hangs.  Setting network.max-connections to 60 and restarting the browser, then trying to load MSNBC will hang/stall the browser, and other tabs will just stay in a 'connecting' state if you try to refresh them.
Hardware: x86 → x86_64
Status: UNCONFIRMED → NEW
Ever confirmed: true
there are 2 builds in 

https://bugzilla.mozilla.org/show_bug.cgi?id=593386#c21

try with both of them and see if you see a difference between them, in order to rule in or out that patch. If you do see a difference, please be orecise about the conditions you see. 

But honestly, it doesn't seem likely that patch has anything to do with it if tis can happen to you with pipelining turned off. (that whole code path is a nop in that case).
I think I can totally rule out the pipelining patch. Your tryserver build without it suffers from the same problems.
I seem to have found when the problem began. The nightly of 14th October does not display those connections left hanging.
I can confirm msnbc.com causes all kinds of fits with high connection settings in this overnight build:

Build identifier: Mozilla/5.0 (X11; Linux x86_64; rv:2.0b8pre) Gecko/20101020 Firefox-4.0/4.0b8pre

Using this pref: ("network.http.max-connections", 60);

Set at its default of 30, everything is ok.. pipelining is off, other relevant prefs (max-connections per server, max keepalive, etc..) are set at their defaults.

It's kind of a shot in the dark, but you might try a build with the patch for https://bugzilla.mozilla.org/show_bug.cgi?id=591881 backed out. That code landed 10/14 and made some changes around connection handling.
That sounds like a close fit for what I was seeing in TCPView, namely a lot of not killed dead connections that seem to clog up the browser. Sadly I'm unable to try the backout myself.
(In reply to comment #7)
> That sounds like a close fit for what I was seeing in TCPView, namely a lot of
> not killed dead connections that seem to clog up the browser. Sadly I'm unable
> to try the backout myself.

can you try it if I make you a build?
Sure!
(In reply to comment #9)
> Sure!

this is mozilla-central from earlier today with c66b0f9777bb (the patch for 591881) backed out.

http://stage.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/mcmanus@ducksong.com-dbf9bd371d86/tryserver-win32/firefox-4.0b8pre.en-US.win32.zip
OK, testing the try.zip build.  Bumped to max-connections to 60.  As reported by me above MSNBC will now load, though its not loading any faster, in fact slower, than the default setting of 30, but point is is does load the site.  CNN which would not load at all, now will load.  

I will run this for a few hours and see.  But it appears that the back-out has fixed the problem.
It may be premature - but looking at TCPView there are not any 'close_wait' messages showing and a lot less overall tcp connections, even set at 60.

Things I very well do not understand.
Jon, can you take a look at this? If we can't figure it out soon we'll have to back out the original patch.

Assigning to you for now.
Assignee: nobody → jon.hemming
For some reason, the magic number for idle connections is 50, so if we put max connections over that we will freeze. I found one bug in the patch that was created for bug 591881:

When timer notification is observed, time of next wake-up needs to be reseted like this:
mTimeOfNextWakeUp = LL_MAXUINT;

Before it was possible that we never woke up to prune dead connection because of this was missing. Now we will wake up eventually, but it may take a lot of time.

If found these thing to be quite confusing:
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnectionMgr.cpp#559:
558 
559     // use >= just to be safe
560     if (mNumActiveConns >= mMaxConns) {
561         LOG(("  num active conns == max conns\n"));
562         return PR_TRUE;
563     }
564 

and

http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnectionMgr.cpp#648
647         
648         // We created a new connection that will become active, purge the
649         // oldest idle connection if we've reached the upper limit.
650         // This only needs to be done if there is a idle connection.
651         if (0 < mNumIdleConns && mNumIdleConns + mNumActiveConns + 1 > mMaxConns)
652             mCT.Enumerate(PurgeOneIdleConnectionCB, this);
653 
654         // XXX this just purges a random idle connection.  we should instead
655         // enumerate the entire hash table to find the eldest idle connection.
656     }
657 

This latter was modified by me to add the check "0 < mNumIdleConns", but "mNumIdleConns + mNumActiveConns + 1 > mMaxConns" was there already. How come this mMaxConns can be used for two different things?
Seems like this was really the guilty party. With https://bugzilla.mozilla.org/show_bug.cgi?id=591881 backed out I can't get Fx to stall no matter how high I set network.http.max-connections.
(In reply to comment #15)
> Seems like this was really the guilty party. With
> https://bugzilla.mozilla.org/show_bug.cgi?id=591881 backed out I can't get Fx
> to stall no matter how high I set network.http.max-connections.

This is depending on how high you can get the real value for mIdleConns and since how it was done previously, the pruning of idle connections was done constantly, this was not visible as easily, but you can reproduce this with that one also. Just patch with this patch to add old pruning with printouts and then browse heavily to get mIdleConns to 50 and notice how you are able to cause the same thing.
Browser wont freeze completely, only for some time, but basically the same thing. Bug 591881 wont freeze completely either if you add the patch that I added for it now, but it is in that constantly loading state for longer time. I think there is something logically wrong when the browser will freeze if we are not constantly waking up to prune idle connections. What ever there goes wrong should be noticed in somewhere else before we will freeze.
Attached patch prune idle on admission (obsolete) — Splinter Review
You can try this patch (not even compile tested).. I have something similar in my pipelining series.

Basically: 

1] only check AtConnectionLimits() when making a new connection, not when reusing an idle connection. Which makes perfect sense to me. (if we aren't adding connections, the connection limits aren't interesting as they can never be exceeded.) This fixes the problem where we have idle connections but are not using them.

2] move the prune-idle logic into the pre-create-connection code as well, instead of the post-create-connection code where it was.. there was a race I think you were seeing where the connection could become idle in between those two events (i.e. not get pruned in post because it wasn't idle yet, but it would be available to be idle in the next pre- round)..

.. but this is only tested in my pipelining connection mgr which looks very different than the trunk version.
dwitte has another patch for this.
Assignee: jon.hemming → dwitte
blocking2.0: --- → beta8+
Attached patch patch v1 (obsolete) — Splinter Review
I think this will do what we want. I merged Patrick's patch with Jon's, and made some tweaks.

* mTimeOfNextWakeUp should be reset both when we stop the timer, and before we perform a prune.
* GetConnection() should be stopping the connection timer if we don't have any idle connections left, which is consistent with everywhere else we remove one.
* It's OK for AtActiveConnectionLimit() to measure just whether active connections are over the limit. (If the idle + active total is over the limit, we can purge afterward.) If the active connections are over the limit, we're done, and there's no need to try a purge.
* The enumeration in PurgeOneIdleConnectionCB() will stop the prune timer if necessary, so there's no need to do it explicitly.

Can you both take a look and see what you think?

Also -- Patrick, I didn't really understand your comment about the race. How exactly would purging before initing the new connection be better than after? Are you saying the newly inited connection might be in an idle state, leading to purging it prematurely?
Attachment #485883 - Flags: review?(mcmanus)
Attachment #485883 - Flags: feedback?(jon.hemming)
(In reply to comment #20)

> Also -- Patrick, I didn't really understand your comment about the race. How
> exactly would purging before initing the new connection be better than after?
> Are you saying the newly inited connection might be in an idle state, leading
> to purging it prematurely?

the code said, roughly,

"if we just made a new connection and the total number of active + idle conns now bumps us up against max conns then try and prune an idle connection".

the problem is that there might not be an idle connection right then - they might be all active and none are pruned. But assume if we wait 200ms, one of those active connections goes idle (which will probably be true, when it finishes its transaction). But this new found idleness is too late for the getconnection logic - it has to be picked up by the timer sweep. If a new connection attempt comes along before the timer-sweep it will

 a] fail (and its request will be queued) because the AtActiveConnectionLimit() is the first thing that is done.. this is tragic - we've got an idle connection and we just ignore it and queue instead...

 b] if the AtActive*() check is moved to only apply to the opening of a new connection then the newly-idle connection can be reused if the request is to the same host. But it obviously cannot be used if the host doesn't match, and AtActive*() will still reject and force a queueing because maxconns is a global limit.. but we don't need to queue - we could close an idle connection to some other host and serve the request in front of us right now, which is why I indicated moving the purgeoneidle logic to just before the atactive*() check - give that thing the maximum chance of succeeding..

does that make sense? (the "race" is when a connection becomes idle).
Comment on attachment 485883 [details] [diff] [review]
patch v1

As for the patch, I think you want all the bits from mine plus likely jon's.

missing:
* I think you need to remove the original call to AtActiveConnectionLimit(), not just add another one.. the original one is what causes the reliance on the sweeper (and big pauses) when we are at maxconnections with some idle ones because the idle ones are never considered.

* change the mCT.enumerate(purge) bit to be in front of the atactiveconnectionlimit() call for the reasons in my last comment.. make sure you get the +1 in there from my patch to anticipate the new connection.
Attached patch patch v2 (obsolete) — Splinter Review
(In reply to comment #22)
> * I think you need to remove the original call to AtActiveConnectionLimit(),

Right you are, I thought I did. :/

> * change the mCT.enumerate(purge) bit to be in front of the
> atactiveconnectionlimit() call for the reasons in my last comment.. make sure
> you get the +1 in there from my patch to anticipate the new connection.

Fair enough; the fact that the timer only has 1 sec resolution is the issue here. Sounds reasonable to me.

This is basically what you had, then; look OK?
Attachment #485883 - Attachment is obsolete: true
Attachment #485931 - Flags: review?(mcmanus)
Attachment #485883 - Flags: review?(mcmanus)
Attachment #485883 - Flags: feedback?(jon.hemming)
Comment on attachment 485931 [details] [diff] [review]
patch v2

I think this looks fine. I'm not sure why changing maxconns from 30 to 60 triggers the behavior though, it would seem to have been a problem at 30 as well, but I guess I can imagine it having to do with the pattern of how many requests end up on each connection..

did you try a build with this (I have not) to confirm?
Attachment #485931 - Flags: review?(mcmanus) → review+
Comment on attachment 485931 [details] [diff] [review]
patch v2

Yes all changes look good and should be made, but stalling of browser can still be reproduced and the magic number of mIdleConns is still 50. For some reason you can't get more than 50 idle connections. Does anyone know a reason for this?
Try builds with this patch are at http://ftp.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/dwitte@mozilla.com-569650a9a9b8/. Can you test whether it fixes the problem for the default number of connections (30)? If so, the idle problem at 50 or 60 may be a different issue...

I'm not sure why there would be a magic number of idle connections. What OS are you testing on?
Also, do you see the same number of idle connections with a build from before the original patch landed? I think they should be the same, since idle connections will never be purged (provided their keepalive is sufficiently long, which let's say it is) until we hit the max connections limit and start forcefully purging them. We shouldn't have changed that behavior, so if we have, something's wrong. Right?
Note for triage: this is a regression which I think should block beta7. The answer here might be "back out the original patch", which we can certainly do if we don't get a fix here in the next few days.
Moving to b7, but if this only happens when max.connections is non-standard, I don't think it's super urgent and can wait for b8.
blocking2.0: beta8+ → beta7+
Just tested the 'try' build and sorry to report - no dice.  Page loading even at default of '30' stutters and hangs during page load and some never complete. 

bumping to 60 - gives pretty same results - hang/lag.  

Performance was much better with the try patch with the original patch backed out.
(In reply to comment #26)
> I'm not sure why there would be a magic number of idle connections. What OS are
> you testing on?
Up till now I have tested on Linux Ubuntu platform and with Fennec, but now I also tested this http://stage.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/mcmanus@ducksong.com-dbf9bd371d86/tryserver-win32/firefox-4.0b8pre.en-US.win32.zip and the issue seems to be very severe with that one also with default settings. Try opening multiple tabs to heavy pages at the same time, like: cnn.com, msnbc.com, nbc.com, nba.com and espn.com and you can see how this stalls the browser. The only difference with patch for pruning backed out is that it is harder to reach that point of stalling the browser, but you will still reach that easily. Only clue that I have found up to this point is that it is dependent of the amount of idle connections.

If you don't believe to that magic number of 50, try with the next patch that I will add. It wont allow mNumIdleConns to reach 50, which leads to browser not stalling. Even this patch would be a better fix than backing out the pruning of idle connections which shouldn't do anything wrong, if the rest would be working properly. Of course we could still ponder if pruning of idle connections should be done more often than when after the shortest lifetime of the connection. This is now tested in Linux environment so hopefully it would give same results in Windows environment, so that we could go ahead and try to fix that real issue which is still unknown.
almost seems like out of file descriptors or some other kind of fixed resource..
when we get in the locked up condition and I try and open a new condition (i.e. I open new tab and type in a new hostname) - nshttpconnection::Activate() completes ok, but we never hear back the expected call of onsocketwritable()
heh. There is a hard limit of 50 open http connections. Not sure how to fix that real quickly.

nsSocketTransportService2.h:
#define NS_SOCKET_MAX_COUNT    50
  //
    // the number of sockets that can be attached at any given time is
    // limited.  this is done because some operating systems (e.g., Win9x)
    // limit the number of sockets that can be created by an application.
    // AttachSocket will fail if the limit is exceeded.  consumers should
    // call CanAttachSocket and check the result before creating a socket.
    //
    PRBool CanAttachSocket() {
        return mActiveCount + mIdleCount < NS_SOCKET_MAX_COUNT;
    }
(In reply to comment #35)
> ... some operating systems (e.g., Win9x)
Do we know exactly which operating systems are affected by this? Windows 2k is the oldest version of Windows still supported by Firefox, so if the problem is just a limitation in Win9x ...
FF has never really worked with a max-connections > 50.. it sort of recovered because the idle reclaim logic worked to get things moving again, but at no point did FF ever actually use more than 50 sockets.

So the expedient thing is simply to clamp max-connections at 50 as part of the fix to this bug..

I'll roll this into the bigger patch (I'm been testing this for a few minutes and it works well.)
#include "nsNetCID.h"
 #include "nsAutoLock.h"
 #include "prprf.h"
 #include "nsReadableUtils.h"
 #include "nsQuickSort.h"
 #include "nsNetUtil.h"
 #include "nsIOService.h"
 #include "nsAsyncRedirectVerifyHelper.h"
+#include "nsSocketTransportService2.h"
 
 #include "nsIXULAppInfo.h"
 
 #ifdef MOZ_IPC
 #include "mozilla/net/NeckoChild.h"
 #endif 
 
 #if defined(XP_UNIX) || defined(XP_BEOS)
@@ -849,17 +850,17 @@ nsHttpHandler::PrefsChanged(nsIPrefBranc
                 mConnMgr->UpdateParam(nsHttpConnectionMgr::MAX_REQUEST_DELAY,
                                       mMaxRequestDelay);
         }
     }
 
     if (PREF_CHANGED(HTTP_PREF("max-connections"))) {
         rv = prefs->GetIntPref(HTTP_PREF("max-connections"), &val);
         if (NS_SUCCEEDED(rv)) {
-            mMaxConnections = (PRUint16) NS_CLAMP(val, 1, 0xffff);
+            mMaxConnections = (PRUint16) NS_CLAMP(val, 1, NS_SOCKET_MAX_COUNT);
             if (mConnMgr)
                 mConnMgr->UpdateParam(nsHttpConnectionMgr::MAX_CONNECTIONS,
                                       mMaxConnections);
         }
     }
 
     if (PREF_CHANGED(HTTP_PREF("max-connections-per-server"))) {
         rv = prefs->GetIntPref(HTTP_PREF("max-connections-per-server"), &val);
Attached patch patch v3 (obsolete) — Splinter Review
Attachment #485747 - Attachment is obsolete: true
Attachment #485931 - Attachment is obsolete: true
Attachment #486410 - Flags: review?(dwitte)
It's only Win9x limit. The default limit on 9x was 100 and it can be up to 256 after a registry modification if I remember correctly. There is no limit on NT or newer, well it's limited by memory, but then we're talking about 1000+ connections on a 64MB NT 4 machine.
(In reply to comment #36)
> (In reply to comment #35)
> > ... some operating systems (e.g., Win9x)
> Do we know exactly which operating systems are affected by this? Windows 2k is
> the oldest version of Windows still supported by Firefox, so if the problem is
> just a limitation in Win9x ...

I'm not really sure why we have a limit like that.. just try and create the socket and deal with it when it fails.. I assume it is so we can be called back when one is freed up, but that is a false sense of security if you are bumped up against the real FD resource limit beacuse FD's are fungible between network sockets, file sockets, whatever.. (so a net socket being freed up doesn't really means you will be open another because that might be allocated to a file before you get there..) basically all you can really do if you are truly out of fd's is to try again later. This code seems to avoid that problem by keeping fd use to be much lower than the actual resource limit on any OS, but that wastes the capabilities of most machines.

followup bug: 607741
OK, great. Let's deal with the limit problem in bug 607741, and take the fixes here. Is all this sufficient to fix the problem? Comment 30 indicates that with the default of 30 we still have issues. Which shouldn't be running into our max limit here -- unless protocols other than HTTP also count toward the socket limit and we're using more than 20.

Assigning to you since you've done most of the work here :)
Assignee: dwitte → mcmanus
Comment on attachment 486410 [details] [diff] [review]
patch v3

>diff --git a/netwerk/protocol/http/nsHttpConnectionMgr.cpp b/netwerk/protocol/http/nsHttpConnectionMgr.cpp
>@@ -619,45 +617,52 @@ nsHttpConnectionMgr::GetConnection(nsCon

>+        if (mNumIdleConns + mNumActiveConns + 1 >= mMaxConns)
>+            mCT.Enumerate(PurgeOneIdleConnectionCB, this);

My omission here (your original patch had this right) -- add an 'mNumIdleConns &&' condition to the front?

r=dwitte.
Attachment #486410 - Flags: review?(dwitte) → review+
Also, this results in a warning:

In file included from /home/dwitte/builds/mozilla/mozilla-central/netwerk/protocol/http/nsHttpHandler.cpp:79:
/home/dwitte/builds/mozilla/mozilla-central/netwerk/protocol/http/../../base/src/nsSocketTransportService2.h:62:1: warning: "LOG" redefined

I'd suggest you just shift the stuff in http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.h#60 to the cpp.
Nevermind, it's used in nsServerSocket.cpp and nsSocketTransport2.cpp.

We can't ignore the warning, so I'd recommend just renaming all uses of LOG to SOCKET_LOG in the relevant files.
Attached patch patch v3.1Splinter Review
I won't mention how hilarious this tit-for-tat reviewing is, but here we go.

Pushed to try. Will comment when builds are available.
Attachment #486410 - Attachment is obsolete: true
Attachment #486430 - Flags: review?(mcmanus)
Comment on attachment 486430 [details] [diff] [review]
patch v3.1

I think that's fine.

Should probably commit as 2 patches with the socket_log() one first, so as not to expand an already confusing 8KB patch to a 53KB one on the off chance it needs to be revised later... but I don't have checkin privs anyhow :)

I'll go test it
Attachment #486430 - Flags: review?(mcmanus) → review+
I cannot hang my linux-debug build using patch 3.1 no matter what I set max-connections at in about:config.. msnbc is kind of sluggish, but it is always that way on a debug build for me.

hopefully the forthcoming try build will help the windows folks (Jim?).
I was experiencing this issue with the nigtlies on Win7. I use doubled network settings (60 max conns, 30 per server, 12 persistent). With the build at comment #48 I've had no problems so far.
Same for me also, I haven't found any issues yet. Tried also with "#define NS_SOCKET_MAX_COUNT    100" and no problems. Count in mNumIdleConns balances always just under the max connections allowed.
Awesome. Let's roll with it then.

http://hg.mozilla.org/mozilla-central/rev/fa02ac41c650
http://hg.mozilla.org/mozilla-central/rev/16f18e41dcf3
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: