All details: http://blog.dennis-boldt.de/?p=177
I verified that issue with the Linux version under Ubuntu 10.04. and the Windows version under Windows XP.
Bug still exists in Firefox 13.0b4
It's not a bug, it's a feature.... Or at least a practical limit at some point. We can't support infinite #s of file descriptors/sockets.
I'm trying to get some consensus happening on the hybi list about what to do when a browser hits its WS count limits.
of course I cannot open (and do not close) unlimited WebSocket connections. Have you read my blog article? The problem is much more problematic.
In my experiement, I wrote a simple test application. It opens a WebSocket connection and close it immediately as soon it was opened. Such that, there will be never more than one WebSocket connection opened at a time.
- Chrome 20 opens and closes more than 2000 without any problem
- Chromium 18 opens and closes more than 2800 connections
- Firefox 11 opens and closes more than 1300 WebSocket connections
- Firefox 12 opens and closes 196 WebSocket connections. The 197th WebSocket connection closed with the error 1006.
- Firefox 13.0b4/nightly has the same problem, just after 198 connections
My idea: Since version 7, Firefox has the preference network.websocket.max-connections to limit the number of connections that can be open at a time, which default value is 200 (see MDN: https://developer.mozilla.org/en/WebSockets#Gecko_7.0 ). I changed that value to 400. Afterwards, the test application opened and closed 396 (400-4) in Firefox 12 until the error 1006 appeared.
Conclusion: Firefox 12 can open and close network.websocket.max-connections - 4 connections. The value of opened WebSocket connection seems _not_ to be decremented in Firefox 12, as soon the WebSocket connection is closed.
It's not a feature, it's a bug ;)
> It's not a feature, it's a bug ;)
Agreed! Thanks for the followup clarification. We definitely need to fix this.
Created attachment 628154 [details] [diff] [review]
slightly modified version of HTML test page
This is a slightly modified version of your test, which does not require the web console to see how many websockets have been created. Could you try running it *without* enabling the web console, and tell me if you still see errors?
I don't see any errors if I run this test with no web console. If I try it again, I hit the limit as you've described. So at least part of the problem here is that the web console is keeping WebSocketChannels alive for a very long time.
You mentioned that your MS thesis application was running into the WS connection limit, too. Do you see this only when the web console is on?
Thanks a lot for the reduced test case, by the way.
Created attachment 628218 [details] [diff] [review]
So I suspect the web console is the major culprit here, but this is both a quick workaround, and a code change that we probably ought to make anyway.
So we're not failing after a total mMaxConcurrentConnections websockets have been opened during a browser session (good! That means this bug is actually much less serious than I thought). Rather, we are failing if there are mMaxConcurrentConnections WebSocketChannels still alive at any one point. That really shouldn't be happening in this simple app (which just opens a single websocket, then opens another during onclose, so there's logically really only ever 1 or 2 websockets that should be logically open). But the web console or something else is keeping them alive.
We make the problem worse, however, by counting every existing WebSocketChannel as a "session", even if it failed to open, or has closed its TCP connection. That seems to cause the test to go into a spin where, even as actually-connected WebSocketChannels are getting garbage collected, the failed, never-opened ones keep the session count too high for new ones to connect. This patch fixes that by incrementing the session count only during the actual TCP session (roughly speaking). With it, we no longer hit the connection limit, with or without the web console on.
Created attachment 628221 [details] [diff] [review]
diagnostic patch to track session count and # of websocket channels
Here's the patch I'm using to generate some stats on how many websockets we've got in existence at once, and how high the session count gets.
Created attachment 628222 [details]
Log with fix and no web console
With the patch and no web console we do great: session count is always 1 or 0, and we never have more than 3 websockets in existence (including the protocol handler).
Created attachment 628251 [details]
Log with fix and web console on
With the web console on we still do well with session count (0 or 1), but the number of live WebSocketChannels gets high--we don't delete the first one until we get to 452 live WebSocketChannels! (after that we keep hovering at about 220-250 WebSocketChannels: GC seems more frequent after the 1st collection).
The stack trace when we finally delete looks like this:
#1 0x00007f478187de0c in mozilla::net::WebSocketChannel::~WebSocketChannel
#2 0x00007f478187ad0e in mozilla::net::WebSocketChannel::Release
#3 0x00007f47816c1251 in nsCOMPtr<nsIRDFDelegateFactory>::~nsCOMPtr
#4 0x00007f4781828431 in mozilla::net::HttpBaseChannel::~HttpBaseChannel
#5 0x00007f478183085e in nsHttpChannel::~nsHttpChannel
#6 0x00007f4781830898 in nsHttpChannel::~nsHttpChannel
#7 0x00007f4782e4c87e in nsHashPropertyBag::Release
#8 0x00007f4781828d42 in mozilla::net::HttpBaseChannel::Release
#9 0x00007f478183f362 in nsHttpChannel::Release
#10 0x00007f47826a46cf in DoDeferredRelease<nsISupports*>
#11 0x00007f478269a828 in XPCJSRuntime::GCCallback
#12 0x00007f4783663f23 in Collect
#13 0x00007f47836640b4 in js::GCSlice
#14 0x00007f478364c9d0 in js::IncrementalGC
#15 0x00007f4782145f31 in nsJSContext::GarbageCollectNow
#16 0x00007f478214684b in GCTimerFired
So we're somehow being kept alive by the HTTP channel that we upgraded from (presumably), which is being kept alive by... something to do with the web console. I've checked and mListener in nsHttpChannel is set to null, so it's not that. Will investigate some more.
dcamp pointed me at you. Do you think we've got work to do to make the webconsole not hang onto nsHttpChannels? See the logs and comments in comment 9 vs comment 10. Or is this fine? (It's all getting garbage collected eventually).
I figured out the first issue was nsHttpChannel not releasing its Upgrade callback (a WebsocketChannel). Once that was fixed, I still got the same very late deletion, this time by nsStreamListenerTee, then nsStreamListenerWrapper. After these are fixed (patch is in bug 759819), I now get earlier collection: 121 websockets are created before the 1st one gets destroyed, with this stack trace:
#1 in mozilla::net::WebSocketChannel::Release (this=0x7f7df2db3000)
#2 in DoDeferredRelease<nsISupports*> (array=...)
#3 in XPCJSRuntime::GCCallback (rt=0x7f7dfd623000, status=JSGC_END)
#4 Collect (rt=0x7f7dfd623000, incremental=true, budget=10000, gckind=js::GC_SHRINK,
reason=js::gcreason::MAYBEGC) at /home/jj/hg/in/t/js/src/jsgc.cpp:3807
So whatever causes an object to get slurped into DoDeferredRelease is happening to the WebSocketChannel. Perhaps that's normal?
Comment on attachment 628218 [details] [diff] [review]
ok.. the deferred increment probably doesn't matter much but I can see how the earlier decrement is important.
> the deferred increment probably doesn't matter much but I can see how the
> earlier decrement is important.
Agree that the earlier decrement is the key, but I did seem to see an issue where failed websockets that hadn't been GC'd yet was also clogging the connection limit, so I do think both are useful.
OK, when this lands I'll consider this bug closed--there've been some improvements in the m-c web console and/or GC in the past few days--I'm now only seeing 60 websockets created before the first GC cleanup, and the average number that exist is much lower--20 or less. I'm still not clear how the web console code gets its hands on WebSocketChannels, which it has no direct acccess to, and/or otherwise somehow causes them to get cleaned up via DoDeferredRelease() instead of the usual immediate cleanup, but it doesn't seem to be a problem (http channels wind up getting deferred cleanup the same way, but that makes more sense, since the web console JS directly refers to them). Not something I'm going to worry about now that we've fixed the session count issue, unless someone tells me otherwise.
Comment on attachment 628218 [details] [diff] [review]
Dennis: please let me know if you've seen this issue at any point when you *haven't* been using the web console--that would change our risk/benefit analysis here a lot.
[Approval Request Comment]
Bug caused by (feature/regressing bug #): long-standing
User impact if declined: users using the web console may run into an artificially low websocket count (and have their connections fail) if they create and close websockets rapidly. Not a common scenario, but for developers especially, not wildly unlikely.
Testing completed (on m-c, etc.): Tested by hand.
Risk to taking this patch (and alternatives if risky): very low risk. Just moves some counter increment/decrement to different code location.
String or UUID changes made by this patch: none.
First of all: nice work.
Sorry for not answering yet. I am going to deliver my thesis in some hours. Therefore I will check this issue in detail on the weekend. Is your bugfix already in the nighly build, so I can compare?
Btw: Why the web console didn't had any effect in earlier versions, e.g. FF11?
Good luck with your thesis delivery :) And thanks for taking time to report this while you were busy with that.
The fix is in the tree, so it should be in the nightly build now.
> Btw: Why the web console didn't had any effect in earlier versions, e.g. FF11?
Good question for which I don't have a clear answer. It could have been something the web console did, or perhaps a change in garbage collection completely unrelated to the web console.
Comment on attachment 628154 [details] [diff] [review]
slightly modified version of HTML test page
Only affects logged count, no chance for major user facing regressions. Approved for Aurora 14.
Just noticed after I checked in that you +a'd patch 1 instead of 2. Patch 1 isn't something that's meant to be in the tree, so I assume you meant patch 2, which is the bug fix we landed on m-c.
Fixed up the flags, thanks.