via this comment on Reddit: http://j.mp/t3PmoT worth investigating?
Patrick, this may explain the issues that caused us to reduce that pref...
Ugh, yes, perhaps. What puzzles me is that I haven't seen any similar issue in the Chromium issue tracker, and they've used 256 connections for a while.
the comment is "Recent versions of Firefox have increased the default value for network.http.max-connections to 256. I found that the virus scanner or firewall at work was detecting this as some sort of bad behaviour, since it could indicate file-sharing or some virus, and it was blocking outbound connections, causing Firefox to hang on occasion. I decreased the value back to 30 and it's been fine since." I'd be interested in knowing what virus scanner or firewall this was and whether or not it was a default configuration. If it caused us to hang that means it was just dropping packets and not even resetting streams. anyhow, it might be part of the story. Implementation issues with LSPs and select() are definitely part of the story too. (we have a patch for that). If this is a prominent part of the story then we probably need to find a way to heuristically detect such breakage and only backoff for those users. Today's web already requires parallelism > 30 and we put ourselves at a disadvantage by limiting that. avira webguard had some issues with this, according to the chrome team - but has been updated. fwiw other than avira I'm told chrome has had very few problems (basically a few old broken home routers - and those reports are low in number).
Patrick, Do you think it's maybe worth backing off max-cxns for now from 256 to some lower number (128? 64?) until we have some sort of detection logic in place to deal with this kind of thing? It's conservative, but the perf we gain from say >128 sockets may not be worth the pain when it causes trouble. The SUMO summary this week reports that "Since Firefox 4, questions with the words "slow", "long time" and "forever" account for 26% of all forum traffic on SUMO... Pages loading slowly or timing out represents 40-55% of threads related to slowness." Obviously max-cxns is not the only possible cause of this--there's plugins, add-ons, and who knows what else. And we've only had max-cxns > 30 since FF 7.
By the way, have you guys considered reaching out to those users to see if they can help us reproduce the problem?
(In reply to Jason Duell (:jduell) from comment #4) > Patrick, > > Do you think it's maybe worth backing off max-cxns for now from 256 to some > lower number (128? 64?) until we have some sort of detection logic in place > to deal with this kind of thing? It's conservative, but the perf we gain > from say >128 sockets may not be worth the pain when it causes trouble. > I don't want FF 3.6 back. I'd rather run into interop challenges pushing the web forward than to calcify on the safe path. Due to a high RTT to BW ratio, generally low IWs and lots of small objects the HTTP web needs lots of parallelism to scale and content providers are working around the 6-per server connection limit with all these hostnames for exactly this reason. I'm not making it up. Loading www.nyt.com uses 83 connections, even maxxing at 6 per host. https://plus.google.com/111091089527727420853/posts uses 64. http://www.facebook.com/media/set/?set=a.445727730658.236308.30530605658&type=3 can use over 75. Not allowing that kind of parallelism is a competitive disadvantage. Chrome doesn't have that problem, and I have read (but not personally confirmed) that IE doesn't have a limit at all. How much this matters is largely up to your RTT - lets say that mobile takes it on the chin the hardest by having a limit too low. A Limit of say, 90, may seem sufficient given this data - but that is going to mean that at the slightest click valid idle persistent connections will need to be closed in order to serve the new click. Its not obvious what pattern we should use to do that, and I've watched us close connections to things like doubleclick or analytics just to reopen them again a moment later - we really want a reasonable working set size that accomodates everything still on a timer. Really busy multi-tab scenarios are a different kettle of fish (also an interesting one) but we push a need for triple digits with using just 1 tab.
I'd love to have an algorithm for backing off in case of silent nat/firewall failure. But given that 1] its symptom can be shared with a real host failure, 2] the level at which it might happen is non-discrete because it is shared with other software on this host (or in this nat key), and 3] due to tcp states like TIME_WAIT and various race conditions it isn't clear that the connection count in a nat would match any connection count we would implement other than as an approximation. Suggestions?