Open Bug 421128 Opened 15 years ago Updated 8 months ago

Independent windows/tabs should not starve each other for network connections

Categories

(Core :: Networking, defect, P3)

x86
Linux
defect

Tracking

()

Future

People

(Reporter: BenB, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [Snappy:p1][lame-network] repro see comment 13[necko-backlog])

Attachments

(2 files)

Reproduction:
Have a server or server group which is not responding at all (but does accept connections) or *very* slowly. Open a page on these servers with many (a few dozen) images (can be small icons etc. of a typical heavy webpage).
Observe that the pages don't load at all, get bored.
Open a new browser window, go to your favourite news site and/or Google.

Actual result:
The Google page does not load. It looks like your network connection (or browser) is entirely broken.

Once the connections to the slow/unresponsive server time out (which is by default after 60-120s, i.e. longer than the most patient user will wait), you can go to Google again.

Expected result:
A request to domain B in window B is entirely independent from a request to domain A in window A.

Implementation:
There is a limit on the number of connections per server and overall.

Group network connections by page and by domain, and restrict network connections only based on these (e.g. 10 per page, 50 per domain), and remove or seriously increase (200) the absolute overall connection limit for the whole browser.
See also bug 421125.
When I ran into this, I assumed that my Internet connection is broken, or very flaky (coming and going), and didn't realize it's only one server, which made the problem diagnosis and solution a lot more time-consuming. I can also see ISP customers calling their ISP hotline.
Blocks: 384323
Patrick is this still an issue or can this bug be resolved?
Assignee: nobody → mcmanus
Whiteboard: [Snappy]
please don't assign bugs to me that I'm not actively working on as I understand that is what the assignment field means.

I suspect the original 'slow' site in this case was significantly sharded and consumed the entire 30 connection pool, leaving nothing for opening a new tab.

Its also possible that it just exhausted the 6-per-host limitation of HTTP and the new tab used the same site, though that seems less likely from the description.

We have landed code that increased the total connection limit to 256 for just this reason, though it has been backed off to something smaller (64? I forget) for windows due to some problems with LSPs.

If the higher overall limit was not sufficient we could build in a limit per load group.
Assignee: mcmanus → nobody
> site ... was significantly sharded and consumed the entire 30 connection pool,
> leaving nothing for opening a new tab.

That's what this bug is about: that should never happen. One site must not interfere with other sites.
> we could build in a limit per load group.

That sounds like the solution
(In reply to Ben Bucksch (:BenB) from comment #5)
> > we could build in a limit per load group.
> 
> That sounds like the solution

It would be a net positive. It still leaves in place the global 6-per-hostname limitation.. changing that to 6 per-tab would drift farther from the standard and is probably not usually relevant to the observed behavior.

Recent changes have increased the size of the global pool, so triaging this it might not be a big problem currently.
Patrick, sorry about ruining your assignment flow. I'll avoid that in future.
Would it be hard to add telemetry so we know whether this is a problem that needs addressing?
I don't think any site in any circumstance should be allowed to exhaust all resources of a certain kind and thus block my browser. Now, I didn't even run into this with a deliberate attack scenario, but just with normal sites, when I internet connection broke.

When I ran into this, I thought my Internet connection was broken. I tried to fix the connection, but I still couldn't load websites, not any website, so I assumed the connection was still broken. I would never have thought that when all sites in my browser are stalled, it could be cause of one single site. It took me considerable time to figure out what was going on.
Hey Ben, did you run into this in the field recently (the last release, maybe two) or is the anecdote older that that?

I ask because the overall pool size was recently increased significantly, but the algorithm hasn't changed. Its not clear how much of a problem it is in practice now as its harder for one page to saturate the total pool.

I agree a partition makes sense, at least for the global limit. But I ask for reasons of project prioritization.
> Hey Ben, did you run into this in the field recently (the last release,
> maybe two) or is the anecdote older that that?

No, because I never wanted to run into this again, so I set the limits in my profile very high:
network.http.max-connections = 500
network.http.max-connections-per-serve = 50
network.http.max-persistent-connections-per-proxy = 200
network.http.max-persistent-connections-per-server = 50
When I had the values a a little lower (but still higher than default), I was still running into it, IIRC.

But I considered that only a band-aid (not guaranteed to fix it), not a real fix, that's why I filed this bug.
I'm using a proxy server.

For the record, the default currently seem to be:
network.http.max-connections = 256
network.http.max-connections-per-server = 15
network.http.max-persistent-connections-per-proxy = 8
network.http.max-persistent-connections-per-server = 6
network.http.pipelining.maxrequests = 4
Will mark this snappy:p3 until I see evidence that this is reproducible.
Whiteboard: [Snappy] → [Snappy:p3]
You can probably easily reproduce it like this:

1. Make a webpage with 1000 pictures, all different URLs, and from 10 different servers.
2.a. Make it so that these webservers act like a black hole. They don't refuse the connection (!), they simply dont respond at all.
2.b. ALTERNATIVELY, make your network connection so that DNS works (cached), but all requests go into a black hole. They are not rejected on IP or TCP level, they just go nowhere.  This is a realistic scenario, that's how I ran into the bug.
3. Go to Google
4. Fix the broken servers or your broken network connection.
5. Go to Google

Actual result:
Step 3 and 5: Google doesn't work

Expected result:
Step 5: Google works.
Step 3 with case 2.a. (broken server): Google works.

Importance:
The real problem with this bug is:
- Even if you have case 2.a. (broken server), the user will assume case 2.b. (his network connection is broken), because Google doesn't work.
- Even if you really had case 2.b. (broken network connection acting as a blackhole), and you have successfully fixed it, it looks to you as if it's still broken, because Google doesn't work. So, you tear it down again, or try to change something else, and in the process of trying and trying you make a mistake actually *really* break your network connection permanently.
True story! ;-)

Taras wrote:
> Will mark this snappy:p3 until I see evidence that this is reproducible.

Removing P3 marker for re-triage.
Whiteboard: [Snappy:p3] → [Snappy]
more of a [lame-network] bug (i.e. networking in very lame conditions) than a [snappy] one imo. lame-network is a down the road project for me to even start (6 or 8 weeks maybe) - feel free to add the whiteboard tag to anything that might apply.
Whiteboard: [Snappy] → [Snappy][lame-network]
Perfect. Restoring [Snappy:p3].
Whiteboard: [Snappy][lame-network] → [Snappy:p3][lame-network]
Just to be clear: weirdly slow pageloading on the snappy radar. Patrick, thanks for taking this on.
I think I'm seeing this right now as a side effect of bug 731130. It again seems associated with google sites. Specifically, I am getting a busywait on an SSL connection to pz-in-f84.1e100.net, which appears to be part of google.com's CDN.

This did not happen yesterday when I was on a fast network. Today I am on a home DSL line.
This is the output of

perl -lne 'print $1 if /TCP \S*->([^:]*)/' /tmp/lsof.txt | sort | uniq -c | sort -n

on the previous attachment. It actually doesn't show a high server count for any one server, which is now making me doubt this is the same thing. The total number of connections is 143, though.
Mine turned out to be bug 710176, which has nothing to do with connection exhaustion. Sorry for the noise.
This made firefox nearly useless on public wifi during our FOSDEM workweek.
Whiteboard: [Snappy:p3][lame-network] → [Snappy:p1][lame-network]
Blocks: 725023
Whiteboard: [Snappy:p1][lame-network] → [Snappy:p1][lame-network] repro see comment 13
Assignee: nobody → mcmanus
Target Milestone: --- → Future
worth noting that the per proxy limit is also a likely bottleneck.
Blocks: 778884
Duplicate of this bug: 953119
Whiteboard: [Snappy:p1][lame-network] repro see comment 13 → [Snappy:p1][lame-network] repro see comment 13[necko-backlog]
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3

Unassigning bugs owned by Patrick.

Assignee: mcmanus → nobody

Hi Ben,
This issue is an old poke but do you know if it is still reproducible or relevant on the latest Firefox version?
If it is not, it should be closed. Please take a look at this when you have the time.

Flags: needinfo?(ben.bucksch)

Hi Timea, this happens in a specific situation with an overloaded web server, which then in turn lets Firefox block all web requests, even to other servers. Testing it requires a complex set up to immitate an overloaded web server. I have a company to run and don't have time to re-test this right now. Can you please assign some of your QA team to please try to follow the reproduction steps?

Flags: needinfo?(ben.bucksch)
You need to log in before you can comment on or make changes to this bug.