Closed Bug 220941 Opened 17 years ago Closed 5 years ago
Consider matching HTTP connections by resolved IP address instead of by host name
Consider matching HTTP connections by resolved IP address instead of by host name. This change is motivated by the fact that it is very common for transparent proxies to be implemented by taking over DNS to map all hostnames to a single IP address. In such cases the browser will continue to open 2 persistent connections per host name, which can quickly add up to a large number of proxy server connections. In such cases, it would be much better if the browser simply reused an existing socket to the same IP node. I could not find anything in RFC 2616 that seemed to suggest that this is not a valid thing to do. Section 8.1.4 talks about restricting the number of persistent connections per server or proxy, but it is not explicit about whether or not that may be interpreted as an IP-node. I think it can be. With a decent DNS cache (which we have), I think this should be trivial to implement. It would also benefit Mozilla on sites which do a lot of virtual hosting. BTW: I learned about this problem from a research paper presented at the 2003 IWCW. Wireless networks (the non-802.11b flavor) like to use transparent proxies to enhance performance. Configuring clients to use an explicit proxy is often not easy or even possible.
Severity: normal → enhancement
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.6beta
I don't think a possible transparent proxy is a good reason to do this. You have to assume that the server admins know they'll get a large number of connections given what they're doing. If they have performance problems it's their problem not mozilla's. As to virtual hosting sites, some of them are just reverse proxies. What you're actually connected to may depend on the virtual host. Are you sure it's always safe to change a Host header in an established connection? Wouldn't this idea adversely affect sites which have multiple IP addresses for load balancing? I guess I don't like it. :-)
pablo (who originally mentioned this issue to me and wanted me to post his response to tenthumbs) had this to say: >I don't think a possible transparent proxy is a good reason to do this. >You have to assume that the server admins know they'll get a large >number of connections given what they're doing. If they have performance >problems it's their problem not mozilla's. The problem is not the extra load generated in the server. The problem is that opening many connections causes extra delays, which are especially problematic over Wireless links. The goal is to make sure that all TCP requests are sent to the transparent proxy and that TCP connections to this proxy are re-used, even for multiple domains. If this is not the case and the browser opens new connections for each domain name, then the end-user experience gets severely decreased. >As to virtual hosting sites, some of them are just reverse proxies. What >you're actually connected to may depend on the virtual host. Are you >sure it's always safe to change a Host header in an established >connection? I cannot think of any problem with having a single TCP connection and different Host headers. Do you have any particular scenario in mind? >Wouldn't this idea adversely affect sites which have multiple IP >addresses for load balancing? Load balancing is not a problem since different users can still be redirected to different IP addresses.
i agree with pablo here. i think it is unlikely that servers will have a problem receiving requests for different Host headers over the same socket connection. i think this bug is worth fixing.
I'm also not convinced this is a good idea, but if Darin thinks this is worth coding, we can find out the old-fashioned way.
> The problem is not the extra load generated in the server. The problem > is that opening many connections causes extra delays, which are > especially problematic over Wireless links. The goal is to make sure > that all TCP requests are sent to the transparent proxy and that TCP > connections to this proxy are re-used, even for multiple domains. If > this is not the case and the browser opens new connections for each > domain name, then the end-user experience gets severely decreased. This sounds like you're trying to mask a network problem by fiddling with the clients. It would be one thing if mozilla had some simple tuning prefs so the user could say "I'm on a T1", or "I'm on a modem", or whatever but mozilla doesn't like treating users as intelligent. You may see a performance improvement because of the reduced traffic but I strongly suspect wired users will see a performance decrease. I can't quantify it (yet) but I see a slowdown when using a proxy server because of the relatively small number of connections. I don't see why everyone should be penalized because of an obscure issue. > I cannot think of any problem with having a single TCP connection and > different Host headers. Do you have any particular scenario in mind? In the ideal world, a reverse proxy will examine every packet coming in from the outside and route it appropriately. The real world's another thing. I have no doubt that there are lots of reverse proxies that sometimes forget to do this so packets will be mis-sent. This should be very carefully tested. Start visiting porn sites because they tend to do this sort of thing a lot.:-) > Load balancing is not a problem since different users can still be > redirected to different IP addresses. How do you know a site isn't configured based on the assumption that each "transaction" (e.g. a web page and its goodies) will be a separate connection. You're assuming that you can arbitrary amounts of data over one connection.
I have worked with Pablo on the design of a performance enhancing proxy to improve performance over wireless links when we noticed this problem with the Mozilla browser. As Pablo pointed out, multiple TCP connections and DNS requests affects performance across wireless links considerably. To eliminate this problem, we decided to rewrite URLs (similar to CDNs) to transparently redirect requests to a proxy. Using this technique, requests to multiple domain names will be resolved to a single IP address (of that of the proxy) and fewer TCP connections will be used to download multiple web pages. The problem that we saw with the Mozilla browser is that even if the DNS resolves different domain names to a single IP address, multiple TCP connections will be opened, one for each domain name, which severely affected wireless browsing experience. Modifying this behaviour to re-use connections per IP address helps boost up performance across wireless links. The performance improvement is not because of reduced traffic but because of the elimination of multiple RTTs across the wireless links required to open multiple TCP connections. For wireline clients I do not believe that there is any adverse impact. The modification to the browser does not imply that all requests are somehow sent either through a forward proxy or a reverse proxy. Only those requests that are URL rewritten will be sent to a forward proxy and will only occur when a performance enhancing proxy is installed across wireless links. If you think about this issue, you will realize that outside of the situation where a performance enhancing proxy is installed, this modification to the browser will only influence servers which host multiple domains under the same IP address. For these servers, multiple requests to different domains can be sent over the same TCP connection. I do not see how this could affect performance in general. If the server wants to keep a transaction model where it accepts one GET over a connection, it can close the connection after serving a GET. The modification at the browser does not prevent this from happening. I do not see GETs being mis-sent and served by the incorrect "virtual host" either. The "HOST" header is there precisely for this reason; to enable "virutal hosts" under one IP address and de-multiplex requests to the appropriate host based on the "HOST" header. It does not matter whether the requests are being sent over one TCP connection or multiple TCP connections. The suggested modification does help wireless performance and I do not believe has any negative consequence to wireline users and should definitely be a fix to be incorporated.
sampath: thanks for the additional details. i agree with you that this is probably the right thing to do. my only concern is virtual hosting systems that assume a connection (and associated state stored on the virtual host) will be for one domain only. this could happen if the virtual host assumes the browser will open a new connection per domain (as mozilla and ie currently do). that said, i think it is also reasonable to say that such a virtual host is "looking for trouble" since RFC 2616 is vague about how connection matching should be done.
Darin, Section 18.104.22.168 of RFC 2616 states the following ---------------------------------------------------------------------- ..................................Given the rate of growth of the Web, and the number of servers already deployed, it is extremely important that all implementations of HTTP (including updates to existing HTTP/1.0 applications) correctly implement these requirements: - Both clients and servers MUST support the Host request-header. - A client that sends an HTTP/1.1 request MUST send a Host header. - Servers MUST report a 400 (Bad Request) error if an HTTP/1.1 request does not include a Host request-header. - Servers MUST accept absolute URIs. ------------------------------------------------------------------------- The third item states that the servers MUST report a 400 error if the HOST header does not exist. This would mean that the HOST header needs to be looked at by the server on every GET request even if they are being received on a single persistent connection. The only reason that a server might be tempted to look at the HOST header only on the first GET and tie that TCP connection to a domain name would be to avoid the extra work of looking at the HOST header on every one of the GETs on that connection. But according to the above statement in the RFC this needs to be done anyway. Once this is done I would assume a correct implementation will put the HOST header with the relative URL to get the absolute URL for every GET. Given the above statement in the RFC you are correct that a server is "looking for trouble" if it ties a TCP connection to a domain name. In fact, Apache does this correctly. See http://www.phpfreaks.com/apache_manual/page/vhosts/details.html The following is from that link. ------------------------------------------------------------ Persistent connections The IP lookup described above is only done once for a particular TCP/IP session while the name lookup is done on every request during a KeepAlive/persistent connection. In other words a client may request pages from different name-based vhosts during a single persistent connection. ---------------------------------------------------------------------------
>The third item states that the servers MUST report a 400 error if the HOST >header does not exist. actually, few servers do this by default. apache and iis are incredibly forgiving (by default). they accept any Host header by default. (which opens up a big security hole, but whatever!) anyways, i was thinking more in terms of other, random home-brew virtual hosts. it does not surprise me that this change will work with apache and iis. what concerns me is the other 5% of the servers :-(
i've given some thought to how i would implement this change, and it occured to me that it would be much simpler to implement it if i could assume that matching by canonical hostname was good enough. thoughts? the alternate solution requires enumerating all socket connections, and matching the IP address of each socket connection to the list of IP addresses returned from a host lookup. that's an O(n*m) operation that i would prefer to avoid. if instead i continue to store the connections in a hash table, indexed by canonical name, then i can reuse the same data structures that we have today, and i get better lookup performance.
Darin, Our suggestion would be to look at only the first IP address in the list that is returned and compare it with the IP addresses for the sockets that are already open. If a socket exists, use it. Otherwise, open a new socket. This will be an O(m) operation, m being the number of open sockets. For example, if IP1, IP2 and IP3 are the addresses that are returned, check if a connection is open to IP1 and use it if it exists – otherwise open a connection to IP1. Ignore the other two IP addresses. Of course, if a connection to IP1 fails, other IP addresses will be tried in order and at each step you can check the existence of a connection. For example, if a connection to IP1 fails and you try IP2, you can check if a connection to IP2 is already open or not and so on. This is also consistent with round-robin DNS solutions because round-robin DNS should return the IP address that is chosen for the current request as the first IP address and the others as the rest. Our solution with the wireless proxy will work in this situation because we return the proxy IP address as the first IP address.
> Our suggestion would be to look at only the first IP address in the list >that is returned and compare it with the IP addresses for the sockets that are >already open.... Sampath, Thanks for the suggestion. I can see why that might work... however, it is a rather involved change to implement what you are suggesting. The problem lies in the fact that the code which manages failover, to the next IP address given in a DNS entry, is not in the same module as the code which manages the list of HTTP connections. We would need to make some drastic changes to implement your suggestion :-( Hmm.... Unfortunately, time has run out for the 1.6 cycle. We have frozen for Beta, and this isn't going to get fixed during this cycle. Pushing out to 1.7 alpha.
Target Milestone: mozilla1.6beta → mozilla1.7alpha
Darin, If you could look at the first IP address that is returned and then search for a matching socket that would be good enough. If there is a failover and you switch to the next IP address and are not able to search for a matching socket( because the code is in a different module) that is ok. Basically, this means only if there is a failover you may fail to reuse an existing connection. Most of the time there should not be a failover and you will be able to use an existing connection. Thanks for trying to get the change into 1.7 alpha. Sampath
Priority: -- → P3
Target Milestone: mozilla1.7alpha → mozilla1.7beta
this isn't to going to happen for 1.7 :-(
Target Milestone: mozilla1.7beta → mozilla1.8alpha
Darin, Hopefully you can get it in for 1.8. Thanks.... Sampath
Chances of this happening by me during the 1.8 cycle are slim. Patches are of course welcome! ;-) 1.8 alpha closes next week according to the Mozilla Roadmap, and I'm not sure that a change as risky of this should happen during the ensuing beta cycle.
Priority: P3 → P5
Priority: P5 → --
Target Milestone: mozilla1.8alpha1 → Future
-> default owner
Assignee: darin → nobody
Status: ASSIGNED → NEW
Component: Networking: HTTP → Networking
QA Contact: networking.http → networking
Target Milestone: Future → ---
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.