Closed Bug 220941 Opened 21 years ago Closed 8 years ago

Consider matching HTTP connections by resolved IP address instead of by host name

Categories

(Core :: Networking, enhancement)

x86
Linux
enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: darin.moz, Unassigned)

Details

(Keywords: helpwanted, perf)

Consider matching HTTP connections by resolved IP address instead of by host name.

This change is motivated by the fact that it is very common for transparent
proxies to be implemented by taking over DNS to map all hostnames to a single IP
address.  In such cases the browser will continue to open 2 persistent
connections per host name, which can quickly add up to a large number of proxy
server connections.  In such cases, it would be much better if the browser
simply reused an existing socket to the same IP node.

I could not find anything in RFC 2616 that seemed to suggest that this is not a
valid thing to do.  Section 8.1.4 talks about restricting the number of
persistent connections per server or proxy, but it is not explicit about whether
or not that may be interpreted as an IP-node.  I think it can be.

With a decent DNS cache (which we have), I think this should be trivial to
implement.  It would also benefit Mozilla on sites which do a lot of virtual
hosting.

BTW: I learned about this problem from a research paper presented at the 2003
IWCW.  Wireless networks (the non-802.11b flavor) like to use transparent
proxies to enhance performance.  Configuring clients to use an explicit proxy is
often not easy or even possible.
Severity: normal → enhancement
Status: NEW → ASSIGNED
Keywords: perf
Target Milestone: --- → mozilla1.6beta
I don't think a possible transparent proxy is a good reason to do this.
You have to assume that the server admins know they'll get a large
number of connections given what they're doing. If they have performance
problems it's their problem not mozilla's.

As to virtual hosting sites, some of them are just reverse proxies. What
you're actually connected to may depend on the virtual host. Are you
sure it's always safe to change a Host header in an established
connection?

Wouldn't this idea adversely affect sites which have multiple IP
addresses for load balancing?

I guess I don't like it. :-)
pablo (who originally mentioned this issue to me and wanted me to post his
response to tenthumbs) had this to say:

>I don't think a possible transparent proxy is a good reason to do this.
>You have to assume that the server admins know they'll get a large
>number of connections given what they're doing. If they have performance
>problems it's their problem not mozilla's.
 
The problem is not the extra load generated in the server. The problem is that
opening many connections causes extra delays, which are especially problematic
over Wireless links. The goal is to make sure that all TCP requests are sent to
the transparent proxy and that TCP connections to this proxy are re-used, even
for multiple domains. If this is not the case and the browser opens new
connections for each domain name, then the end-user experience gets severely
decreased.
 
>As to virtual hosting sites, some of them are just reverse proxies. What
>you're actually connected to may depend on the virtual host. Are you
>sure it's always safe to change a Host header in an established
>connection?
 
I cannot think of any problem with having a single TCP connection and different
Host headers. Do you have any particular scenario in mind?
 
>Wouldn't this idea adversely affect sites which have multiple IP
>addresses for load balancing?
 
Load balancing is not a problem since different users can still be redirected to
different IP addresses.
i agree with pablo here.  i think it is unlikely that servers will have a
problem receiving requests for different Host headers over the same socket
connection.  i think this bug is worth fixing.
I'm also not convinced this is a good idea, but if Darin thinks this is worth
coding, we can find out the old-fashioned way.
> The problem is not the extra load generated in the server. The problem
> is that opening many connections causes extra delays, which are
> especially problematic over Wireless links. The goal is to make sure
> that all TCP requests are sent to the transparent proxy and that TCP
> connections to this proxy are re-used, even for multiple domains. If
> this is not the case and the browser opens new connections for each
> domain name, then the end-user experience gets severely decreased.

This sounds like you're trying to mask a network problem by fiddling
with the clients. It would be one thing if mozilla had some simple
tuning prefs so the user could say "I'm on a T1", or "I'm on a modem",
or whatever but mozilla doesn't like treating users as intelligent.

You may see a performance improvement because of the reduced traffic but
I strongly suspect wired users will see a performance decrease. I can't
quantify it (yet) but I see a slowdown when using a proxy server
because of the relatively small number of connections.

I don't see why everyone should be penalized because of an obscure 
issue.

> I cannot think of any problem with having a single TCP connection and
> different Host headers. Do you have any particular scenario in mind?

In the ideal world, a reverse proxy will examine every packet coming in
from the outside and route it appropriately. The real world's another
thing. I have no doubt that there are lots of reverse proxies that
sometimes forget to do this so packets will be mis-sent.

This should be very carefully tested. Start visiting porn sites because
they tend to do this sort of thing a lot.:-)

> Load balancing is not a problem since different users can still be
> redirected to different IP addresses.

How do you know a site isn't configured based on the assumption that
each "transaction" (e.g. a web page and its goodies) will be a separate
connection. You're assuming that you can arbitrary amounts of data over
one connection.
I have worked with Pablo on the design of a performance enhancing proxy to 
improve performance over wireless links when we noticed this problem with the 
Mozilla browser. As Pablo pointed out, multiple TCP connections and DNS 
requests affects performance across wireless links considerably. To eliminate 
this problem, we decided to rewrite URLs (similar to CDNs) to transparently 
redirect requests to a proxy. Using this technique, requests to multiple domain 
names will be resolved to a single IP address (of that of the proxy) and fewer 
TCP connections will be used to download multiple web pages.
 
The problem that we saw with the Mozilla browser is that even if the DNS 
resolves different domain names to a single IP address, multiple TCP 
connections will be opened, one for each domain name, which severely affected 
wireless browsing experience. Modifying this behaviour to re-use connections 
per IP address helps boost up performance across wireless links. The 
performance improvement is not because of reduced traffic but because of the 
elimination of multiple RTTs across the wireless links required to open 
multiple TCP connections. 
 
For wireline clients I do not believe that there is any adverse impact. The 
modification to the browser does not imply that all requests are somehow sent 
either through a forward proxy or a reverse proxy. Only those requests that are 
URL rewritten will be sent to a forward proxy and will only occur when a 
performance enhancing proxy is installed across wireless links. 
 
If you think about this issue, you will realize that outside of the situation 
where a performance enhancing proxy is installed, this modification to the 
browser will only influence servers which host multiple domains under the same 
IP address. For these servers, multiple requests to different domains can be 
sent over the same TCP connection. I do not see how this could affect 
performance in general. If the server wants to keep a transaction model where 
it accepts one GET over a connection, it can close the connection after serving 
a GET. The modification at the browser does not prevent this from happening. I 
do not see GETs being mis-sent and served by the incorrect "virtual host" 
either. The "HOST" header is there precisely for this reason; to 
enable "virutal hosts" under one IP address and de-multiplex requests to the 
appropriate host based on the "HOST" header. It does not matter whether the 
requests are being sent over one TCP connection or multiple TCP connections.
 
The suggested modification does help wireless performance and I do not believe 
has any negative consequence to wireline users and should definitely be a fix 
to be incorporated.

sampath: thanks for the additional details.  i agree with you that this is
probably the right thing to do.  my only concern is virtual hosting systems that
assume a connection (and associated state stored on the virtual host) will be
for one domain only.  this could happen if the virtual host assumes the browser
will open a new connection per domain (as mozilla and ie currently do).  that
said, i think it is also reasonable to say that such a virtual host is "looking
for trouble" since RFC 2616 is vague about how connection matching should be done.
Darin,


 Section 19.6.1.1 of RFC 2616 states the following
----------------------------------------------------------------------

 ..................................Given the rate of growth of
   the Web, and the number of servers already deployed, it is extremely

   important that all implementations of HTTP (including updates to
   existing HTTP/1.0 applications) correctly implement these
   requirements:

      - Both clients and servers MUST support the Host request-header.

      - A client that sends an HTTP/1.1 request MUST send a Host header.

      - Servers MUST report a 400 (Bad Request) error if an HTTP/1.1
        request does not include a Host request-header.

      - Servers MUST accept absolute URIs.
-------------------------------------------------------------------------

   The third item states that the servers MUST report a 400 error if the HOST 
header does not exist. This would mean that the HOST header needs to be looked 
at by the server on every GET request even if they are being received on a 
single persistent connection. The only reason that a server might be tempted to 
look at the HOST header only on the first GET and tie that TCP connection to a 
domain name would be to avoid the extra work of looking at the HOST header on 
every one of the GETs on that connection. But according to the above statement 
in the RFC this needs to be done anyway. Once this is done I would assume a 
correct implementation will put the HOST header with the relative URL to get the 
absolute URL for every GET.

 Given the above statement in the RFC you are correct that a server is "looking 
for trouble" if it ties a TCP connection to a domain name.
 

   In fact, Apache does this correctly. See 
http://www.phpfreaks.com/apache_manual/page/vhosts/details.html 

The following is from that link.

------------------------------------------------------------
Persistent connections

The IP lookup described above is only done once for a particular TCP/IP session 
while the name lookup is done on every request during a KeepAlive/persistent 
connection. In other words a client may request pages from different name-based 
vhosts during a single persistent connection.

---------------------------------------------------------------------------

 

>The third item states that the servers MUST report a 400 error if the HOST 
>header does not exist.

actually, few servers do this by default.  apache and iis are incredibly
forgiving (by default).  they accept any Host header by default.  (which opens
up a big security hole, but whatever!)

anyways, i was thinking more in terms of other, random home-brew virtual hosts.
 it does not surprise me that this change will work with apache and iis.  what
concerns me is the other 5% of the servers :-(
i've given some thought to how i would implement this change, and it occured to
me that it would be much simpler to implement it if i could assume that matching
by canonical hostname was good enough.  thoughts?  the alternate solution
requires enumerating all socket connections, and matching the IP address of each
socket connection to the list of IP addresses returned from a host lookup. 
that's an
O(n*m) operation that i would prefer to avoid.  if instead i continue to store
the connections in a hash table, indexed by canonical name, then i can reuse the
same data structures that we have today, and i get better lookup performance.
Darin,

    Our suggestion would be to look at only the first IP address in the list 
that is returned and compare it with the IP addresses for the sockets that are 
already open. If a socket exists, use it. Otherwise, open a new socket. This 
will be an O(m) operation, m being the number of open sockets. For example, if 
IP1, IP2 and IP3 are the addresses that are returned, check if a connection is 
open to IP1 and use it if it exists – otherwise open a connection to IP1. 
Ignore the other two IP addresses. Of course, if a connection to IP1 fails, 
other IP addresses will be tried in order and at each step you can check the 
existence of a connection. For example, if a connection to IP1 fails and you 
try IP2, you can check if a connection to IP2 is already open or not and so on. 
This is also consistent with round-robin DNS solutions because round-robin DNS 
should return the IP address that is chosen for the current request as the 
first IP address and the others as the rest. Our solution with the wireless 
proxy will work in this situation because we return the proxy IP address as the 
first IP address.

>    Our suggestion would be to look at only the first IP address in the list 
>that is returned and compare it with the IP addresses for the sockets that are 
>already open....

Sampath,

Thanks for the suggestion.  I can see why that might work... however, it is a
rather involved change to implement what you are suggesting.  The problem lies
in the fact that the code which manages failover, to the next IP address given
in a DNS entry, is not in the same module as the code which manages the list of
HTTP connections.  We would need to make some drastic changes to implement your
suggestion :-(  Hmm....

Unfortunately, time has run out for the 1.6 cycle.  We have frozen for Beta, and
this isn't going to get fixed during this cycle.  Pushing out to 1.7 alpha.
Target Milestone: mozilla1.6beta → mozilla1.7alpha
Darin,

   If you could look at the first IP address that is returned and then search 
for a matching socket that would be good enough. If there is a failover and you 
switch to the next IP address and are not able to search for a matching socket( 
because the code is in a different module) that is ok. Basically, this means 
only if there is a failover you may fail to reuse an existing connection. Most 
of the time there should not be a failover and you will be able to use an 
existing connection. Thanks for trying to get the change into 1.7 alpha.

Sampath
Priority: -- → P3
Target Milestone: mozilla1.7alpha → mozilla1.7beta
this isn't to going to happen for 1.7 :-(
Target Milestone: mozilla1.7beta → mozilla1.8alpha
Darin,

 Hopefully you can get it in for 1.8. Thanks....

Sampath
Chances of this happening by me during the 1.8 cycle are slim.  Patches are of
course welcome! ;-)

1.8 alpha closes next week according to the Mozilla Roadmap, and I'm not sure
that a change as risky of this should happen during the ensuing beta cycle.
Keywords: helpwanted
Priority: P3 → P5
Priority: P5 → --
Target Milestone: mozilla1.8alpha1 → Future
-> default owner
Assignee: darin → nobody
Status: ASSIGNED → NEW
Component: Networking: HTTP → Networking
QA Contact: networking.http → networking
Target Milestone: Future → ---
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.