Closed Bug 783830 Opened 12 years ago Closed 8 years ago

Use multiple host addresses when available

Categories

(Core :: Networking: HTTP, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: andershol, Unassigned)

Details

When multiple addresses, i.e. A or AAAA records, exist for a host name, the browser should, for increased availability and speed,
a) attempt to open connections to all (or at most e.g. 5) addresses concurrently
b) start the HTTP-requests on the best connection. For the first request the best connection is probably the one that connects first, but for subsequent requests availability, latency and bandwidth (e.g. if one address is much slower than the others, don't use it even if it is idle) can be considered. Statistics can probably be kept for addresses independent of host name. 
c) restart stalled requests on other connections (e.g. using a range request to pickup). The stalled request might be canceled after the new request is verified to be better.
d) for requests with a content-length over some limit (and perhaps only if a e-tag and/or a last-modified header also exist) use range requests to download multiple parts of the file simultaneously.
(c) and (d) can probably be combined by using the estimated time remaining as the criteria instead of the content-length.

As an example, suppose you setup two a-records for the domain *.a.example.com to point to two webservers, both serving that wildcard-domian and both with an image at the address /a.gif, one server serving up a blue and one a green image. You also setup a third a-record for the domain pointing to an address not hosting a webserver. All of the addresses should be non-local as it seems the current implementation prefers local addresses over remote. Now you would load up e.g. a html page containing:
<style>img { width:5px; height:5px; border:1px solid #ccc; } </style>
<script>var t = new Date().getTime();
for (var i = 0; i < 100; i++)
 document.write('<img src="http://'+t+'.a.example.com/a.gif?'+t+'-'+i+'"> ');
</script>

In the current implementation you would get a page of all blue or all green images depending on the order of the records in the dns-response. If the record not pointing to a web-server happened to be first, the page would take a little longer to load as the bad address would be tried first. If the chosen web-server happened to be very slow or crash half way through, the page would just load slowly or not finish, since fail-over, between the returned a-records, only happen in the connection-phase of the request (I think).

With the proposed changes the expected behavior would be that the page contains a mix of blue and green images, and that the page would load with the same speed regardless of the order of the addresses in the dns-response (as all addresses a being tried simultaneously and the non-functional address is simply being ignored). Also the page might load faster as images can be downloaded over two connections (but that would depend of connection speed for client and server and server load). You could view it as failing over for each request (even successful ones), but it goes further in that multiple downloads are started simultaneously.

This proposal does not solve the same problem as e.g. domain sharding, which is was what SPDY replaces (among other things). There multiple host names is pointed to the same address to get more simultaneous downloads from the same host. This bug tries to address the problem where some hosts are unavailable, slow or far away, which sites might try to mitigate using fail-over (using load balancers, dns or bgp-routing) and geolocation-aware dns. Also client that can not access ipv6 addresses would seamlessly and instantly fall back to using ipv4 addresses if available.

Bug 258456 seems related, but only proposes that unavailable should not be retried, not that all addresses should be used simultaneously if multiple downloads are pending or for pieces of large downloads.
Gecko already opens a connection to an A and another one to an AAAA address for Happy-Eyeballs (http://en.wikipedia.org/wiki/Happy_Eyeballs ). Switching the server in a "session" could lead to problems for example if the content is dynamically generated for the user. see also bug 725587

Opening a connection to all available A and AAAA addresses would be a waste of resources on the server
(In reply to Matthias Versen (Matti) from comment #1)
> Gecko already opens a connection to an A and another one to an AAAA address
> for Happy-Eyeballs (http://en.wikipedia.org/wiki/Happy_Eyeballs ).
Thanks for the reference, it had slipped my mind. But it seems to work on the assumptions that (1) "every SYN is sacred" and (2) the goal is to find one optimal connection.

But for even small resources the number of packets are dwarfed by the content transfer (e.g. ~3 packets for the connection setup version ~20 packets for a 20-30kb file that I think would be considered tiny). So the extra packets should not be a problem for the network. I assume that networking stacks in modern OS'es are so optimized that the extra (idle) connections is not what will overload a web server.

I would assume that the number of resources fetched for a page are almost always higher than the number of addresses the site have setup (the counter example is www.google.com that when I looked it up had 6 A-records and 1 AAAA-record, but only 4 resources loaded from google.com -- but even there there is a good chance that user will perform a search and thus load more pages) and since the various "max...connections" settings (see https://developer.mozilla.org/Mozilla_Networking_Preferences ) are all higher than the 5 connections I suggested, there is a good chance that all 5 connections will be used for a normal page load (modulo keep-alive...). In the worst case, 4 connections are wasted, but only one per address and only for sites that have configured at least 5 addresses (which I assume is a tiny minority) and they have done so exactly because they want to make sure their site is available and fast.

> Switching the server in a "session" could lead to problems for example
> if the content is dynamically generated for the user.
I believe that only a tiny minority of sites have multiple address records (pointing to different servers) configured, and that only a tiny minority runs applications where e.g. the script that generates the html also generates a image on disk which is referenced by the html (I think it was such a sinario you meant). The intersection of those sets, where it is furthermore assumed that the resource request will always hit the same server (which seems fragile) i.e. no replication between servers, seems very tiny. Some site probably exist, since every possible misconfiguration probably exists on the net. But it would break at some point.

> see also bug 725587
Skimming the bug it seems to me that the main problem is that "Log-ins are bound to IP Addresses" and this causes problem when contacting the the host over both ipv4 and ipv6 since the clients ipv4 and ipv6 address are of course different. This is a good point and means that the full proposed functionality of this bug should be grouped by address family.

It is not clear if they have also observed the browser trying to use the same ssl-session on both the ipv4 and ipv6 address. That would seem to me to be a bug if it did (although I have not read the ssl spec to see if such address migration is legal). Even if allowed by the spec it seems fragile. But ssl-session != http-session.

> Opening a connection to all available A and AAAA addresses would be a waste
> of resources on the server
As I try to argue above, I do not believe it will be and in common cases will not increase resource usage. But I think it would be very interesting if something like telemetry or testpilot could be used to test the impact.
andershol, thanks for taking the time to file a detailed feature request. There is a mix of things in here that in my opinion run the gambit from promising to skeptical to non-starter. But let's not let that get in the way of an interesting project.

If you're interested in writing code, let's talk about how you can structure it in a limited way so we can get some data to make informed decisions with - i.e. using a/b controls with small samples and telemetry and considering what factors might be interesting other than just mean page load time.. and maybe limiting to a pref-on context of only <= beta to get that info. Are you proposing writing the code for this?

There are several spots which you might run into real trouble with - I don't want to be all fuddish and reject the work before it happens because their outcome is really unknown - but I'm skeptical about a few things.

but first, in the spirit of optimism - I really like:

* new ideas and innovation

* syns are not sacred; I'm totally comfortable with low bandwidth speculative operations that have a good payoff. For the most part syns and dns are the places we have looked for that so far, but I'm interested in other ideas. We currently do create connections in asynchronous speculative ways (generally to pre-empty OS level timeouts), and its likely we'll add more of this along the lines of "last time you needed 3 connections to serve foo.com, let's make 3 as soon as you click on foo.com this time".. but they can create server burden so their increased use should be thoughtful and deliberate.

* the browser is in the ideal spot to do locality discovery. It makes sense to do something.

but my initial concerns would be:

 * load balancing within a session as matti suggests. LBs go to great lengths to keep sessions on the same member of a cluster and generally rely on DNS to keep browsers pointed at the same cluster.

 * ignoring the ordering of the RRs is fraught with some danger.. there has been a problem in the past with hosts that perceive the 1st address to be the primary and the 2nd to be a backup - using the latter as more than a solution to "unreachable right now" ends up overloading the backup.. I can't lay my hands on the bug number right now, but it has been a barrier in the past. Its something I'm tempted to push back on but we should have a mechanism for understanding the scope of the problem (and backing off ourselves if necessary) in place.

 * substituting latency for max-bandwidth-for-these-particular-transactions-on-this-particular protocol isn't necessarily a sure thing. Especially in the case of overload (see last bullet), routers configured with short buffers won't show increased delay, but the transfer experience will be awful. Measuring bandwidth is hard, because it is so contextual and easily messed with by other local activity. I think latency is a good guide in the absence of any other data - but its an open question of whether record ordering is better guide in practice.

* there is a significant problem with server side state for some servers that scales per connection. It shouldn't be that way, and for lots of folks it isn't a problem, but I know privately of several large household name services that are still process-per-connection (not even thread per connection). The standard PHP implementation is a big contributor to the problem. Its not clear what the extent of the problem really is - so some cautious exploration is warranted. Its also clear to me that there are answers available on the server side - so some gentle pressure if we are looking at a real minority is probably warranted. I certainly have ideas different that yours that also result in increased connection use.

As a first step I'd support a testing project to do parallel connections to multiple addresses and select the fastest one from the set.. but maybe limit it to just that at first.
(In reply to Patrick McManus [:mcmanus] from comment #3)
> my opinion run the gambit from promising to skeptical to non-starter.
Thanks a lot for the feedback. Although that your range of feeling seems a bit lopsided :) I actually find it encouraging in some way, that such a counter-intuitive suggestion (we all know that "browsers do not use multiple a-records, so we have to use all sorts of tricks to be redundant") can't be (or at least isn't) dismissed out of hand.

> Are you proposing writing the code for this?
Well, I wasn't, but maybe I should take a stab at it. The pointer from Matti seems like a good place to start looking.

> We currently do create connections in asynchronous speculative
> ways (generally to pre-empty OS level timeouts), and its likely we'll add
> more of this along the lines of "last time you needed 3 connections to serve
> foo.com, let's make 3 as soon as you click on foo.com this time".. but they
> can create server burden so their increased use should be thoughtful and
> deliberate.
I was just now trying to look for some "tcp-pre-warm"-thing I seem to remember being talked about for firefox, but stumbled upon nsISpeculativeConnect instead (and bug 580117, though it doesn't discuss tcp-connections -- yet). Also note that this proposal won't hit all servers on the net, but only those that have chosen to setup multiple a-records for the same host.

> * the browser is in the ideal spot to do locality discovery. It makes sense
> to do something.
Yep, the current workaround (geolocation via dns and short TTLs -- which causes extra dns-lookups) needs to make the decision upfront and from afar (knowing the best path from server to dns and from client to dns, doesn't mean you know the best path from client to server -- or by extension what server is the best)

> but my initial concerns would be:
>  * load balancing within a session as matti suggests. LBs go to great
> lengths to keep sessions on the same member of a cluster and generally rely
> on DNS to keep browsers pointed at the same cluster.
Yep, at least at the time when big-iron and java ruled the earth and state was stored in the front-end, now where people have moved the state to memcache it is probably less of a concern.

But more importantly, if you rely on your load balancer for session affinity, you won't have multiple a-records pointing to multiple (non-connected) load balancers, since when the user next clicks a link time might have passed and the browser decided to refresh dns. You might setup multiple a-records, if you have multiple connections to the internet via different ISPs, and in this case everything will continue to work (just better) since the load balancer (or LB cluster) will continue to handle the session affinity.

And lastly, I believe, the common case is that on a page load one html-resource is loaded followed by the referenced image(/whatever)-resources, and that session affinity is almost solely relevant for the html-resources. If multiple a-records is setup it would already today be common to have the html-resource requests go to different addresses since the user needs to click a link in-between (so a dns refresh might have happened), so you already have the affinity problem where it matters most.

>  * ignoring the ordering of the RRs is fraught with some danger.. there has
> been a problem in the past with hosts that perceive the 1st address to be
> the primary and the 2nd to be a backup - using the latter as more than a
> solution to "unreachable right now" ends up overloading the backup.. I can't
> lay my hands on the bug number right now, but it has been a barrier in the
> past. Its something I'm tempted to push back on but we should have a
> mechanism for understanding the scope of the problem (and backing off
> ourselves if necessary) in place.
Sounds fragile (if it wasn't for MX-records :) ). I would suspect that geo-aware-dns-servers put a-records in a specific order, but if it really is the best server they put first, then nothing will change. But it would be very interesting if you could locate the bug (I only tried a very quick search myself). (see below for opting out)

But your point is inspiring: Perhaps the full functionality should only be turned on if the multiple addresses are advertised using SRV-records. It supports both priority and weight, and there is a "http"-service name registered for services that "is served over HTTP, can be displayed by "typical" web browser client software, and is intented primarily to be viewed by a human user". There currently seems to be implementation going on in bug 735967 (and 545866 and 14328).

>  * substituting latency for
> max-bandwidth-for-these-particular-transactions-on-this-particular protocol
> isn't necessarily a sure thing. Especially in the case of overload (see last
> bullet), routers configured with short buffers won't show increased delay,
> but the transfer experience will be awful. Measuring bandwidth is hard,
> because it is so contextual and easily messed with by other local activity.
> I think latency is a good guide in the absence of any other data - but its
> an open question of whether record ordering is better guide in practice.
I completely agree, which is also why I did some hand-waving above (talking about the "best" connection) and just mentioned that the first have less information to choose from, so in some sense the choice is easier. But it is probably the part that needs most tweaking/tuning.

> * there is a significant problem with server side state for some servers
> that scales per connection. It shouldn't be that way, and for lots of folks
> it isn't a problem, but I know privately of several large household name
> services that are still process-per-connection (not even thread per
> connection). The standard PHP implementation is a big contributor to the
> problem. Its not clear what the extent of the problem really is - so some
> cautious exploration is warranted. Its also clear to me that there are
> answers available on the server side - so some gentle pressure if we are
> looking at a real minority is probably warranted. I certainly have ideas
> different that yours that also result in increased connection use.
- As previously argued this might not even increase the number of connections used, just spread them over more servers and connect upfront instead of seconds later when it turns out that the have images.
- Initially, when only availability and latency can be checked, it would not make a difference for the measurement from the client-side, if only the two first steps of the three-way handshake is completed (although for ssl-sites it would be nice to complete the ssl-handshake), so it can be controlled, the third packed could be postponed and then it will just look like a SYN-attack (in the very worst case) which all sites, that chooses to have multiple a-records, should be able to handle on the front-end.
- A service that currently have a setup with 4 web-servers like www.example.com (a-record to)-> 1.1.1.[1-4], could probably just opt-out by doing www.example.com (c-name to)->www[1-4].example.com -> (a-record to one of)-> 1.1.1.[1-4] so the browser will only get one a-record at the "leaf" (the dns-servers will probably combine the responses, as additional info, so no extra lookup are needed).
- The reason that only a minority is using multiple a-records is exactly that browsers don't use them (chicken and egg). But I believe the abundance of geo-aware-dns, HA and CDN services show that there is a need.

> As a first step I'd support a testing project to do parallel connections to
> multiple addresses and select the fastest one from the set.. but maybe limit
> it to just that at first.
It sounds like (a) and the first half of (b), from the first comment, if I understand correctly. I would probably also vote for doing the last half of (b) and perhaps save the others for when the servers was advertised using srv-records.

I guess some interesting numbers (or histogram), that could be done without an implementation, would be con#="the total number of opened connections to a hostname within the first minute from first connection" versus rec#="the number of a-records for the hostname". If con#>=min(rec#,5) then no resources will be wasted. Since min(rec#,5) is small, a histogram probably makes more sense than the average of min(con#/min(rec#,5),1) (and checking if it is close to 1).
(In reply to andershol from comment #4)
> (In reply to Patrick McManus [:mcmanus] from comment #3)
> > my opinion run the gambit from promising to skeptical to non-starter.
> Thanks a lot for the feedback. Although that your range of feeling seems a
> bit lopsided :) I actually find it encouraging in some way, that such a

It was meant to be encouraging. There are always a million reasons why-not instead of why-yes. That just means its not easy :)

> 
> > Are you proposing writing the code for this?
> Well, I wasn't, but maybe I should take a stab at it.

Realistically that's how something like this gets done - it needs a champion who will shepherd it through an evaluation, keep everyone else aprised of the results, and take on the burden of the various bug reports.


> > We currently do create connections in asynchronous speculative
> [..]
> > deliberate.
> I was just now trying to look for some "tcp-pre-warm"-thing I seem to
> remember being talked about for firefox, but stumbled upon
> nsISpeculativeConnect

right now speculative connect is part of it - we race connection and ssl establishment against cache IO and search box activity. Those connections might get used for something other than the event that kicked them off. The "syn-retry" code is another example.. after 200ms of waiting to connect we start another connection in parallel. The basic reason for that is to avoid waiting for an os level syn-retry at 3000ms, but the net-effect is that we start 2 connections in high latency environments. And as I said, I'm willing to experiment with more.

 instead (and bug 580117, though it doesn't discuss
> tcp-connections -- yet). Also note that this proposal won't hit all servers
> on the net, but only those that have chosen to setup multiple a-records for
> the same host.

That's true. I'm not sure it correlates to anything useful, but its true :)

> 
> memcache it is probably less of a concern.
> 

the tail always bites you. and we have a huge userbase - so they will find the tail. And if we break their use case they will simply go to another browser. The question is always - how many people are effected and what's the net gain from the behavior change. The first part is always hard to answer, but we can make an attempt at quantifying the second.

> 
> relevant for the html-resources. If multiple a-records is setup it would
> already today be common to have the html-resource requests go to different
> addresses since the user needs to click a link in-between (so a dns refresh
> might have happened), so you already have the affinity problem where it
> matters most.

no - dns records are rarely reordered by their servers (for this reason) and we intentionally don't round robin through the set. 

> 
> But your point is inspiring: Perhaps the full functionality should only be
> turned on if the multiple addresses are advertised using SRV-records. 

srv is an interesting approach.. the trouble there is always figuring what to do in the no-srv case. You can make the A and SRV requests in parallel, but consider the case where you get the A back before the SRV. Do you add latency waiting for the SRV for the >99% of hosts that won't have one anyhow? How long will you wait?

If you can make some progress just using the fraction of cases where SRV comes back first (hoepfully bundled with the A's) then its a good thing. Of course we need to implement srv :)


> - Initially, when only availability and latency can be checked, it would not
> make a difference for the measurement from the client-side, if only the two
> first steps of the three-way handshake is completed 

clever thought - but how are you going to do that in a non-priv'd userspace application hopefully on the full set of supported platforms? The other concern there is that you will trigger syncookie behavior on some servers by looking like a DOS and that will result in the loss of other desirable TCP features.. so there is a bad second order effect. 

> 
> > As a first step I'd support a testing project to do parallel connections to
> > multiple addresses and select the fastest one from the set.. but maybe limit
> > it to just that at first.
> It sounds like (a) and the first half of (b), from the first comment,


sounds reasonable at least.
(In reply to Patrick McManus [:mcmanus] from comment #5)
> (In reply to andershol from comment #4)
> > (In reply to Patrick McManus [:mcmanus] from comment #3)
> > Also note that this proposal won't hit all servers
> > on the net, but only those that have chosen to setup multiple a-records for
> > the same host.
> That's true. I'm not sure it correlates to anything useful, but its true :)
Why isn't it useful, that any potential harm is limited to a self-selected, small minority of sites?

> > memcache it is probably less of a concern.
> the tail always bites you. and we have a huge userbase - so they will find
> the tail. And if we break their use case they will simply go to another
> browser. The question is always - how many people are effected and what's
> the net gain from the behavior change. The first part is always hard to
> answer, but we can make an attempt at quantifying the second.
The potential harms we have discussed in this bug is to the sites, not the users. The benefit is to the users (faster connections) and the sites (happier users with a simpler setup).

> no - dns records are rarely reordered by their servers (for this reason) and
> we intentionally don't round robin through the set. 
I don't think I have ever seen a (authoritative) nameserver that didn't rotate the records. Looking a bit a round:
- rfc 1035 ( http://www.ietf.org/rfc/rfc1035.txt ) - doesn't seem to mention or assign any significance to order of multiple records.
- Wikipedia ( http://en.wikipedia.org/wiki/Domain_Name_System ) - Says "The order of resource records in a set, returned by a resolver to an application, is undefined, but often servers implement round-robin ordering to achieve Global Server Load Balancing."
- Bind ( http://ftp.isc.org/isc/bind9/cur/9.9/doc/arm/Bv9ARM.ch06.html#rrset_ordering ) - "By default, all records are returned in random order." and "In this release of BIND 9, the rrset-order statement does not support "fixed" ordering by default."
- getaddrinfo ( http://linux.die.net/man/3/getaddrinfo ) - References RFC 3484 ( http://www.ietf.org/rfc/rfc3484.txt ) that suggests how resolver clients may reorder results before returning them to the application (including "Rules 9 and 10 may be superseded if the implementation has other means of sorting destination addresses.")
So I'm not saying that no one is doing it, just that a server that expects clients to honer the order (and in what cases would clients be allowed to use other than the first -- a client can never know for sure that a server is down), is walking on thin ice.

> If you can make some progress just using the fraction of cases where SRV
> comes back first (hoepfully bundled with the A's) then its a good thing.
Sure, but the safe bet would be to require srv. If you are setting up a redundant web-server, it is a small extra burden to use SRV-records instead of A-records.

> > > As a first step I'd support a testing project to do parallel connections to
> > > multiple addresses and select the fastest one from the set.. but maybe
> > > limit it to just that at first.
> > It sounds like (a) and the first half of (b), from the first comment,
> sounds reasonable at least.

The part of my comment, you think is "reasonable", is where I quote you? Nice :)
(In reply to andershol from comment #6)

> Why isn't it useful, that any potential harm is limited to a self-selected,
> small minority of sites?

Servers that have multiple A records aren't opting into this algorithm in particular so there is risk. That's my point. Its certainly nice that they represent a reasonable subset of all servers though.

> The potential harms we have discussed in this bug is to the sites, not the
> users.

Both users and sites have risks. Users risk using overloaded (slow) backup installations and having a bad experience, and having their sessions break because of unexpected load balancing.


> I don't think I have ever seen a (authoritative) nameserver that didn't
> rotate the records. Looking a bit a round:

you're citing capabilities - I'm citing configurations.

> > If you can make some progress just using the fraction of cases where SRV
> > comes back first (hoepfully bundled with the A's) then its a good thing.

right.. so you require srv to do this client driven load balancing. You get back an A record at T=20ms. Do you wait for a SRV? If so, for how long? (you can query A and SRV in parallel, but not atomically - it just breaks nats).
Status: UNCONFIRMED → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.