+++ This bug was initially created as a clone of Bug #621558 +++
At the moment, many dual-stack environments are wholly or partially broken, presenting problems for seamless IPv6 deployment. Long timeouts are bad for users, making the system appear to be broken.
http://tools.ietf.org/html/draft-wing-http-new-tech-00 "Happy Eyeballs: Successful Introduction of New Technology to HTTP" suggests the use of overlapped attempts to resolve names for and to open IPv4 and IPv6 sockets to HTTP-like network endpoints as a way to autodetect IPv6 operation in a dual-stack environment, without significantly degrading performance if that environment is broken.
Support for this, or something similar, in the core networking code would greatly ease the development of practical IPv6 interoperability in Mozilla products. The Internet-Draft describes a number of heuristics that may be possible to improve upon over time -- if this code were to be implemented centrally, this would also provide a good place to implement these new heuristics.
I'll take a look at this.
It refers what Firefox and Chrome implemented.
Interesting. The draft above talks about calling getaddrinfo() to get all the IP addresses. But the same flaky-connectivity problem that can cause problems with TCP can also cause problems with DNS.
Perhaps DNS requests should be issued simultaneously via IPv4 and IPv6, letting the two race against one another, and possibly also act as a hint regarding the quality of IPv6 connectivity?
And then there's the case of TCP-based DNS lookups, which will provide yet another avenue for timeouts...
Unfortunately, we have no way to know which address the system prefer without calling getaddrinfo() unless the system provides a way to get a prefix policy table.
DNS query response time is affected by dropped DNS query (or response) packets, and if the DNS server has cached the answer (that is, someone else already asked for that A or AAAA record). It isn't a useful indicator of the viability of the IPv6 path to the server, which is what we really want to know. Also, if we don't use getaddrinfo() and instead craft a separate A and AAAA query inside the application, it's impossible to follow the host's IP address preference -- unless you're on Windows which provides an API to have the OS re-order IPv6 and IPv4 addresses into the host's preferred address order. Perhaps other OSs provide a similar API; if they do, that would be a reasonable solution.
Remember that the question being asked (A vs AAAA) and the IP version of the DNS packets are orthogonal so you can't reliably infer the quality of IPv6 connectivity just from the DNS lookup response time. For example, "A" lookups can be made by the resolver over IPv6 or IPv4 and AAAA DNS lookups can be made over either IP version as well. The IP version from the browser to the resolver and the IP version from the resolver to the various authorities may also be different. (The IPv6/IPv4 performance from the browser to the resolver may also not be indicative of general connectivity as that may be over localhost or over a LAN while IPv6 network brokenness commonly may exist when going beyond the LAN, especially in the cases such as a home gateway that provides dualstack IPv6 within the home but then uses flakey 6to4 to reach the world.)
If we assume that the browser to resolver connectivity is good (since introducing IPv6 sites doesn't change anything here), then this becomes an issue for the resolver. Many resolvers are fairly good about handling some DNS authorities so should be able to handle some of the authorities having unreachable AAAA records for some of their NS records.
As such, this comes down to whether getaddrinfo() issues A and AAAA requests in-parallel (which is preferable from a performance perspective) as well as how the OS getaddrinfo() handles fail-over between multiple configured resolvers (which is independent of IPv4 vs IPv6).
At least both connections can share one getaddrinfo() result.
1. Call getaddrinfo() via nsHostResolver::ResolveHost() with address family AF_UNSPEC. It will return addresses in the system preferred order.
2. Pick the first address from the result.
3. Make a primary connection and start a backup timer.
4. If the primary connection has been established before the timer has been fired, cancel the timer and skip all remaining steps.
5. When backup timer has been fired, call nsHostResolver::ResolveHost() with the same address family and flags as step 1. It will return the result from our internal cache without sending DNS queries.
6.1. If there's only one address in the result, choose the address regardless of the address family and go to step 7.
6.2. Search IPv4 addresses from the result. If IPv4 address is not found, choose the second address from the list regardless of the family and go to step 7. If there's only one IPv4 address in the result, choose the address and go to step 7.
6.3. If the IPv4 address is at the top of the result (which is used by primary connection), choose the second IPv4 address from the result and go to step 7.
6.4. Choose the first IPv4 address from the result.
7. Make a backup connection using the chosen address.
8. When whichever connection has been established, abort another one.
Ugh, the above algorithm didn't consider the case when system prefer the IPv4 address. Read "IPv4" as "other address family".
Yeah, I was just writing about that. Easily fixed. Instead of "search for the first IPv4 address" in Steps 6.2 and 6.3, it needs to be "search for the 'other' IP address family from the first result".
Created attachment 569507 [details] [diff] [review]
Based on https://bugzilla.mozilla.org/show_bug.cgi?id=679090#c9 and https://bugzilla.mozilla.org/show_bug.cgi?id=621558#c16 this patch does:
- starts the backup timer in CONNECTING_TO notification that happens after DN resolution, this is done just ones, there maybe more CONNECTING_TO notifications due to recover
- cancels the backup timer in CONNECTED_TO notification, there is no need for a backup connection since then
- when a host cannot be resolved and DISABLE_IPV6 flag had been set on a transport, drop that flag and try DN resolving again
- when we tried all IPs and DISABLE_IPV6 flag had been set, drop that flag and try DN resolving again
The first two fixes also mitigates an issue with too much unused connections we are opening ; I have discovered that some time ago during pipelining stuff review (missing telemetry for that right now..)
The remaining parts may get optimized: if there were no IPv4 address in the list of the primary transport -> bypass the second IPV4-including pass. That is vague and racy but may save some resources. The primary goal is to let IPv4 transport fallback to IPV4/6 back again when it is worth to.
Also there is a space to cache the information we obtained about the host. But I'd rather do this separately and start getting stick with the happy-eyeballs spec here.
I can think of preserving NS_ABORT_IF_FALSE (out == mStreamOut, "timer for non existant stream") somehow if still applicable.
Pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=76ebad540b33
The tryserver build seems to work for me. However, my IPv6 connection may be too fast to reproduce bug 679090. My ISP provides a native IPv6 connection along with IPv4.
Eduardo, could you confirm that this patch wouldn't regress the issue?
Created attachment 569768 [details] [diff] [review]
- fixed missing check for !mSynTimer before call to SetupBackupTimer
- added logs
Is there a place I can test a build with this patch without building my own from source? I tried the tbpl link above but it says "Loading failed: error".
(In reply to Lorenzo Colitti from comment #12)
> Is there a place I can test a build with this patch without building my own
> from source? I tried the tbpl link above but it says "Loading failed: error".
Here it should be:
The previous one was probably wiped out, there were some problems with the Try server recently.
I'll get to this review monday or earlier.
Comment on attachment 569768 [details] [diff] [review]
Review of attachment 569768 [details] [diff] [review]:
The nshttpconnectionmgr changes are good, I see the problem they solve. What do you think about landing them as a separate (first) patch so if the ipv6 specific stuff needs to be backed out we can keep the timer changes?
and the whole patch is fine (and smaller than I thought) so r=mcmanus
It took me a little while to figure out the root problem being solved from biesi's original patch - can you confirm that christian's patch disabled ipv6 on backup connections and therefore a backup connection made to a v6-only host (as opposed to a dual stack host) always failed. nsHalfOpenSocket::OnOutputStreamReady() passes any connection result (good or bad) through to the http stack - so these failed connections were actually being used (this is the part that wasn't obvious to my memory) and resulted in page load errors. Your patch makes the backup connection prefer v4, but still do v6 if the v4 path doesn't connect. So the backup connections will still work with v6 only stacks. Do I have that right?
@@ +1618,5 @@
> + case nsISocketTransport::STATUS_CONNECTING_TO:
> + // Passed DNS resolution, now trying to connect, start the backup timer
> + // only prevent creating another backup transport
> + if (!mBackupTransport && !mSynTimer)
> + SetupBackupTimer();
this is a good change
(In reply to Patrick McManus from comment #15)
> Comment on attachment 569768 [details] [diff] [review] [diff] [details] [review]
> Review of attachment 569768 [details] [diff] [review] [diff] [details] [review]:
> do you think about landing them as a separate (first) patch so if the ipv6
> specific stuff needs to be backed out we can keep the timer changes?
Yes, I'll split the patch to two.
> It took me a little while to figure out the root problem being solved from
> biesi's original patch - can you confirm that christian's patch disabled
> ipv6 on backup connections and therefore a backup connection made to a
> v6-only host (as opposed to a dual stack host) always failed.
> nsHalfOpenSocket::OnOutputStreamReady() passes any connection result (good
> or bad) through to the http stack - so these failed connections were
> actually being used (this is the part that wasn't obvious to my memory) and
> resulted in page load errors. Your patch makes the backup connection prefer
> v4, but still do v6 if the v4 path doesn't connect. So the backup
> connections will still work with v6 only stacks. Do I have that right?
This is exactly that. I thought I wrote that in the comment with the patch.. But this is better, you double-confirmed this way the solution is correct.
Comment on attachment 569768 [details] [diff] [review]
Landed as two parts:
Unless you need it for you tracking purposes, the [inbound] status whiteboard entry is not mandatory anymore (we'd prefer not having to handle it, indeed).
*** Bug 669159 has been marked as a duplicate of this bug. ***