Open Bug 1533273 Opened 6 years ago Updated 3 months ago

network.http.speculative-parallel-limit = 0 breaks initial NAT64 connections

Categories

(Core :: Networking, defect, P2)

65 Branch
defect

Tracking

()

People

(Reporter: nospam, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged][necko-priority-queue])

Attachments

(3 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0

Steps to reproduce:

First make sure your internet connection has the following properties:

  • No IPv4 address
  • Globally routable IPv6 address
  • NAT64+DNS64
  1. Create a new Firefox profile
  2. Set network.http.speculative-parallel-limit to 0 in about:config
  3. Try to access a v4 only website (e.g. https://www.westpac.com.au)
  4. Observe it fails to load
  5. Refresh the page
  6. Observe it now loads
  7. Further refreshes also continue to load the page

To speed up reproducing the bug and not have to create a new profile each time, set Firefox to clear caches on shutdown and not remember history (so it acts like incognito mode). Then simply restarting Firefox and accessing the website again is enough to trigger the bug.

How might an end user end up setting network.http.speculative-parallel-limit to 0 without touching about:config? Setting networkPredictionEnabled to false in the privacy WebExtensions API does it. So privacy-focused extensions (e.g., uBlock Origin) might trigger it just by being installed.
(Related bug report on uBlock Origin: https://github.com/uBlockOrigin/uBlock-issues/issues/450)

At the time of writing, www.westpac.com.au resolves to 64:ff9b::13.32.161.141 aka 64:ff9b::d20:a18d under NAT64/DNS64.

Actual results:

Initial access to the v4-only website results in a "Server Not Found" and "Hmm. We’re having trouble finding that site." error.

Wireshark/tcpdump shows that Firefox in fact successfully opens a TCP connection to the website, but then ~4s later shuts it down cleanly without any actual data traffic.

Subsequent loads of the site work fine and the bug no longer appears for the domain.

Expected results:

First load of v4-only websites should work

Component: Untriaged → Networking
Product: Firefox → Core

HI,

tried to reproduced this issue in our NAT64+DNS64 Environment but it worked as expected. All connection to a v4-only website are loaded directly via the provide DNS64 IPv6 address. No timeout or server errors.

Used Windows 7/10 and Ubuntu 18.04 and fresh installed Firefox (65.0.2) and added uBlockOrgin (which sets the mentioned setting to 0). Tried different v4-only websites and the one you mentioned above and had no problems. Even ad blocking from uBlock worked as expected.

Did you tried other v4-only sites as well (twitch.tv, nytimes.com) and had the same problems there as well?

Best

So with further testing, it seems to only happen with a certain subset of v4-only sites. It seems like most v4-only sites work, with only a few with the issue.
For example, westpac.com.au and livejournal.com were both able to trigger the issue, but gitlab.com and github.com (and many others) were fine.

TL;DR:

It looks like if network.http.speculative-parallel-limit is set to 0 then for some sites Firefox is opening just one TCP session and sends 'GET' after the TCP handshake completed. However for some sites it opens two sessions and use the first one established to send 'GET'. The problem is that while on ipv4-enabled network both such sessions to v4-only site are established over IPv4, for NAT64/DNS64 network the FF seems to open one connection over IPv6 and trying to open the second one over IPv4 (!!!), and fails miserably...
Most likely the failure to reproduce is related to the fact that in your case it is opening one connection.

Details:

OK, so here are my findings...
we seem to have two types of IPv4-only sites: "good" ones (FF can open them for the first time) and "bad" (FF reports issues).

In our test:
www.livejournal.com

Good site:
ipv4.google.com

Let's see if there are any difference.

Test 1: Opening the "bad" site (www.livejournal.com) for the first time from v6-only network.

What we see in the tcpdump
tcp handshake to the NAT64 address (64:ff9b::5113:4a03):
SYN, SYN/ACK, ACK - and that's it.

Test 2:
Second attempt to open the site:
This time FF still reports "We have trouble finding this site" but tcpdump shows:
two sessions to tcp/80,
handshake completed, then GET / over one of those, get 301 redircting to https -
and then one session is open to 64:ff9b::5113:4a03 port 443.
TCP handshake completed, then TLS handshake - and that's it.

Test 3:
Another attempt to open the site (w/o restarting FF)
Success.
In the capture:
two sessions opened to tcp/443, handshake, all good (the page is loaded)

What is important here - after test#2 (when the 301 was received, my broswer has https://www.livejournal.com - so there was no redirect)

Test 4:
https to the "bad" site for the first time
Again, we see one TCP session to 64:ff9b::5113:4a03 port 443, tcp handshake successful, the TLS exchange but does not look like the HTTP request was sent, the browser reports an error.

Test 5:
Good site, opened from IPv6-only network

One TCP session to 64:ff9b::acd9:198e port 80, then after the handshake - GET /, receiving 301 and then opening a session to tcp/443 and load the page after TLS handshake.

So the summary of the tests above is:

  • "good" sites: the browser opens a TCP session and (after TCP handshake in case of HTTP and TLS handshake in case of HTTPS) sends GET / request.
  • "bad" sites: the browser opens a TCP session but in case of HTTP nothing happens after TCP handshake (and in case of HTTPS nothing happens after TLS handshake).

Let's do more tests and see how those sites get loaded in dual-stack network.

Test 6:
Good site, on the dual-stack network.

the client opens one TCP session to port 80,
after the handshake sends GET /, receives redirect and opens one session to the tcp/443

Test 7:
Bad site, on the dual-stack network

The client opens two TCP sessions to the server, after the first handshake completes, it sends GET /, and then after receiving the redirect - opens two HTTP sessions.

So - while it still might look like the red herring - for some sites FF is opening two tcp sessions on the DS network and for some sites - only one. When he two sessions on DS network are expected, on v6-only it opens only one and reports a failure, not even trying to use it.
So may it be that the root cause is the failure to open the second session?

Test 8.
v6-only network, trying to open a "bad" site but with a trick:
I've added a static IPv4 address and the static arp entry for the fake default gateway (again, there is no IPv4 on that network)

furry@Wintermute:~>sudo ifconfig en0 inet 192.0.2.2 netmask 0xffffff00 alias

furry@Wintermute:>netstat -rn -f inet | grep default
default 192.0.2.1 UGSc 54 0 en0
furry@Wintermute:
>

furry@Wintermute:~>arp -an
? (192.0.2.1) at 98:1:a7:b8:74:ab on en0 permanent [ethernet]

And - magic - in the tcpdump I see is two TCP SYNs to port 80 - one over v6, one over v4!!!
And everything works.

Test 8.5
Remove the arp and the static IPv4 address:

furry@Wintermute:~>sudo arp -d 192.0.2.1
192.0.2.1 (192.0.2.1) deleted

furry@Wintermute:~>sudo ifconfig en0 inet 192.0.2.2 netmask 0xffffff00 -alias

Suddenly things stopped worked again..

I have no idea why for some sites FF is trying to open two TCP sessions and for some - only one. It does not look like Happy Eyeballs as on the dual-stack network it opens both sessions for v4-only site over IPv4.

Bonus data:
I've compared how both "normal" FF (network.http.speculative-parallel-limit is set to 6) with "customised" (network.http.speculative-parallel-limit is set to 0) are trying to open "good" and "bad" sites in two setups: IPv6-only network and a dual-stack network which uses DNS64 and NAT64 (so ideally there should not be any v4 traffic as all sites have AAAA).
What I'm seeing:

Bad sites:
Dual-stack network with DNS64:
"normal" FF: opens 2 TCP sessions over IPv6 and one session over IPv4 (page loads over v6)
"customised" FF: 1 TCP over IPv6, one TCP over IPv4 (page loads)

V6-Only, DNS64/NAT64:
"normal": opens 3 TCP sessions, all over ipv6 (page loads)
"customised" FF: opens 1 TCP over v6 and page load fails (allegedly because of failure to open a second session over v4)

Good sites:
Dual-stack network with DNS64:
"normal" FF: opens 2 TCP sessions over IPv6
"customised" FF: 1 TCP over IPv6

V6-Only, DNS64/NAT64:
"normal": opens 2 TCP sessions, all over ipv6 (page loads)
"customised" FF: opens 1 TCP over v6

Sorry for the long message ;)
Cheers, Jen

Also I should have clarified probably - the troubleshooting described above was done on MacOS High Sierra (10.13.6), FF 65.0.1 (unlike the original report, which was on Linux).

Hello,

HTTP logs and Wireshark captures attached (b1533273-http-pcap-logs.tgz)

I've got packet capture and HTTP logs (with the default settings for the current log modules) for the following scenarios on IPv6-only NAT64/DNS64 network:

  1. "Good" site (which loads OK) - http://ipv4.google.com
    HTTP log: goodsite-v6only-success.FF.log
    packet capture: goodsite-v6only-success.pcapng

In the capture you can see TCP/80 handshake (packets 3-5), then GET / (packet 6), then HTTP 1.1 302 (packet 8) and then one TCP/443 session opened (packets 10-12) and then TLS and data exchange.

  1. "Bad" site (www.livejournal.com) - first attempt to open it, when the page can not be loaded.

HTTP log: badsite-v6only-failure.FF.log
packet capture: badsite-v6only-failure.pcapng

The capture shows TCP handshake and then the browser reports a failure.

  1. "Bad" site (www.livejournal.com) - first (failed) attempt to load it, then the second attempt to load it (http OK this time but https fails, and then 3rd attempt is successful.

HTTP log: badsite-v6only-fail-then-success.FF.log
Packet capture: badsite-v6only-fail-then-success.pcapng

What we see in the capture:

  • First attempt: TCP/80 handshake (packets 1-3). At that point the browser complains that the site can not be loaded - like Test #2. Then the connection is closed (packets 4-7).
  • Second attempt to load the site: TCP/80 hadnshake, then the browser sends GET / HTTP 1.1 and gets 301 back (packets 8-13). As the redirect points to https://www.livejournal.com, the browser opens tcp session to port 443 (packets 15-17) but then after TCP and TLS handshake the browser still reports the failure, and the third attempt to load the site is successful (there are 4 more TCP handshakes - packets 86, 90, 161, 164 - not sure which one corresponds to the page being loaded).
  1. While the network is still IPv6-only, I added ipv4 address 192.0.2.2/24, the fake default gateway 192.0.2.1 and the fake ARP entry for 192.0.2.1.

HTTP log: badsite-v6only-staticv4-success.FF.log
packet capture: badsite-v6only-staticv4-success.pcapng

In this case the site (www.livejournal.com) gets loaded from the very first attempt. And the most interesting packet in the capture is the packet #2 - when the browser is trying to open the second TCP/80 connection to the site ipv4 address (no success, indeed - but the page is getting loaded over the IPV6 TCP session).

I hope it helps - please let me know if you need more data, looks like it's quite easy to reproduce in my network.

BTW I'm at IETF104 now and I tried "the bad" sites on IETF NAT64 network and www.livejournal.com works. Which means that "bad" sites are network-specific - I still have no idea why for some sites Firefox opens two parallel connections and for some sites just one. Also it means that it might hard to reproduce in other networks as we do not know what sites are going to fail in the specific network..

I found a "bad" site which is broken on this network: www.cityrail.com.au
furry@Wintermute:>dig www.cityrail.com.au a +short
www.sydneytrains.info.
52.62.206.108
52.65.55.15
furry@Wintermute:
>dig www.cityrail.com.au aaaa +short
www.sydneytrains.info.
64:ff9b::343e:ce6c
64:ff9b::3441:370f
furry@Wintermute:~>

Attaching the captures for both tests described below (v6only-anothernet.tgz)

Test 1:
Ipv6-only/NAT54 network.

furry@Wintermute:~/tmp/FF-v6only>ifconfig en0
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
ether 98:01:a7:b8:74:ad
inet6 fe80::cae:ec26:7ea7:f396%en0 prefixlen 64 secured scopeid 0x5
inet6 2001:67c:370:1998:92:df1b:7132:26fb prefixlen 64 autoconf secured
inet6 2001:67c:370:1998:1019:de2e:7af7:70cb prefixlen 64 autoconf temporary
inet 169.254.178.230 netmask 0xffff0000 broadcast 169.254.255.255
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active

(so it's only IPv4 self-assigned IP)
Log file: FF-log-first-fail-then-work.log
Capure file: first-fail-then-succeeds.pcapng
(capture filter expression: (ipv6.addr == 64:ff9b::3441:370f or ipv6.addr == 64:ff9b::343e:ce6c) or (ip.version == 4 and tcp)

Attempt to open www.cityrail.com.au fails (12:58, the first 6 packets in the capture).

Then, at 12:59 I cicked on 'refresh' and the page got opened.

Test 2: the same network but now I have a faked IPv4 address and faked default gateway added:

furry@Wintermute:>ifconfig en0
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
ether 98:01:a7:b8:74:ad
inet6 fe80::cae:ec26:7ea7:f396%en0 prefixlen 64 secured scopeid 0x5
inet6 2001:67c:370:1998:92:df1b:7132:26fb prefixlen 64 autoconf secured
inet6 2001:67c:370:1998:1019:de2e:7af7:70cb prefixlen 64 autoconf temporary
inet 192.0.2.2 netmask 0xffff0000 broadcast 192.0.255.255
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active
furry@Wintermute:
>
furry@Wintermute:>arp -an
? (192.0.2.1) at 98:1:a7:b8:74:ab on en0 permanent [ethernet]
? (192.0.2.1) at (incomplete) on en0 ifscope [ethernet]
furry@Wintermute:
>

Log file: faked-v4-success.log

Capture file: faked_v4_added_success

As expected - FF is sending TCP SYN to ipv4 address (those SYNs are never get answered, obviously) and to IPv6 address and then everything works.

This bug was left forgotten because you didn't drop the needinfo flag.

Someone has to look at the logs now, thanks for them.

Flags: needinfo?(nospam) → needinfo?(honzab.moz)

Tentatively assigning to Honza for investigation.

Priority: -- → P2
Whiteboard: [necko-triaged]

this is happy eyeballs related. I was mostly looking into the badsite-v6only-fail-then-success.FF.log and what I see is that when the backup (disabled ipv6) socket fails to resolve (no ipv4 addresses for the host), we take it as the host can't be resolved and push this error on a waiting request (you see "host not found" in UI). but the primary socket is (with a delay ~400ms) able to reach the end server with ipv6, the idle conn is kept up to 5 secs, then killed from our side if no request is made to use it.

so, when an unrecoverable error happens on one of the sockets, we have to ignore it until the second socket is done too. then we have to merge possible errors from both sockets. if both say "host not found", the final error will be "not found". if one socket says "not found" and other "connection reset" then the error is "reset".

this is a good catch!
thanks for the logs.

Flags: needinfo?(honzab.moz)
Status: UNCONFIRMED → NEW
Ever confirmed: true

Thanks for the explanation. I'm not sure I understand why the issue happens for some sites only. So is it a matter of latency to the destination over IPv6?
Why does it work after I press 'reload' then?
Just curious ;)

I will try to retest this. This code has change a lot recently.

Flags: needinfo?(dd.mozilla)

I will nota have time to work on this.

Flags: needinfo?(dd.mozilla)
Severity: normal → S3

I've had an issue with this for months and finally decided to figure out what was going on. On some sites images fail to load unless you refresh. I found this setting set to 0 and it has fixed the issue resetting it to 6. Sometimes it would require a few refreshes to start working. It's not a network issue as using chrome or even wget/curl on the images directly always loaded properly.

I'm on an IPv4 connection but the sites I have issues with have both IPv4 and IPv6 connectivity.

Flags: needinfo?(kershaw)
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-next]

Hi Reporters,

As said in comment #14, the happy eyeballs code has been changed a lot recently, so could you try to use the latest Firefox and record a new http log again? Please select Logging to a file and send the file to necko@mozilla.com.

Thanks.

Flags: needinfo?(verm)
Flags: needinfo?(nospam)
Flags: needinfo?(kershaw)
Flags: needinfo?(furry13)

Redirect needinfos that are pending on inactive users to the triage owner.
:edgul, since the bug has recent activity, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(verm)
Flags: needinfo?(furry13)
Flags: needinfo?(edgul)

Unfortunately I no longer have a v6-only network I can easily use while using a browser :(
I talked to furry13@gmail.com two weeks ago and they said they'll take another capture soon™, but it looks like it's not happened yet...

Flags: needinfo?(nospam)

This is already in necko-next

Flags: needinfo?(edgul)
Flags: needinfo?(furry13)

Redirect a needinfo that is pending on an inactive user to the triage owner.
:jesup, since the bug has recent activity, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(furry13) → needinfo?(rjesup)
Flags: needinfo?(rjesup)

I've had a chance to use a v6-only network again, so I've made and attached a capture (both firefox log and pcap) to http://anz.com.au, which presently resolves to 64:ff9b::2d3c:7c2e and 64:ff9b::2d3c:7e2e

The bug still reproduces, with basically the same error as before:

Unable to connect
An error occurred during a connection to anz.com.au.

Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged][necko-priority-queue]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: