Open Bug 1700703 Opened 1 year ago Updated 7 months ago

Recent Firefox Nightly with HTTP3 enabled has problems loading HTTPS sites on Cloudflare

Categories

(Core :: Networking: HTTP, defect, P2)

defect

Tracking

()

REOPENED

People

(Reporter: cks+mozilla, Unassigned)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [necko-triaged])

Attachments

(2 files)

For me, recent versions of Firefox Nightly have problems loading HTTPS sites hosted on/through Cloudflare; an example site that's affected for me is https://daringfireball.net/. Attempts to visit such sites either time out or proceed very slowly, often with resources such as CSS only partially loaded or not loaded at all. With mozregression, I have bisected this to changeset 76a285c237e9, "Bug 1699490 - Enable HTTP3 on Nightly and beta". Manually changing network.http.http3.enabled to 'false' on the current Nightly makes the problem go away.

I am using Fedora 33 Linux on x86_64 in Toronto, on a DSL connection with a local MTU of 1436 bytes. A typical request to daringfireball.net gets a cf-ray header of "6351d91fd97b0568-LAX". It's possible that this is related to bug #1699289. A tcpdump of UDP traffic on a connection attempt looks like this:

14:24:59.765741 IP6 [MYIP6] > 2606:4700:137:a360:6f11:807:b007:5395: frag (0|1232) 41495 > https: UDP, length 1337
14:24:59.765750 IP6 [MYIP6] > 2606:4700:137:a360:6f11:807:b007:5395: frag (1232|113)
14:24:59.774372 IP6 2606:4700:137:a360:6f11:807:b007:5395.https > [MYIP6].41495: UDP, length 1200
14:24:59.775114 IP6 2606:4700:137:a360:6f11:807:b007:5395.https > [MYIP6].41495: UDP, length 1200
14:24:59.775475 IP6 [MYIP6] > 2606:4700:137:a360:6f11:807:b007:5395: frag (0|1232) 41495 > https: UDP, length 1337
14:24:59.775483 IP6 [MYIP6] > 2606:4700:137:a360:6f11:807:b007:5395: frag (1232|113)
14:24:59.779861 IP6 2606:4700:137:a360:6f11:807:b007:5395.https > [MYIP6].41495: UDP, length 1196
14:24:59.780017 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 54
14:24:59.780649 IP6 2606:4700:137:a360:6f11:807:b007:5395.https > [MYIP6].41495: UDP, length 1197
14:24:59.780650 IP6 2606:4700:137:a360:6f11:807:b007:5395.https > [MYIP6].41495: UDP, length 463
14:24:59.781720 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 55
14:24:59.784180 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 88
14:24:59.784204 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 68
14:24:59.784323 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 328
14:24:59.806677 IP6 2606:4700:137:a360:6f11:807:b007:5395.https > [MYIP6].41495: UDP, length 1196
14:24:59.806900 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 93
14:24:59.836151 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 52
14:24:59.836174 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 52
14:24:59.858622 IP6 2606:4700:137:a360:6f11:807:b007:5395.https > [MYIP6].41495: UDP, length 1196
14:24:59.858763 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 93
14:24:59.859330 IP6 2606:4700:137:a360:6f11:807:b007:5395.https > [MYIP6].41495: UDP, length 1197
14:24:59.859402 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 93
14:24:59.917643 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 88
14:24:59.917679 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 52
14:24:59.962091 IP6 2606:4700:137:a360:6f11:807:b007:5395.https > [MYIP6].41495: UDP, length 179
14:24:59.962218 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 95
14:24:59.962863 IP6 2606:4700:137:a360:6f11:807:b007:5395.https > [MYIP6].41495: UDP, length 1196
14:24:59.962928 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 94
14:25:00.078264 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 52
14:25:00.078287 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 52
14:25:00.309102 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 88
14:25:00.309223 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 52
14:25:00.768756 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 88
14:25:00.768783 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 52
14:25:01.687462 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 88
14:25:01.687568 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 52
14:25:03.523129 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 52
14:25:03.523242 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 52
14:25:05.419402 IP6 [MYIP6].41495 > 2606:4700:137:a360:6f11:807:b007:5395.https: UDP, length 48

Dragana, this looks like something you'd be interested in.

Flags: needinfo?(dd.mozilla)
Assignee: nobody → dd.mozilla
Blocks: QUIC
Severity: -- → S2
Status: NEW → ASSIGNED
Flags: needinfo?(dd.mozilla)
Priority: -- → P1
Whiteboard: [necko-triaged]

This looks like IP fragmentation and reassembly is going on. That shouldn't happen in v6, but we know it does. What is bad here is that it works enough to get things up and running, but then it seems to break down later. If this is tcpdump on the local system, why is it fragmenting before it leaves the host? 1232 is much shorter than the local MTU. I wonder if the kernel is remembering a previous ICMP PTB that was received and forcing fragmentation.

The awkward thing here is that this is ending up with a timeout rather than a failure to connect or other hard stop. A hard failure would be easier to identify and work around.

This seems to be an example of a network doing exactly what the specification does not want (fragmentation and reassembly only sometimes). It does have the desired effect of breaking QUIC. I'm not sure yet what I think about addressing it though.

Duplicating this to bug 1699289 is probably the right answer either way.

At this stage, I want to make the change to decrease the packet size . Our plans for HTTP/3 may be at risk.

Although my DSL PPPoE link itself claims an MTU of 1436 octets, 'ip route get <cloudflare IP6>' reports a smaller path MTU:

2606:4700:13e:cfbd:5568:9a54:4c04:b93b from :: dev ppp0 src [MYIP6] metric 1024 mtu 1280 pref medium

I'm not sure where this path MTU is coming from (I generated an ip route cached entry by connecting to daringfireball.net in Firefox without HTTP3).

I think this is the same bug. This is also regressed by bug 1699490.

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0
Build ID: 20210325085523

Steps to reproduce

Preparation:

  1. Use a new profile. Enable HTTP/3, DoH (use either provider) in Nightly.
  2. Install uMatrix. Leave settings as default. uMatrix is optional but it amplifies the bug.
  3. Visit https://lihkg.com/category/25 (a Chinese forum with a Cloudflare certificate). At uMatrix toolbar panel, allow ajax.cloudflare.com script to lihkg.com. Reload the tab.

Reproducing the bug:

  1. In the page's left panel, click on a few threads to load them on the right panel. This is to ensure the site works properly and also prepare for the next steps
  2. Click on any loaded thread in left panel. Then do not interact with the page.
  3. Wait for about 1 minute to 1m10s.
  4. Click on another loaded thread in left panel.

Actual result

  1. The forum thread on step 7 does not load. On the right side on right panel, there is a forum notice popping out saying 線已斷, i.e. connection dropped.
  2. If I keep clicking on other threads, more 線已斷 will be shown.

Notes

  1. Without HTTP3 or DoH, I cannot reproduce the bug.
  2. Without uMatrix, 線已斷 only occurs once. Connection is restored at 2nd click.
  3. If I waited for too long at step 6, I cannot reproduce the bug.
  4. After the bug occurs, it will resolve itself after letting the page idle for a few minutes. The bug will surface again if I follow step 5~7 again.
  5. The bug can occur even without the waiting at step 6 but it is random. The STR above reproduces the bug 100% of the time, at least on my system.

Comment #5 is probably a different issue.
I have opened bug 1701110.

Can you make a http log, I want to double check our fallback code.

https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

please use:
MOZ_LOG=timestamp,rotate:200,nsHttp:5,neqo_transport::*:5

please use a new profile and avoid logging into sites, because log may contain cookies. Also log contain URL you are visiting (i may contain your local IP address as well).
You can also send me the log via email.

Flags: needinfo?(cks+mozilla)

Can you also check if you see the same issue if you visit google.com, or any google property.

This doesn't happen if I visit google.com, maps.google.com, or ipv6.google.com (I checked the latter specifically to make sure I was using IPv6 for a connection that's hopefully HTTP/3). I'll send the log file in email; it comes from starting the latest Firefox Nightly in a clean profile with './firefox https://daringfireball.net/'. This time around the site actually loaded, but after a significant delay that's visible in the log.

Flags: needinfo?(cks+mozilla)

Maybe there are two issues here? I also get timeouts from connections to Cloudflare but I do not see any fragmentation (my pmtu is 1492, the 0rtt packets have 1385).

I've been experiencing the same problems described in comment #0 lately, although my ISP doesn't support IPv6. Let me know if you need more logs to investigate.

Sorry for a late update:

The fragmentation issue was on the server side and that is fix.

The issue from comment 10, I was able to reproduce as well. I am in contact with the CDN to try to debug the issue. I still do not know where the problem is.

Issue from comment 9, I cannot reproduce myself. Firefox receives some packet from the server, but it does not receive 1RTT packets. So the handshake is done from client side and the client sends request, but it does not receive any response from the server. his kind of late failure are hard to detect.
Chris, do you have any firewall? It is also interesting that google works.

Flags: needinfo?(cks+mozilla)

I forgot to write:
in the log from comment 9 google property safebrowsing.googleapis.com fails to finish the handshake so Firefox never uses H3 for safebrowsing.googleapis.com

on the other hand for https://daringfireball.net/ it succeeded to finish the handshake but does not receive any 1RTT packets.

safebrowsing.googleapis.com is blocked in a good way and https://daringfireball.net/ is not completely blocked which makes problems.

I don't have a firewall (and I don't think my ISP has a silent upstream one). My Linux machine directly runs PPPoE for my DSL connection and so sits directly on the public Internet without any general screening or NAT'ing, although I do have some rules to block inbound access to selected sensitive TCP and UDP ports.

Flags: needinfo?(cks+mozilla)
Duplicate of this bug: 1706586

An issue on the server side has been fix.

Chris, can you try to see how https://daringfireball.net/ behaves now?

Flags: needinfo?(cks+mozilla)

Things with daringfireball.net and other sites seem to work fine for me; I haven't noticed any glitches with HTTP3 enabled.

Flags: needinfo?(cks+mozilla)

(In reply to Chris Siebenmann from comment #17)

Things with daringfireball.net and other sites seem to work fine for me; I haven't noticed any glitches with HTTP3 enabled.

Thanks!

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED

This Firefox/Cloudflare HTTP/3 issue has resurfaced for me recently, starting last Friday or so. It affects Firefox 88 and Firefox 89, as well as Nightly, and sites such as https://prometheus.io/ (where I first noticed it). I have corresponded with some Cloudflare people about this, and they had me collect a Firefox network log with (if I did it right) "timestamp,sync,nsUDPSocket:5,neqo_transport::*:5"; I will attach that here.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

I've also seen some stalls recently that seem like this very bug again.

Dragana, do we need to reach out to cloudflare again?

Flags: needinfo?(dd.mozilla)

(In reply to Chris Siebenmann from comment #19)

This Firefox/Cloudflare HTTP/3 issue has resurfaced for me recently, starting last Friday or so. It affects Firefox 88 and Firefox 89, as well as Nightly, and sites such as https://prometheus.io/ (where I first noticed it). I have corresponded with some Cloudflare people about this, and they had me collect a Firefox network log with (if I did it right) "timestamp,sync,nsUDPSocket:5,neqo_transport::*:5"; I will attach that here.

Can you also add nsHttp5

timestamp,sync,nsHttp:5,nsUDPSocket:5,neqo_transport::*:5"

Thanks!

Flags: needinfo?(dd.mozilla)

Sorry, (just back from PTO) I have not notice that there is a log attached. I will first look at that log and I will let you know if we need nsHttp log as well.
Thanks.

(In reply to Julien Cristau [:jcristau] from comment #22)

Dragana, do we need to reach out to cloudflare again?

Cloudflare is aware of this issue. Thanks.

There are 3 QUIC connection in the log:

  1. uses draft-29 this is not cloudflare (this is my assumption, because I do not see URLs)
  2. uses draft-27 and successfully connects but it is not used for any request.
  3. this is interesting one:
  • the connection succeeds to connect (the client also receives HandshakeDone frame)
  • a lot of packets are lost: a some point the client receives an ack that only acks packets 28, 12, 9, 8, 7, and 3, other packets are lost. In the other direction no packet has been lost

This looks like a network problem , the network is dropping QUIC packets (I assume that TCP connection works fine).

If you make a log with timestamp,sync,nsHttp:5,nsUDPSocket:5,neqo_transport::*:5, I can double check that nothing strange is happening between getting he packet from neqo and sending to a UDP socket. I do not expect, since the problem is reproducible in Firefox 88 and the code has not change for some months.

I will also check with Cloudflare.

I took a log but it came out too large to upload once I let it run through to finishing and/or time out, so I've put this up at https://www.cs.toronto.edu/~cks/vendors/mozilla/log.txt-main.1932881.moz_log instead. This is done for daringfireball.net instead of prometheus.io, because something seems to have fixed the latter and it no longer reproduces there (prometheus.io is still Cloudflare and still seems to be offering HTTP/3, so I don't know what's going on).

I also made a log with dash.cloudflare.com which came out huge. I compressed it down to 18M and sent to Dragana.

The issue I saw is also caused by IPv6 fragmentation. Just before my comment I added the IPv6 tunnel from HE.net which has 1280 bytes MTU to my router.
I guess unless PMTUD is implemented (bug 1699289) I cannot use the IPv6 tunnel. I hope most native IPv6 has larger MTU so this is not an issue for them.

Correction. The HE tunnel actually can support up to 1480 bytes MTU but the openwrt's default MTU is 1280 for static tunnels. I'm going to change it to 1480 and see if the problem goes away.

I have this issue as well with HTTP3 connections since around the time of reopening this bug. Is any progress being made here? Do you need more logs?

Setting "network.dns.disableIPv6" to "true" fixes the issue for me. I dont have IPv6 connectivity. I have DoH enabled in Firefox.

This bug already covers a couple of issues. Comment 33 is a different issue and from he description it should just work.
mozbugz, can you make a http log? Is it only Cloudflare affected or google as well?

Flags: needinfo?(bugsgalore)

It affects several HTTP3 sites, hosted by cloudflare, cdns, google.com. I further narrowed the issue down to setting "network.dns.echconfig.enabled" to "true".

"network.dns.echconfig.enabled" "true" and "network.dns.disableIPv6" "true" no issue.

"network.dns.disableIPv6" "false" and "network.dns.echconfig.enabled" "false" no issue.

Sent log via email to :dragana with settings "network.dns.echconfig.enabled" "true" and "network.dns.disableIPv6" "false" and issue manifesting itself.

Flags: needinfo?(bugsgalore)

(In reply to mozbugz from comment #35)

It affects several HTTP3 sites, hosted by cloudflare, cdns, google.com. I further narrowed the issue down to setting "network.dns.echconfig.enabled" to "true".

"network.dns.echconfig.enabled" "true" and "network.dns.disableIPv6" "true" no issue.

"network.dns.disableIPv6" "false" and "network.dns.echconfig.enabled" "false" no issue.

Sent log via email to :dragana with settings "network.dns.echconfig.enabled" "true" and "network.dns.disableIPv6" "false" and issue manifesting itself.

Thank you for this investigation. his is very helpful.

I have open a separate bug for the particular issue from comment 35 -> Bug 1724240

I have the same problem as OP with that website and some other websites on Cloudflare and IPv6 (but not all websites on Cloudflare and IPv6). I'll post my findings. BTW, I'm in Europe with a different ISP than OP.

This problem affects Firefox stable on Windows and Android.

This also affects Chrome on Windows, but in a different way: Chrome, unlike Firefox, always loads these websites completely, but it takes too long, around 4 seconds or so. When it finally loads them, it's always over HTTP/2, not HTTP/3 (looking at the Network tab in inspector/F12).
Chrome on Android seems unaffected.

Overall, I found that this problem has three requirements:

  1. IPv6 enabled
  2. HTTP3/QUIC enabled
  3. Using my home ISP

If any of those is disabled/changed, the problem goes away. If I switch to my mobile ISP, the problem goes away.

Also, with my home ISP, on Windows, I can ping one of those Cloudflare website IPs with a max of 1232 bytes, anything over results in "Request timed out" (for example: ping 2606:4700:e4::ac40:a61c -l 1232).
With my mobile ISP, still on Windows, I can ping it successfully even with 65500 bytes aka the max allowed by the test.

So it looks like it's the fault of the ISP and some MTU limits (?), but I'd like to know for sure.

For those with small MTUs, see also bug 1734110.

Our users have also been experiencing intermittent problems over the past few months with pages not loading and it is getting worse. We are coming to the conclusion that it is since Cloudflare enabled HTTP/3 at the end of May 2021.

But more specifically, using <domain>/cdn-cgi/trace on the url for the site, the Cloudflare issue appears to be almost always when http=http/3 and colo=AMS (Amsterdam) and it does affect our users in The Netherlands more than others. colo=LHR or MAN (London or Manchester) don't have the problem.

My mobile phone ISP (EE) routes via AMS so I can switch between Cloudflare sites to see the problem (although it doesn't happen all the time even via AMS).

Firefox, Chrome and Edge all have problems but in different ways. By disabling http/3 in each of them the problem appears to have stopped although I cannot be 100% sure yet.

I am not a network engineer, but could Cloudflare Amsterdam have some faulty, unpatched or under-sized hardware which is causing the problem?

Assignee: dd.mozilla → nobody
Priority: P1 → P2
You need to log in before you can comment on or make changes to this bug.