<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Updated

•

4 years ago

Assignee: nobody → dd.mozilla

Blocks: QUIC

Severity: -- → S2

Status: NEW → ASSIGNED

Flags: needinfo?(dd.mozilla)

Priority: -- → P1

Whiteboard: [necko-triaged]

Martin Thomson [:mt:]

Comment 2

•

4 years ago

This looks like IP fragmentation and reassembly is going on. That shouldn't happen in v6, but we know it does. What is bad here is that it works enough to get things up and running, but then it seems to break down later. If this is tcpdump on the local system, why is it fragmenting before it leaves the host? 1232 is much shorter than the local MTU. I wonder if the kernel is remembering a previous ICMP PTB that was received and forcing fragmentation.

The awkward thing here is that this is ending up with a timeout rather than a failure to connect or other hard stop. A hard failure would be easier to identify and work around.

This seems to be an example of a network doing exactly what the specification does not want (fragmentation and reassembly only sometimes). It does have the desired effect of breaking QUIC. I'm not sure yet what I think about addressing it though.

Duplicating this to bug 1699289 is probably the right answer either way.

Comment 3

•

4 years ago

At this stage, I want to make the change to decrease the packet size . Our plans for HTTP/3 may be at risk.

Reporter

Comment 4

•

4 years ago

Although my DSL PPPoE link itself claims an MTU of 1436 octets, 'ip route get <cloudflare IP6>' reports a smaller path MTU:

2606:4700:13e:cfbd:5568:9a54:4c04:b93b from :: dev ppp0 src [MYIP6] metric 1024 mtu 1280 pref medium

I'm not sure where this path MTU is coming from (I generated an ip route cached entry by connecting to daringfireball.net in Firefox without HTTP3).

Fanolian

Comment 5

•

4 years ago

I think this is the same bug. This is also regressed by bug 1699490.

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0
Build ID: 20210325085523

Steps to reproduce

Preparation:

Use a new profile. Enable HTTP/3, DoH (use either provider) in Nightly.
Install uMatrix. Leave settings as default. uMatrix is optional but it amplifies the bug.
Visit https://lihkg.com/category/25 (a Chinese forum with a Cloudflare certificate). At uMatrix toolbar panel, allow ajax.cloudflare.com script to lihkg.com. Reload the tab.

Reproducing the bug:

In the page's left panel, click on a few threads to load them on the right panel. This is to ensure the site works properly and also prepare for the next steps
Click on any loaded thread in left panel. Then do not interact with the page.
Wait for about 1 minute to 1m10s.
Click on another loaded thread in left panel.

Actual result

The forum thread on step 7 does not load. On the right side on right panel, there is a forum notice popping out saying 線已斷, i.e. connection dropped.
If I keep clicking on other threads, more 線已斷 will be shown.

Notes

Without HTTP3 or DoH, I cannot reproduce the bug.
Without uMatrix, 線已斷 only occurs once. Connection is restored at 2nd click.
If I waited for too long at step 6, I cannot reproduce the bug.
After the bug occurs, it will resolve itself after letting the page idle for a few minutes. The bug will surface again if I follow step 5~7 again.
The bug can occur even without the waiting at step 6 but it is random. The STR above reproduces the bug 100% of the time, at least on my system.

Comment 6

•

4 years ago

Comment #5 is probably a different issue.
I have opened bug 1701110.

https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

Comment 7

•

4 years ago

Can you make a http log, I want to double check our fallback code.

please use:
MOZ_LOG=timestamp,rotate:200,nsHttp:5,neqo_transport::*:5

please use a new profile and avoid logging into sites, because log may contain cookies. Also log contain URL you are visiting (i may contain your local IP address as well).
You can also send me the log via email.

Flags: needinfo?(cks+mozilla)

Comment 8

•

4 years ago

Can you also check if you see the same issue if you visit google.com, or any google property.

Updated

•

4 years ago

Blocks: 1665621

Jan Alexander Steffens [:heftig]

Reporter

Comment 9

•

4 years ago

This doesn't happen if I visit google.com, maps.google.com, or ipv6.google.com (I checked the latter specifically to make sure I was using IPv6 for a connection that's hopefully HTTP/3). I'll send the log file in email; it comes from starting the latest Firefox Nightly in a clean profile with './firefox https://daringfireball.net/'. This time around the site actually loaded, but after a significant delay that's visible in the log.

Flags: needinfo?(cks+mozilla)

Comment 10

•

4 years ago

Maybe there are two issues here? I also get timeouts from connections to Cloudflare but I do not see any fragmentation (my pmtu is 1492, the 0rtt packets have 1385).

Selim Şumlu

Comment 11

•

4 years ago

I've been experiencing the same problems described in comment #0 lately, although my ISP doesn't support IPv6. Let me know if you need more logs to investigate.

Comment 12

•

4 years ago

Sorry for a late update:

The fragmentation issue was on the server side and that is fix.

The issue from comment 10, I was able to reproduce as well. I am in contact with the CDN to try to debug the issue. I still do not know where the problem is.

Issue from comment 9, I cannot reproduce myself. Firefox receives some packet from the server, but it does not receive 1RTT packets. So the handshake is done from client side and the client sends request, but it does not receive any response from the server. his kind of late failure are hard to detect.
Chris, do you have any firewall? It is also interesting that google works.

Flags: needinfo?(cks+mozilla)

Comment 13

•

4 years ago

I forgot to write:
in the log from comment 9 google property safebrowsing.googleapis.com fails to finish the handshake so Firefox never uses H3 for safebrowsing.googleapis.com

on the other hand for https://daringfireball.net/ it succeeded to finish the handshake but does not receive any 1RTT packets.

safebrowsing.googleapis.com is blocked in a good way and https://daringfireball.net/ is not completely blocked which makes problems.

Reporter

Comment 14

•

4 years ago

I don't have a firewall (and I don't think my ISP has a silent upstream one). My Linux machine directly runs PPPoE for my DSL connection and so sits directly on the public Internet without any general screening or NAT'ing, although I do have some rules to block inbound access to selected sensitive TCP and UDP ports.

Flags: needinfo?(cks+mozilla)

Comment 16

•

4 years ago

An issue on the server side has been fix.

Chris, can you try to see how https://daringfireball.net/ behaves now?

Flags: needinfo?(cks+mozilla)

Reporter

Comment 17

•

4 years ago

Things with daringfireball.net and other sites seem to work fine for me; I haven't noticed any glitches with HTTP3 enabled.

Flags: needinfo?(cks+mozilla)

Julien Cristau [:jcristau]

Comment 18

•

4 years ago

(In reply to Chris Siebenmann from comment #17)

Things with daringfireball.net and other sites seem to work fine for me; I haven't noticed any glitches with HTTP3 enabled.

Thanks!

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

Updated

•

4 years ago

status-firefox89: affected → ---

Reporter

Comment 19

•

4 years ago

This Firefox/Cloudflare HTTP/3 issue has resurfaced for me recently, starting last Friday or so. It affects Firefox 88 and Firefox 89, as well as Nightly, and sites such as https://prometheus.io/ (where I first noticed it). I have corresponded with some Cloudflare people about this, and they had me collect a Firefox network log with (if I did it right) "timestamp,sync,nsUDPSocket:5,neqo_transport::*:5"; I will attach that here.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Jan Alexander Steffens [:heftig]

Reporter

Comment 20

•

4 years ago

Attached file log.txt-main.886416.moz_log — Details

Comment 21

•

4 years ago

I've also seen some stalls recently that seem like this very bug again.

Julien Cristau [:jcristau]

Comment 22

•

4 years ago

Dragana, do we need to reach out to cloudflare again?

Flags: needinfo?(dd.mozilla)

Comment 23

•

4 years ago

(In reply to Chris Siebenmann from comment #19)

This Firefox/Cloudflare HTTP/3 issue has resurfaced for me recently, starting last Friday or so. It affects Firefox 88 and Firefox 89, as well as Nightly, and sites such as https://prometheus.io/ (where I first noticed it). I have corresponded with some Cloudflare people about this, and they had me collect a Firefox network log with (if I did it right) "timestamp,sync,nsUDPSocket:5,neqo_transport::*:5"; I will attach that here.

Can you also add nsHttp5

timestamp,sync,nsHttp:5,nsUDPSocket:5,neqo_transport::*:5"

Thanks!

Flags: needinfo?(dd.mozilla)

Comment 24

•

4 years ago

•

Edited

Sorry, (just back from PTO) I have not notice that there is a log attached. I will first look at that log and I will let you know if we need nsHttp log as well.
Thanks.

Comment 25

•

4 years ago

(In reply to Julien Cristau [:jcristau] from comment #22)

Dragana, do we need to reach out to cloudflare again?

Cloudflare is aware of this issue. Thanks.

Comment 26

•

4 years ago

There are 3 QUIC connection in the log:

uses draft-29 this is not cloudflare (this is my assumption, because I do not see URLs)
uses draft-27 and successfully connects but it is not used for any request.
this is interesting one:

the connection succeeds to connect (the client also receives HandshakeDone frame)
a lot of packets are lost: a some point the client receives an ack that only acks packets 28, 12, 9, 8, 7, and 3, other packets are lost. In the other direction no packet has been lost

This looks like a network problem , the network is dropping QUIC packets (I assume that TCP connection works fine).

If you make a log with timestamp,sync,nsHttp:5,nsUDPSocket:5,neqo_transport::*:5, I can double check that nothing strange is happening between getting he packet from neqo and sending to a UDP socket. I do not expect, since the problem is reproducible in Firefox 88 and the code has not change for some months.

I will also check with Cloudflare.

Reporter

Comment 27

•

4 years ago

I took a log but it came out too large to upload once I let it run through to finishing and/or time out, so I've put this up at https://www.cs.toronto.edu/~cks/vendors/mozilla/log.txt-main.1932881.moz_log instead. This is done for daringfireball.net instead of prometheus.io, because something seems to have fixed the latter and it no longer reproduces there (prometheus.io is still Cloudflare and still seems to be offering HTTP/3, so I don't know what's going on).

Comment 28

•

4 years ago

I also made a log with dash.cloudflare.com which came out huge. I compressed it down to 18M and sent to Dragana.

Comment 29

•

4 years ago

The issue I saw is also caused by IPv6 fragmentation. Just before my comment I added the IPv6 tunnel from HE.net which has 1280 bytes MTU to my router.
I guess unless PMTUD is implemented (bug 1699289) I cannot use the IPv6 tunnel. I hope most native IPv6 has larger MTU so this is not an issue for them.

Comment 30

•

4 years ago

Attached file quic-ipv6-tunnelbroker-cf.pcapng — Details

Comment 31

•

4 years ago

Correction. The HE tunnel actually can support up to 1480 bytes MTU but the openwrt's default MTU is 1280 for static tunnels. I'm going to change it to 1480 and see if the problem goes away.

mozbugz

Comment 32

•

4 years ago

I have this issue as well with HTTP3 connections since around the time of reopening this bug. Is any progress being made here? Do you need more logs?

mozbugz

Comment 33

•

4 years ago

Setting "network.dns.disableIPv6" to "true" fixes the issue for me. I dont have IPv6 connectivity. I have DoH enabled in Firefox.

Comment 34

•

4 years ago

This bug already covers a couple of issues. Comment 33 is a different issue and from he description it should just work.
mozbugz, can you make a http log? Is it only Cloudflare affected or google as well?

Flags: needinfo?(bugsgalore)

mozbugz

Comment 35

•

4 years ago

It affects several HTTP3 sites, hosted by cloudflare, cdns, google.com. I further narrowed the issue down to setting "network.dns.echconfig.enabled" to "true".

"network.dns.echconfig.enabled" "true" and "network.dns.disableIPv6" "true" no issue.

"network.dns.disableIPv6" "false" and "network.dns.echconfig.enabled" "false" no issue.

Sent log via email to :dragana with settings "network.dns.echconfig.enabled" "true" and "network.dns.disableIPv6" "false" and issue manifesting itself.

Flags: needinfo?(bugsgalore)

Comment 36

•

4 years ago

(In reply to mozbugz from comment #35)

It affects several HTTP3 sites, hosted by cloudflare, cdns, google.com. I further narrowed the issue down to setting "network.dns.echconfig.enabled" to "true".

"network.dns.echconfig.enabled" "true" and "network.dns.disableIPv6" "true" no issue.

"network.dns.disableIPv6" "false" and "network.dns.echconfig.enabled" "false" no issue.

Sent log via email to :dragana with settings "network.dns.echconfig.enabled" "true" and "network.dns.disableIPv6" "false" and issue manifesting itself.

Thank you for this investigation. his is very helpful.

Comment 37

•

4 years ago

I have open a separate bug for the particular issue from comment 35 -> Bug 1724240

bugzee

Comment 38

•

4 years ago

I have the same problem as OP with that website and some other websites on Cloudflare and IPv6 (but not all websites on Cloudflare and IPv6). I'll post my findings. BTW, I'm in Europe with a different ISP than OP.

This problem affects Firefox stable on Windows and Android.

This also affects Chrome on Windows, but in a different way: Chrome, unlike Firefox, always loads these websites completely, but it takes too long, around 4 seconds or so. When it finally loads them, it's always over HTTP/2, not HTTP/3 (looking at the Network tab in inspector/F12).
Chrome on Android seems unaffected.

Overall, I found that this problem has three requirements:

IPv6 enabled
HTTP3/QUIC enabled
Using my home ISP

If any of those is disabled/changed, the problem goes away. If I switch to my mobile ISP, the problem goes away.

Also, with my home ISP, on Windows, I can ping one of those Cloudflare website IPs with a max of 1232 bytes, anything over results in "Request timed out" (for example: ping 2606:4700:e4::ac40:a61c -l 1232).
With my mobile ISP, still on Windows, I can ping it successfully even with 65500 bytes aka the max allowed by the test.

So it looks like it's the fault of the ISP and some MTU limits (?), but I'd like to know for sure.

David Vo [:auscompgeek]

Comment 39

•

4 years ago

For those with small MTUs, see also bug 1734110.

David

Comment 40

•

4 years ago

Our users have also been experiencing intermittent problems over the past few months with pages not loading and it is getting worse. We are coming to the conclusion that it is since Cloudflare enabled HTTP/3 at the end of May 2021.

But more specifically, using <domain>/cdn-cgi/trace on the url for the site, the Cloudflare issue appears to be almost always when http=http/3 and colo=AMS (Amsterdam) and it does affect our users in The Netherlands more than others. colo=LHR or MAN (London or Manchester) don't have the problem.

My mobile phone ISP (EE) routes via AMS so I can switch between Cloudflare sites to see the problem (although it doesn't happen all the time even via AMS).

Firefox, Chrome and Edge all have problems but in different ways. By disabling http/3 in each of them the problem appears to have stopped although I cannot be 100% sure yet.

I am not a network engineer, but could Cloudflare Amsterdam have some faulty, unpatched or under-sized hardware which is causing the problem?

Updated

•

3 years ago

Assignee: dd.mozilla → nobody

Priority: P1 → P2