HTTP/3 connections do not survive network change, leading to resource loading failures
Categories
(Core :: Networking, defect, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox136 | --- | fixed |
People
(Reporter: acreskey, Assigned: kershaw)
References
(Blocks 4 open bugs)
Details
(Whiteboard: [necko-priority-queue][necko-triaged])
Attachments
(3 files)
In bug 1706377 logic was added to move http/2 connections to the pending list on change of network in order to gracefully allow subsequent transactions to complete.
See Always create new connection after network change
https://phabricator.services.mozilla.com/D189523
However we don't yet do this for HTTP/3 connections, see ConnectionEntry::VerifyTraffic
| Reporter | ||
Updated•1 year ago
|
| Reporter | ||
Comment 1•1 year ago
|
||
From Kershaw:
QUIC supports connection migration:
RFC 9000: QUIC: A UDP-Based Multiplexed and Secure Transport
| Reporter | ||
Updated•1 year ago
|
| Reporter | ||
Updated•1 year ago
|
| Reporter | ||
Updated•1 year ago
|
| Assignee | ||
Updated•1 year ago
|
| Reporter | ||
Comment 2•1 year ago
|
||
Based on discussions with Lars, I may have the wrong terminology in this bug.
But the intent is to handle scenarios where the user transitions networks as gracefully as possible (for HTTP/3 requests).
| Reporter | ||
Comment 3•1 year ago
|
||
Specifically, allowing the network request to complete even if the client changes network interfaces midway.
For example, when a mobile user migrates from Wifi to Cellular.
i.e. as described here:
https://www.debugbear.com/blog/http3-quic-protocol-guide
By attaching an unencrypted connection identifier (CID) to each QUIC packet header, QUIC doesn’t have to reset the connection like TCP if the device switches to a new network (for example, from a 4G network to Wi-Fi, or vice versa) or the IP addresses or port numbers change for any other reason.
With the help of connection migration, QUIC doesn’t have to redo the handshake under the new conditions and HTTP/3 doesn’t have to re-request the files that were being downloaded when the network migration happened — which can be a problem in the case of larger files or video streaming.
For what it is worth QUIC Conection Migration is tested on Neqo's CI in:
https://github.com/mozilla/neqo/blob/main/neqo-transport/src/connection/tests/migration.rs
Do I understand correctly that this Bug is about:
- Update
ConnectionEntry::VerifyTrafficto include h3 connections. - Update
test_verify_traffic.jstest to ensure h3 connections are included inConnectionEntry::VerifyTraffic, in other words test that h3 connections are requeued with change of network.
| Reporter | ||
Comment 5•1 year ago
|
||
(In reply to Max Inden from comment #4)
For what it is worth QUIC Conection Migration is tested on Neqo's CI in:
https://github.com/mozilla/neqo/blob/main/neqo-transport/src/connection/tests/migration.rs
Good to know!
Do I understand correctly that this Bug is about:
- Update
ConnectionEntry::VerifyTrafficto include h3 connections.- Update
test_verify_traffic.jstest to ensure h3 connections are included inConnectionEntry::VerifyTraffic, in other words test that h3 connections are requeued with change of network.
So I'm not sure what changes are needed to the necko stack in order to accomplish this (very possibly the two areas you mentioned).
I'm going to write a test case within Firefox first to verify that this isn't already being handled gracefully.
| Reporter | ||
Comment 6•1 year ago
•
|
||
Here's a distilled scenario where I'm seeing what looks to be the problem:
-
Load a website over HTTP/3
I've written a simple test which is hosted on Cloudflare
https://connectionreuse.pages.dev/connection_reuse
Use dev tools to verify the connection was made over HTTP/3 -
Change networks (e.g. switch to a different access point)
Wait until the connection change is complete -
Have the site make a new request.
On this site you can press the "Load New Image" button to request a new image.
Result:
Image request fails
Profile with nsHttp logs capture, https://share.firefox.dev/3TSdekj
At times I've seen the connection drop down to HTTP/2, which is better, but not connection migration.
| Reporter | ||
Updated•1 year ago
|
| Reporter | ||
Comment 7•1 year ago
|
||
Seeing this behaviour on Android as well.
| Reporter | ||
Comment 8•1 year ago
|
||
An alternative way to reproduce this behaviour, related to a scenario Kershaw observed:
- Visit a site that is served over HTTP/3, like https://www.cloudflare.com (verify via devtools)
- Change networks: access point, interface etc
- Try to interact with the site, e.g. click on
Learn Morebutton
Result:
Nothing happens, interaction silently fails
| Reporter | ||
Comment 9•1 year ago
|
||
This particular problem where the site becomes non-responsive after a change of networks may be limited to Cloudflare servers
The sites mentioned in comment 6 and comment 8 are both cloudflare HTTP/3 sites.
Seeing the same issue here, a Shopify site that looks to be hosted by cloudflare:
https://www.windsorstore.com/
And another site on cloudflare:
https://wordsift.org/
| Reporter | ||
Comment 10•1 year ago
•
|
||
[removed duplicate comment]
| Reporter | ||
Comment 11•1 year ago
•
|
||
Yes, it seems that the problem of requests failing after a network change is occurring only on Cloudflare's HTTP/3 servers and not on Google Cloud servers.
I deployed the same simple test site used in comment 6 to both platforms:
Cloudflare: https://connectionreuse.pages.dev/connection_reuse
Google Cloud: https://connectionmigration.ue.r.appspot.com/
Both sites are accessible over HTTP/3, after changing networks (e.g., switching from Wi-Fi to cellular), I can continue to perform requests against the Google Cloud site without any issues. However, the problem persists on the Cloudflare-hosted site, where requests fail following a network change.
Also reproduced the issue in Fx 131 Release, desktop and android.
Updated•1 year ago
|
| Reporter | ||
Updated•1 year ago
|
Comment 12•1 year ago
|
||
For what it is worth, the Quic Interop Runner does not currently support the connection migration testcase. In other words, we are not testing (active) connection migration between Neqo and e.g. Quiche (Cloudflare Quic/http3) on CI today.
| Assignee | ||
Comment 13•1 year ago
|
||
Some updates:
- I can confirm Andrew's observation in comment #11: this issue seems to only occur on Cloudflare's site. Firefox works fine with Google’s site after a network change.
- In the transport parameters returned from the Cloudflare site, there is no disable_active_migration specified, so their server should support connection migration.
- The Wireshark trace shows that the Cloudflare site stops sending packets back to Firefox after a network change.
- Chrome appears to always establish new connections after a network change, which is why it works fine with Cloudflare’s site. I think we should consider doing the same.
Comment 14•1 year ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #13)
- Chrome appears to always establish new connections after a network change, which is why it works fine with Cloudflare’s site. I think we should consider doing the same.
Is that for any network change event, or only those that result in an IP change?
I suppose there could be network change events where the computer temporarily disconnects and the NAT clears existing port mappings, but I don't think that's common, is it?
If I understand correctly, we don't yet support active connection migration, so it kinda makes sense that Cloudflare stops sending back packets if suddenly they seem to be coming from a different source port.
When a network change happens, neqo should ensure that the quic connection is still alive before continuing to use it. If the local IP address changes, then we definitely should use a new connection.
| Reporter | ||
Comment 15•1 year ago
•
|
||
I'm only seeing requests fail after a network change in which the IP changes.
So scenarios in which the next HTTP/3 request against cloudflare will succeed:
• disabling and re-enabling wifi
• changing wifi access points and then changing back to the original one
I have a work-in-progress test here*
https://phabricator.services.mozilla.com/D224261
Can be run with
./mach raptor --browsertime -t browsertime-connection-migration
*However, at the moment you have to manually change networks midway through the test.
| Reporter | ||
Comment 16•1 year ago
|
||
(Or run with Chrome via
./mach raptor --browsertime -t browsertime-connection-migration --app chrome --binary={path to Chrome binary}
)
Updated•1 year ago
|
| Reporter | ||
Comment 17•1 year ago
|
||
At the moment I'm having a hard time reproducing this issue.
i.e., I can connect via HTTP/3, but the change of IP does not break the connection.
I'm forcing DoH on to ensure that we wait for HTTPS RR, and thus ECH on cloudflare.
| Reporter | ||
Comment 18•1 year ago
|
||
Running the test from comment 6 I'm sometimes seeing the image load over an HTTP/2 connection (while the main document was loaded over HTTP/3).
Comment 19•1 year ago
|
||
I spent some time trying to figure out what's so special about cloudflare.
The thing that most bugged me was that wireshark would fail to decrypt the QUIC packets into HTTP3 for the cloudflare connection when setting SSLKEYLOGFILE=/tmp/keys.txt, but it worked for google HTTP/3 connections.
After doing mozregression multiple times it subsequently pointed me to:
Bug 1874464 - Turn on native HTTPS-RR DNS resolver on Nightly
Bug 1892528 - part 2: enable Xyber768 in Http/3 under a pref.
Bug 1910360 - network.http.http3.use_nspr_for_io
After each mozregression session I would fix the pref then continue.
My mozregression_prefs.json now looks like this:
{
"security.sandbox.content.level": 0,
"network.trr.mode": 5,
"network.dns.native_https_query": false,
"network.http.http3.enable_kyber": false,
"network.http.http3.use_nspr_for_io": true
}
With all these prefs set I can always decrypt HTTP3 in wireshark.
After a network change I can see that Firefox is still using sending HTTP/3 packets using the DCID established before the IP change.
Will upload pcap.
Comment 20•1 year ago
|
||
| Reporter | ||
Comment 21•1 year ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #19)
My mozregression_prefs.json now looks like this:
{ "security.sandbox.content.level": 0, "network.trr.mode": 5, "network.dns.native_https_query": false, "network.http.http3.enable_kyber": false, "network.http.http3.use_nspr_for_io": true }With all these prefs set I can always decrypt HTTP3 in wireshark.
After a network change I can see that Firefox is still using sending HTTP/3 packets using the DCID established before the IP change.
Will upload pcap.
As I believe can be surmised from this set of preferences, ECH does not appear to a cause of this bug.
When I set network.dns.http3_echconfig.enabled to false, I can still reproduce the bug.
Oddly, I'm now only seeing the failed requests after the second change of networks (in fact, when switching back to the original network).
I'll try to find out why that's the case.
| Reporter | ||
Comment 22•1 year ago
|
||
In this profile, https://share.firefox.dev/3U8H8Ri, the load of the first image (other-image1.jpg) succeeds while the second (other-image2.jpg) fails.
As far as I can tell in the case of the first image load we've pruned idle connections and so make a new connection:
LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 14d37cb20 .S........[tlsflags0x00000000]connectionreuse.pages.dev:443 <ROUTE-via connectionreuse.pages.dev:443> {NPN-TOKEN h3}^partitionKey=%28https%2Cconnectionreuse.pages.dev%29 did not find an active connection
000
For the second image load, which fails, we find an active connection and use it:
LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 14d37cb20 .S........[tlsflags0x00000000]connectionreuse.pages.dev:443 <ROUTE-via connectionreuse.pages.dev:443> {NPN-TOKEN h3}^partitionKey=%28https%2Cconnectionreuse.pages.dev%29 found an active experienced connection 13ba09d80 in native connection entry
Comment 23•1 year ago
|
||
My current approach is looking at the following:
- mozilla::net::nsHttpConnectionMgr::DispatchTransaction calls mozilla::net::HttpConnectionUDP::Activate
- if it is experienced, we should also check if NeqoHttp3Conn::local_addr is the same as the machine's current IP (still trying to figure out how to do that).
| Assignee | ||
Comment 24•1 year ago
|
||
- if it is experienced, we should also check if [NeqoHttp3Conn::local_addr](https://searchfox.org/mozilla-central/rev/3265b390bd5d08a5be520253ef71835bcb715f27/netwerk/socket/neqo_glue/src/lib.rs#47) is the same as the machine's current IP (still trying to figure out how to do that).
The local address is initialized here and it seems that we always use 0.0.0.0. So, I think we can't use it to compare the current IP.
| Assignee | ||
Comment 25•1 year ago
|
||
- In the transport parameters returned from the Cloudflare site, there is no disable_active_migration specified, so their server should support connection migration.
FYI: Cloudflare confirmed that they don't support connection migration for now and they will roll out a change to set disable_active_migration.
Comment 26•1 year ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #24)
- if it is experienced, we should also check if [NeqoHttp3Conn::local_addr](https://searchfox.org/mozilla-central/rev/3265b390bd5d08a5be520253ef71835bcb715f27/netwerk/socket/neqo_glue/src/lib.rs#47) is the same as the machine's current IP (still trying to figure out how to do that).The local address is initialized here and it seems that we always use
0.0.0.0. So, I think we can't use it to compare the current IP.
I think we should be able to get the local IP address if we call mozilla::net::nsUDPSocket::Connect.
We don't currently do that, so it might not be a trivial change.
| Assignee | ||
Comment 27•1 year ago
|
||
Take this from Valentin.
I'd like to try the approach mentioned in comment #0.
Comment 28•1 year ago
|
||
Do I understand correctly that we would thus always create a new QUIC connection on network change, even though, if supported by the server, we could use QUIC's connection migration instead?
Do I understand correctly that long term, we would want to make use of QUIC's connection migration, if available?
| Assignee | ||
Comment 29•1 year ago
|
||
| Assignee | ||
Comment 30•1 year ago
|
||
Comment 31•1 year ago
|
||
Comment 32•1 year ago
|
||
Backed out for causing bp-nu bustages in MockNetworkLayerController.h.
- Backout link
- Push with failures
- Failure Log
- Failure line: /builds/worker/checkouts/gecko/netwerk/base/MockNetworkLayerController.h(34,32): error: implicit instantiation of undefined template 'nsTBaseHashSet<nsCStringHashKey>'
Please also check these xpcshell failures.
| Assignee | ||
Comment 33•1 year ago
|
||
Comment 34•1 year ago
|
||
Comment 35•1 year ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/7d93026bd93e
https://hg.mozilla.org/mozilla-central/rev/39bdda0a1574
| Reporter | ||
Comment 36•9 months ago
|
||
(In reply to Andrew Creskey [:acreskey] from comment #6)
Here's a distilled scenario where I'm seeing what looks to be the problem:
Load a website over HTTP/3
I've written a simple test which is hosted on Cloudflare
https://connectionreuse.pages.dev/connection_reuse
Use dev tools to verify the connection was made over HTTP/3Change networks (e.g. switch to a different access point)
Wait until the connection change is completeHave the site make a new request.
On this site you can press the "Load New Image" button to request a new image.Result:
Image request failsProfile with nsHttp logs capture, https://share.firefox.dev/3TSdekj
At times I've seen the connection drop down to HTTP/2, which is better, but not connection migration.
I was revisiting scenarios like this -- just noting that this indeed now fixed in Fenix -- i.e. the new request (from "Load New Image") no longer fails.
Description
•