fosstodon.org doesn't load with network.http.http3.use_nspr_for_io=false on ARM64 Windows
Categories
(Core :: Networking, defect, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox-esr115 | --- | unaffected |
| firefox-esr128 | --- | unaffected |
| firefox130 | --- | disabled |
| firefox131 | --- | disabled |
| firefox132 | --- | disabled |
People
(Reporter: saschanaz, Assigned: mail)
References
(Regression)
Details
(Keywords: regression, Whiteboard: [necko-triaged][necko-priority-monitor])
Mozregression says it's in this range: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=4d36a2f69e7027dcc1e3535e563aa57e556d9f70&tochange=c38029641964591e518535856e2d7b3038b134ad
Could not pinpoint the commit because of mozregression aarch64 support issue (WARNING: Skipping build b94ec5ba05c9: Unable to find build info using the taskcluster route 'gecko.v2.autoland.shippable.revision.b94ec5ba05c96151a539e0c05403f7384829b008.firefox.win64-aarch64-opt) but I see bug 1910360 being the only network change there.
Bisected by mozregression --good 2024-07-03 --bad 2024-08-14 -a https://fosstodon.org/explore. On a bad build the page never finishes loading, and on a good build it loads instantly.
Edit: On MSEdge the page always loads instantly.
| Reporter | ||
Updated•1 year ago
|
| Reporter | ||
Comment 1•1 year ago
|
||
Setting network.http.http3.use_nspr_for_io to true indeed fixes the issue.
Comment 2•1 year ago
|
||
Set release status flags based on info from the regressing bug 1910360
:mail, since you are the author of the regressor, bug 1910360, could you take a look?
For more information, please visit BugBot documentation.
Thank you Kagami Rosylight for tracking this down. I will take a look.
Given that network.http.http3.use_nspr_for_io is only set to false on Nightly, i.e. quinn-udp only used Nightly, I don't think this affects firefox130, firefox131 and firefox132. Am I missing something?
# Use NSPR for HTTP3 UDP IO
- name: network.http.http3.use_nspr_for_io
type: RelaxedAtomicBool
# Always use NSPR on Android x86. On Android x86 sendmsg and recvmmsg, both used
# by quinn-udp, are disallowed by seccomp. Future fix and further details
# tracked in https://github.com/quinn-rs/quinn/issues/1947.
#if defined(ANDROID) && !defined(HAVE_64BIT_BUILD)
value: true
#else
value: @IS_NOT_NIGHTLY_BUILD@
#endif
mirror: always
I am unable to reproduce this bug on x86-64 Linux and Arm M2 Mac.
Kagami Rosylight can you reproduce the bug once more, with the following log level enabled, and send the logs to necko@mozilla.com and minden@mozilla.com?
timestamp,sync,nsHttp:5,nsSocketTransport:5,nsHostResolver:5,neqo_http3::*:5,neqo_transport::*:5,quinn_udp::*:5,neqo_udp::*:5,neqo_glue::*:5
| Reporter | ||
Comment 6•1 year ago
•
|
||
Sure. A perf profile with a fresh firefox profile (select Socket Thread and see Marker Table to get the logs): https://share.firefox.dev/3XdSqo5
Comment 7•1 year ago
|
||
(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #6)
Sure. A perf profile with a fresh firefox profile (see Marker Table to get the logs): https://share.firefox.dev/3XdSqo5
(I see zero rust logs this way π€)
Could you choose logging to a file instead?
Thanks.
| Reporter | ||
Comment 8•1 year ago
|
||
LogMessages β (neqo_transport::) [neqo_transport] Crypto operation failed NssError { name: "SEC_ERROR_BAD_DATA", code: -8190, desc: "security library: received bad data." }
LogMessages β (neqo_transport::) [neqo_transport::stats] [Client ...] Dropped received packet: Decryption failure; Total: 27
This constantly happens, interesting.
| Reporter | ||
Comment 9•1 year ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #7)
(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #6)
Sure. A perf profile with a fresh firefox profile (see Marker Table to get the logs): https://share.firefox.dev/3XdSqo5
(I see zero rust logs this way π€)
Could you choose logging to a file instead?
Thanks.
You checked my comment too early! I just selected a wrong thread (main thread lol), select socket thread and you see all the logs.
Comment 10•1 year ago
|
||
FWIW I tried it in Nightly Version 132.0a1 (2024-09-03) (64-bit) in a Windows for ARM VM on my Macbook M3 and fosstodon.org loads fine.
Comment 11•1 year ago
•
|
||
(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #8)
LogMessages β (neqo_transport::) [neqo_transport] Crypto operation failed NssError { name: "SEC_ERROR_BAD_DATA", code: -8190, desc: "security library: received bad data." }
LogMessages β (neqo_transport::) [neqo_transport::stats] [Client ...] Dropped received packet: Decryption failure; Total: 27This constantly happens, interesting.
This is possibly a red herring - this is the message that gets printed when QUIC packets are padded with garbage. We should probably lower the log level of that message, or only log it if it is not a coalesced garbage packet that triggers it.
Edit: But there sure are more of those than there should be...
| Reporter | ||
Comment 12•1 year ago
•
|
||
(In reply to Lars Eggert [:lars] from comment #10)
FWIW I tried it in Nightly Version 132.0a1 (2024-09-03) (64-bit) in a Windows for ARM VM on my Macbook M3 and fosstodon.org loads fine.
Hmm, could this be device specific? x86 32bit binary (by --arch 32 to mozregression) shows the same problem on my Surface Pro 11 (Qualcomm Snapdragon X) on the BER office network. (Edit: the profile above was recorded in a different network)
| Reporter | ||
Comment 13•1 year ago
|
||
And with the same configuration my AMD desktop has no problem.
Comment 14•1 year ago
•
|
||
@krosylight, am I understanding you correctly that you think this may happen on 32-bit builds/platforms but not 64-bit ones? Or only on 32-bit ARM?
| Reporter | ||
Comment 15•1 year ago
•
|
||
No, I wanted to say that different archs shows the same problem on my device. All aarch64/x86/x86-64 nightly builds have the same issue.
| Assignee | ||
Comment 16•1 year ago
•
|
||
Thank you for the additional logs via mail!
Paraphrasing the above to make sure I understand correctly. On an ARM machine running Windows for ARM (Surface Pro 11) fosstodon.org does not load with aarch64, x86 and x86-64 Firefox Nightly builds with network.http.http3.use_nspr_for_io set to false. The latter two (x86, x86-64) are presumably run via Windows for ARM emulation layer.
Do I understand correctly that all other H3 pages work fine, e.g. https://quic.nginx.org/ or https://cloudflare-quic.com/ ?
| Reporter | ||
Comment 17•1 year ago
•
|
||
(In reply to Max Inden from comment #16)
Thank you for the additional logs via mail!
Paraphrasing the above to make sure I understand correctly. On an ARM machine running Windows for ARM (Surface Pro 11) fosstodon.org does not load with aarch64, x86 and x86-64 Firefox Nightly builds with
network.http.http3.use_nspr_for_ioset tofalse. The latter two (x86, x86-64) are presumably run via Windows for ARM emulation layer.
Yes.
Do I understand correctly that all other H3 pages work fine, e.g. https://quic.nginx.org/ or https://cloudflare-quic.com/ ?
I did not check others as I don't know who uses H3. https://quic.nginx.org/ shows "You're not using QUIC right now, but don't despair," with NSPR pref off, while it shows "Congratulations! You're connected over QUIC." with NSPR pref on. Tested with mozregression --launch 2024-09-04 -a https://quic.nginx.org/ --pref network.http.http3.use_nspr_for_io:true. Similar result on the cloudflare page.
But as I showed to Kershaw offline, the test result is a bit flaky. Sometimes it connects well with QUIC (and then breaks on a few reload and then permanently stuck on HTTP2), or just goes straight to HTTP2 from the start.
Edit: And with the nspr pref on, things work consistently without flakyness.
| Assignee | ||
Comment 18•1 year ago
|
||
Thank you for your help!
Do you have Rust installed on this laptop? If so, would you mind trying the following and posting the output here?
# In some directory of your choice:
git clone https://github.com/quinn-rs/quinn.git
cd quinn/quinn-udp
cargo test
Context:
- When you set
network.http.http3.use_nspr_for_iotofalseyour Firefox instance uses quinn-udp instead of NSPR. - Instead of debugging quinn-udp through Firefox, we might as well debug it directly, running its unit tests on your machine.
| Reporter | ||
Comment 19•1 year ago
|
||
Test seems clean:
> cargo test
warning: D:\quinn\quinn-udp\Cargo.toml: unused manifest key: target.cfg(any(target_os = "linux", target_os = "windows")).bench
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.09s
Running unittests src\lib.rs (D:\quinn\target\debug\deps\quinn_udp-528859568f696d14.exe)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running tests\tests.rs (D:\quinn\target\debug\deps\tests-9710717108edc830.exe)
running 6 tests
test basic ... ok
test ecn_v4_mapped_v6 ... ok
test ecn_v6 ... ok
test ecn_v6_dualstack ... ok
test ecn_v4 ... ok
test gso ... ok
test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s
Doc-tests quinn_udp
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
| Assignee | ||
Comment 20•1 year ago
|
||
Thank you for trying this so quickly. This is surprising.
Do I understand correctly that you only alter the network.http.http3.use_nspr_for_io pref?
| Reporter | ||
Comment 21•1 year ago
•
|
||
Yes, as each time mozregression creates a new profile in a temp directory, which then only includes the default prefs unless told otherwise by --pref.
Comment 22•1 year ago
|
||
Is this only a problem from the Berlin office network, or does it happen elsewhere?
| Reporter | ||
Comment 23•1 year ago
|
||
It happens elsewhere too.
| Reporter | ||
Comment 24•1 year ago
|
||
So the connection tend to work initially (in https://quic.nginx.org/) and then breaks in a few reload. Can we try making a test simulating such situation?
Updated•1 year ago
|
| Assignee | ||
Comment 25•1 year ago
|
||
If I understand Firefox's build system correctly, debug and trace level logging is not available in release builds.
Would you mind reproducing the issue at hand once more, this time in a debug build, and again send us the logs?
As far as I know, the following addition to mozconfig should do, but I assume you know better than me.
ac_add_options --enable-debug
ac_add_options --enable-debug-symbols
Please enable the following log level:
timestamp,sync,nsHttp:5,nsSocketTransport:5,nsHostResolver:5,neqo_http3::*:5,neqo_transport::*:5,quinn_udp::*:5,neqo_udp::*:5,neqo_glue::*:5,neqo_common::*:5
Thank you for your help!
| Reporter | ||
Comment 26•1 year ago
|
||
Oh hey, I can see tons of dropped packets in debug build. Sending the log!
| Assignee | ||
Comment 27•1 year ago
|
||
Kagami, can you still reproduce this issue with latest Firefox? If yes, could you do a packet capture with e.g. Wireshark and send the .pcap and SSLKEYLOGFILE to necko@mozilla.com?
You can instruct Firefox to persist its TLS keys via: SSLKEYLOGFILE=/tmp/keys.txt ./mach run.
Sorry for not making any progress here thus far. Thank you for your help.
| Reporter | ||
Comment 28•1 year ago
|
||
Yes, fosstodon still doesn't load with the default pref. I'm no familiar with wireshark, does it log every packet from the OS?
| Assignee | ||
Comment 29•1 year ago
|
||
does it log every packet from the OS?
After starting Wireshark, you can first specify a capture filter and then choose an interface. Capture filter with udp should suffice for us here.
Are you running any anti-virus software? If so, does the problem continue, even when disabling the anti-virus software?
| Reporter | ||
Comment 30•1 year ago
|
||
Given it still sounds like it would capture ALL UDP packets, is there some better filter syntax to make sure it only captures anything goes to fosstodon, or anything comes only from a fresh profile nightly?
Disabling Windows Security's realtime protection doesn't seem to help.
Comment 31•1 year ago
•
|
||
You can right click one of the quic/http3/udp packets > follow udp stream. Then select all the packets and go to File > export specified packets.
Then assuming you used SSLKELOGFILE=/tmp/keys.txt you can inject the secrets into the pcap with editcap --inject-secrets tls,/tmp/keys.txt capture.pcapng capwithsectrets.pcapng
Otherwise you can just attach your keys.txt file.
| Reporter | ||
Comment 32•1 year ago
|
||
Alright, sent the capture to necko@moz. I hope I did it right π
| Assignee | ||
Comment 33•1 year ago
|
||
Thank you Kagami. Very helpful.
I see UDP datagrams larger than 1500 bytes, with the largest larger than 15_000. Would you mind sharing your interface MTU? After a quick Google search, the following command should do it.
netsh interface ipv4 show subinterface
| Reporter | ||
Comment 34•1 year ago
|
||
MTU MediaSenseState Bytes In Bytes Out Interface
---------- --------------- ------------ ------------ -------------
4294967295 1 0 22145 Loopback Pseudo-Interface 1
1500 5 0 0 Bluetooth Network Connection
1500 1 4076196507 198252145 WiFi
1500 5 0 0 WiFi 2
1500 5 0 0 WiFi 4
1500 5 0 0 WiFi 5
That loopback interface should be wireshark, but still weird to have weirdly high MTU.
| Assignee | ||
Comment 35•1 year ago
|
||
Thank you Kagami.
MTUs above look fine. Windows loopback is special, as it allows a MTU > 64k.
I have a theory, namely that with quinn-udp, we don't correctly read the UDP segment size on Windows on ARM. This is supported by the pcap you sent, where e.g. we receive a 3840 bytes UDP datagram containing a 1280 bytes QUIC packet only. The remaining bytes I assume contain additional QUIC packet(s), read in a single GRO read.
I will prepare a unit test next to confirm the above.
| Assignee | ||
Comment 36•1 year ago
|
||
Would you mind running the following unit test?
https://github.com/mxinden/quinn/commit/7ce09c3f9523920682163c35fca17b4f3a4ed2f3
The steps below should work on your machine:
git clone https://github.com/mxinden/quinn.git
cd quinn/
git checkout large-gro
cd quinn-udp
cargo test
| Reporter | ||
Comment 37•1 year ago
|
||
running 7 tests
test ecn_v4 ... ok
test basic ... ok
test ecn_v6 ... ok
test large_gro ... FAILED
test ecn_v6_dualstack ... ok
test ecn_v4_mapped_v6 ... ok
test gso ... ok
failures:
---- large_gro stdout ----
thread 'large_gro' panicked at quinn-udp\tests\tests.rs:289:5:
assertion `left == right` failed
left: 1280
right: 3848
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
large_gro
test result: FAILED. 6 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s
error: test failed, to rerun pass `--test tests`
π
| Reporter | ||
Comment 38•1 year ago
|
||
Added debug!("cmsg_iter {0} {1}", cmsg.cmsg_level, cmsg.cmsg_type); on top of https://github.com/mxinden/quinn/commit/83bafa51c9898a03bfc40410175cdb64e630a421, and the result was:
running 1 test
2024-11-08T21:36:24.430147Z DEBUG quinn_udp::imp: recv
2024-11-08T21:36:24.432055Z DEBUG quinn_udp::imp: cmsg_iter 0 19
2024-11-08T21:36:24.432335Z DEBUG quinn_udp::imp: cmsg_iter 0 50
thread 'large_gro' panicked at quinn-udp\tests\tests.rs:293:5:
assertion `left == right` failed
left: 1280
right: 3848
And 0 means IPPROTO_IP instead of IPPROTO_UDP?
| Assignee | ||
Comment 39•1 year ago
|
||
Good idea. That said, these values are as expected.
0 19 should be WinSock::IPPROTO_IP WinSock::IP_PKTINFO.
0 50 should be WinSock::IPPROTO_IP WinSock::IP_ECN.
Each of these are handled here:
My unit test above might fail to trigger a coalesced UDP receive (i.e. URO, the Windows GRO equivalent).
Still the pcap you shared support my theory above. Firefox seems to read multiple > 1_500 bytes UDP datagrams. Given that this is an Internet path (fosstodon.org is behind fastly), the single large UDP datagram is likely coalesced from multiple < 1_500 bytes UDP datagrams. Thus I assume we properly set the WinSock::UDP_RECV_MAX_COALESCED_SIZE, i.e. indicate to the OS that UDP datagrams should be coalesced, but fail to read the segment size on receive, i.e. mistake a coalesced UDP datagram for a single classic very large UDP datagram.
The above is problematic once the QUIC connection is established. QUIC will use short headers, not containing packet lengths:
Packets with short headers (Section 17.3) only include the Destination Connection ID and omit the explicit length.
https://www.rfc-editor.org/rfc/rfc9000.html#name-connection-id
Thus QUIC is not able to identify the QUIC packet boundaries within the coalesced large UDP datagram, and thereby drops the invalid packet. The high packet loss leads to spurious or total connection failure.
I will give this more thought.
| Reporter | ||
Updated•1 year ago
|
| Assignee | ||
Comment 40•1 year ago
|
||
To validate another theory raised in quinn#2041, namely that our control message buffer is too small, would you mind running the above unit test once more on this commit, Kagami?
https://github.com/mxinden/quinn/commit/eb2be2d9842fe2b1039229f104958eba186af9db
| Reporter | ||
Comment 41•1 year ago
|
||
> git log -1
commit eb2be2d9842fe2b1039229f104958eba186af9db (HEAD -> large-gro, mxinden/large-gro)
Author: Max Inden <mail@max-inden.de>
Date: Thu Nov 14 11:56:51 2024 +0100
Use large CMSG_LEN
> cargo test
Compiling winapi v0.3.9
Compiling quinn-udp v0.5.7 (D:\quinn\quinn-udp)
Compiling criterion v0.5.1
Compiling nu-ansi-term v0.46.0
Compiling tracing-subscriber v0.3.18
Finished `test` profile [unoptimized + debuginfo] target(s) in 7.11s
Running unittests src\lib.rs (D:\quinn\target\debug\deps\quinn_udp-0e126f761ea7a4f8.exe)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running tests\tests.rs (D:\quinn\target\debug\deps\tests-705ba9a9199dc5c7.exe)
running 7 tests
test ecn_v4 ... ok
test ecn_v6 ... ok
test basic ... ok
test ecn_v6_dualstack ... ok
test ecn_v4_mapped_v6 ... ok
test gso ... ok
test large_gro ... FAILED
failures:
---- large_gro stdout ----
thread 'large_gro' panicked at quinn-udp\tests\tests.rs:293:5:
assertion `left == right` failed
left: 1280
right: 3848
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
large_gro
test result: FAILED. 6 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.01s
error: test failed, to rerun pass `--test tests`
| Assignee | ||
Comment 42•1 year ago
|
||
Sorry for not mentioning this. Please run with RUST_LOG=debug like you did in https://bugzilla.mozilla.org/show_bug.cgi?id=1916558#c38.
| Reporter | ||
Comment 43•1 year ago
•
|
||
Oh yes.
> $env:RUST_LOG="debug"; cargo test large_gro -- --nocapture
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.17s
Running unittests src\lib.rs (D:\quinn\target\debug\deps\quinn_udp-0e126f761ea7a4f8.exe)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running tests\tests.rs (D:\quinn\target\debug\deps\tests-705ba9a9199dc5c7.exe)
running 1 test
2024-11-14T11:10:15.247131Z DEBUG quinn_udp::imp: recv
2024-11-14T11:10:15.249245Z DEBUG quinn_udp::imp: cmsg_iter?
2024-11-14T11:10:15.249363Z DEBUG quinn_udp::imp: cmsg_iter 0 19
2024-11-14T11:10:15.249402Z DEBUG quinn_udp::imp: cmsg_iter 0 50
thread 'large_gro' panicked at quinn-udp\tests\tests.rs:293:5:
assertion `left == right` failed
left: 1280
right: 3848
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test large_gro ... FAILED
failures:
failures:
large_gro
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 6 filtered out; finished in 1.01s
error: test failed, to rerun pass `--test tests`
(That iter? is added by me while I was confused why I'm not getting any debug log. Of course I don't get it without env var. π)
| Assignee | ||
Comment 44•1 year ago
|
||
Out-of-band you confirmed that you are running on the latest Windows on ARM version. Given that Windows' URO implementation might offload to your network interface card, can you confirm that you don't have any pending firmware updates, especially for the NIC?
Comment 45•1 year ago
|
||
And what NIC is it? Some are better than others in terms of drivers.
| Reporter | ||
Comment 46•1 year ago
|
||
It's "Qualcomm(R) FastConnect(TM) 7800 Mobile Connectivity System". I see no firmware update.
| Assignee | ||
Comment 47•1 year ago
|
||
Status Update:
- With the help of other Mozillians, I have tried to reproduce this bug on:
- Surface Pro X (yet to be tested on Windows 11, version 24H2)
- Dell Latitude 7455 (with Snapdragon X Elite) (maybe Windows 11, version 24H2)
- Thinkpad P1 x86_64 (Windows 11, version 24H2)
- Mac M3 Windows VM (Windows 11, version 24H2)
- Thus far without success.
- I have ordered a Surface Pro 11 (Qualcomm Snapdragon X) to try to reproduce the issue directly.
| Assignee | ||
Comment 48•1 year ago
|
||
I have ordered a Surface Pro 11 (Qualcomm Snapdragon X) to try to reproduce the issue directly.
Unfortunately I can not reproduce the bug with this new device.
- Surface Pro 11th Edition 2076
- Snapdragon X 12-core X1E80100
- Qualcomm(R) FastConnect(TM) 7800 Mobile Connectivity System
- Windows 11 Home 10.0.26100 Build 26100
- Tested with both Firefox Nightly 134.0a1 (2024-11-21)(aarch64) and Firefox Nightly 134.0a1 (2024-11-21)(64-Bit)
network.http.http3.use_nspr_for_iofalse- https://fosstodon.org loads most resources over http3, some use http2.
- No connection issues noticeable. No HTTP request fails.
- fosstodon.org resolves to 151.101.131.52 (fastly) as expected.
- Glean metrics are as expected:
http_3_udp_datagram_segments_receivedshows that the majority of receive calls read multiple coalesced UDP datagrams (i.e. segments).http_3_udp_datagram_size_receivedshows multiple >1500datagrams (presumably coalesced).http_3_udp_datagram_segment_size_receivedshows no datagram segment >1500.
- Wireshark shows (as expected) multiple >
1500UDP datagrams (presumably coalesced).
| Assignee | ||
Comment 49•1 year ago
|
||
I found a way to reproduce the bug π
- Install Windows subsystem for Linux in a Terminal via
wsl --install. - Restart Windows.
- As described above, navigate to https://fosstodon.org and hard reload (ctrl-shift-r) until you see http3 requests stalling.
Once reproduced you can undo via:
- Control Panel -> programs-features -> turn off "Virtual Machine Platform".
- Restart Windows.
- Navigate to https://fosstodon.org and you won't see any more http3 request stalls.
@Kagami I assume you are running WSL, correct? Would you mind testing out part two above to see whether that fixes the bug on your end as well? Note that I am not a Windows expert, e.g. I don't know whether turning off "Virtual Machine Platform" is easily revertible. It is on my machine.
| Reporter | ||
Comment 50•1 year ago
|
||
Ohhhhh. I have Docker so I'm not sure turning it off is safe, but then I don't really have important thing in Docker nor in WSL so maybe I can still try.
Interesting that it doesn't happen on my desktop which also have Virtual Machine Platform for WSL.
| Reporter | ||
Comment 51•1 year ago
|
||
Yes, I can confirm that turning off VMP fixes the issue and turning it back on reintroduces the issue. π²
| Assignee | ||
Comment 52•1 year ago
|
||
Bug 1935954 introduces an interim fix, disabling URO when detecting Windows on ARM.
Kagami, in case you have some time, would you mind confirming that https://phabricator.services.mozilla.com/D231505 fixes the issues described here? A simple debug build of the patch and visiting fosstodon.org should be enough.
| Reporter | ||
Comment 53•1 year ago
|
||
Repeating ./mach run --temp-profile https://fosstodon.org shows no issue, compared to 2024-12-09 build which shows the issue. ππ»
Comment 54•1 year ago
|
||
I recently began to experience this issue after updating to Windows 11 24H2, but I am on x86 (I see the same issue on 2 devices, a desktop with an AMD CPU and a Realtek NIC, and a laptop with an Intel CPU and Intel Wi-Fi). Bisection pointed to bug 1910360 and I confirmed that setting network.http.http3.use_nspr_for_io back to true fixes the issue (I used a different website to bisect, but confirmed the difference with https://fosstodon.org as well).
Unfortunately turning off VMP did not help in my case (it was enabled, but disabling it did not make a difference).
Would you like me to file a separate bug, or investigate here?
Comment 55•1 year ago
|
||
I have reached out to Microsoft to determine if this is a known issue and whether a Windows fix is in preparation.
| Assignee | ||
Comment 56•1 year ago
|
||
Thank you for the investigation Emanuel.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #54)
Would you like me to file a separate bug, or investigate here?
Until we have any strong indication that the Windows on ARM URO failure is unrelated to the x86 URO failure, I suggest we track both in this Bug.
(In reply to Lars Eggert [:lars] from comment #55)
I have reached out to Microsoft to determine if this is a known issue and whether a Windows fix is in preparation.
Thank you Lars!
| Reporter | ||
Comment 57•1 year ago
|
||
NI to me to confirm whether this was wifi specific on my machine or wired connection was also affected
| Reporter | ||
Comment 58•1 year ago
|
||
Turns out the 2024-12-09 (the build before URO disabled) doesn't have a problem on wired connection. The network info says different device for each connection:
- For wired, it says "Realtek USB GbE Family Controller"
- For wireless, it says "Qualcomm(R) FastConnect(TM) 7800 Mobile Connectivity System"
The Realtek controller doesn't appear on Device Manager without the USB cable connected, so perhaps it's an external device than in SP11. (The wired connection is coming through a USB-C cable from Dell U2723QE monitor where a LAN cable is connected)
Comment 59•1 year ago
|
||
I found that disabling "Recv Segment Coalescing" in the Realtek drivers also works around this problem (which would make sense given that URO is UDP Receive Segment Coalescing Offload), and found at least 1 comment online[1] suggesting that this setting may have been disabled by default in older drivers with power saving features enabled (it was enabled by default for me).
So this may be a case where you need to install the latest drivers[2] or manually enable the "Recv Segment Coalescing" setting in the drivers to reproduce the problem.
I have not found an equivalent setting in the Intel drivers. There's a setting called "Packet Coalescing", but disabling it does not work around the problem.
[1] https://www.elevenforum.com/t/latest-realtek-lan-driver-win11.9226/page-2#post-334360
[2] https://www.realtek.com/Download/List?cate_id=584
| Reporter | ||
Comment 60•1 year ago
|
||
Good point Emanuel, thanks!
So it was RTL8153 and the Windows builtin driver didn't have "Recv Segment Coalescing" at all in driver advanced menu. Manually installing the driver from Realtek added the entry enabled by default, and with that the issue happens again.
| Reporter | ||
Comment 61•1 year ago
|
||
And I can confirm that disabling "Recv Segment Coalescing" from driver menu solves the issue on 2024-12-09 build for both Qualcomm and Realtek driver.
Comment 62•1 year ago
|
||
Ah, I just spotted the equivalent settings in the Intel drivers: RSCv4 and RSCv6.
I've confirmed that setting these options to disabled works around the problem (as expected).
To reproduce the problem, these settings probably need to be available and enabled.
| Assignee | ||
Comment 63•1 year ago
|
||
Thank you for the recent investigations.
Given that this issue does not only apply to Windows on ARM and given that no solution is in sight, we are disabling URO on Windows all together.
quinn#2092 disabled URO on Windows by default. It is part of quinn-udp v0.5.9 which has landed in mozilla-central with phabricator#D232475.
Comment 64•1 year ago
|
||
I'm no longer seeing issues with the latest Nightly, with default driver settings and network.http.http3.use_nspr_for_io == false :)
By the way, one more thing I noticed:
- On my desktop's Intel WiFi, I don't see the
RSCv4andRSCv6options despite having the latest drivers (and Windows 11 24H2). It's an older chip (AX200) than the one in my laptop (BE200). - On my laptop's Realtek NIC, I don't see the
Recv Segment Coalescingoptions despite having the latest drivers (and Windows 11 24H2), and despite the fact that it's a newer revision of the same product (RTL8125, butREV_05as opposed toREV_04according to the compatible IDs).
So even having the offloading feature available in the first place seems almost random!
Comment 65•8 months ago
|
||
Max, should we keep this bug open?
I think we can close this bug, since URO is disabled.
| Assignee | ||
Comment 66•8 months ago
|
||
Long term I believe we should do URO (Windows equivalent to GRO). I suggest we track that work somewhere, for example here. In other words, I suggest keeping this open.
We are in touch with Microsoft. One of the engineers has been able to reproduce the failure with our reproducer. That is it for now.
Updated•5 months ago
|
Description
•