Open Bug 1916558 Opened 1 year ago Updated 5 months ago

fosstodon.org doesn't load with network.http.http3.use_nspr_for_io=false on ARM64 Windows

Categories

(Core :: Networking, defect, P2)

ARM64
Windows 11
defect

Tracking

()

Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- unaffected
firefox130 --- disabled
firefox131 --- disabled
firefox132 --- disabled

People

(Reporter: saschanaz, Assigned: mail)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged][necko-priority-monitor])

Mozregression says it's in this range: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=4d36a2f69e7027dcc1e3535e563aa57e556d9f70&tochange=c38029641964591e518535856e2d7b3038b134ad

Could not pinpoint the commit because of mozregression aarch64 support issue (WARNING: Skipping build b94ec5ba05c9: Unable to find build info using the taskcluster route 'gecko.v2.autoland.shippable.revision.b94ec5ba05c96151a539e0c05403f7384829b008.firefox.win64-aarch64-opt) but I see bug 1910360 being the only network change there.

Bisected by mozregression --good 2024-07-03 --bad 2024-08-14 -a https://fosstodon.org/explore. On a bad build the page never finishes loading, and on a good build it loads instantly.

Edit: On MSEdge the page always loads instantly.

OS: Unspecified → Windows 11
Hardware: Unspecified → ARM64
Version: unspecified → Firefox 130

Setting network.http.http3.use_nspr_for_io to true indeed fixes the issue.

Summary: fosstodon.org doesn't load since nightly 2024-08-02 → fosstodon.org doesn't load with network.http.http3.use_nspr_for_io=false

Set release status flags based on info from the regressing bug 1910360

:mail, since you are the author of the regressor, bug 1910360, could you take a look?

For more information, please visit BugBot documentation.

Thank you Kagami Rosylight for tracking this down. I will take a look.

Given that network.http.http3.use_nspr_for_io is only set to false on Nightly, i.e. quinn-udp only used Nightly, I don't think this affects firefox130, firefox131 and firefox132. Am I missing something?

# Use NSPR for HTTP3 UDP IO
- name: network.http.http3.use_nspr_for_io
  type: RelaxedAtomicBool
# Always use NSPR on Android x86. On Android x86 sendmsg and recvmmsg, both used
# by quinn-udp, are disallowed by seccomp. Future fix and further details
# tracked in https://github.com/quinn-rs/quinn/issues/1947.
#if defined(ANDROID) && !defined(HAVE_64BIT_BUILD)
  value: true
#else
  value: @IS_NOT_NIGHTLY_BUILD@
#endif
  mirror: always

https://searchfox.org/mozilla-central/rev/8fffdc727aa507ee4955042ec2d6f71d23c9c2de/modules/libpref/init/StaticPrefList.yaml#13672-13683

I am unable to reproduce this bug on x86-64 Linux and Arm M2 Mac.

Kagami Rosylight can you reproduce the bug once more, with the following log level enabled, and send the logs to necko@mozilla.com and minden@mozilla.com?

timestamp,sync,nsHttp:5,nsSocketTransport:5,nsHostResolver:5,neqo_http3::*:5,neqo_transport::*:5,quinn_udp::*:5,neqo_udp::*:5,neqo_glue::*:5

Flags: needinfo?(krosylight)

Sure. A perf profile with a fresh firefox profile (select Socket Thread and see Marker Table to get the logs): https://share.firefox.dev/3XdSqo5

Flags: needinfo?(krosylight)

(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #6)

Sure. A perf profile with a fresh firefox profile (see Marker Table to get the logs): https://share.firefox.dev/3XdSqo5

(I see zero rust logs this way πŸ€”)

Could you choose logging to a file instead?
Thanks.

Flags: needinfo?(krosylight)

LogMessages β€” (neqo_transport::) [neqo_transport] Crypto operation failed NssError { name: "SEC_ERROR_BAD_DATA", code: -8190, desc: "security library: received bad data." }
LogMessages β€” (neqo_transport::
) [neqo_transport::stats] [Client ...] Dropped received packet: Decryption failure; Total: 27

This constantly happens, interesting.

(In reply to Kershaw Chang [:kershaw] from comment #7)

(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #6)

Sure. A perf profile with a fresh firefox profile (see Marker Table to get the logs): https://share.firefox.dev/3XdSqo5

(I see zero rust logs this way πŸ€”)

Could you choose logging to a file instead?
Thanks.

You checked my comment too early! I just selected a wrong thread (main thread lol), select socket thread and you see all the logs.

Flags: needinfo?(krosylight)

FWIW I tried it in Nightly Version 132.0a1 (2024-09-03) (64-bit) in a Windows for ARM VM on my Macbook M3 and fosstodon.org loads fine.

(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #8)

LogMessages β€” (neqo_transport::) [neqo_transport] Crypto operation failed NssError { name: "SEC_ERROR_BAD_DATA", code: -8190, desc: "security library: received bad data." }
LogMessages β€” (neqo_transport::
) [neqo_transport::stats] [Client ...] Dropped received packet: Decryption failure; Total: 27

This constantly happens, interesting.

This is possibly a red herring - this is the message that gets printed when QUIC packets are padded with garbage. We should probably lower the log level of that message, or only log it if it is not a coalesced garbage packet that triggers it.

Edit: But there sure are more of those than there should be...

(In reply to Lars Eggert [:lars] from comment #10)

FWIW I tried it in Nightly Version 132.0a1 (2024-09-03) (64-bit) in a Windows for ARM VM on my Macbook M3 and fosstodon.org loads fine.

Hmm, could this be device specific? x86 32bit binary (by --arch 32 to mozregression) shows the same problem on my Surface Pro 11 (Qualcomm Snapdragon X) on the BER office network. (Edit: the profile above was recorded in a different network)

And with the same configuration my AMD desktop has no problem.

@krosylight, am I understanding you correctly that you think this may happen on 32-bit builds/platforms but not 64-bit ones? Or only on 32-bit ARM?

No, I wanted to say that different archs shows the same problem on my device. All aarch64/x86/x86-64 nightly builds have the same issue.

Thank you for the additional logs via mail!

Paraphrasing the above to make sure I understand correctly. On an ARM machine running Windows for ARM (Surface Pro 11) fosstodon.org does not load with aarch64, x86 and x86-64 Firefox Nightly builds with network.http.http3.use_nspr_for_io set to false. The latter two (x86, x86-64) are presumably run via Windows for ARM emulation layer.

Do I understand correctly that all other H3 pages work fine, e.g. https://quic.nginx.org/ or https://cloudflare-quic.com/ ?

Flags: needinfo?(krosylight)

(In reply to Max Inden from comment #16)

Thank you for the additional logs via mail!

Paraphrasing the above to make sure I understand correctly. On an ARM machine running Windows for ARM (Surface Pro 11) fosstodon.org does not load with aarch64, x86 and x86-64 Firefox Nightly builds with network.http.http3.use_nspr_for_io set to false. The latter two (x86, x86-64) are presumably run via Windows for ARM emulation layer.

Yes.

Do I understand correctly that all other H3 pages work fine, e.g. https://quic.nginx.org/ or https://cloudflare-quic.com/ ?

I did not check others as I don't know who uses H3. https://quic.nginx.org/ shows "You're not using QUIC right now, but don't despair," with NSPR pref off, while it shows "Congratulations! You're connected over QUIC." with NSPR pref on. Tested with mozregression --launch 2024-09-04 -a https://quic.nginx.org/ --pref network.http.http3.use_nspr_for_io:true. Similar result on the cloudflare page.

But as I showed to Kershaw offline, the test result is a bit flaky. Sometimes it connects well with QUIC (and then breaks on a few reload and then permanently stuck on HTTP2), or just goes straight to HTTP2 from the start.

Edit: And with the nspr pref on, things work consistently without flakyness.

Flags: needinfo?(krosylight)

Thank you for your help!

Do you have Rust installed on this laptop? If so, would you mind trying the following and posting the output here?

# In some directory of your choice:
git clone https://github.com/quinn-rs/quinn.git
cd quinn/quinn-udp 
cargo test

Context:

  • When you set network.http.http3.use_nspr_for_io to false your Firefox instance uses quinn-udp instead of NSPR.
  • Instead of debugging quinn-udp through Firefox, we might as well debug it directly, running its unit tests on your machine.
Flags: needinfo?(krosylight)

Test seems clean:

> cargo test
warning: D:\quinn\quinn-udp\Cargo.toml: unused manifest key: target.cfg(any(target_os = "linux", target_os = "windows")).bench
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.09s
     Running unittests src\lib.rs (D:\quinn\target\debug\deps\quinn_udp-528859568f696d14.exe)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests\tests.rs (D:\quinn\target\debug\deps\tests-9710717108edc830.exe)

running 6 tests
test basic ... ok
test ecn_v4_mapped_v6 ... ok
test ecn_v6 ... ok
test ecn_v6_dualstack ... ok
test ecn_v4 ... ok
test gso ... ok

test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s

   Doc-tests quinn_udp

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Flags: needinfo?(krosylight)

Thank you for trying this so quickly. This is surprising.

Do I understand correctly that you only alter the network.http.http3.use_nspr_for_io pref?

Flags: needinfo?(krosylight)

Yes, as each time mozregression creates a new profile in a temp directory, which then only includes the default prefs unless told otherwise by --pref.

Flags: needinfo?(krosylight)

Is this only a problem from the Berlin office network, or does it happen elsewhere?

It happens elsewhere too.

So the connection tend to work initially (in https://quic.nginx.org/) and then breaks in a few reload. Can we try making a test simulating such situation?

Assignee: nobody → mail
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-next]

If I understand Firefox's build system correctly, debug and trace level logging is not available in release builds.

Would you mind reproducing the issue at hand once more, this time in a debug build, and again send us the logs?

As far as I know, the following addition to mozconfig should do, but I assume you know better than me.

ac_add_options --enable-debug
ac_add_options --enable-debug-symbols

Please enable the following log level:

timestamp,sync,nsHttp:5,nsSocketTransport:5,nsHostResolver:5,neqo_http3::*:5,neqo_transport::*:5,quinn_udp::*:5,neqo_udp::*:5,neqo_glue::*:5,neqo_common::*:5

Thank you for your help!

Flags: needinfo?(mail) → needinfo?(krosylight)

Oh hey, I can see tons of dropped packets in debug build. Sending the log!

Flags: needinfo?(krosylight)

Kagami, can you still reproduce this issue with latest Firefox? If yes, could you do a packet capture with e.g. Wireshark and send the .pcap and SSLKEYLOGFILE to necko@mozilla.com?

You can instruct Firefox to persist its TLS keys via: SSLKEYLOGFILE=/tmp/keys.txt ./mach run.

Sorry for not making any progress here thus far. Thank you for your help.

Flags: needinfo?(krosylight)

Yes, fosstodon still doesn't load with the default pref. I'm no familiar with wireshark, does it log every packet from the OS?

Flags: needinfo?(krosylight)

does it log every packet from the OS?

After starting Wireshark, you can first specify a capture filter and then choose an interface. Capture filter with udp should suffice for us here.


Are you running any anti-virus software? If so, does the problem continue, even when disabling the anti-virus software?

Flags: needinfo?(krosylight)

Given it still sounds like it would capture ALL UDP packets, is there some better filter syntax to make sure it only captures anything goes to fosstodon, or anything comes only from a fresh profile nightly?

Disabling Windows Security's realtime protection doesn't seem to help.

Flags: needinfo?(krosylight)

You can right click one of the quic/http3/udp packets > follow udp stream. Then select all the packets and go to File > export specified packets.
Then assuming you used SSLKELOGFILE=/tmp/keys.txt you can inject the secrets into the pcap with editcap --inject-secrets tls,/tmp/keys.txt capture.pcapng capwithsectrets.pcapng

Otherwise you can just attach your keys.txt file.

Alright, sent the capture to necko@moz. I hope I did it right πŸ‘

Thank you Kagami. Very helpful.

I see UDP datagrams larger than 1500 bytes, with the largest larger than 15_000. Would you mind sharing your interface MTU? After a quick Google search, the following command should do it.

netsh interface ipv4 show subinterface
Flags: needinfo?(krosylight)

       MTU  MediaSenseState      Bytes In     Bytes Out  Interface
----------  ---------------  ------------  ------------  -------------
4294967295                1             0         22145  Loopback Pseudo-Interface 1
      1500                5             0             0  Bluetooth Network Connection
      1500                1    4076196507     198252145  WiFi
      1500                5             0             0  WiFi 2
      1500                5             0             0  WiFi 4
      1500                5             0             0  WiFi 5

That loopback interface should be wireshark, but still weird to have weirdly high MTU.

Flags: needinfo?(krosylight)

Thank you Kagami.

MTUs above look fine. Windows loopback is special, as it allows a MTU > 64k.

I have a theory, namely that with quinn-udp, we don't correctly read the UDP segment size on Windows on ARM. This is supported by the pcap you sent, where e.g. we receive a 3840 bytes UDP datagram containing a 1280 bytes QUIC packet only. The remaining bytes I assume contain additional QUIC packet(s), read in a single GRO read.

I will prepare a unit test next to confirm the above.

Would you mind running the following unit test?

https://github.com/mxinden/quinn/commit/7ce09c3f9523920682163c35fca17b4f3a4ed2f3

The steps below should work on your machine:

git clone https://github.com/mxinden/quinn.git
cd quinn/
git checkout large-gro
cd quinn-udp
cargo test
Flags: needinfo?(krosylight)
running 7 tests
test ecn_v4 ... ok
test basic ... ok
test ecn_v6 ... ok
test large_gro ... FAILED
test ecn_v6_dualstack ... ok
test ecn_v4_mapped_v6 ... ok
test gso ... ok

failures:

---- large_gro stdout ----
thread 'large_gro' panicked at quinn-udp\tests\tests.rs:289:5:
assertion `left == right` failed
  left: 1280
 right: 3848
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    large_gro

test result: FAILED. 6 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s

error: test failed, to rerun pass `--test tests`

πŸ‘€

Flags: needinfo?(krosylight)

Added debug!("cmsg_iter {0} {1}", cmsg.cmsg_level, cmsg.cmsg_type); on top of https://github.com/mxinden/quinn/commit/83bafa51c9898a03bfc40410175cdb64e630a421, and the result was:

running 1 test
2024-11-08T21:36:24.430147Z DEBUG quinn_udp::imp: recv
2024-11-08T21:36:24.432055Z DEBUG quinn_udp::imp: cmsg_iter 0 19
2024-11-08T21:36:24.432335Z DEBUG quinn_udp::imp: cmsg_iter 0 50
thread 'large_gro' panicked at quinn-udp\tests\tests.rs:293:5:
assertion `left == right` failed
  left: 1280
 right: 3848

And 0 means IPPROTO_IP instead of IPPROTO_UDP?

Good idea. That said, these values are as expected.

0 19 should be WinSock::IPPROTO_IP WinSock::IP_PKTINFO.

0 50 should be WinSock::IPPROTO_IP WinSock::IP_ECN.

Each of these are handled here:

https://github.com/quinn-rs/quinn/blob/a0d8985021cfd45665da38f17376ba335fd44bb4/quinn-udp/src/windows.rs#L217-L233


My unit test above might fail to trigger a coalesced UDP receive (i.e. URO, the Windows GRO equivalent).

Still the pcap you shared support my theory above. Firefox seems to read multiple > 1_500 bytes UDP datagrams. Given that this is an Internet path (fosstodon.org is behind fastly), the single large UDP datagram is likely coalesced from multiple < 1_500 bytes UDP datagrams. Thus I assume we properly set the WinSock::UDP_RECV_MAX_COALESCED_SIZE, i.e. indicate to the OS that UDP datagrams should be coalesced, but fail to read the segment size on receive, i.e. mistake a coalesced UDP datagram for a single classic very large UDP datagram.

The above is problematic once the QUIC connection is established. QUIC will use short headers, not containing packet lengths:

Packets with short headers (Section 17.3) only include the Destination Connection ID and omit the explicit length.

https://www.rfc-editor.org/rfc/rfc9000.html#name-connection-id

Thus QUIC is not able to identify the QUIC packet boundaries within the coalesced large UDP datagram, and thereby drops the invalid packet. The high packet loss leads to spurious or total connection failure.


I will give this more thought.

Summary: fosstodon.org doesn't load with network.http.http3.use_nspr_for_io=false → fosstodon.org doesn't load with network.http.http3.use_nspr_for_io=false on ARM64 Windows

To validate another theory raised in quinn#2041, namely that our control message buffer is too small, would you mind running the above unit test once more on this commit, Kagami?

https://github.com/mxinden/quinn/commit/eb2be2d9842fe2b1039229f104958eba186af9db

> git log -1
commit eb2be2d9842fe2b1039229f104958eba186af9db (HEAD -> large-gro, mxinden/large-gro)
Author: Max Inden <mail@max-inden.de>
Date:   Thu Nov 14 11:56:51 2024 +0100

    Use large CMSG_LEN

> cargo test
   Compiling winapi v0.3.9
   Compiling quinn-udp v0.5.7 (D:\quinn\quinn-udp)
   Compiling criterion v0.5.1
   Compiling nu-ansi-term v0.46.0
   Compiling tracing-subscriber v0.3.18
    Finished `test` profile [unoptimized + debuginfo] target(s) in 7.11s
     Running unittests src\lib.rs (D:\quinn\target\debug\deps\quinn_udp-0e126f761ea7a4f8.exe)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests\tests.rs (D:\quinn\target\debug\deps\tests-705ba9a9199dc5c7.exe)

running 7 tests
test ecn_v4 ... ok
test ecn_v6 ... ok
test basic ... ok
test ecn_v6_dualstack ... ok
test ecn_v4_mapped_v6 ... ok
test gso ... ok
test large_gro ... FAILED

failures:

---- large_gro stdout ----
thread 'large_gro' panicked at quinn-udp\tests\tests.rs:293:5:
assertion `left == right` failed
  left: 1280
 right: 3848
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    large_gro

test result: FAILED. 6 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.01s

error: test failed, to rerun pass `--test tests`

Sorry for not mentioning this. Please run with RUST_LOG=debug like you did in https://bugzilla.mozilla.org/show_bug.cgi?id=1916558#c38.

Flags: needinfo?(krosylight)

Oh yes.

> $env:RUST_LOG="debug"; cargo test large_gro -- --nocapture
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.17s
     Running unittests src\lib.rs (D:\quinn\target\debug\deps\quinn_udp-0e126f761ea7a4f8.exe)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests\tests.rs (D:\quinn\target\debug\deps\tests-705ba9a9199dc5c7.exe)

running 1 test
2024-11-14T11:10:15.247131Z DEBUG quinn_udp::imp: recv
2024-11-14T11:10:15.249245Z DEBUG quinn_udp::imp: cmsg_iter?
2024-11-14T11:10:15.249363Z DEBUG quinn_udp::imp: cmsg_iter 0 19
2024-11-14T11:10:15.249402Z DEBUG quinn_udp::imp: cmsg_iter 0 50
thread 'large_gro' panicked at quinn-udp\tests\tests.rs:293:5:
assertion `left == right` failed
  left: 1280
 right: 3848
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test large_gro ... FAILED

failures:

failures:
    large_gro

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 6 filtered out; finished in 1.01s

error: test failed, to rerun pass `--test tests`

(That iter? is added by me while I was confused why I'm not getting any debug log. Of course I don't get it without env var. πŸ˜‚)

Flags: needinfo?(krosylight)

Out-of-band you confirmed that you are running on the latest Windows on ARM version. Given that Windows' URO implementation might offload to your network interface card, can you confirm that you don't have any pending firmware updates, especially for the NIC?

Flags: needinfo?(krosylight)

And what NIC is it? Some are better than others in terms of drivers.

It's "Qualcomm(R) FastConnect(TM) 7800 Mobile Connectivity System". I see no firmware update.

Flags: needinfo?(krosylight)

Status Update:

  • With the help of other Mozillians, I have tried to reproduce this bug on:
    • Surface Pro X (yet to be tested on Windows 11, version 24H2)
    • Dell Latitude 7455 (with Snapdragon X Elite) (maybe Windows 11, version 24H2)
    • Thinkpad P1 x86_64 (Windows 11, version 24H2)
    • Mac M3 Windows VM (Windows 11, version 24H2)
  • Thus far without success.
  • I have ordered a Surface Pro 11 (Qualcomm Snapdragon X) to try to reproduce the issue directly.

I have ordered a Surface Pro 11 (Qualcomm Snapdragon X) to try to reproduce the issue directly.

Unfortunately I can not reproduce the bug with this new device.

  • Surface Pro 11th Edition 2076
  • Snapdragon X 12-core X1E80100
  • Qualcomm(R) FastConnect(TM) 7800 Mobile Connectivity System
  • Windows 11 Home 10.0.26100 Build 26100
  • Tested with both Firefox Nightly 134.0a1 (2024-11-21)(aarch64) and Firefox Nightly 134.0a1 (2024-11-21)(64-Bit)
  • network.http.http3.use_nspr_for_io false
  • https://fosstodon.org loads most resources over http3, some use http2.
  • No connection issues noticeable. No HTTP request fails.
  • fosstodon.org resolves to 151.101.131.52 (fastly) as expected.
  • Glean metrics are as expected:
    • http_3_udp_datagram_segments_received shows that the majority of receive calls read multiple coalesced UDP datagrams (i.e. segments).
    • http_3_udp_datagram_size_received shows multiple > 1500 datagrams (presumably coalesced).
    • http_3_udp_datagram_segment_size_received shows no datagram segment > 1500.
  • Wireshark shows (as expected) multiple > 1500 UDP datagrams (presumably coalesced).

I found a way to reproduce the bug πŸŽ‰

  1. Install Windows subsystem for Linux in a Terminal via wsl --install.
  2. Restart Windows.
  3. As described above, navigate to https://fosstodon.org and hard reload (ctrl-shift-r) until you see http3 requests stalling.

Once reproduced you can undo via:

  1. Control Panel -> programs-features -> turn off "Virtual Machine Platform".
  2. Restart Windows.
  3. Navigate to https://fosstodon.org and you won't see any more http3 request stalls.

@Kagami I assume you are running WSL, correct? Would you mind testing out part two above to see whether that fixes the bug on your end as well? Note that I am not a Windows expert, e.g. I don't know whether turning off "Virtual Machine Platform" is easily revertible. It is on my machine.

Flags: needinfo?(krosylight)

Ohhhhh. I have Docker so I'm not sure turning it off is safe, but then I don't really have important thing in Docker nor in WSL so maybe I can still try.

Interesting that it doesn't happen on my desktop which also have Virtual Machine Platform for WSL.

Flags: needinfo?(krosylight)

Yes, I can confirm that turning off VMP fixes the issue and turning it back on reintroduces the issue. 😲

See Also: → 1935954

Bug 1935954 introduces an interim fix, disabling URO when detecting Windows on ARM.

Kagami, in case you have some time, would you mind confirming that https://phabricator.services.mozilla.com/D231505 fixes the issues described here? A simple debug build of the patch and visiting fosstodon.org should be enough.

Flags: needinfo?(krosylight)

Repeating ./mach run --temp-profile https://fosstodon.org shows no issue, compared to 2024-12-09 build which shows the issue. πŸ‘πŸ»

Flags: needinfo?(krosylight)

I recently began to experience this issue after updating to Windows 11 24H2, but I am on x86 (I see the same issue on 2 devices, a desktop with an AMD CPU and a Realtek NIC, and a laptop with an Intel CPU and Intel Wi-Fi). Bisection pointed to bug 1910360 and I confirmed that setting network.http.http3.use_nspr_for_io back to true fixes the issue (I used a different website to bisect, but confirmed the difference with https://fosstodon.org as well).

Unfortunately turning off VMP did not help in my case (it was enabled, but disabling it did not make a difference).

Would you like me to file a separate bug, or investigate here?

I have reached out to Microsoft to determine if this is a known issue and whether a Windows fix is in preparation.

Thank you for the investigation Emanuel.

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #54)

Would you like me to file a separate bug, or investigate here?

Until we have any strong indication that the Windows on ARM URO failure is unrelated to the x86 URO failure, I suggest we track both in this Bug.

(In reply to Lars Eggert [:lars] from comment #55)

I have reached out to Microsoft to determine if this is a known issue and whether a Windows fix is in preparation.

Thank you Lars!

NI to me to confirm whether this was wifi specific on my machine or wired connection was also affected

Flags: needinfo?(krosylight)

Turns out the 2024-12-09 (the build before URO disabled) doesn't have a problem on wired connection. The network info says different device for each connection:

  • For wired, it says "Realtek USB GbE Family Controller"
  • For wireless, it says "Qualcomm(R) FastConnect(TM) 7800 Mobile Connectivity System"

The Realtek controller doesn't appear on Device Manager without the USB cable connected, so perhaps it's an external device than in SP11. (The wired connection is coming through a USB-C cable from Dell U2723QE monitor where a LAN cable is connected)

Flags: needinfo?(krosylight)

I found that disabling "Recv Segment Coalescing" in the Realtek drivers also works around this problem (which would make sense given that URO is UDP Receive Segment Coalescing Offload), and found at least 1 comment online[1] suggesting that this setting may have been disabled by default in older drivers with power saving features enabled (it was enabled by default for me).

So this may be a case where you need to install the latest drivers[2] or manually enable the "Recv Segment Coalescing" setting in the drivers to reproduce the problem.

I have not found an equivalent setting in the Intel drivers. There's a setting called "Packet Coalescing", but disabling it does not work around the problem.

[1] https://www.elevenforum.com/t/latest-realtek-lan-driver-win11.9226/page-2#post-334360
[2] https://www.realtek.com/Download/List?cate_id=584

Good point Emanuel, thanks!

So it was RTL8153 and the Windows builtin driver didn't have "Recv Segment Coalescing" at all in driver advanced menu. Manually installing the driver from Realtek added the entry enabled by default, and with that the issue happens again.

And I can confirm that disabling "Recv Segment Coalescing" from driver menu solves the issue on 2024-12-09 build for both Qualcomm and Realtek driver.

Ah, I just spotted the equivalent settings in the Intel drivers: RSCv4 and RSCv6.

I've confirmed that setting these options to disabled works around the problem (as expected).

To reproduce the problem, these settings probably need to be available and enabled.

Thank you for the recent investigations.

Given that this issue does not only apply to Windows on ARM and given that no solution is in sight, we are disabling URO on Windows all together.

quinn#2092 disabled URO on Windows by default. It is part of quinn-udp v0.5.9 which has landed in mozilla-central with phabricator#D232475.

I'm no longer seeing issues with the latest Nightly, with default driver settings and network.http.http3.use_nspr_for_io == false :)

By the way, one more thing I noticed:

  1. On my desktop's Intel WiFi, I don't see the RSCv4 and RSCv6 options despite having the latest drivers (and Windows 11 24H2). It's an older chip (AX200) than the one in my laptop (BE200).
  2. On my laptop's Realtek NIC, I don't see the Recv Segment Coalescing options despite having the latest drivers (and Windows 11 24H2), and despite the fact that it's a newer revision of the same product (RTL8125, but REV_05 as opposed to REV_04 according to the compatible IDs).

So even having the offloading feature available in the first place seems almost random!

See Also: → 1937771

Max, should we keep this bug open?
I think we can close this bug, since URO is disabled.

Flags: needinfo?(mail)

Long term I believe we should do URO (Windows equivalent to GRO). I suggest we track that work somewhere, for example here. In other words, I suggest keeping this open.

We are in touch with Microsoft. One of the engineers has been able to reproduce the failure with our reproducer. That is it for now.

Flags: needinfo?(mail)
Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged][necko-priority-monitor]
You need to log in before you can comment on or make changes to this bug.