Open Bug 1916558 Opened 17 days ago Updated 4 days ago

fosstodon.org doesn't load with network.http.http3.use_nspr_for_io=false

Categories

(Core :: Networking, defect, P2)

ARM64
Windows 11
defect

Tracking

()

Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- unaffected
firefox130 --- disabled
firefox131 --- disabled
firefox132 --- disabled

People

(Reporter: saschanaz, Assigned: mail)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged][necko-priority-next])

Mozregression says it's in this range: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=4d36a2f69e7027dcc1e3535e563aa57e556d9f70&tochange=c38029641964591e518535856e2d7b3038b134ad

Could not pinpoint the commit because of mozregression aarch64 support issue (WARNING: Skipping build b94ec5ba05c9: Unable to find build info using the taskcluster route 'gecko.v2.autoland.shippable.revision.b94ec5ba05c96151a539e0c05403f7384829b008.firefox.win64-aarch64-opt) but I see bug 1910360 being the only network change there.

Bisected by mozregression --good 2024-07-03 --bad 2024-08-14 -a https://fosstodon.org/explore. On a bad build the page never finishes loading, and on a good build it loads instantly.

Edit: On MSEdge the page always loads instantly.

OS: Unspecified → Windows 11
Hardware: Unspecified → ARM64
Version: unspecified → Firefox 130

Setting network.http.http3.use_nspr_for_io to true indeed fixes the issue.

Summary: fosstodon.org doesn't load since nightly 2024-08-02 → fosstodon.org doesn't load with network.http.http3.use_nspr_for_io=false

Set release status flags based on info from the regressing bug 1910360

:mail, since you are the author of the regressor, bug 1910360, could you take a look?

For more information, please visit BugBot documentation.

Thank you Kagami Rosylight for tracking this down. I will take a look.

Given that network.http.http3.use_nspr_for_io is only set to false on Nightly, i.e. quinn-udp only used Nightly, I don't think this affects firefox130, firefox131 and firefox132. Am I missing something?

# Use NSPR for HTTP3 UDP IO
- name: network.http.http3.use_nspr_for_io
  type: RelaxedAtomicBool
# Always use NSPR on Android x86. On Android x86 sendmsg and recvmmsg, both used
# by quinn-udp, are disallowed by seccomp. Future fix and further details
# tracked in https://github.com/quinn-rs/quinn/issues/1947.
#if defined(ANDROID) && !defined(HAVE_64BIT_BUILD)
  value: true
#else
  value: @IS_NOT_NIGHTLY_BUILD@
#endif
  mirror: always

https://searchfox.org/mozilla-central/rev/8fffdc727aa507ee4955042ec2d6f71d23c9c2de/modules/libpref/init/StaticPrefList.yaml#13672-13683

I am unable to reproduce this bug on x86-64 Linux and Arm M2 Mac.

Kagami Rosylight can you reproduce the bug once more, with the following log level enabled, and send the logs to necko@mozilla.com and minden@mozilla.com?

timestamp,sync,nsHttp:5,nsSocketTransport:5,nsHostResolver:5,neqo_http3::*:5,neqo_transport::*:5,quinn_udp::*:5,neqo_udp::*:5,neqo_glue::*:5

Flags: needinfo?(krosylight)

Sure. A perf profile with a fresh firefox profile (select Socket Thread and see Marker Table to get the logs): https://share.firefox.dev/3XdSqo5

Flags: needinfo?(krosylight)

(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #6)

Sure. A perf profile with a fresh firefox profile (see Marker Table to get the logs): https://share.firefox.dev/3XdSqo5

(I see zero rust logs this way 🤔)

Could you choose logging to a file instead?
Thanks.

Flags: needinfo?(krosylight)

LogMessages — (neqo_transport::) [neqo_transport] Crypto operation failed NssError { name: "SEC_ERROR_BAD_DATA", code: -8190, desc: "security library: received bad data." }
LogMessages — (neqo_transport::
) [neqo_transport::stats] [Client ...] Dropped received packet: Decryption failure; Total: 27

This constantly happens, interesting.

(In reply to Kershaw Chang [:kershaw] from comment #7)

(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #6)

Sure. A perf profile with a fresh firefox profile (see Marker Table to get the logs): https://share.firefox.dev/3XdSqo5

(I see zero rust logs this way 🤔)

Could you choose logging to a file instead?
Thanks.

You checked my comment too early! I just selected a wrong thread (main thread lol), select socket thread and you see all the logs.

Flags: needinfo?(krosylight)

FWIW I tried it in Nightly Version 132.0a1 (2024-09-03) (64-bit) in a Windows for ARM VM on my Macbook M3 and fosstodon.org loads fine.

(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #8)

LogMessages — (neqo_transport::) [neqo_transport] Crypto operation failed NssError { name: "SEC_ERROR_BAD_DATA", code: -8190, desc: "security library: received bad data." }
LogMessages — (neqo_transport::
) [neqo_transport::stats] [Client ...] Dropped received packet: Decryption failure; Total: 27

This constantly happens, interesting.

This is possibly a red herring - this is the message that gets printed when QUIC packets are padded with garbage. We should probably lower the log level of that message, or only log it if it is not a coalesced garbage packet that triggers it.

Edit: But there sure are more of those than there should be...

(In reply to Lars Eggert [:lars] from comment #10)

FWIW I tried it in Nightly Version 132.0a1 (2024-09-03) (64-bit) in a Windows for ARM VM on my Macbook M3 and fosstodon.org loads fine.

Hmm, could this be device specific? x86 32bit binary (by --arch 32 to mozregression) shows the same problem on my Surface Pro 11 (Qualcomm Snapdragon X) on the BER office network. (Edit: the profile above was recorded in a different network)

And with the same configuration my AMD desktop has no problem.

@krosylight, am I understanding you correctly that you think this may happen on 32-bit builds/platforms but not 64-bit ones? Or only on 32-bit ARM?

No, I wanted to say that different archs shows the same problem on my device. All aarch64/x86/x86-64 nightly builds have the same issue.

Thank you for the additional logs via mail!

Paraphrasing the above to make sure I understand correctly. On an ARM machine running Windows for ARM (Surface Pro 11) fosstodon.org does not load with aarch64, x86 and x86-64 Firefox Nightly builds with network.http.http3.use_nspr_for_io set to false. The latter two (x86, x86-64) are presumably run via Windows for ARM emulation layer.

Do I understand correctly that all other H3 pages work fine, e.g. https://quic.nginx.org/ or https://cloudflare-quic.com/ ?

Flags: needinfo?(krosylight)

(In reply to Max Inden from comment #16)

Thank you for the additional logs via mail!

Paraphrasing the above to make sure I understand correctly. On an ARM machine running Windows for ARM (Surface Pro 11) fosstodon.org does not load with aarch64, x86 and x86-64 Firefox Nightly builds with network.http.http3.use_nspr_for_io set to false. The latter two (x86, x86-64) are presumably run via Windows for ARM emulation layer.

Yes.

Do I understand correctly that all other H3 pages work fine, e.g. https://quic.nginx.org/ or https://cloudflare-quic.com/ ?

I did not check others as I don't know who uses H3. https://quic.nginx.org/ shows "You're not using QUIC right now, but don't despair," with NSPR pref off, while it shows "Congratulations! You're connected over QUIC." with NSPR pref on. Tested with mozregression --launch 2024-09-04 -a https://quic.nginx.org/ --pref network.http.http3.use_nspr_for_io:true. Similar result on the cloudflare page.

But as I showed to Kershaw offline, the test result is a bit flaky. Sometimes it connects well with QUIC (and then breaks on a few reload and then permanently stuck on HTTP2), or just goes straight to HTTP2 from the start.

Edit: And with the nspr pref on, things work consistently without flakyness.

Flags: needinfo?(krosylight)

Thank you for your help!

Do you have Rust installed on this laptop? If so, would you mind trying the following and posting the output here?

# In some directory of your choice:
git clone https://github.com/quinn-rs/quinn.git
cd quinn/quinn-udp 
cargo test

Context:

  • When you set network.http.http3.use_nspr_for_io to false your Firefox instance uses quinn-udp instead of NSPR.
  • Instead of debugging quinn-udp through Firefox, we might as well debug it directly, running its unit tests on your machine.
Flags: needinfo?(krosylight)

Test seems clean:

> cargo test
warning: D:\quinn\quinn-udp\Cargo.toml: unused manifest key: target.cfg(any(target_os = "linux", target_os = "windows")).bench
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.09s
     Running unittests src\lib.rs (D:\quinn\target\debug\deps\quinn_udp-528859568f696d14.exe)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests\tests.rs (D:\quinn\target\debug\deps\tests-9710717108edc830.exe)

running 6 tests
test basic ... ok
test ecn_v4_mapped_v6 ... ok
test ecn_v6 ... ok
test ecn_v6_dualstack ... ok
test ecn_v4 ... ok
test gso ... ok

test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s

   Doc-tests quinn_udp

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Flags: needinfo?(krosylight)

Thank you for trying this so quickly. This is surprising.

Do I understand correctly that you only alter the network.http.http3.use_nspr_for_io pref?

Flags: needinfo?(krosylight)

Yes, as each time mozregression creates a new profile in a temp directory, which then only includes the default prefs unless told otherwise by --pref.

Flags: needinfo?(krosylight)

Is this only a problem from the Berlin office network, or does it happen elsewhere?

It happens elsewhere too.

So the connection tend to work initially (in https://quic.nginx.org/) and then breaks in a few reload. Can we try making a test simulating such situation?

Assignee: nobody → mail
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-next]

If I understand Firefox's build system correctly, debug and trace level logging is not available in release builds.

Would you mind reproducing the issue at hand once more, this time in a debug build, and again send us the logs?

As far as I know, the following addition to mozconfig should do, but I assume you know better than me.

ac_add_options --enable-debug
ac_add_options --enable-debug-symbols

Please enable the following log level:

timestamp,sync,nsHttp:5,nsSocketTransport:5,nsHostResolver:5,neqo_http3::*:5,neqo_transport::*:5,quinn_udp::*:5,neqo_udp::*:5,neqo_glue::*:5,neqo_common::*:5

Thank you for your help!

Flags: needinfo?(mail) → needinfo?(krosylight)

Oh hey, I can see tons of dropped packets in debug build. Sending the log!

Flags: needinfo?(krosylight)
You need to log in before you can comment on or make changes to this bug.