Closed Bug 1772057 Opened 3 years ago Closed 1 year ago

On Windows 10/11, HTTP/3 connections fail with "ErrorAccordingToNSPR [in=-5987 out=80004005]"

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1734110

People

(Reporter: aldwulf, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(7 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0

Steps to reproduce:

Open Firefox, go to cloudflare-quic.com, Google's services, or any other site that should be using HTTP/3.

First observed with Firefox 98.0.2 on Windows 11, then confirmed on another machine with Windows 10, and with all subsequent versions of Firefox, both my actual profile with all the extensions as well as a clean profile.

Actual results:

Firefox fails to connect using HTTP/3 to any of the tested sites, and adds them to an exception list so that it doesn't make any further (futile) attempts.

Expected results:

AFAIK, HTTP/3 should be there. It does work on Linux, with a much similar set of extensions.

I should mention that I experience the same issue on Android (LineageOS 18). Both on Windows and Android, HTTP/3 work perfectly with Chromium, so it's unlikely the issue is with my system configuration or my network.

The Bugbug bot thinks this bug should belong to the 'Core::Networking' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Networking
Product: Firefox → Core

Dragana, could you take a look?
Thanks.

Flags: needinfo?(dd.mozilla)
Blocks: QUIC

The log shows that the speculative connection we used to verify the altsvc mapping is failed.

2022-04-01 15:55:26.770000 UTC - [Parent 22732: Socket Thread]: V/nsHttp Http3Session::ProcessOutput sending packet with 1357 bytes to 34.247.69.99 port=443 [this=17d0a78e400].
2022-04-01 15:55:26.770000 UTC - [Parent 22732: Socket Thread]: D/nsSocketTransport ErrorAccordingToNSPR [in=-5987 out=80004005]
2022-04-01 15:55:26.770000 UTC - [Parent 22732: Socket Thread]: V/nsHttp Http3Session::ProcessOutput sending packet rv=-2147467259
2022-04-01 15:55:26.770000 UTC - [Parent 22732: Socket Thread]: V/nsHttp HttpConnectionUDP::OnInputReady 17d0b3bb4d0 rv=80004005

-5987 is PR_INVALID_ARGUMENT_ERROR.

Can you still reproduce the issue?

Flags: needinfo?(dd.mozilla) → needinfo?(aldwulf)

As far as the observable symptoms go, QUIC is still not working on 105.2.

Should I recreate a network log to confirm that it's the same issue?

Still happening in 106.0.5.

Flags: needinfo?(aldwulf)

(In reply to aldwulf from comment #6)

Created attachment 9303022 [details]
networklog-106-main.13828.moz_log.gz

Still happening in 106.0.5.

Thanks for the log. It shows the same problem as comment #3, the http/3 connection fails because of the error PR_INVALID_ARGUMENT_ERROR.
It seems that the error is coming from NSPR layer, so I'd like to change the component and hope someone might know how to debug this.

Component: Networking → NSPR
Product: Core → NSPR
Version: Firefox 101 → other

The severity field is not set for this bug.
:KaiE, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(kaie)

Could you use a debugger to identify the NSPR function that returns PR_INVALID_ARGUMENT_ERROR, and possibly get a call stack?

Flags: needinfo?(kaie)
Attached file call stack (obsolete) —

Here's what I managed to collect, to the best of my ability (I've hardly ever used a debugger before). For convenience I used the latest Developer Edition since it uses a separate profile and doesn't require me to restart Firefox. The debugger was VS2022.

Attachment #9320510 - Attachment is obsolete: true

Seems I misinterpreted the "paste a link" wording. Anyway, the file is far too large to attach it here, so here's a link to my own server (148 MB). If there's any other preferred way of sharing large files, I'll comply immediately.

I don't know what to do with file linked from comment 11. I downloaded it, but Visual Studio doesn't know how to open it.

Kershaw I think this should be debugged by someone who works on the application level networking code that you identified (QUIC?).

Someone should diagnose which place in Necko executes the call into NSPR that gets the unexpected error code,
including the call stack.

I think this is a Firefox networking bug until you can provide evidence that NSPR is doing something wrong (and explain why).

Component: NSPR → Networking
Product: NSPR → Core
Version: other → unspecified

I compressed the file using zstd to make the download faster, especially overseas. I'm truly sorry that it ended up wasting your time instead.

If you don't have the means to decompress it, here's (http://plain.hylobat.es/PR_INVALID_ARGUMENT_ERROR.dmp)[the uncompressed original].

I mean, http://plain.hylobat.es/PR_INVALID_ARGUMENT_ERROR.dmp, with the correct order of brackets.

Thanks, I wasn't aware that zst is a compression algorithm. I was able to decompress, and Visual Studio is able to open the memory dump you have provided.

Still doesn't help much.

We need a developer who can identify the right place in the code where the failure ocurrs, and the developer needs to use the debugger to STOP execution at exactly that point, and then look at the call stack at this point of time, and mention it here.

Necko developers:
You could review in which scenarios NSPR sets error code PR_INVALID_ARGUMENT_ERROR when performing a socket operation:
https://searchfox.org/mozilla-central/source/nsprpub/pr/src/io/prsocket.c

Or you could set breakpoints on all those locations, and try to reproduce the bug scenario, to confirm which scenario is reached.

Looking at the codepath from sendto that return an error, I think PR_INVALID_ARGUMENT_ERROR is returned from here: this.
I think the solution here is to also log the value of PR_GetOSError in Http3Session::ProcessOutput.

Then we can ask :aldwulf to capture the logs again.

Severity: -- → S3
Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-review]
Assignee: nobody → valentin.gosu
Status: NEW → ASSIGNED
Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/98aa97ba9e9f Add OSError to log when an error occurs in Http3Session::ProcessOutput calling SendWithAddress r=necko-reviewers,kershaw

So what I need to do is download the installer built from this commit by treeherder, run it and collect the network logs?

Sounds right, or you could wait for tomorrow's nightly build, which should contain it. (Assuming you'll also see a commit to mozilla-central posted in this bug prior to tomorrow's build.)

The most relevant seems to be osError=10040.

(In reply to aldwulf from comment #23)

Created attachment 9321429 [details]
98aa97ba9e9f-main.18600.moz_log

The most relevant seems to be osError=10040.

That's WSAEMSGSIZE:
https://learn.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2

Message too long.
A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram was smaller than the datagram itself.

We're maybe hitting the MTU size here? @awulf, are you maybe using a VPN?
In any case, could you capture some more logs for us?
Could you also add neqo_transport::*:5 to the list of log modules in about:logging?

Thanks!

Flags: needinfo?(aldwulf)
Attached file neqo_transport::*:5

Here you go.

No, I'm not using a VPN, and the MTU is the usual 1500 bytes.

Flags: needinfo?(aldwulf)
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(kershaw)

I can't find anything wrong in the neqo layer.
Despite the MTU being 1500, we still encounter the same OS error (10040) when attempting to send only 1357 bytes.

2023-03-07 16:35:09.728000 UTC - [Parent 5916: Socket Thread]: V/nsHttp Http3Session::ProcessOutput sending packet with 1357 bytes to 172.67.9.235 port=443 [this=123deef6e00].
2023-03-07 16:35:09.728000 UTC - [Parent 5916: Socket Thread]: D/nsSocketTransport ErrorAccordingToNSPR [in=-5987 out=80004005]
2023-03-07 16:35:09.728000 UTC - [Parent 5916: Socket Thread]: V/nsHttp Http3Session::ProcessOutput sending packet rv=-2147467259 osError=10040
2023-03-07 16:35:09.728000 UTC - [Parent 5916: Socket Thread]: V/nsHttp HttpConnectionUDP::OnInputReady 123d985e080 rv=80004005
2023-03-07 16:35:09.728000 UTC - [Parent 5916: Socket Thread]: V/nsHttp HttpConnectionUDP::CloseTransaction[this=123d985e080 trans=123deef6e00 reason=80004005]
Flags: needinfo?(kershaw)

Hi,

Would it be possible to capture a wireshark trace while reproducing the issue?
https://firefox-source-docs.mozilla.org/networking/submitting_networking_bugs.html#wireshark-dump

Another cap when trying this with Chrome would also be useful to compare.
We may be able to provide a special Firefox build that uses a smaller MTU.

Flags: needinfo?(valentin.gosu) → needinfo?(aldwulf)
Attached file chromium.pcapng
Flags: needinfo?(aldwulf)
Attached file firefox.pcapng

Sure.

It seems the firefox pcap doesn't contain any quic traffic.
@aldwulf, could you restart Firefox and try again? After a HTTP/3 failure, Firefox avoids using QUIC for that hostname, so you'd have to be recording the first time you try it. Also make sure you have DNS over HTTPS enabled (network.trr.mode = 3) in order to try HTTP/3 right away.

At the same time, in Chrome ethernet frames containing the QUIC packet are 1242 bytes, whil the UDP length is 1208.

Flags: needinfo?(aldwulf)

I started the capture right before running Firefox with a clean profile. And even after setting network.trr.mode to 3 and restarting yet again, with a recreated profile - not a single QUIC packet is ever sent.

Flags: needinfo?(aldwulf)

I also went around various sites I haven't ever visited, neither on my default profile nor on my 'test' one, and I can confirm that DoH is working and no UDP-based DNS queries are being made. Still no QUIC traffic though.

The leave-open keyword is there and there is no activity for 6 months.
:valentin, maybe it's time to close this bug?
For more information, please visit BugBot documentation.

Flags: needinfo?(moz.valentin)

I think similar to bug 1734110 we should ultimately consider adding a configuration option for MTU (or PMTU detection).

Assignee: moz.valentin → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(moz.valentin)
Keywords: leave-open
Priority: P2 → P3
Whiteboard: [necko-triaged][necko-priority-review] → [necko-triaged]

By sheer coincidence I managed to find that Windows was, for some inexplicable reason, lowering my MTU to ~1200. It still showed up as 1500 in the GUI (network card settings, but not in the "Modern" settings, which I didn't look at up to this point), which led me to reporting this as a Firefox bug. Manually forcing the MTU to normal values (my link is Ethernet all the way) made QUIC work for me.

So, it was a low MTU issue all along, which was already reported many times. I therefore apologize for having wasted your time investigating this.

No worries, thank you for confirming the problem. It's good to know for sure.

Status: NEW → RESOLVED
Closed: 1 year ago
Duplicate of bug: 1734110
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: