On Windows 10/11, HTTP/3 connections fail with "ErrorAccordingToNSPR [in=-5987 out=80004005]"
Categories
(Core :: Networking, defect, P3)
Tracking
()
People
(Reporter: aldwulf, Unassigned)
References
(Blocks 1 open bug)
Details
(Whiteboard: [necko-triaged])
Attachments
(7 files, 1 obsolete file)
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0
Steps to reproduce:
Open Firefox, go to cloudflare-quic.com, Google's services, or any other site that should be using HTTP/3.
First observed with Firefox 98.0.2 on Windows 11, then confirmed on another machine with Windows 10, and with all subsequent versions of Firefox, both my actual profile with all the extensions as well as a clean profile.
Actual results:
Firefox fails to connect using HTTP/3 to any of the tested sites, and adds them to an exception list so that it doesn't make any further (futile) attempts.
Expected results:
AFAIK, HTTP/3 should be there. It does work on Linux, with a much similar set of extensions.
I should mention that I experience the same issue on Android (LineageOS 18). Both on Windows and Android, HTTP/3 work perfectly with Chromium, so it's unlikely the issue is with my system configuration or my network.
Comment 1•3 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::Networking' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.
Comment 3•3 years ago
|
||
The log shows that the speculative connection we used to verify the altsvc mapping is failed.
2022-04-01 15:55:26.770000 UTC - [Parent 22732: Socket Thread]: V/nsHttp Http3Session::ProcessOutput sending packet with 1357 bytes to 34.247.69.99 port=443 [this=17d0a78e400].
2022-04-01 15:55:26.770000 UTC - [Parent 22732: Socket Thread]: D/nsSocketTransport ErrorAccordingToNSPR [in=-5987 out=80004005]
2022-04-01 15:55:26.770000 UTC - [Parent 22732: Socket Thread]: V/nsHttp Http3Session::ProcessOutput sending packet rv=-2147467259
2022-04-01 15:55:26.770000 UTC - [Parent 22732: Socket Thread]: V/nsHttp HttpConnectionUDP::OnInputReady 17d0b3bb4d0 rv=80004005
-5987 is PR_INVALID_ARGUMENT_ERROR.
Comment 4•3 years ago
|
||
Can you still reproduce the issue?
As far as the observable symptoms go, QUIC is still not working on 105.2.
Should I recreate a network log to confirm that it's the same issue?
Comment 7•3 years ago
|
||
(In reply to aldwulf from comment #6)
Created attachment 9303022 [details]
networklog-106-main.13828.moz_log.gzStill happening in 106.0.5.
Thanks for the log. It shows the same problem as comment #3, the http/3 connection fails because of the error PR_INVALID_ARGUMENT_ERROR.
It seems that the error is coming from NSPR layer, so I'd like to change the component and hope someone might know how to debug this.
Comment 8•2 years ago
|
||
The severity field is not set for this bug.
:KaiE, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 9•2 years ago
|
||
Could you use a debugger to identify the NSPR function that returns PR_INVALID_ARGUMENT_ERROR, and possibly get a call stack?
| Reporter | ||
Comment 10•2 years ago
|
||
Here's what I managed to collect, to the best of my ability (I've hardly ever used a debugger before). For convenience I used the latest Developer Edition since it uses a separate profile and doesn't require me to restart Firefox. The debugger was VS2022.
| Reporter | ||
Comment 11•2 years ago
|
||
Seems I misinterpreted the "paste a link" wording. Anyway, the file is far too large to attach it here, so here's a link to my own server (148 MB). If there's any other preferred way of sharing large files, I'll comply immediately.
Comment 12•2 years ago
|
||
I don't know what to do with file linked from comment 11. I downloaded it, but Visual Studio doesn't know how to open it.
Comment 13•2 years ago
•
|
||
Kershaw I think this should be debugged by someone who works on the application level networking code that you identified (QUIC?).
Someone should diagnose which place in Necko executes the call into NSPR that gets the unexpected error code,
including the call stack.
I think this is a Firefox networking bug until you can provide evidence that NSPR is doing something wrong (and explain why).
| Reporter | ||
Comment 14•2 years ago
|
||
I compressed the file using zstd to make the download faster, especially overseas. I'm truly sorry that it ended up wasting your time instead.
If you don't have the means to decompress it, here's (http://plain.hylobat.es/PR_INVALID_ARGUMENT_ERROR.dmp)[the uncompressed original].
| Reporter | ||
Comment 15•2 years ago
|
||
I mean, http://plain.hylobat.es/PR_INVALID_ARGUMENT_ERROR.dmp, with the correct order of brackets.
Comment 16•2 years ago
|
||
Thanks, I wasn't aware that zst is a compression algorithm. I was able to decompress, and Visual Studio is able to open the memory dump you have provided.
Still doesn't help much.
We need a developer who can identify the right place in the code where the failure ocurrs, and the developer needs to use the debugger to STOP execution at exactly that point, and then look at the call stack at this point of time, and mention it here.
Comment 17•2 years ago
|
||
Necko developers:
You could review in which scenarios NSPR sets error code PR_INVALID_ARGUMENT_ERROR when performing a socket operation:
https://searchfox.org/mozilla-central/source/nsprpub/pr/src/io/prsocket.c
Or you could set breakpoints on all those locations, and try to reproduce the bug scenario, to confirm which scenario is reached.
Comment 18•2 years ago
|
||
Looking at the codepath from sendto that return an error, I think PR_INVALID_ARGUMENT_ERROR is returned from here: this.
I think the solution here is to also log the value of PR_GetOSError in Http3Session::ProcessOutput.
Then we can ask :aldwulf to capture the logs again.
Comment 19•2 years ago
|
||
Updated•2 years ago
|
Updated•2 years ago
|
Comment 20•2 years ago
|
||
| Reporter | ||
Comment 21•2 years ago
|
||
So what I need to do is download the installer built from this commit by treeherder, run it and collect the network logs?
Comment 22•2 years ago
|
||
Sounds right, or you could wait for tomorrow's nightly build, which should contain it. (Assuming you'll also see a commit to mozilla-central posted in this bug prior to tomorrow's build.)
| Reporter | ||
Comment 23•2 years ago
|
||
The most relevant seems to be osError=10040.
Comment 24•2 years ago
|
||
(In reply to aldwulf from comment #23)
Created attachment 9321429 [details]
98aa97ba9e9f-main.18600.moz_logThe most relevant seems to be
osError=10040.
That's WSAEMSGSIZE:
https://learn.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2
Message too long.
A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram was smaller than the datagram itself.
Comment 25•2 years ago
|
||
| bugherder | ||
Comment 26•2 years ago
|
||
We're maybe hitting the MTU size here? @awulf, are you maybe using a VPN?
In any case, could you capture some more logs for us?
Could you also add neqo_transport::*:5 to the list of log modules in about:logging?
Thanks!
| Reporter | ||
Comment 27•2 years ago
|
||
Here you go.
No, I'm not using a VPN, and the MTU is the usual 1500 bytes.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 28•2 years ago
|
||
I can't find anything wrong in the neqo layer.
Despite the MTU being 1500, we still encounter the same OS error (10040) when attempting to send only 1357 bytes.
2023-03-07 16:35:09.728000 UTC - [Parent 5916: Socket Thread]: V/nsHttp Http3Session::ProcessOutput sending packet with 1357 bytes to 172.67.9.235 port=443 [this=123deef6e00].
2023-03-07 16:35:09.728000 UTC - [Parent 5916: Socket Thread]: D/nsSocketTransport ErrorAccordingToNSPR [in=-5987 out=80004005]
2023-03-07 16:35:09.728000 UTC - [Parent 5916: Socket Thread]: V/nsHttp Http3Session::ProcessOutput sending packet rv=-2147467259 osError=10040
2023-03-07 16:35:09.728000 UTC - [Parent 5916: Socket Thread]: V/nsHttp HttpConnectionUDP::OnInputReady 123d985e080 rv=80004005
2023-03-07 16:35:09.728000 UTC - [Parent 5916: Socket Thread]: V/nsHttp HttpConnectionUDP::CloseTransaction[this=123d985e080 trans=123deef6e00 reason=80004005]
Comment 29•2 years ago
|
||
Hi,
Would it be possible to capture a wireshark trace while reproducing the issue?
https://firefox-source-docs.mozilla.org/networking/submitting_networking_bugs.html#wireshark-dump
Another cap when trying this with Chrome would also be useful to compare.
We may be able to provide a special Firefox build that uses a smaller MTU.
| Reporter | ||
Comment 30•2 years ago
|
||
| Reporter | ||
Comment 31•2 years ago
|
||
Sure.
Comment 32•2 years ago
|
||
It seems the firefox pcap doesn't contain any quic traffic.
@aldwulf, could you restart Firefox and try again? After a HTTP/3 failure, Firefox avoids using QUIC for that hostname, so you'd have to be recording the first time you try it. Also make sure you have DNS over HTTPS enabled (network.trr.mode = 3) in order to try HTTP/3 right away.
At the same time, in Chrome ethernet frames containing the QUIC packet are 1242 bytes, whil the UDP length is 1208.
| Reporter | ||
Comment 33•2 years ago
|
||
I started the capture right before running Firefox with a clean profile. And even after setting network.trr.mode to 3 and restarting yet again, with a recreated profile - not a single QUIC packet is ever sent.
| Reporter | ||
Comment 34•2 years ago
|
||
I also went around various sites I haven't ever visited, neither on my default profile nor on my 'test' one, and I can confirm that DoH is working and no UDP-based DNS queries are being made. Still no QUIC traffic though.
Comment 35•1 year ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:valentin, maybe it's time to close this bug?
For more information, please visit BugBot documentation.
Comment 36•1 year ago
|
||
I think similar to bug 1734110 we should ultimately consider adding a configuration option for MTU (or PMTU detection).
Updated•1 year ago
|
| Reporter | ||
Comment 37•1 year ago
|
||
By sheer coincidence I managed to find that Windows was, for some inexplicable reason, lowering my MTU to ~1200. It still showed up as 1500 in the GUI (network card settings, but not in the "Modern" settings, which I didn't look at up to this point), which led me to reporting this as a Firefox bug. Manually forcing the MTU to normal values (my link is Ethernet all the way) made QUIC work for me.
So, it was a low MTU issue all along, which was already reported many times. I therefore apologize for having wasted your time investigating this.
Comment 38•1 year ago
|
||
No worries, thank you for confirming the problem. It's good to know for sure.
Description
•