Requests fail with PR_BAD_DESCRIPTOR_ERROR
Categories
(Core :: Networking, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr128 | --- | unaffected |
firefox-esr140 | --- | unaffected |
firefox141 | --- | unaffected |
firefox142 | --- | wontfix |
firefox143 | --- | fixed |
People
(Reporter: robwu, Assigned: kershaw)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: regression, Whiteboard: [necko-triaged][necko-priority-queue])
Attachments
(4 files, 1 obsolete file)
Nightly 143.0a1, buildID 20250726091124
STR:
- Have a Firefox open.
- Open the Firefox in another instance, e.g.
mkdir /tmp/prof && "/Applications/Firefox Nightly.app/Contents/MacOS/firefox" -profile /tmp/prof
- EDIT: I am now in a state where this step is not even needed. The necessity of this step is unclear.
- Go back to the Firefox from step 1. Try to visit a domain that you have not visited before (in the past few minutes). E.g. search for something in DuckDuckGo.
Expected:
- Should load.
Actual:
- Often fails to load. While trying to reproduce, I found that it fails multiple times in succession, but sometimes succeeds (on retry).
- Quitting the other Firefox instance (from step 3) makes the issue go away.
- In the devtools, I see
NS_ERROR_FAILURE
.
I visited about:logging
and logged to disk with the network preset. In the logs that fail, I consistently observe the presence of ErrorAccordingToNSPR [in=-5999 out=80004005]
logged from https://searchfox.org/mozilla-central/rev/ab26427a8d31be475be11bbae0e04c84cc7f20ef/netwerk/base/nsSocketTransport2.cpp#227-230,251-252
-5999
there isPR_BAD_DESCRIPTOR_ERROR
.80004005
is the hex code of 2147500037, akaNS_ERROR_FAILURE
2025-07-30 12:49:35.929894 UTC - [Parent 54986: Main Thread]: D/nsHttp sending progress and status notification [this=424580800 status=4b0007 progress=0/0]
2025-07-30 12:49:35.929980 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport ErrorAccordingToNSPR [in=-5999 out=80004005]
2025-07-30 12:49:35.929991 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport connection failed! [reason=80004005]
2025-07-30 12:49:35.929992 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::DetachSocket [handler=41405d600]
2025-07-30 12:49:35.929995 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketDetached [this=41405d600 cond=80004005]
2025-07-30 12:49:35.929997 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=41405d600 state=3 cond=80004005]
2025-07-30 12:49:35.930000 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Adding address to blocklist for host [duckduckgo.com], host record [4525ecdf0].used trr=0
2025-07-30 12:49:35.930003 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Successfully adding address [52.142.124.215] to blocklist for host [duckduckgo.com].
Reporter | ||
Comment 1•3 months ago
•
|
||
Note: I am also observing the issue despite having quit the other Firefox.
In any case I know that my current browser is in some bad state, and I can make it more likely to trigger the bug for investigation.
Comment 2•3 months ago
|
||
Rob, can you go back to about:logging
log to the Firefox Profile and then share the profile?
Reporter | ||
Comment 3•3 months ago
|
||
Logs from about:logging with Networking preset, redacted cookies and some URLs.
Searched for "testanother" in DuckDuckGo in a private browsing tab. Result: navigation was aborted.
Reporter | ||
Comment 4•3 months ago
|
||
Logs from about:logging with Networking preset.
Searched for "testanother" in DuckDuckGo in a private browsing tab. Result: navigation succeeded.
The only difference vs the previous step is that I quit the other Firefox instance.
Reporter | ||
Comment 5•3 months ago
|
||
(In reply to Andrew Creskey [:acreskey] from comment #2)
Rob, can you go back to
about:logging
log to the Firefox Profile and then share the profile?
I attached the log files. I also have a profiler capture from other attempts to reproduce, but it contains sensitive information. Is there anything you'd like to see in that is not already available in the MOZ_LOG log files?
Reporter | ||
Comment 6•3 months ago
|
||
about:logging
with Network preset + pipnss added. I attempted to load http://searchfox.org/mozilla-central/search?q=PR_BAD_DESCRIPTOR_ERROR&path=
timestamp,sync,nsHttp:5,cache2:5,nsSocketTransport:5,nsHostResolver:5,EarlyHint:5,pipnss:5
Reporter | ||
Comment 7•3 months ago
•
|
||
I traced the network request in the log from comment 6, and selected the very specific part where it transitions from an okay state (MSG_DNS_LOOKUP_COMPLETE) to ultimately a bad state. To avoid incorrect interpretations, I explicitly annotated the parts I cut from the log in the log extract below.
nsSocketTransportService::DoPollIteration
callsDisengageTimeout()
andOnSocketReady
(both appear in the log).nsSocketTransport::OnSocketReady
ends up loggingErrorAccordingToNSPR
, withErrorAccordingToNSPR [in=-5999 out=80004005]
(where-5999
isPR_BAD_DESCRIPTOR_ERROR
).
2025-07-30 14:45:13.368939 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=3736cfd00 type=1 status=0 param=0]
2025-07-30 14:45:13.368940 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport MSG_DNS_LOOKUP_COMPLETE
2025-07-30 14:45:13.368943 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Checking unusable list for host [searchfox.org], host record [415894c60].
2025-07-30 14:45:13.368945 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Address [35.82.221.129] is blocklisted for host [searchfox.org].
2025-07-30 14:45:13.368947 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Checking unusable list for host [searchfox.org], host record [415894c60].
2025-07-30 14:45:13.368950 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Address [35.155.250.20] is blocklisted for host [searchfox.org].
2025-07-30 14:45:13.368952 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Checking unusable list for host [searchfox.org], host record [415894c60].
2025-07-30 14:45:13.368954 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::InitiateSocket [this=3736cfd00]
2025-07-30 14:45:13.368956 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::BuildSocket [this=3736cfd00]
2025-07-30 14:45:13.368958 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport pushing io layer [0:ssl]
2025-07-30 14:45:13.368990 UTC - [Parent 54986: Socket Thread]: D/pipnss [16468d900] nsSSLIOLayerSetOptions: using TLS version range (0x0303,0x0304)
2025-07-30 14:45:13.368998 UTC - [Parent 54986: Socket Thread]: D/pipnss [16468d900] nsSSLIOLayerSetOptions: enabling TLS ECH Grease
2025-07-30 14:45:13.369007 UTC - [Parent 54986: Socket Thread]: D/pipnss [16468d900] Socket set up
2025-07-30 14:45:13.369010 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport [tlsSocketControl=44ffa5900 callbacks=15385bf10]
2025-07-30 14:45:13.369016 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::AttachSocket [handler=3736cfd00]
2025-07-30 14:45:13.369018 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::AddToIdleList 16ba72360 [handler=3736cfd00]
2025-07-30 14:45:13.369020 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport active=19 idle=1
2025-07-30 14:45:13.369022 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport advancing to STATE_CONNECTING
2025-07-30 14:45:13.369023 UTC - [Parent 54986: Socket Thread]: E/nsSocketTransport nsSocketTransport::SendStatus [this=3736cfd00 status=4b0007]
(... removed 3 lines from seemingly unrelated socket to unrelated domain ...)
2025-07-30 14:45:13.369159 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport STS poll iter
(... removed statuses of 19 active unrelated sockets ...)
2025-07-30 14:45:13.369265 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport idle [0] { handler=3736cfd00 condition=0 pollflags=6 }
2025-07-30 14:45:13.369267 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::MoveToPollList 126508888 [handler=3736cfd00]
2025-07-30 14:45:13.369269 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::AddToPollList 126508888 [handler=3736cfd00]
2025-07-30 14:45:13.369271 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport SocketContext::EnsureTimeout socket=3736cfd00
2025-07-30 14:45:13.369273 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport engaging
2025-07-30 14:45:13.369274 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport active=20 idle=1
2025-07-30 14:45:13.369276 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::RemoveFromIdleList [handler=0]
2025-07-30 14:45:13.369278 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport active=20 idle=0
2025-07-30 14:45:13.369280 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport calling PR_Poll [active=20 idle=0]
(... removed 19x message like the next two lines for 19 unrelated sockets ...)
2025-07-30 14:45:13.369348 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=3736cfd00, timeout=65535s
2025-07-30 14:45:13.369350 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport not engaged
2025-07-30 14:45:13.369351 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport poll timeout: none
2025-07-30 14:45:13.369353 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport timeout = -1 milliseconds
(... removed 18x "poll SSL socket using lower 5" + "poll SSL socket returned 5" ...)
2025-07-30 14:45:13.369421 UTC - [Parent 54986: Socket Thread]: V/pipnss [16468d900] poll SSL socket using lower 6
2025-07-30 14:45:13.369423 UTC - [Parent 54986: Socket Thread]: V/pipnss [16468d900] poll SSL socket returned 6
2025-07-30 14:45:13.619900 UTC - [Parent 54986: Timer]: D/nsSocketTransport STS dispatch [142f21390]
2025-07-30 14:45:13.619967 UTC - [Parent 54986: Timer]: D/nsSocketTransport PollableEvent::Signal
2025-07-30 14:45:13.619982 UTC - [Parent 54986: Timer]: D/nsSocketTransport PollableEvent::MarkFirstSignalTimestamp
2025-07-30 14:45:13.620001 UTC - [Parent 54986: Timer]: D/nsSocketTransport PollableEvent::Signal PR_Write 1
2025-07-30 14:45:13.620044 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport ...returned after 251 milliseconds
(... removed 19x "SocketContext::TimeoutIn" + "not engaged" messages ...)
2025-07-30 14:45:13.620432 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport SocketContext::DisengageTimeout socket=3736cfd00
2025-07-30 14:45:13.620443 UTC - [Parent 54986: Socket Thread]: E/nsSocketTransport nsSocketTransport::OnSocketReady [this=3736cfd00 outFlags=16]
2025-07-30 14:45:13.620468 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport ErrorAccordingToNSPR [in=-5999 out=80004005]
2025-07-30 14:45:13.620480 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport connection failed! [reason=80004005]
2025-07-30 14:45:13.620490 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::DetachSocket [handler=3736cfd00]
2025-07-30 14:45:13.620500 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketDetached [this=3736cfd00 cond=80004005]
2025-07-30 14:45:13.620511 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=3736cfd00 state=3 cond=80004005]
2025-07-30 14:45:13.620524 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Adding address to blocklist for host [searchfox.org], host record [415894c60].used trr=0
2025-07-30 14:45:13.620536 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Successfully adding address [44.226.105.90] to blocklist for host [searchfox.org].
2025-07-30 14:45:13.620548 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport not a recoverable error 80004005
2025-07-30 14:45:13.620558 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketInputStream::OnSocketReady [this=53b6cd040 cond=80004005]
2025-07-30 14:45:13.620568 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketOutputStream::OnSocketReady [this=53b6cee80 cond=80004005]
2025-07-30 14:45:13.620582 UTC - [Parent 54986: Socket Thread]: V/nsHttp DnsAndConnectSocket::OnOutputStreamReady [this=4f1b47ec0 ent=searchfox.org primary]
(if you want to see more, just look at the log from comment 6).
Reporter | ||
Comment 8•3 months ago
|
||
(shortening title because I can now consistently experience the problem even without another Firefox around)
Comment 9•3 months ago
|
||
Rob, if you can still reproduce after bug 1964292 is reverted, can I ask for an about:logging Firefox Profiler capture, this time with "Enable stack traces for log messages" enabled?
I've been finding that helpful in understanding what's happening.
Reporter | ||
Comment 10•3 months ago
|
||
(In reply to Andrew Creskey [:acreskey] from comment #9)
Rob, if you can still reproduce after bug 1964292 is reverted, can I ask for an about:logging Firefox Profiler capture, this time with "Enable stack traces for log messages" enabled?
I've been finding that helpful in understanding what's happening.
I'm afraid that I won't be able to reproduce after a restart, so I'll try debugging with what I have now.
Also, I don't think that bug 1964292 is going to affect this bug in any way; the "Adding address to blocklist for host" log entry is after the "ErrorAccordingToNSPR" entry (see comment 7 for easy Ctrl+F).
Just to make sure that it was not cached way earlier, I created a new STR, with a website that I have never visited before (https://overheid.nl/testhere
). I put a x
before the domain before I started profiling, to make sure that there is no preloading or anything that could have happened. After starting the profiler, I removed the x, pressed enter, and stopped the profile. Here is the result: https://share.firefox.dev/3H28ntA
Comment 11•3 months ago
|
||
(In reply to Rob Wu [:robwu] from comment #10)
(In reply to Andrew Creskey [:acreskey] from comment #9)
Rob, if you can still reproduce after bug 1964292 is reverted, can I ask for an about:logging Firefox Profiler capture, this time with "Enable stack traces for log messages" enabled?
I've been finding that helpful in understanding what's happening.I'm afraid that I won't be able to reproduce after a restart, so I'll try debugging with what I have now.
Also, I don't think that bug 1964292 is going to affect this bug in any way; the "Adding address to blocklist for host" log entry is after the "ErrorAccordingToNSPR" entry (see comment 7 for easy Ctrl+F).
Just to make sure that it was not cached way earlier, I created a new STR, with a website that I have never visited before (
https://overheid.nl/testhere
). I put ax
before the domain before I started profiling, to make sure that there is no preloading or anything that could have happened. After starting the profiler, I removed the x, pressed enter, and stopped the profile. Here is the result: https://share.firefox.dev/3H28ntA
Thank you, Rob, that's quite interesting, and I only see the -5999, PR_BAD_DESCRIPTOR_ERROR
, ErrorAccordingToNSPR in your logs, but not PR_HOST_UNREACHABLE_ERROR
, -5927.
Could I ask for one more profile of a repro? -- This last one is missing some threads like the DNS resolver.
When you go to upload the profile, make sure "Include Hidden Threads" is checked - that's my best guess as to why those threads are missing.
Reporter | ||
Comment 12•3 months ago
|
||
Here is a new profile: https://share.firefox.dev/4fgqQPv
Description of STR:
- Visiting a site I never visited before, navigation from a new private browsing tab to it.
- URL: http://coolgoodtranscendentjoke.neverssl.com/online
- Preferences
dom.security.https_first_pbm
set to false,dom.security.https_only_mode_send_http_background_request
set to false.
(I did this to rule out https as the cause of this issue). - I tried to avoid noise; when the request was captured, I saw a long-polling request from google in one of my tabs (to
prod-dynamite-prod-00-us-signaler-pa.clients6.google.com
). The request was already started before so I don't think that this should matter much, but mentioning for completeness. - I redacted the tracks not related to the issue, but still included all network threads. In the profile I kept the threads without markers as inactive, and showed those that did have markers indicating activity.
P.S. I am writing this comment from the same Firefox profile. I have to repeat Bugzilla a couple of times before the navigation can even succeed.
P.S. profiler.firefox.com
can be used to export the initial profile. It fails to symbolize, so I used a separate Firefox instance to symbolize before sharing here.
Reporter | ||
Comment 13•3 months ago
|
||
(In reply to Rob Wu [:robwu] from comment #7)
nsSocketTransportService::DoPollIteration
callsDisengageTimeout()
andOnSocketReady
(both appear in the log).nsSocketTransport::OnSocketReady
ends up loggingErrorAccordingToNSPR
, withErrorAccordingToNSPR [in=-5999 out=80004005]
(where-5999
isPR_BAD_DESCRIPTOR_ERROR
).
I continued reading the code. nsSocketTransport::OnSocketReady
only gets to the state above if state != PR_SUCCESS
.
That happens if PR_ConnectContinue(fd, outFlags)
returns PR_FAILURE
. Now PR_ConnectContinue
delegates its implementation elsewhere. Searchfox is unable to find the right symbol, so I looked up all definitions of PRIOMethods
types and think that either tcpMethods
in prsocket.c or _pr_tcp_methods
in ptio.c` are the most likely candidates.
Both SocketConnectContinue
and pt_ConnectContinue
call PR_SetError(PR_BAD_DESCRIPTOR_ERROR, 0);
if out_flags & PR_POLL_NVAL
, which is defined as:
#define PR_POLL_NVAL 0x10 /* only in out_flags when fd is bad */
From the log file (comment 7), I see that outFlags
is 16: E/nsSocketTransport nsSocketTransport::OnSocketReady [this=3736cfd00 outFlags=16]
which comes from mPollList[i + 1]
, only modified by PR_Poll
. PR_Poll
here is _pr_poll_with_poll
, which only has two ways to set PR_POLL_NVAL
:
- https://searchfox.org/mozilla-central/rev/ab26427a8d31be475be11bbae0e04c84cc7f20ef/nsprpub/pr/src/pthreads/ptio.c#3918
- https://searchfox.org/mozilla-central/rev/f2da682709d6dcdb1e5fd3573bc269e22db7e11e/nsprpub/pr/src/pthreads/ptio.c#3992
Question is why these are triggered.
Reporter | ||
Comment 14•3 months ago
•
|
||
My Nightly instance where the issue happened has crashed without reason.
Way before that happened, I recorded syscalls by launching Instruments from Xcode and selecting the existing Nightly process.
In my test case, I disabled automatic https requests (see comment 12), and loaded http://httpbin.org/?httponly
. I chose this domain because it responds with plain HTTP, without forcing HTTPS upgrades. Also, it only has IPv4 addresses, to rule out IPv6 issues.
I'm observing the following:
- I confirmed that there are no other threads that interacts with the fd returned by
socket()
for this request. The remaining calls/logs in this list are all on the Socket Thread. - After the socket is created,
connect()
is called on it and we start polling viaPR_Poll
, which ends up callingssl_Poll
, which ultimately triggers thegetpeername
syscall that appears in the syscall trace. This tells that the socket is not connected (this is normal; at this point the socket is new and we want to wait until it is connected). - The next syscall is
select
, still fromPR_Poll
. This blocks the thread untilselect
returns. - The next syscall is
semaphore_signal_trap
triggered via some Glean code, rooted in anOnSocketReady
call immediately after thePR_ConnectContinue
call returned.- Note that this
PR_ConnectContinue
was already mentioned in comment 13. From the log analysis in comment 13, it is obvious that the internal state is already bad here.
- Note that this
- The next syscall is
sys_ulock_wait2
, from amalloc
, indirectly called viansSocketTransportService::DetachSocket
.nsSocketTransportService::DetachSocket
logsnsSocketTransportService::DetachSocket [handler=
. This log appears in the initial report and the other logs and profiles that I have shared in this bug so far, and always consistently afterErrorAccordingToNSPR [in=-5999 out=80004005]
.
- The next syscall (ignoring syscalls related to memory management rooted in
DetachSocket
) issys_close
, and passed the fd.- This happens immediatly after the log call above (after the
nsSocketTransport::OnSocketDetached
log entry).
- This happens immediatly after the log call above (after the
So, based on these observations, there is no indication from syscalls that told Firefox that the fd was bad, yet Firefox somehow thought otherwise.
Reporter | ||
Comment 15•3 months ago
•
|
||
(In reply to Rob Wu [:robwu] from comment #14)
So, based on these observations, there is no indication from syscalls that told Firefox that the fd was bad, yet Firefox somehow thought otherwise.
That conclusion may not be accurate; select()
modifies the set of fds that it receives. It is possible that the bad fd status was actually communicated through that.
I see that select()
on the Socket Thread was passed 0x767 (two separate recordings) and 0x4b3 (two separate recordings, after the first one, with many hours apart - one night of sleep) as the first parameter. This implies that the Socket Thread was polling for 1895 and 1203 sockets. I have many tabs open, but nothing this observation in case it is relevant.
Assignee | ||
Comment 16•3 months ago
|
||
Assignee | ||
Comment 17•3 months ago
|
||
Well, this is what I found:
The PR_POLL_NVAL
error appears to come from this check that treats a socket as invalid if its osfd is greater than FD_SETSIZE
. In Firefox, FD_SETSIZE is defined as 4096.
From the system trace Rob provided, all the failed sockets had file descriptors greater than 4096.
Note that we increased FD_SETSIZE to 4096 in bug 1516325. However, weβre now seeing that socket()
can return a file descriptor larger than 4096.
Does this mean we should increase FD_SETSIZE again, or is the check itself incorrect? Iβm not sure.
Jed, do you have any thoughts on this?
Comment 18•3 months ago
|
||
We should stop using PR_Poll
and rely on the polling crate.
Comment hidden (obsolete) |
Reporter | ||
Comment 20•3 months ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #17)
From the system trace Rob provided, all the failed sockets had file descriptors greater than 4096.
In the traces I recorded, the fd number was in the 6000 range.
I saved the output of lsof -p $(pgrep -f firefox)
from around that time, and noticed that there are many UNIX domain sockets:
$ awk '{print $5}' lsof.txt | grep -v TYPE | sort | uniq -c | sort -rn
5564 unix
638 REG
11 KQUEUE
8 PIPE
8 IPv6
7 IPv4
6 CHR
1 systm
1 NPOLICY
1 DIR
The vast majority of these domain sockets point to (none)
. Could that be a sign of something not getting cleaned up properly?
$ grep ' unix ' lsof.txt | grep '(none)' | wc -l
5353
For future reference, the Firefox instance that this was captured from:
- 143.0a1 buildid 20250726091124 (about one week old)
- Session duration: about one week.
- Other info: Updater has not downloaded an update. My browsing habits were not significantly different from before. As usual I visited lots of Searchfox and Bugzilla (in private browsing and some non-private browsing) and some Google Docs. I often have many tabs open, it is possible that I had even more open than usual (mostly Searchfox and a few others in PBM), but I have also done that before without issues.
Comment 22•3 months ago
|
||
I'm probably missing something, but how does FD_SETSIZE
matter here? That is the maximum size of the descriptor array passed into poll
(or actually select
, which had limits). But the descriptor values in the array can obviously be larger than FD_SETSIZE
- you just can't pass more elements in.
Reporter | ||
Comment 23•3 months ago
|
||
(In reply to Lars Eggert [:lars] from comment #22)
I'm probably missing something, but how does
FD_SETSIZE
matter here? That is the maximum size of the descriptor array passed intopoll
(or actuallyselect
, which had limits). But the descriptor values in the array can obviously be larger thanFD_SETSIZE
- you just can't pass more elements in.
Are you asking in relation to the linked code, to understand why we have the check? When I looked into it (for the first time ever - I never worked with this before), I read that the numeric value of the fd is used as an index into the file descriptor set. If we don't skip fds whose numeric value exceeds FD_SETSIZE
, then we would be accessing memory out of bounds. Buffer overrun on the stack in this case since the implementation declares the fd_set
structs on the stack.
Comment 24•3 months ago
|
||
Oh, so fd#0
would always be at index 0, and so on? That seems like a strange design choice. But given that it's NSPR, I'm not surprised.
I still think the correct longterm fix is to move to the polling crate.
Reporter | ||
Comment 25•2 months ago
|
||
Following my observation in comment 20, I created an extension that opens and closes tabs automatically. I can reliably reproduce the issue here.
The easiest thing to see if it reproduces is to:
- Run the STR from bug 1981002 with many (somewhere between 4000 and 5000) tabs
- Start a syscall trace with Instruments
- Open a new private browsing window and navigate to example.com (and if that happens to work, a few other sites - in case a lower fd happened to have become available in the meantime)
- Stop the recording of Instruments.
- Search for the
connect
syscall and see if its arg1 is a high number (0x1000 or higher).
(In reply to Lars Eggert [:lars] from comment #24)
Oh, so
fd#0
would always be at index 0, and so on? That seems like a strange design choice. But given that it's NSPR, I'm not surprised.
This is not a design choice by NSPR, but from select
.
Comment 26•2 months ago
|
||
PR_Poll() uses either select or poll, depending on the system. (There are comments about HPUX, so that gives an idea how old this code is - probably originally circa 1998-2000, though I'm sure quire a few tweaks along the way).
Comment 27•2 months ago
|
||
Rob Wu's mozregression points at bug 1964600
Comment 28•2 months ago
|
||
Set release status flags based on info from the regressing bug 1964600
Assignee | ||
Comment 29•2 months ago
|
||
Since bug 1964600 has been identified as the regression bug, needinfo
can be cleared.
Updated•2 months ago
|
Comment 30•2 months ago
|
||
:gsvelto, since you are the author of the regressor, bug 1964600, could you take a look?
For more information, please visit BugBot documentation.
Updated•2 months ago
|
Comment 31•2 months ago
|
||
Should we close this as a symptom/duplicate of bug 1981002?
Reporter | ||
Comment 32•2 months ago
|
||
Is there some logging or special handling that we want to add to improve the visibility of this issue in the future?
Right now the user doesn't get any feedback, other than the navigation immediately be aborted.
If there is nothing else to do here, then this can be closed as a duplicate of 1981002 indeed.
Comment 33•2 months ago
|
||
My suggestion is that we just crash on file descriptor exhaustion. I'm pretty sure we do this in other places: https://searchfox.org/mozilla-central/rev/4fd0d5e4669bfa2d0888b730684d8adea061fd30/ipc/glue/DataPipe.cpp#491. This seems like a better outcome for the users because they don't end up with a non-functional browser and we get much better visibility into the problem.
Comment 34•2 months ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #33)
My suggestion is that we just crash on file descriptor exhaustion. I'm pretty sure we do this in other places: https://searchfox.org/mozilla-central/rev/4fd0d5e4669bfa2d0888b730684d8adea061fd30/ipc/glue/DataPipe.cpp#491. This seems like a better outcome for the users because they don't end up with a non-functional browser and we get much better visibility into the problem.
This would be good. At one point we reserved a few file descriptors to prevent exhaustion from stopping us from creating a minidump, however with the crash handling done out-of-process I think(?) we removed that. Though I think :robwu said that he had the crashreporter come up without a minidump (which might imply that we couldn't get a file descriptor). All the same, explicitly crashing might be a more stable path forward. It'd also be pretty neat if we could create a test to check whether crash reporting works when file descriptors are exhausted (either using a simulated exhaustion, which might be difficult to pull off, or setting low file descriptor limits and pushing against them).
Comment 35•2 months ago
|
||
Also, if we could have some integration test that basically does these STR to check for fd leaks, that would catch things a lot earlier.
Updated•2 months ago
|
Assignee | ||
Comment 36•2 months ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #33)
My suggestion is that we just crash on file descriptor exhaustion. I'm pretty sure we do this in other places: https://searchfox.org/mozilla-central/rev/4fd0d5e4669bfa2d0888b730684d8adea061fd30/ipc/glue/DataPipe.cpp#491. This seems like a better outcome for the users because they don't end up with a non-functional browser and we get much better visibility into the problem.
I think this is the right approach. The issue is that socket() returns a valid file descriptor, but itβs the poll implementation that considers the fd invalid due to being out of range.
It seems more appropriate to crash when we detect an out-of-range fd directly, rather than crashing only when poll fails.
Assignee | ||
Comment 37•2 months ago
|
||
Downgrade this bug to S3 since the real issue will be fixed in bug 1981002.
Updated•2 months ago
|
Assignee | ||
Comment 38•2 months ago
|
||
Bug 1981002 is fixed, so clear the needinfo.
Assignee | ||
Comment 39•2 months ago
|
||
Updated•2 months ago
|
Comment 40•2 months ago
|
||
Comment 41•2 months ago
|
||
Comment 42•2 months ago
|
||
Backed out for causing build bustages @ nsSocketTransportService2.cpp
Updated•2 months ago
|
Assignee | ||
Updated•2 months ago
|
Comment 43•2 months ago
|
||
Comment 44•2 months ago
|
||
bugherder |
Comment 45•2 months ago
|
||
The patch landed in nightly and beta is affected.
:kershaw, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- See https://wiki.mozilla.org/Release_Management/Requesting_an_Uplift for documentation on how to request an uplift.
- If no, please set
status-firefox142
towontfix
.
For more information, please visit BugBot documentation.
Assignee | ||
Updated•2 months ago
|
Comment 46•2 months ago
|
||
(In reply to Rob Wu [:robwu] from comment #25)
(In reply to Lars Eggert [:lars] from comment #24)
Oh, so
fd#0
would always be at index 0, and so on? That seems like a strange design choice. But given that it's NSPR, I'm not surprised.This is not a design choice by NSPR, but from
select
.
I think I've mentioned this out-of-band, but: the traditional select
implementation seen on macOS (and Linux) uses flat bitsets like that, but Winsock implements fd_set
as an array of fds (so in that case FD_SETSIZE
is a limit only on the cardinality of the set), and that might explain some of the confusion about this.
The other thing that might be useful is: typically, when fd_set
is a bitset, the kernel doesn't care about FD_SETSIZE
, only the nfds
value passed in the syscall, so it's possible for userland to redefine FD_SETSIZE
like we're doing on macOS, but in theory it's also possible to dynamically allocate a large enough buffer and cast it to fd_set*
, and libevent even has some code for this (which we're not using in IPC, because we use kqueue
instead, but I noticed it a while ago while auditing for uses of select
).
Updated•2 months ago
|
Description
•