Closed Bug 1980171 Opened 3 months ago Closed 2 months ago

Requests fail with PR_BAD_DESCRIPTOR_ERROR

Categories

(Core :: Networking, defect, P2)

defect

Tracking

()

RESOLVED FIXED
143 Branch
Tracking Status
firefox-esr128 --- unaffected
firefox-esr140 --- unaffected
firefox141 --- unaffected
firefox142 --- wontfix
firefox143 --- fixed

People

(Reporter: robwu, Assigned: kershaw)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged][necko-priority-queue])

Attachments

(4 files, 1 obsolete file)

Nightly 143.0a1, buildID 20250726091124

STR:

  1. Have a Firefox open.
  2. Open the Firefox in another instance, e.g. mkdir /tmp/prof && "/Applications/Firefox Nightly.app/Contents/MacOS/firefox" -profile /tmp/prof
    • EDIT: I am now in a state where this step is not even needed. The necessity of this step is unclear.
  3. Go back to the Firefox from step 1. Try to visit a domain that you have not visited before (in the past few minutes). E.g. search for something in DuckDuckGo.

Expected:

  • Should load.

Actual:

  • Often fails to load. While trying to reproduce, I found that it fails multiple times in succession, but sometimes succeeds (on retry).
  • Quitting the other Firefox instance (from step 3) makes the issue go away.
  • In the devtools, I see NS_ERROR_FAILURE.

I visited about:logging and logged to disk with the network preset. In the logs that fail, I consistently observe the presence of ErrorAccordingToNSPR [in=-5999 out=80004005] logged from https://searchfox.org/mozilla-central/rev/ab26427a8d31be475be11bbae0e04c84cc7f20ef/netwerk/base/nsSocketTransport2.cpp#227-230,251-252

  • -5999 there is PR_BAD_DESCRIPTOR_ERROR.
  • 80004005 is the hex code of 2147500037, aka NS_ERROR_FAILURE
2025-07-30 12:49:35.929894 UTC - [Parent 54986: Main Thread]: D/nsHttp sending progress and status notification [this=424580800 status=4b0007 progress=0/0]
2025-07-30 12:49:35.929980 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport ErrorAccordingToNSPR [in=-5999 out=80004005]
2025-07-30 12:49:35.929991 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   connection failed! [reason=80004005]
2025-07-30 12:49:35.929992 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::DetachSocket [handler=41405d600]
2025-07-30 12:49:35.929995 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketDetached [this=41405d600 cond=80004005]
2025-07-30 12:49:35.929997 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=41405d600 state=3 cond=80004005]
2025-07-30 12:49:35.930000 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Adding address to blocklist for host [duckduckgo.com], host record [4525ecdf0].used trr=0
2025-07-30 12:49:35.930003 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Successfully adding address [52.142.124.215] to blocklist for host [duckduckgo.com].

Note: I am also observing the issue despite having quit the other Firefox.

In any case I know that my current browser is in some bad state, and I can make it more likely to trigger the bug for investigation.

Rob, can you go back to about:logging log to the Firefox Profile and then share the profile?

Flags: needinfo?(rob)

Logs from about:logging with Networking preset, redacted cookies and some URLs.

Searched for "testanother" in DuckDuckGo in a private browsing tab. Result: navigation was aborted.

Logs from about:logging with Networking preset.

Searched for "testanother" in DuckDuckGo in a private browsing tab. Result: navigation succeeded.
The only difference vs the previous step is that I quit the other Firefox instance.

(In reply to Andrew Creskey [:acreskey] from comment #2)

Rob, can you go back to about:logging log to the Firefox Profile and then share the profile?

I attached the log files. I also have a profiler capture from other attempts to reproduce, but it contains sensitive information. Is there anything you'd like to see in that is not already available in the MOZ_LOG log files?

Flags: needinfo?(rob)

about:logging with Network preset + pipnss added. I attempted to load http://searchfox.org/mozilla-central/search?q=PR_BAD_DESCRIPTOR_ERROR&path=

timestamp,sync,nsHttp:5,cache2:5,nsSocketTransport:5,nsHostResolver:5,EarlyHint:5,pipnss:5

I traced the network request in the log from comment 6, and selected the very specific part where it transitions from an okay state (MSG_DNS_LOOKUP_COMPLETE) to ultimately a bad state. To avoid incorrect interpretations, I explicitly annotated the parts I cut from the log in the log extract below.

2025-07-30 14:45:13.368939 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=3736cfd00 type=1 status=0 param=0]
2025-07-30 14:45:13.368940 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   MSG_DNS_LOOKUP_COMPLETE
2025-07-30 14:45:13.368943 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Checking unusable list for host [searchfox.org], host record [415894c60].
2025-07-30 14:45:13.368945 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Address [35.82.221.129] is blocklisted for host [searchfox.org].
2025-07-30 14:45:13.368947 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Checking unusable list for host [searchfox.org], host record [415894c60].
2025-07-30 14:45:13.368950 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Address [35.155.250.20] is blocklisted for host [searchfox.org].
2025-07-30 14:45:13.368952 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Checking unusable list for host [searchfox.org], host record [415894c60].
2025-07-30 14:45:13.368954 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::InitiateSocket [this=3736cfd00]
2025-07-30 14:45:13.368956 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::BuildSocket [this=3736cfd00]
2025-07-30 14:45:13.368958 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   pushing io layer [0:ssl]
2025-07-30 14:45:13.368990 UTC - [Parent 54986: Socket Thread]: D/pipnss [16468d900] nsSSLIOLayerSetOptions: using TLS version range (0x0303,0x0304)
2025-07-30 14:45:13.368998 UTC - [Parent 54986: Socket Thread]: D/pipnss [16468d900] nsSSLIOLayerSetOptions: enabling TLS ECH Grease
2025-07-30 14:45:13.369007 UTC - [Parent 54986: Socket Thread]: D/pipnss [16468d900] Socket set up
2025-07-30 14:45:13.369010 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   [tlsSocketControl=44ffa5900 callbacks=15385bf10]
2025-07-30 14:45:13.369016 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::AttachSocket [handler=3736cfd00]
2025-07-30 14:45:13.369018 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::AddToIdleList 16ba72360 [handler=3736cfd00]
2025-07-30 14:45:13.369020 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   active=19 idle=1
2025-07-30 14:45:13.369022 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   advancing to STATE_CONNECTING
2025-07-30 14:45:13.369023 UTC - [Parent 54986: Socket Thread]: E/nsSocketTransport nsSocketTransport::SendStatus [this=3736cfd00 status=4b0007]
(... removed 3 lines from seemingly unrelated socket to unrelated domain ...)
2025-07-30 14:45:13.369159 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport STS poll iter
(... removed statuses of 19 active unrelated sockets ...)
2025-07-30 14:45:13.369265 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   idle [0] { handler=3736cfd00 condition=0 pollflags=6 }
2025-07-30 14:45:13.369267 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::MoveToPollList 126508888 [handler=3736cfd00]
2025-07-30 14:45:13.369269 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::AddToPollList 126508888 [handler=3736cfd00]
2025-07-30 14:45:13.369271 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport SocketContext::EnsureTimeout socket=3736cfd00
2025-07-30 14:45:13.369273 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   engaging
2025-07-30 14:45:13.369274 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   active=20 idle=1
2025-07-30 14:45:13.369276 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::RemoveFromIdleList [handler=0]
2025-07-30 14:45:13.369278 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   active=20 idle=0
2025-07-30 14:45:13.369280 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   calling PR_Poll [active=20 idle=0]
(... removed 19x message like the next two lines for 19 unrelated sockets ...)
2025-07-30 14:45:13.369348 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport SocketContext::TimeoutIn socket=3736cfd00, timeout=65535s
2025-07-30 14:45:13.369350 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   not engaged
2025-07-30 14:45:13.369351 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport poll timeout: none
2025-07-30 14:45:13.369353 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport     timeout = -1 milliseconds
(... removed 18x "poll SSL socket using lower 5" + "poll SSL socket returned 5" ...)
2025-07-30 14:45:13.369421 UTC - [Parent 54986: Socket Thread]: V/pipnss [16468d900] poll SSL socket using lower 6
2025-07-30 14:45:13.369423 UTC - [Parent 54986: Socket Thread]: V/pipnss [16468d900] poll SSL socket returned 6
2025-07-30 14:45:13.619900 UTC - [Parent 54986: Timer]: D/nsSocketTransport STS dispatch [142f21390]
2025-07-30 14:45:13.619967 UTC - [Parent 54986: Timer]: D/nsSocketTransport PollableEvent::Signal
2025-07-30 14:45:13.619982 UTC - [Parent 54986: Timer]: D/nsSocketTransport PollableEvent::MarkFirstSignalTimestamp
2025-07-30 14:45:13.620001 UTC - [Parent 54986: Timer]: D/nsSocketTransport PollableEvent::Signal PR_Write 1
2025-07-30 14:45:13.620044 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport     ...returned after 251 milliseconds
(... removed 19x "SocketContext::TimeoutIn" + "not engaged" messages ...)
2025-07-30 14:45:13.620432 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport SocketContext::DisengageTimeout socket=3736cfd00
2025-07-30 14:45:13.620443 UTC - [Parent 54986: Socket Thread]: E/nsSocketTransport nsSocketTransport::OnSocketReady [this=3736cfd00 outFlags=16]
2025-07-30 14:45:13.620468 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport ErrorAccordingToNSPR [in=-5999 out=80004005]
2025-07-30 14:45:13.620480 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   connection failed! [reason=80004005]
2025-07-30 14:45:13.620490 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransportService::DetachSocket [handler=3736cfd00]
2025-07-30 14:45:13.620500 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketDetached [this=3736cfd00 cond=80004005]
2025-07-30 14:45:13.620511 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=3736cfd00 state=3 cond=80004005]
2025-07-30 14:45:13.620524 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Adding address to blocklist for host [searchfox.org], host record [415894c60].used trr=0
2025-07-30 14:45:13.620536 UTC - [Parent 54986: Socket Thread]: D/nsHostResolver Successfully adding address [44.226.105.90] to blocklist for host [searchfox.org].
2025-07-30 14:45:13.620548 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport   not a recoverable error 80004005
2025-07-30 14:45:13.620558 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketInputStream::OnSocketReady [this=53b6cd040 cond=80004005]
2025-07-30 14:45:13.620568 UTC - [Parent 54986: Socket Thread]: D/nsSocketTransport nsSocketOutputStream::OnSocketReady [this=53b6cee80 cond=80004005]
2025-07-30 14:45:13.620582 UTC - [Parent 54986: Socket Thread]: V/nsHttp DnsAndConnectSocket::OnOutputStreamReady [this=4f1b47ec0 ent=searchfox.org primary]

(if you want to see more, just look at the log from comment 6).

(shortening title because I can now consistently experience the problem even without another Firefox around)

Summary: Requests fail with PR_BAD_DESCRIPTOR_ERROR, when another Firefox instance is open → Requests fail with PR_BAD_DESCRIPTOR_ERROR

Rob, if you can still reproduce after bug 1964292 is reverted, can I ask for an about:logging Firefox Profiler capture, this time with "Enable stack traces for log messages" enabled?
I've been finding that helpful in understanding what's happening.

Flags: needinfo?(rob)

(In reply to Andrew Creskey [:acreskey] from comment #9)

Rob, if you can still reproduce after bug 1964292 is reverted, can I ask for an about:logging Firefox Profiler capture, this time with "Enable stack traces for log messages" enabled?
I've been finding that helpful in understanding what's happening.

I'm afraid that I won't be able to reproduce after a restart, so I'll try debugging with what I have now.

Also, I don't think that bug 1964292 is going to affect this bug in any way; the "Adding address to blocklist for host" log entry is after the "ErrorAccordingToNSPR" entry (see comment 7 for easy Ctrl+F).

Just to make sure that it was not cached way earlier, I created a new STR, with a website that I have never visited before (https://overheid.nl/testhere). I put a x before the domain before I started profiling, to make sure that there is no preloading or anything that could have happened. After starting the profiler, I removed the x, pressed enter, and stopped the profile. Here is the result: https://share.firefox.dev/3H28ntA

Flags: needinfo?(rob)

(In reply to Rob Wu [:robwu] from comment #10)

(In reply to Andrew Creskey [:acreskey] from comment #9)

Rob, if you can still reproduce after bug 1964292 is reverted, can I ask for an about:logging Firefox Profiler capture, this time with "Enable stack traces for log messages" enabled?
I've been finding that helpful in understanding what's happening.

I'm afraid that I won't be able to reproduce after a restart, so I'll try debugging with what I have now.

Also, I don't think that bug 1964292 is going to affect this bug in any way; the "Adding address to blocklist for host" log entry is after the "ErrorAccordingToNSPR" entry (see comment 7 for easy Ctrl+F).

Just to make sure that it was not cached way earlier, I created a new STR, with a website that I have never visited before (https://overheid.nl/testhere). I put a x before the domain before I started profiling, to make sure that there is no preloading or anything that could have happened. After starting the profiler, I removed the x, pressed enter, and stopped the profile. Here is the result: https://share.firefox.dev/3H28ntA

Thank you, Rob, that's quite interesting, and I only see the -5999, PR_BAD_DESCRIPTOR_ERROR, ErrorAccordingToNSPR in your logs, but not PR_HOST_UNREACHABLE_ERROR, -5927.

Could I ask for one more profile of a repro? -- This last one is missing some threads like the DNS resolver.

When you go to upload the profile, make sure "Include Hidden Threads" is checked - that's my best guess as to why those threads are missing.

Flags: needinfo?(rob)

Here is a new profile: https://share.firefox.dev/4fgqQPv

Description of STR:

  • Visiting a site I never visited before, navigation from a new private browsing tab to it.
  • URL: http://coolgoodtranscendentjoke.neverssl.com/online
  • Preferences dom.security.https_first_pbm set to false, dom.security.https_only_mode_send_http_background_request set to false.
    (I did this to rule out https as the cause of this issue).
  • I tried to avoid noise; when the request was captured, I saw a long-polling request from google in one of my tabs (to prod-dynamite-prod-00-us-signaler-pa.clients6.google.com). The request was already started before so I don't think that this should matter much, but mentioning for completeness.
  • I redacted the tracks not related to the issue, but still included all network threads. In the profile I kept the threads without markers as inactive, and showed those that did have markers indicating activity.

P.S. I am writing this comment from the same Firefox profile. I have to repeat Bugzilla a couple of times before the navigation can even succeed.
P.S. profiler.firefox.com can be used to export the initial profile. It fails to symbolize, so I used a separate Firefox instance to symbolize before sharing here.

Flags: needinfo?(rob)
See Also: → 1964292

(In reply to Rob Wu [:robwu] from comment #7)

I continued reading the code. nsSocketTransport::OnSocketReady only gets to the state above if state != PR_SUCCESS.
That happens if PR_ConnectContinue(fd, outFlags) returns PR_FAILURE. Now PR_ConnectContinue delegates its implementation elsewhere. Searchfox is unable to find the right symbol, so I looked up all definitions of PRIOMethods types and think that either tcpMethods in prsocket.c or _pr_tcp_methods in ptio.c` are the most likely candidates.

Both SocketConnectContinue and pt_ConnectContinue call PR_SetError(PR_BAD_DESCRIPTOR_ERROR, 0); if out_flags & PR_POLL_NVAL, which is defined as:

#define PR_POLL_NVAL    0x10        /* only in out_flags when fd is bad */

From the log file (comment 7), I see that outFlags is 16: E/nsSocketTransport nsSocketTransport::OnSocketReady [this=3736cfd00 outFlags=16]
which comes from mPollList[i + 1], only modified by PR_Poll. PR_Poll here is _pr_poll_with_poll, which only has two ways to set PR_POLL_NVAL:

Question is why these are triggered.

My Nightly instance where the issue happened has crashed without reason.

Way before that happened, I recorded syscalls by launching Instruments from Xcode and selecting the existing Nightly process.

In my test case, I disabled automatic https requests (see comment 12), and loaded http://httpbin.org/?httponly . I chose this domain because it responds with plain HTTP, without forcing HTTPS upgrades. Also, it only has IPv4 addresses, to rule out IPv6 issues.

I'm observing the following:

  • I confirmed that there are no other threads that interacts with the fd returned by socket() for this request. The remaining calls/logs in this list are all on the Socket Thread.
  • After the socket is created, connect() is called on it and we start polling via PR_Poll, which ends up calling ssl_Poll, which ultimately triggers the getpeername syscall that appears in the syscall trace. This tells that the socket is not connected (this is normal; at this point the socket is new and we want to wait until it is connected).
  • The next syscall is select, still from PR_Poll. This blocks the thread until select returns.
  • The next syscall is semaphore_signal_trap triggered via some Glean code, rooted in an OnSocketReady call immediately after the PR_ConnectContinue call returned.
    • Note that this PR_ConnectContinue was already mentioned in comment 13. From the log analysis in comment 13, it is obvious that the internal state is already bad here.
  • The next syscall is sys_ulock_wait2, from a malloc, indirectly called via nsSocketTransportService::DetachSocket.
    • nsSocketTransportService::DetachSocket logs nsSocketTransportService::DetachSocket [handler=. This log appears in the initial report and the other logs and profiles that I have shared in this bug so far, and always consistently after ErrorAccordingToNSPR [in=-5999 out=80004005].
  • The next syscall (ignoring syscalls related to memory management rooted in DetachSocket) is sys_close, and passed the fd.

So, based on these observations, there is no indication from syscalls that told Firefox that the fd was bad, yet Firefox somehow thought otherwise.

(In reply to Rob Wu [:robwu] from comment #14)

So, based on these observations, there is no indication from syscalls that told Firefox that the fd was bad, yet Firefox somehow thought otherwise.

That conclusion may not be accurate; select() modifies the set of fds that it receives. It is possible that the bad fd status was actually communicated through that.

I see that select() on the Socket Thread was passed 0x767 (two separate recordings) and 0x4b3 (two separate recordings, after the first one, with many hours apart - one night of sleep) as the first parameter. This implies that the Socket Thread was polling for 1895 and 1203 sockets. I have many tabs open, but nothing this observation in case it is relevant.

Attached file WIP: Bug 1980171 - Add logs (obsolete) β€”

Well, this is what I found:

The PR_POLL_NVAL error appears to come from this check that treats a socket as invalid if its osfd is greater than FD_SETSIZE. In Firefox, FD_SETSIZE is defined as 4096.
From the system trace Rob provided, all the failed sockets had file descriptors greater than 4096.

Note that we increased FD_SETSIZE to 4096 in bug 1516325. However, we’re now seeing that socket() can return a file descriptor larger than 4096.
Does this mean we should increase FD_SETSIZE again, or is the check itself incorrect? I’m not sure.
Jed, do you have any thoughts on this?

Flags: needinfo?(jld)
See Also: → 1516325

We should stop using PR_Poll and rely on the polling crate.

(In reply to Kershaw Chang [:kershaw] from comment #17)

From the system trace Rob provided, all the failed sockets had file descriptors greater than 4096.

In the traces I recorded, the fd number was in the 6000 range.
I saved the output of lsof -p $(pgrep -f firefox) from around that time, and noticed that there are many UNIX domain sockets:

$ awk '{print $5}' lsof.txt | grep -v TYPE | sort | uniq -c | sort -rn
5564 unix
 638 REG
  11 KQUEUE
   8 PIPE
   8 IPv6
   7 IPv4
   6 CHR
   1 systm
   1 NPOLICY
   1 DIR

The vast majority of these domain sockets point to (none). Could that be a sign of something not getting cleaned up properly?

$ grep ' unix ' lsof.txt | grep '(none)' | wc -l
    5353

For future reference, the Firefox instance that this was captured from:

  • 143.0a1 buildid 20250726091124 (about one week old)
  • Session duration: about one week.
  • Other info: Updater has not downloaded an update. My browsing habits were not significantly different from before. As usual I visited lots of Searchfox and Bugzilla (in private browsing and some non-private browsing) and some Google Docs. I often have many tabs open, it is possible that I had even more open than usual (mostly Searchfox and a few others in PBM), but I have also done that before without issues.
Duplicate of this bug: 1980425

I'm probably missing something, but how does FD_SETSIZE matter here? That is the maximum size of the descriptor array passed into poll (or actually select, which had limits). But the descriptor values in the array can obviously be larger than FD_SETSIZE - you just can't pass more elements in.

(In reply to Lars Eggert [:lars] from comment #22)

I'm probably missing something, but how does FD_SETSIZE matter here? That is the maximum size of the descriptor array passed into poll (or actually select, which had limits). But the descriptor values in the array can obviously be larger than FD_SETSIZE - you just can't pass more elements in.

Are you asking in relation to the linked code, to understand why we have the check? When I looked into it (for the first time ever - I never worked with this before), I read that the numeric value of the fd is used as an index into the file descriptor set. If we don't skip fds whose numeric value exceeds FD_SETSIZE, then we would be accessing memory out of bounds. Buffer overrun on the stack in this case since the implementation declares the fd_set structs on the stack.

Oh, so fd#0 would always be at index 0, and so on? That seems like a strange design choice. But given that it's NSPR, I'm not surprised.

I still think the correct longterm fix is to move to the polling crate.

Depends on: 1981002

Following my observation in comment 20, I created an extension that opens and closes tabs automatically. I can reliably reproduce the issue here.

The easiest thing to see if it reproduces is to:

  1. Run the STR from bug 1981002 with many (somewhere between 4000 and 5000) tabs
  2. Start a syscall trace with Instruments
  3. Open a new private browsing window and navigate to example.com (and if that happens to work, a few other sites - in case a lower fd happened to have become available in the meantime)
  4. Stop the recording of Instruments.
  5. Search for the connect syscall and see if its arg1 is a high number (0x1000 or higher).

(In reply to Lars Eggert [:lars] from comment #24)

Oh, so fd#0 would always be at index 0, and so on? That seems like a strange design choice. But given that it's NSPR, I'm not surprised.

This is not a design choice by NSPR, but from select.

PR_Poll() uses either select or poll, depending on the system. (There are comments about HPUX, so that gives an idea how old this code is - probably originally circa 1998-2000, though I'm sure quire a few tweaks along the way).

Rob Wu's mozregression points at bug 1964600

Keywords: regression
Regressed by: 1964600

Set release status flags based on info from the regressing bug 1964600

Since bug 1964600 has been identified as the regression bug, needinfo can be cleared.

Flags: needinfo?(jld)
Severity: -- → S2
Priority: -- → P1
Whiteboard: [necko-triaged]

:gsvelto, since you are the author of the regressor, bug 1964600, could you take a look?

For more information, please visit BugBot documentation.

Flags: needinfo?(gsvelto)
Flags: needinfo?(afranchuk)

Should we close this as a symptom/duplicate of bug 1981002?

Flags: needinfo?(afranchuk)

Is there some logging or special handling that we want to add to improve the visibility of this issue in the future?

Right now the user doesn't get any feedback, other than the navigation immediately be aborted.

If there is nothing else to do here, then this can be closed as a duplicate of 1981002 indeed.

Flags: needinfo?(kershaw)

My suggestion is that we just crash on file descriptor exhaustion. I'm pretty sure we do this in other places: https://searchfox.org/mozilla-central/rev/4fd0d5e4669bfa2d0888b730684d8adea061fd30/ipc/glue/DataPipe.cpp#491. This seems like a better outcome for the users because they don't end up with a non-functional browser and we get much better visibility into the problem.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #33)

My suggestion is that we just crash on file descriptor exhaustion. I'm pretty sure we do this in other places: https://searchfox.org/mozilla-central/rev/4fd0d5e4669bfa2d0888b730684d8adea061fd30/ipc/glue/DataPipe.cpp#491. This seems like a better outcome for the users because they don't end up with a non-functional browser and we get much better visibility into the problem.

This would be good. At one point we reserved a few file descriptors to prevent exhaustion from stopping us from creating a minidump, however with the crash handling done out-of-process I think(?) we removed that. Though I think :robwu said that he had the crashreporter come up without a minidump (which might imply that we couldn't get a file descriptor). All the same, explicitly crashing might be a more stable path forward. It'd also be pretty neat if we could create a test to check whether crash reporting works when file descriptors are exhausted (either using a simulated exhaustion, which might be difficult to pull off, or setting low file descriptor limits and pushing against them).

Also, if we could have some integration test that basically does these STR to check for fd leaks, that would catch things a lot earlier.

Attachment #9504607 - Attachment is obsolete: true

(In reply to Jeff Muizelaar [:jrmuizel] from comment #33)

My suggestion is that we just crash on file descriptor exhaustion. I'm pretty sure we do this in other places: https://searchfox.org/mozilla-central/rev/4fd0d5e4669bfa2d0888b730684d8adea061fd30/ipc/glue/DataPipe.cpp#491. This seems like a better outcome for the users because they don't end up with a non-functional browser and we get much better visibility into the problem.

I think this is the right approach. The issue is that socket() returns a valid file descriptor, but it’s the poll implementation that considers the fd invalid due to being out of range.
It seems more appropriate to crash when we detect an out-of-range fd directly, rather than crashing only when poll fails.

Flags: needinfo?(kershaw)

Downgrade this bug to S3 since the real issue will be fixed in bug 1981002.

Severity: S2 → S3
Priority: P1 → P2
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-queue]

Bug 1981002 is fixed, so clear the needinfo.

Flags: needinfo?(gsvelto)
Assignee: nobody → kershaw
Status: NEW → ASSIGNED
Pushed by amarc@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/90a41b4f9a5b https://hg.mozilla.org/integration/autoland/rev/b19993d2b6f2 Revert "Bug 1980171 - Crash if we get our-of-range fd, r=necko-reviewers,valentin" for causing build bustages @ nsSocketTransportService2.cpp

Backed out for causing build bustages @ nsSocketTransportService2.cpp

Flags: needinfo?(kershaw)
Attachment #9506857 - Attachment description: Bug 1980171 - Crash if we get our-of-range fd, r=#necko → Bug 1980171 - Crash if we get out-of-range fd, r=#necko
Flags: needinfo?(kershaw)
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 143 Branch

The patch landed in nightly and beta is affected.
:kershaw, is this bug important enough to require an uplift?

For more information, please visit BugBot documentation.

Flags: needinfo?(kershaw)
Flags: needinfo?(kershaw)
See Also: → 1983245

(In reply to Rob Wu [:robwu] from comment #25)

(In reply to Lars Eggert [:lars] from comment #24)

Oh, so fd#0 would always be at index 0, and so on? That seems like a strange design choice. But given that it's NSPR, I'm not surprised.

This is not a design choice by NSPR, but from select.

I think I've mentioned this out-of-band, but: the traditional select implementation seen on macOS (and Linux) uses flat bitsets like that, but Winsock implements fd_set as an array of fds (so in that case FD_SETSIZE is a limit only on the cardinality of the set), and that might explain some of the confusion about this.

The other thing that might be useful is: typically, when fd_set is a bitset, the kernel doesn't care about FD_SETSIZE, only the nfds value passed in the syscall, so it's possible for userland to redefine FD_SETSIZE like we're doing on macOS, but in theory it's also possible to dynamically allocate a large enough buffer and cast it to fd_set*, and libevent even has some code for this (which we're not using in IPC, because we use kqueue instead, but I noticed it a while ago while auditing for uses of select).

QA Whiteboard: [qa-triage-done-c144/b143]
See Also: → 1986095
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: