Crash in [@ mozilla::net::nsSocketTransportService::AttachSocket] with osfd < kFDs
Categories
(Core :: Networking, defect, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox-esr115 | --- | unaffected |
| firefox-esr128 | --- | unaffected |
| firefox-esr140 | --- | unaffected |
| firefox142 | --- | unaffected |
| firefox143 | --- | wontfix |
| firefox144 | --- | wontfix |
| firefox146 | --- | wontfix |
| firefox147 | + | fixed |
| firefox148 | + | fixed |
People
(Reporter: mccr8, Assigned: afranchuk)
References
Details
(Keywords: crash, topcrash, Whiteboard: [necko-triaged][necko-priority-next])
Crash Data
Attachments
(2 files)
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-release+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/e9d6cc98-9359-48c1-a9d0-b71eb0250829
MOZ_CRASH Reason:
MOZ_RELEASE_ASSERT(osfd < kFDs)
Top 10 frames:
0 XUL MOZ_CrashSequence(void*, long) mfbt/Assertions.h:253
0 XUL mozilla::net::nsSocketTransportService::AttachSocket(PRFileDesc*, nsASocketHa... netwerk/base/nsSocketTransportService2.cpp:405
1 XUL mozilla::net::nsSocketTransport::InitiateSocket() netwerk/base/nsSocketTransport2.cpp:1464
2 XUL mozilla::net::nsSocketTransport::OnSocketEvent(unsigned int, nsresult, nsISup... netwerk/base/nsSocketTransport2.cpp
2 XUL mozilla::net::nsSocketEvent::Run() netwerk/base/nsSocketTransport2.cpp:98
3 XUL nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1151
3 XUL NS_ProcessNextEvent(nsIThread*, bool) xpcom/threads/nsThreadUtils.cpp:461
4 XUL mozilla::net::nsSocketTransportService::Run() netwerk/base/nsSocketTransportService2.cpp:1214
5 XUL {virtual override thunk({offset(-32)}, mozilla::net::nsSocketTransportService... netwerk/base/nsSocketTransportService2.cpp:0
6 XUL nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1151
This assertion was added in bug 1980171 but I assume whatever problem it indicates is separate, so not really a regression.
Updated•6 months ago
|
| Reporter | ||
Comment 1•6 months ago
|
||
It looks like this is only on MacOS.
| Reporter | ||
Updated•6 months ago
|
| Reporter | ||
Comment 2•6 months ago
|
||
I didn't see any other signatures where the crash reason contains osfd < kFDs.
Updated•6 months ago
|
Comment 3•6 months ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #1)
It looks like this is only on MacOS.
Several things are going on here. The issue that the assertion is protecting against is macOS-only; see also bug 1982486.
The assertion itself is enabled on other Unixes, and it actually doesn't need to be (real poll doesn't have a limit other than the range of the int type, and I don't think we're using the fake select-based one anywhere other than macOS), but it's unlikely to ever fail because the assertion's cutoff is 64k in that case and the browser would have to be started with the OS's fd resource limit set higher than that (which I believe is very uncommon).
Comment 5•6 months ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #4)
Also possibly fixed by bug 1983245.
Isn't that just a work-around? If we leak fds somewhere, that would be a bug, even if we lower the impact.
Perhaps worth calling out: over one third of the crash reports are happening shortly after startup.
83 out of 221 crash (of the crashes in the past month) crash within 30 seconds (of which 62 within 3 seconds!). Some of these are from the same user who crash repeatedly. That looks very unusual, a potential bug somewhere.
Comment 6•6 months ago
|
||
Considering it only happens on MacOS, it's possible that leaking FDs or not reusing closed ones may be a bug in MacOS.
Regardless, replacing that outdated PR_Poll with something more modern is something we've been pursuing for a while.
Comment 7•5 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 5 desktop browser crashes on Mac on release
:jesup, could you consider increasing the severity of this top-crash bug?
For more information, please visit BugBot documentation.
Comment 8•4 months ago
|
||
Perhaps worth calling out: over one third of the crash reports are happening shortly after startup.
83 out of 221 crash (of the crashes in the past month) crash within 30 seconds (of which 62 within 3 seconds!). Some of these are from the same user who crash repeatedly.
I might be such a user.
Crash reports:
bp-c66d520e-c315-4186-8332-c03080251030bp-9fc0cb2f-3297-403e-b789-ee80f0251030bp-96f2da31-36e7-45b6-8fda-20c060251030
It's not clear what caused the first one, but the latter two happened immediately after restarting Firefox from the crash reporter. It seemed that what stopped the loop was un-selecting the option to restart Firefox after sending the crash report, and launching Firefox manually.
Comment 9•4 months ago
|
||
:gsvelto Does the described interaction between this bug and the crash reporter ring any bell towards a potential resolution? Here is the relevant part from comment 8 (removed code formatting around crash ID so that Bugzilla can automatically turn these into links):
- bp-c66d520e-c315-4186-8332-c03080251030
- bp-9fc0cb2f-3297-403e-b789-ee80f0251030
- bp-96f2da31-36e7-45b6-8fda-20c060251030
It's not clear what caused the first one, but the latter two happened immediately after restarting Firefox from the crash reporter. It seemed that what stopped the loop was un-selecting the option to restart Firefox after sending the crash report, and launching Firefox manually.
Comment 10•4 months ago
|
||
I wonder if this is caused by us sending crash pings via background tasks, Alex can you have a look?
| Assignee | ||
Comment 11•4 months ago
•
|
||
I would have thought so a month ago, however since bug 1991491 landed, we disable crash reporting in the background task when launched. On the contrary, this appears to have increased a lot in the past month. Also as a stronger indication: none of these crashes have the BackgroundTaskName annotation set.
It's interesting that restarting from the crash reporter client was failing repeatedly, but launching manually worked. The main difference between the two is that Firefox would be launched with the same initial args when launched from the crash reporter client.
Comment 12•4 months ago
|
||
I'm wondering... we're using posix_spawnp() to launch the crash reporter client so the new Firefox should not inherit the file descriptors of the crashed one, but what if it did? It might be worth checking if we still have some unexpected file descriptors after a restart. I'm NI?ing myself so I remember to check on my mac tomorrow.
Updated•4 months ago
|
Comment 13•4 months ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #12)
I'm wondering... we're using
posix_spawnp()to launch the crash reporter client so the new Firefox should not inherit the file descriptors of the crashed one, but what if it did? It might be worth checking if we still have some unexpected file descriptors after a restart. I'm NI?ing myself so I remember to check on my mac tomorrow.
That's a very good theory.
https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/posix_spawn.2.html (not sure this is the best link for the docs though)
File descriptors open in the calling process image remain open in the new
process image, except for those for which the close-on-exec flag is set
(see close(2) and fcntl(2)). Descriptors that remain open are unaffected
by posix_spawn() unless their behaviour is modified by a file action; see
posix_spawn_file_actions_init(3) for more information.
As it happens we have a few open bugs about setting the flag - See bug 1463873, 1667748 and 1840206
Comment 14•4 months ago
|
||
Ah yes, that's definitely it then. Given we're starting a new fresh instance of Firefox we should be closing all spurious files but doing so might be tricky. We launch the crash reporter client here which is in a crashed context, so iterating over the existing files and adding file actions is not possible as we're in a crashed context. The crash reporter client uses Rust's process facilities to launch the new instance, and that also doesn't give controls over the files. So the only reasonable place where we could clean up whatever file descriptors we've inherited from the crashed Firefox is during the crash reporter client's startup. We could iterate over the open files right on startup and close everything we don't know about (stdin/stdout/stderr and whatever else might have been opened for launching the client). We should do that on Linux too given we use fork() to launch the client. Alex, does this sound reasonable?
Naturally all of this would be moot if I had already finished fixing bug 587729 and we could open the crash reporter client from there... but that's unfortunately still a few months away due to unforeseen issues :(
| Assignee | ||
Comment 15•4 months ago
|
||
Yeah, that sounds simple enough to me. I'm curious why there's an uptick in crashes now, though. I know we have been making some changes to which APIs we use on Mac when dealing with fds in certain contexts.
Comment 16•4 months ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #12)
I'm wondering... we're using
posix_spawnp()to launch the crash reporter client so the new Firefox should not inherit the file descriptors of the crashed one, but what if it did?
On macOS you can use POSIX_SPAWN_CLOEXEC_DEFAULT for that. But that's an Apple extension, so it's not usable on Linux or regular BSD; we have to use CloseSuperfluousFds with fork+exec, and avoid posix_spawn.
Our nsIProcess and JS Subprocess have been modified to use IPC LaunchApp which takes care of this, but as for Rust… this fix for a similar issue involving GDK and sandboxed image decoders (part of the chain of things that went wrong in bug 1986254) mentions that Rust libraries generally set cloexec when creating fds, so the Rust ecosystem probably doesn't encounter this problem much.
Comment 17•3 months ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #14)
We could iterate over the open files right on startup and close everything we don't know about (stdin/stdout/stderr and whatever else might have been opened for launching the client). We should do that on Linux too given we use
fork()to launch the client. Alex, does this sound reasonable?
Naturally all of this would be moot if I had already finished fixing bug 587729 and we could open the crash reporter client from there... but that's unfortunately still a few months away due to unforeseen issues :(
Gabriele, has there been any progress on this task? Are we tracking it somewhere?
Comment 18•3 months ago
|
||
None of us is working on this yet, but we might prioritize if the volume is big enough to justify it.
Comment 19•3 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on release (startup)
- Top 5 desktop browser crashes on Mac on release
For more information, please visit BugBot documentation.
Comment 20•2 months ago
|
||
[Tracking Requested - why for this release]:
Most frequent macOS crash signature.
Updated•2 months ago
|
Comment 21•2 months ago
|
||
The bug is marked as tracked for firefox148 (nightly). We have limited time to fix this, the soft freeze is in a day. However, the bug still isn't assigned and has low severity.
:ghess, could you please find an assignee and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit BugBot documentation.
Comment 22•2 months ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #18)
None of us is working on this yet, but we might prioritize if the volume is big enough to justify it.
Hi Gabriele,
Should we consider increasing the priority of this bug now?
As a short-team workaround, is increasing the current limit helpful?
| Assignee | ||
Updated•2 months ago
|
| Assignee | ||
Comment 23•2 months ago
|
||
| Assignee | ||
Updated•2 months ago
|
Comment 25•2 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Comment 26•2 months ago
|
||
Comment 27•2 months ago
|
||
| bugherder | ||
Comment 28•2 months ago
|
||
The patch landed in nightly and beta is affected.
:afranchuk, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- See https://wiki.mozilla.org/Release_Management/Requesting_an_Uplift for documentation on how to request an uplift.
- If no, please set
status-firefox147towontfix.
For more information, please visit BugBot documentation.
| Assignee | ||
Comment 29•2 months ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D278155
Updated•2 months ago
|
Comment 30•2 months ago
|
||
firefox-beta Uplift Approval Request
- User impact if declined: MacOS users may experience crashes as a result of having too many consecutive crashes (from fd exhaustion).
- Code covered by automated testing: no
- Fix verified in Nightly: yes
- Needs manual QE test: no
- Steps to reproduce for manual QE testing:
- Risk associated with taking this patch: low
- Explanation of risk level: The patch doesn't produce any new error paths (at worst, a warning will be written to the log).
- String changes made/needed: No
- Is Android affected?: no
| Assignee | ||
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Comment 31•2 months ago
|
||
| uplift | ||
Comment 32•4 days ago
|
||
Looking at the crash stats, this ain't fixed.
Comment 33•4 days ago
|
||
Agreed, but probably better tracked in a new bug at this point.
| Assignee | ||
Comment 34•4 days ago
|
||
Looking at the crash counts, there's been a steady decline since 147 was released to a new baseline. The patch here fixes an issue which exacerbates the problem, however it cannot fix all instances of the issue.
Description
•