Closed Bug 1986095 Opened 6 months ago Closed 2 months ago

Crash in [@ mozilla::net::nsSocketTransportService::AttachSocket] with osfd < kFDs

Categories

(Core :: Networking, defect, P2)

All
macOS
defect
Points:
3

Tracking

()

RESOLVED FIXED
148 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- unaffected
firefox-esr140 --- unaffected
firefox142 --- unaffected
firefox143 --- wontfix
firefox144 --- wontfix
firefox146 --- wontfix
firefox147 + fixed
firefox148 + fixed

People

(Reporter: mccr8, Assigned: afranchuk)

References

Details

(Keywords: crash, topcrash, Whiteboard: [necko-triaged][necko-priority-next])

Crash Data

Attachments

(2 files)

Crash report: https://crash-stats.mozilla.org/report/index/e9d6cc98-9359-48c1-a9d0-b71eb0250829

MOZ_CRASH Reason:

MOZ_RELEASE_ASSERT(osfd < kFDs)

Top 10 frames:

0  XUL  MOZ_CrashSequence(void*, long)  mfbt/Assertions.h:253
0  XUL  mozilla::net::nsSocketTransportService::AttachSocket(PRFileDesc*, nsASocketHa...  netwerk/base/nsSocketTransportService2.cpp:405
1  XUL  mozilla::net::nsSocketTransport::InitiateSocket()  netwerk/base/nsSocketTransport2.cpp:1464
2  XUL  mozilla::net::nsSocketTransport::OnSocketEvent(unsigned int, nsresult, nsISup...  netwerk/base/nsSocketTransport2.cpp
2  XUL  mozilla::net::nsSocketEvent::Run()  netwerk/base/nsSocketTransport2.cpp:98
3  XUL  nsThread::ProcessNextEvent(bool, bool*)  xpcom/threads/nsThread.cpp:1151
3  XUL  NS_ProcessNextEvent(nsIThread*, bool)  xpcom/threads/nsThreadUtils.cpp:461
4  XUL  mozilla::net::nsSocketTransportService::Run()  netwerk/base/nsSocketTransportService2.cpp:1214
5  XUL  {virtual override thunk({offset(-32)}, mozilla::net::nsSocketTransportService...  netwerk/base/nsSocketTransportService2.cpp:0
6  XUL  nsThread::ProcessNextEvent(bool, bool*)  xpcom/threads/nsThread.cpp:1151

This assertion was added in bug 1980171 but I assume whatever problem it indicates is separate, so not really a regression.

It looks like this is only on MacOS.

OS: Unspecified → macOS
Hardware: Unspecified → All

I didn't see any other signatures where the crash reason contains osfd < kFDs.

Severity: -- → S3
Points: --- → 3
Rank: 2
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-next]

(In reply to Andrew McCreight [:mccr8] from comment #1)

It looks like this is only on MacOS.

Several things are going on here. The issue that the assertion is protecting against is macOS-only; see also bug 1982486.

The assertion itself is enabled on other Unixes, and it actually doesn't need to be (real poll doesn't have a limit other than the range of the int type, and I don't think we're using the fake select-based one anywhere other than macOS), but it's unlikely to ever fail because the assertion's cutoff is 64k in that case and the browser would have to be started with the OS's fd resource limit set higher than that (which I believe is very uncommon).

See Also: → 1982486

Also possibly fixed by 1983245.

See Also: → 1983245

(In reply to Valentin Gosu [:valentin] (he/him) from comment #4)

Also possibly fixed by bug 1983245.

Isn't that just a work-around? If we leak fds somewhere, that would be a bug, even if we lower the impact.

Perhaps worth calling out: over one third of the crash reports are happening shortly after startup.
83 out of 221 crash (of the crashes in the past month) crash within 30 seconds (of which 62 within 3 seconds!). Some of these are from the same user who crash repeatedly. That looks very unusual, a potential bug somewhere.

Considering it only happens on MacOS, it's possible that leaking FDs or not reusing closed ones may be a bug in MacOS.
Regardless, replacing that outdated PR_Poll with something more modern is something we've been pursuing for a while.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 5 desktop browser crashes on Mac on release

:jesup, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(rjesup)
Keywords: topcrash

Perhaps worth calling out: over one third of the crash reports are happening shortly after startup.
83 out of 221 crash (of the crashes in the past month) crash within 30 seconds (of which 62 within 3 seconds!). Some of these are from the same user who crash repeatedly.

I might be such a user.

Crash reports:

  • bp-c66d520e-c315-4186-8332-c03080251030
  • bp-9fc0cb2f-3297-403e-b789-ee80f0251030
  • bp-96f2da31-36e7-45b6-8fda-20c060251030

It's not clear what caused the first one, but the latter two happened immediately after restarting Firefox from the crash reporter. It seemed that what stopped the loop was un-selecting the option to restart Firefox after sending the crash report, and launching Firefox manually.

:gsvelto Does the described interaction between this bug and the crash reporter ring any bell towards a potential resolution? Here is the relevant part from comment 8 (removed code formatting around crash ID so that Bugzilla can automatically turn these into links):

It's not clear what caused the first one, but the latter two happened immediately after restarting Firefox from the crash reporter. It seemed that what stopped the loop was un-selecting the option to restart Firefox after sending the crash report, and launching Firefox manually.

Flags: needinfo?(gsvelto)

I wonder if this is caused by us sending crash pings via background tasks, Alex can you have a look?

Flags: needinfo?(gsvelto) → needinfo?(afranchuk)

I would have thought so a month ago, however since bug 1991491 landed, we disable crash reporting in the background task when launched. On the contrary, this appears to have increased a lot in the past month. Also as a stronger indication: none of these crashes have the BackgroundTaskName annotation set.

It's interesting that restarting from the crash reporter client was failing repeatedly, but launching manually worked. The main difference between the two is that Firefox would be launched with the same initial args when launched from the crash reporter client.

Flags: needinfo?(afranchuk)

I'm wondering... we're using posix_spawnp() to launch the crash reporter client so the new Firefox should not inherit the file descriptors of the crashed one, but what if it did? It might be worth checking if we still have some unexpected file descriptors after a restart. I'm NI?ing myself so I remember to check on my mac tomorrow.

Flags: needinfo?(gsvelto)
Flags: needinfo?(rjesup)

(In reply to Gabriele Svelto [:gsvelto] from comment #12)

I'm wondering... we're using posix_spawnp() to launch the crash reporter client so the new Firefox should not inherit the file descriptors of the crashed one, but what if it did? It might be worth checking if we still have some unexpected file descriptors after a restart. I'm NI?ing myself so I remember to check on my mac tomorrow.

That's a very good theory.
https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/posix_spawn.2.html (not sure this is the best link for the docs though)

File descriptors open in the calling process image remain open in the new
process image, except for those for which the close-on-exec flag is set
(see close(2) and fcntl(2)). Descriptors that remain open are unaffected
by posix_spawn() unless their behaviour is modified by a file action; see
posix_spawn_file_actions_init(3) for more information.

As it happens we have a few open bugs about setting the flag - See bug 1463873, 1667748 and 1840206

Ah yes, that's definitely it then. Given we're starting a new fresh instance of Firefox we should be closing all spurious files but doing so might be tricky. We launch the crash reporter client here which is in a crashed context, so iterating over the existing files and adding file actions is not possible as we're in a crashed context. The crash reporter client uses Rust's process facilities to launch the new instance, and that also doesn't give controls over the files. So the only reasonable place where we could clean up whatever file descriptors we've inherited from the crashed Firefox is during the crash reporter client's startup. We could iterate over the open files right on startup and close everything we don't know about (stdin/stdout/stderr and whatever else might have been opened for launching the client). We should do that on Linux too given we use fork() to launch the client. Alex, does this sound reasonable?

Naturally all of this would be moot if I had already finished fixing bug 587729 and we could open the crash reporter client from there... but that's unfortunately still a few months away due to unforeseen issues :(

Flags: needinfo?(gsvelto)

Yeah, that sounds simple enough to me. I'm curious why there's an uptick in crashes now, though. I know we have been making some changes to which APIs we use on Mac when dealing with fds in certain contexts.

(In reply to Gabriele Svelto [:gsvelto] from comment #12)

I'm wondering... we're using posix_spawnp() to launch the crash reporter client so the new Firefox should not inherit the file descriptors of the crashed one, but what if it did?

On macOS you can use POSIX_SPAWN_CLOEXEC_DEFAULT for that. But that's an Apple extension, so it's not usable on Linux or regular BSD; we have to use CloseSuperfluousFds with fork+exec, and avoid posix_spawn.

Our nsIProcess and JS Subprocess have been modified to use IPC LaunchApp which takes care of this, but as for Rust… this fix for a similar issue involving GDK and sandboxed image decoders (part of the chain of things that went wrong in bug 1986254) mentions that Rust libraries generally set cloexec when creating fds, so the Rust ecosystem probably doesn't encounter this problem much.

See Also: → 1989596

(In reply to Gabriele Svelto [:gsvelto] from comment #14)

We could iterate over the open files right on startup and close everything we don't know about (stdin/stdout/stderr and whatever else might have been opened for launching the client). We should do that on Linux too given we use fork() to launch the client. Alex, does this sound reasonable?
Naturally all of this would be moot if I had already finished fixing bug 587729 and we could open the crash reporter client from there... but that's unfortunately still a few months away due to unforeseen issues :(

Gabriele, has there been any progress on this task? Are we tracking it somewhere?

Flags: needinfo?(gsvelto)

None of us is working on this yet, but we might prioritize if the volume is big enough to justify it.

Flags: needinfo?(gsvelto)

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 5 desktop browser crashes on Mac on release

For more information, please visit BugBot documentation.

[Tracking Requested - why for this release]:
Most frequent macOS crash signature.

The bug is marked as tracked for firefox148 (nightly). We have limited time to fix this, the soft freeze is in a day. However, the bug still isn't assigned and has low severity.

:ghess, could you please find an assignee and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(ghess)

(In reply to Gabriele Svelto [:gsvelto] from comment #18)

None of us is working on this yet, but we might prioritize if the volume is big enough to justify it.

Hi Gabriele,

Should we consider increasing the priority of this bug now?
As a short-team workaround, is increasing the current limit helpful?

Flags: needinfo?(ghess) → needinfo?(gsvelto)
Assignee: nobody → afranchuk
Status: NEW → ASSIGNED

I've put up a patch for this.

Flags: needinfo?(gsvelto)

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 148 Branch

The patch landed in nightly and beta is affected.
:afranchuk, is this bug important enough to require an uplift?

For more information, please visit BugBot documentation.

Flags: needinfo?(afranchuk)
Attachment #9536654 - Flags: approval-mozilla-beta?

firefox-beta Uplift Approval Request

  • User impact if declined: MacOS users may experience crashes as a result of having too many consecutive crashes (from fd exhaustion).
  • Code covered by automated testing: no
  • Fix verified in Nightly: yes
  • Needs manual QE test: no
  • Steps to reproduce for manual QE testing:
  • Risk associated with taking this patch: low
  • Explanation of risk level: The patch doesn't produce any new error paths (at worst, a warning will be written to the log).
  • String changes made/needed: No
  • Is Android affected?: no
Flags: needinfo?(afranchuk)
Attachment #9536654 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Attachment #9536654 - Flags: approval-mozilla-release? → approval-mozilla-release+

Looking at the crash stats, this ain't fixed.

Agreed, but probably better tracked in a new bug at this point.

Looking at the crash counts, there's been a steady decline since 147 was released to a new baseline. The patch here fixes an issue which exacerbates the problem, however it cannot fix all instances of the issue.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: