Closed Bug 1986095 Opened 6 months ago Closed 2 months ago

Crash in [@ mozilla::net::nsSocketTransportService::AttachSocket] with osfd < kFDs

Tracking

()

Status:

RESOLVED FIXED

Milestone:

148 Branch

Tracking Flags:

Tracking

Status

firefox-esr115

---

unaffected

firefox-esr128

---

unaffected

firefox-esr140

---

unaffected

firefox142

---

unaffected

firefox143

---

wontfix

firefox144

---

wontfix

firefox146

---

wontfix

firefox147

fixed

firefox148

fixed

People

(Reporter: mccr8, Assigned: afranchuk)

References

Details

(Keywords: crash, topcrash, Whiteboard: [necko-triaged][necko-priority-next])

Crash Data

Attachments

(2 files)

Bug 1986095 - Close unused fds when the crashreporter client starts r=gsvelto 2 months ago Alex Franchuk [:afranchuk] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1986095 - Close unused fds when the crashreporter client starts 2 months ago Alex Franchuk [:afranchuk] 48 bytes, text/x-phabricator-request	phab-bot : approval-mozilla-release+	Details \| Review

Andrew McCreight [:mccr8]

Reporter

Description

•

6 months ago

Crash report: https://crash-stats.mozilla.org/report/index/e9d6cc98-9359-48c1-a9d0-b71eb0250829

MOZ_CRASH Reason:

MOZ_RELEASE_ASSERT(osfd < kFDs)

Top 10 frames:

0  XUL  MOZ_CrashSequence(void*, long)  mfbt/Assertions.h:253
0  XUL  mozilla::net::nsSocketTransportService::AttachSocket(PRFileDesc*, nsASocketHa...  netwerk/base/nsSocketTransportService2.cpp:405
1  XUL  mozilla::net::nsSocketTransport::InitiateSocket()  netwerk/base/nsSocketTransport2.cpp:1464
2  XUL  mozilla::net::nsSocketTransport::OnSocketEvent(unsigned int, nsresult, nsISup...  netwerk/base/nsSocketTransport2.cpp
2  XUL  mozilla::net::nsSocketEvent::Run()  netwerk/base/nsSocketTransport2.cpp:98
3  XUL  nsThread::ProcessNextEvent(bool, bool*)  xpcom/threads/nsThread.cpp:1151
3  XUL  NS_ProcessNextEvent(nsIThread*, bool)  xpcom/threads/nsThreadUtils.cpp:461
4  XUL  mozilla::net::nsSocketTransportService::Run()  netwerk/base/nsSocketTransportService2.cpp:1214
5  XUL  {virtual override thunk({offset(-32)}, mozilla::net::nsSocketTransportService...  netwerk/base/nsSocketTransportService2.cpp:0
6  XUL  nsThread::ProcessNextEvent(bool, bool*)  xpcom/threads/nsThread.cpp:1151

This assertion was added in bug 1980171 but I assume whatever problem it indicates is separate, so not really a regression.

Ryan VanderMeulen [:RyanVM]

Updated

•

6 months ago

status-firefox142: --- → unaffected

status-firefox143: --- → affected

status-firefox144: --- → affected

status-firefox-esr115: --- → unaffected

status-firefox-esr128: --- → unaffected

status-firefox-esr140: --- → unaffected

Andrew McCreight [:mccr8]

Reporter

Comment 1

•

6 months ago

It looks like this is only on MacOS.

status-firefox142: unaffected → ---

status-firefox143: affected → ---

status-firefox144: affected → ---

status-firefox-esr115: unaffected → ---

status-firefox-esr128: unaffected → ---

status-firefox-esr140: unaffected → ---

OS: Unspecified → macOS

Hardware: Unspecified → All

Andrew McCreight [:mccr8]

Reporter

Updated

•

6 months ago

status-firefox142: --- → unaffected

status-firefox143: --- → affected

status-firefox144: --- → affected

status-firefox-esr115: --- → unaffected

status-firefox-esr128: --- → unaffected

status-firefox-esr140: --- → unaffected

Andrew McCreight [:mccr8]

Reporter

Comment 2

•

6 months ago

I didn't see any other signatures where the crash reason contains osfd < kFDs.

Sunil Mayya

Updated

•

6 months ago

Severity: -- → S3

Points: --- → 3

Rank: 2

Priority: -- → P2

Whiteboard: [necko-triaged][necko-priority-next]

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 3

•

6 months ago

(In reply to Andrew McCreight [:mccr8] from comment #1)

It looks like this is only on MacOS.

Several things are going on here. The issue that the assertion is protecting against is macOS-only; see also bug 1982486.

The assertion itself is enabled on other Unixes, and it actually doesn't need to be (real poll doesn't have a limit other than the range of the int type, and I don't think we're using the fake select-based one anywhere other than macOS), but it's unlikely to ever fail because the assertion's cutoff is 64k in that case and the browser would have to be started with the OS's fd resource limit set higher than that (which I believe is very uncommon).

Comment 4

•

6 months ago

Also possibly fixed by 1983245.

Comment 5

•

6 months ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #4)

Also possibly fixed by bug 1983245.

Isn't that just a work-around? If we leak fds somewhere, that would be a bug, even if we lower the impact.

Perhaps worth calling out: over one third of the crash reports are happening shortly after startup.
83 out of 221 crash (of the crashes in the past month) crash within 30 seconds (of which 62 within 3 seconds!). Some of these are from the same user who crash repeatedly. That looks very unusual, a potential bug somewhere.

Valentin Gosu [:valentin]

Comment 6

•

6 months ago

Considering it only happens on MacOS, it's possible that leaking FDs or not reusing closed ones may be a bug in MacOS.
Regardless, replacing that outdated PR_Poll with something more modern is something we've been pursuing for a while.

BugBot [:suhaib / :marco/ :calixte]

Comment 7

•

5 months ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 5 desktop browser crashes on Mac on release

:jesup, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(rjesup)

Keywords: topcrash

Calum Smith [:cpmsmith]

Comment 8

•

4 months ago

Perhaps worth calling out: over one third of the crash reports are happening shortly after startup.
83 out of 221 crash (of the crashes in the past month) crash within 30 seconds (of which 62 within 3 seconds!). Some of these are from the same user who crash repeatedly.

I might be such a user.

Crash reports:

bp-c66d520e-c315-4186-8332-c03080251030
bp-9fc0cb2f-3297-403e-b789-ee80f0251030
bp-96f2da31-36e7-45b6-8fda-20c060251030

It's not clear what caused the first one, but the latter two happened immediately after restarting Firefox from the crash reporter. It seemed that what stopped the loop was un-selecting the option to restart Firefox after sending the crash report, and launching Firefox manually.

Rob Wu [:robwu]

Comment 9

•

4 months ago

:gsvelto Does the described interaction between this bug and the crash reporter ring any bell towards a potential resolution? Here is the relevant part from comment 8 (removed code formatting around crash ID so that Bugzilla can automatically turn these into links):

bp-c66d520e-c315-4186-8332-c03080251030

bp-9fc0cb2f-3297-403e-b789-ee80f0251030

bp-96f2da31-36e7-45b6-8fda-20c060251030

It's not clear what caused the first one, but the latter two happened immediately after restarting Firefox from the crash reporter. It seemed that what stopped the loop was un-selecting the option to restart Firefox after sending the crash report, and launching Firefox manually.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Comment 10

•

4 months ago

I wonder if this is caused by us sending crash pings via background tasks, Alex can you have a look?

Flags: needinfo?(gsvelto) → needinfo?(afranchuk)

Alex Franchuk [:afranchuk]

Assignee

Comment 11

•

4 months ago

•

Edited

I would have thought so a month ago, however since bug 1991491 landed, we disable crash reporting in the background task when launched. On the contrary, this appears to have increased a lot in the past month. Also as a stronger indication: none of these crashes have the BackgroundTaskName annotation set.

It's interesting that restarting from the crash reporter client was failing repeatedly, but launching manually worked. The main difference between the two is that Firefox would be launched with the same initial args when launched from the crash reporter client.

Flags: needinfo?(afranchuk)

Gabriele Svelto [:gsvelto]

Comment 12

•

4 months ago

I'm wondering... we're using posix_spawnp() to launch the crash reporter client so the new Firefox should not inherit the file descriptors of the crashed one, but what if it did? It might be worth checking if we still have some unexpected file descriptors after a restart. I'm NI?ing myself so I remember to check on my mac tomorrow.

Flags: needinfo?(gsvelto)

Randell Jesup [:jesup] (needinfo me)

Updated

•

4 months ago

Flags: needinfo?(rjesup)

Valentin Gosu [:valentin]

Comment 13

•

4 months ago

(In reply to Gabriele Svelto [:gsvelto] from comment #12)

I'm wondering... we're using posix_spawnp() to launch the crash reporter client so the new Firefox should not inherit the file descriptors of the crashed one, but what if it did? It might be worth checking if we still have some unexpected file descriptors after a restart. I'm NI?ing myself so I remember to check on my mac tomorrow.

That's a very good theory.
https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/posix_spawn.2.html (not sure this is the best link for the docs though)

File descriptors open in the calling process image remain open in the new
process image, except for those for which the close-on-exec flag is set
(see close(2) and fcntl(2)). Descriptors that remain open are unaffected
by posix_spawn() unless their behaviour is modified by a file action; see
posix_spawn_file_actions_init(3) for more information.

As it happens we have a few open bugs about setting the flag - See bug 1463873, 1667748 and 1840206

Gabriele Svelto [:gsvelto]

Comment 14

•

4 months ago

Ah yes, that's definitely it then. Given we're starting a new fresh instance of Firefox we should be closing all spurious files but doing so might be tricky. We launch the crash reporter client here which is in a crashed context, so iterating over the existing files and adding file actions is not possible as we're in a crashed context. The crash reporter client uses Rust's process facilities to launch the new instance, and that also doesn't give controls over the files. So the only reasonable place where we could clean up whatever file descriptors we've inherited from the crashed Firefox is during the crash reporter client's startup. We could iterate over the open files right on startup and close everything we don't know about (stdin/stdout/stderr and whatever else might have been opened for launching the client). We should do that on Linux too given we use fork() to launch the client. Alex, does this sound reasonable?

Naturally all of this would be moot if I had already finished fixing bug 587729 and we could open the crash reporter client from there... but that's unfortunately still a few months away due to unforeseen issues :(

Flags: needinfo?(gsvelto)

Alex Franchuk [:afranchuk]

Assignee

Comment 15

•

4 months ago

Yeah, that sounds simple enough to me. I'm curious why there's an uptick in crashes now, though. I know we have been making some changes to which APIs we use on Mac when dealing with fds in certain contexts.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 16

•

4 months ago

(In reply to Gabriele Svelto [:gsvelto] from comment #12)

I'm wondering... we're using posix_spawnp() to launch the crash reporter client so the new Firefox should not inherit the file descriptors of the crashed one, but what if it did?

On macOS you can use POSIX_SPAWN_CLOEXEC_DEFAULT for that. But that's an Apple extension, so it's not usable on Linux or regular BSD; we have to use CloseSuperfluousFds with fork+exec, and avoid posix_spawn.

Our nsIProcess and JS Subprocess have been modified to use IPC LaunchApp which takes care of this, but as for Rust… this fix for a similar issue involving GDK and sandboxed image decoders (part of the chain of things that went wrong in bug 1986254) mentions that Rust libraries generally set cloexec when creating fds, so the Rust ecosystem probably doesn't encounter this problem much.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Updated

•

4 months ago

Updated

•

4 months ago

Comment 17

•

3 months ago

(In reply to Gabriele Svelto [:gsvelto] from comment #14)

We could iterate over the open files right on startup and close everything we don't know about (stdin/stdout/stderr and whatever else might have been opened for launching the client). We should do that on Linux too given we use fork() to launch the client. Alex, does this sound reasonable?
Naturally all of this would be moot if I had already finished fixing bug 587729 and we could open the crash reporter client from there... but that's unfortunately still a few months away due to unforeseen issues :(

Gabriele, has there been any progress on this task? Are we tracking it somewhere?

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Comment 18

•

3 months ago

None of us is working on this yet, but we might prioritize if the volume is big enough to justify it.

Flags: needinfo?(gsvelto)

BugBot [:suhaib / :marco/ :calixte]

Comment 19

•

3 months ago

The bug is linked to a topcrash signature, which matches the following criteria:

Top 20 desktop browser crashes on release (startup)
Top 5 desktop browser crashes on Mac on release

For more information, please visit BugBot documentation.

Keywords: topcrash-startup

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 20

•

2 months ago

[Tracking Requested - why for this release]:
Most frequent macOS crash signature.

status-firefox146: --- → wontfix

status-firefox147: --- → affected

status-firefox148: --- → affected

tracking-firefox147: --- → ?

Ryan VanderMeulen [:RyanVM]

Updated

•

2 months ago

status-firefox143: affected → wontfix

status-firefox144: affected → wontfix

status-firefox147: affected → fix-optional

tracking-firefox147: ? → +

tracking-firefox148: --- → +

BugBot [:suhaib / :marco/ :calixte]

Comment 21

•

2 months ago

The bug is marked as tracked for firefox148 (nightly). We have limited time to fix this, the soft freeze is in a day. However, the bug still isn't assigned and has low severity.

:ghess, could you please find an assignee and increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(ghess)

Kershaw Chang [:kershaw]

Comment 22

•

2 months ago

(In reply to Gabriele Svelto [:gsvelto] from comment #18)

None of us is working on this yet, but we might prioritize if the volume is big enough to justify it.

Hi Gabriele,

Should we consider increasing the priority of this bug now?
As a short-team workaround, is increasing the current limit helpful?

Flags: needinfo?(ghess) → needinfo?(gsvelto)

Alex Franchuk [:afranchuk]

Assignee

Updated

•

2 months ago

Assignee: nobody → afranchuk

Status: NEW → ASSIGNED

Alex Franchuk [:afranchuk]

Assignee

Comment 23

•

2 months ago

Attached file Bug 1986095 - Close unused fds when the crashreporter client starts r=gsvelto — Details

Alex Franchuk [:afranchuk]

Assignee

Comment 24

•

2 months ago

I've put up a patch for this.

Alex Franchuk [:afranchuk]

Assignee

Updated

•

2 months ago

Flags: needinfo?(gsvelto)

BugBot [:suhaib / :marco/ :calixte]

Comment 25

•

2 months ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash-startup

Pulsebot

Comment 26

•

2 months ago

Pushed by afranchuk@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/13698e897857 https://hg.mozilla.org/integration/autoland/rev/1e18bf07619d Close unused fds when the crashreporter client starts r=gsvelto

Serban Stanca [:SerbanS]

Comment 27

•

2 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1e18bf07619d

Status: ASSIGNED → RESOLVED

Closed: 2 months ago

status-firefox148: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 148 Branch

BugBot [:suhaib / :marco/ :calixte]

Comment 28

•

2 months ago

The patch landed in nightly and beta is affected.
:afranchuk, is this bug important enough to require an uplift?

If yes, please nominate the patch for beta approval.
- See https://wiki.mozilla.org/Release_Management/Requesting_an_Uplift for documentation on how to request an uplift.
If no, please set status-firefox147 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(afranchuk)

Alex Franchuk [:afranchuk]

Assignee

Comment 29

•

2 months ago

Attached file Bug 1986095 - Close unused fds when the crashreporter client starts — Details

Original Revision: https://phabricator.services.mozilla.com/D278155

Phabricator Automation

Updated

•

2 months ago

Attachment #9536654 - Flags: approval-mozilla-beta?

Phabricator Automation

Comment 30

•

2 months ago

firefox-beta Uplift Approval Request

User impact if declined: MacOS users may experience crashes as a result of having too many consecutive crashes (from fd exhaustion).
Code covered by automated testing: no
Fix verified in Nightly: yes
Needs manual QE test: no
Steps to reproduce for manual QE testing:
Risk associated with taking this patch: low
Explanation of risk level: The patch doesn't produce any new error paths (at worst, a warning will be written to the log).
String changes made/needed: No
Is Android affected?: no

Alex Franchuk [:afranchuk]

Assignee

Updated

•

2 months ago

Flags: needinfo?(afranchuk)

Ryan VanderMeulen [:RyanVM]

Updated

•

2 months ago

Attachment #9536654 - Flags: approval-mozilla-beta? → approval-mozilla-release?

Phabricator Automation

Updated

•

2 months ago

Attachment #9536654 - Flags: approval-mozilla-release? → approval-mozilla-release+

Lando Automation

Updated

•

2 months ago

status-firefox147: fix-optional → fixed

Pulsebot

Comment 31

•

2 months ago

uplift

https://github.com/mozilla-firefox/firefox/commit/f6d603573037 https://hg.mozilla.org/releases/mozilla-release/rev/0682ccaadf15

Steven Michaud [:smichaud] (Retired)

Comment 32

•

4 days ago

Looking at the crash stats, this ain't fixed.

Ryan VanderMeulen [:RyanVM]

Comment 33

•

4 days ago

Agreed, but probably better tracked in a new bug at this point.

Alex Franchuk [:afranchuk]

Assignee

Comment 34

•

4 days ago

Looking at the crash counts, there's been a steady decline since 147 was released to a new baseline. The patch here fixes an issue which exacerbates the problem, however it cannot fix all instances of the issue.

You need to log in before you can comment on or make changes to this bug.