Closed Bug 1833312 Opened 1 year ago Closed 1 year ago

Crash in [@ __aarch64_ldadd8_relax] in mozilla::dom::RemoteWorkerChild::SharedWorkerOp::StartOnMainThread(RefPtr<mozilla::dom::RemoteWorkerChild>&)

Tracking

()

Status:

RESOLVED FIXED

Milestone:

116 Branch

Tracking Flags:

Tracking

Status

firefox-esr102

---

unaffected

firefox113

---

wontfix

firefox114

---

wontfix

firefox115

---

fixed

firefox116

---

fixed

People

(Reporter: cpeterson, Assigned: asuth)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

Bug 1833312 - MPIR should not steal the caller's RefPtr. r=edenchuang! 1 year ago Andrew Sutherland [:asuth] (he/him) 48 bytes, text/x-phabricator-request	dmeehan : approval-mozilla-beta+	Details \| Review

Chris Peterson [:cpeterson]

Reporter

Description

•

1 year ago

This crash looks like a regression starting in Fenix Nightly 113.0a1.

Crash report: https://crash-stats.mozilla.org/report/index/3749953e-b5dd-49b2-87f7-e34860230516

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0  libmozglue.so  __aarch64_ldadd8_relax  
1  libxul.so  mozilla::dom::RemoteWorkerChild::SharedWorkerOp::StartOnMainThread  dom/workers/remoteworkers/RemoteWorkerChild.cpp:872
2  libxul.so  mozilla::dom::RemoteWorkerChild::SharedWorkerOp::MaybeStart  dom/workers/remoteworkers/RemoteWorkerChild.cpp:823
2  libxul.so  mozilla::detail::RunnableFunction<mozilla::dom::RemoteWorkerChild::SharedWorkerOp::MaybeStart  xpcom/threads/nsThreadUtils.h:548
3  libxul.so  mozilla::SchedulerGroup::Runnable::Run  xpcom/threads/SchedulerGroup.cpp:114
4  libxul.so  mozilla::RunnableTask::Run  xpcom/threads/TaskController.cpp:555
5  libxul.so  mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal  xpcom/threads/TaskController.cpp:879
6  libxul.so  mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal  xpcom/threads/TaskController.cpp:702
6  libxul.so  mozilla::TaskController::ProcessPendingMTTask  xpcom/threads/TaskController.cpp:491
6  libxul.so  mozilla::TaskController::TaskController const  xpcom/threads/TaskController.cpp:218

Eden Chuang[:edenchuang]

Comment 1

•

1 year ago

It is a noticed issue that WorkerRunnable refactoring will fix. Set it as blocked by bug 1800659.

Assignee: nobody → echuang

Depends on: 1800659

Ryan VanderMeulen [:RyanVM]

Comment 2

•

1 year ago

First crash reports came in with the 20230322211554 Nightly, but the volume is low and may not be reliable. Is this maybe a regression from bug 1775784?

Flags: needinfo?(bugmail)

Ryan VanderMeulen [:RyanVM]

Comment 3

•

1 year ago

Regardless, if this is blocked on bug 1800659, it doesn't seem likely that we'll be able to address this with an uplift for older releases.

status-firefox113: affected → wontfix

status-firefox114: affected → wontfix

status-firefox115: affected → fix-optional

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 4

•

1 year ago

We believe bug 1800659 will fix this but unfortunately it is going to land in Fx116 and will not be appropriate to uplift. That said, we should be able to create an upliftable version of the fix for just this bug that does not include the more extensive changes in bug 1800659, although it seems like the crashes have tapered off somewhat, not sure what that means.

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

1 year ago

Severity: -- → S3

Priority: -- → P3

Chris Peterson [:cpeterson]

Reporter

Comment 5

•

1 year ago

This crash's volume is pretty low. We've received fewer than 400 crash reports. It's only #45 on our list of top content process crashes. So if creating an upliftable fix is a lot of work, then feel free close this bug as fixed in 116 and skip the uplift.

Crash Signature: [@ __aarch64_ldadd8_relax] → [@ __aarch64_ldadd4_acq_rel] [@ __aarch64_ldadd4_rel] [@ __aarch64_ldadd4_relax] [@ __aarch64_ldadd8_acq_rel] [@ __aarch64_ldadd8_rel] [@ __aarch64_ldadd8_relax]

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 6

•

1 year ago

•

Edited

I dug into this a little more.

Restating:

The bug that's causing the nullptr crash in this stack is that MessagePortIdentifierRunnable steals the aOwner RefPtr we pass in by reference using std::move, so it's not safe to call aOwner->ErrorPropagationDispatch in the failed dispatch case.
The weird "hey, I'm stealing your strong reference, muahahaha" idiom I think is an outgrowth of when RemoteWorkerChild was not threadsafe refcounted. I fixed that in bug 1775784.
The reason bug 1800659 fixes the problem is that it makes it impossible that dispatch will fail.
- The only possible situation where dispatch could have failed is if SharedWorkerGlobalScope.close was called because this would make dispatch fail but would not have advanced the mState; mState only would advance when the parent receives the self-cancellation request.
Edited: The explicit calls to ForceClose that are newly introduced in that patch are important and necessary because until we have transferred the MessagePortIdentifier from IPC into a UniqueMessagePortId (which wraps MessagePortIdentifier) we don't have any RAII magic that will call ForceClose for us automatically. But they don't factor in to the crash.

So I think the most correct fix for this crash is to stop MessagePortIdentifierRunnable's constructor from stealing the refcount which is very unhygienic and just let it have its own freshly allocated refcount. We can land this on m-c and then uplift it. I'll quickly provide a non-risky patch for this.

Assignee: echuang → bugmail

Status: NEW → ASSIGNED

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 7

•

1 year ago

Attached file Bug 1833312 - MPIR should not steal the caller's RefPtr. r=edenchuang! — Details

Before bug 1775784, RemoteWorkerChild's refcount was not threadsafe and
so RemoteWorkerChild-related logic went to great contortions to avoid
trying to create new refcounts off the worker launcher thread, leading
to the logic stealing the caller's RefPtr. However, this was unsound
in the case that the RefPtr would be used after the creation of the
MessagePortIdentifierRuynnable, which was the case if dispatch of the
runnable failed. This fix changes us to the standard XPCOM calling
convention of passing a raw pointer which the MPIR then acquires a
fresh strong refcount. This code is not remotely performance
sensitive and the prior behavior was surprising, so there's no reason
to try and maintain the transfer of the refcount.

Eden Chuang[:edenchuang]

Updated

•

1 year ago

Duplicate of this bug: 1833791

BugBot [:suhaib / :marco/ :calixte]

Comment 9

•

1 year ago

Copying crash signatures from duplicate bugs.

Crash Signature: [@ __aarch64_ldadd4_acq_rel] [@ __aarch64_ldadd4_rel] [@ __aarch64_ldadd4_relax] [@ __aarch64_ldadd8_acq_rel] [@ __aarch64_ldadd8_rel] [@ __aarch64_ldadd8_relax] → [@ __aarch64_ldadd4_acq_rel] [@ __aarch64_ldadd4_rel] [@ __aarch64_ldadd4_relax] [@ __aarch64_ldadd8_acq_rel] [@ __aarch64_ldadd8_rel] [@ __aarch64_ldadd8_relax] [@ mozilla::dom::RemoteWorkerChild::AddRef]

Pulsebot

Comment 10

•

1 year ago

Pushed by bugmail@asutherland.org: https://hg.mozilla.org/integration/autoland/rev/fbeeda98d792 MPIR should not steal the caller's RefPtr. r=edenchuang

BugBot [:suhaib / :marco/ :calixte]

Comment 11

•

1 year ago

The severity field for this bug is set to S3. However, the following bug duplicate has higher severity:

Bug 1833791: S2

:asuth, could you consider increasing the severity of this bug to S2?

For more information, please visit BugBot documentation.

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

1 year ago

Crash Signature: [@ __aarch64_ldadd4_acq_rel] [@ __aarch64_ldadd4_rel] [@ __aarch64_ldadd4_relax] [@ __aarch64_ldadd8_acq_rel] [@ __aarch64_ldadd8_rel] [@ __aarch64_ldadd8_relax] [@ mozilla::dom::RemoteWorkerChild::AddRef] → [@ __aarch64_ldadd4_acq_rel] [@ __aarch64_ldadd4_rel] [@ __aarch64_ldadd4_relax] [@ __aarch64_ldadd8_acq_rel] [@ __aarch64_ldadd8_rel] [@ __aarch64_ldadd8_relax] [@ mozilla::dom::RemoteWorkerChild::AddRef]

Flags: needinfo?(bugmail)

Natalia Csoregi [:nataliaCs]

Comment 12

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/fbeeda98d792

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

status-firefox116: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 116 Branch

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 13

•

1 year ago

Comment on attachment 9337624 [details]
Bug 1833312 - MPIR should not steal the caller's RefPtr. r=edenchuang!

Beta/Release Uplift Approval Request

User impact if declined: Some rare crashes. This correlates with use of SharedWorkers which is not a common thing but is something that can easily change if popular sites change how they use workers. (NB: This just landed on m-c for nightly, but I want to get the request in sooner rather than later; it could be reasonable to wait a day for baking.)
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Turns a definite nullptr crash without the patching into definitely not a crash with the patch. The change in behavior is an additional addref/release pair of a threadsafe object where we don't expect any fallout from potential changes in where the last release occurs.
String changes made/needed:
Is Android affected?: Yes

Attachment #9337624 - Flags: approval-mozilla-beta?

Donal Meehan [:dmeehan]

Comment 14

•

1 year ago

Comment on attachment 9337624 [details]
Bug 1833312 - MPIR should not steal the caller's RefPtr. r=edenchuang!

Approved for 115.0b4 and Fenix/Focus 115.0b4

Attachment #9337624 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Donal Meehan [:dmeehan]

Comment 15

•

1 year ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/30b87dd0216b

status-firefox115: fix-optional → fixed

You need to log in before you can comment on or make changes to this bug.