Closed Bug 1833312 Opened 1 year ago Closed 1 year ago

Crash in [@ __aarch64_ldadd8_relax] in mozilla::dom::RemoteWorkerChild::SharedWorkerOp::StartOnMainThread(RefPtr<mozilla::dom::RemoteWorkerChild>&)

Categories

(Core :: DOM: Workers, defect, P3)

ARM64
Android
defect

Tracking

()

RESOLVED FIXED
116 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox113 --- wontfix
firefox114 --- wontfix
firefox115 --- fixed
firefox116 --- fixed

People

(Reporter: cpeterson, Assigned: asuth)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This crash looks like a regression starting in Fenix Nightly 113.0a1.

Crash report: https://crash-stats.mozilla.org/report/index/3749953e-b5dd-49b2-87f7-e34860230516

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0  libmozglue.so  __aarch64_ldadd8_relax  
1  libxul.so  mozilla::dom::RemoteWorkerChild::SharedWorkerOp::StartOnMainThread  dom/workers/remoteworkers/RemoteWorkerChild.cpp:872
2  libxul.so  mozilla::dom::RemoteWorkerChild::SharedWorkerOp::MaybeStart  dom/workers/remoteworkers/RemoteWorkerChild.cpp:823
2  libxul.so  mozilla::detail::RunnableFunction<mozilla::dom::RemoteWorkerChild::SharedWorkerOp::MaybeStart  xpcom/threads/nsThreadUtils.h:548
3  libxul.so  mozilla::SchedulerGroup::Runnable::Run  xpcom/threads/SchedulerGroup.cpp:114
4  libxul.so  mozilla::RunnableTask::Run  xpcom/threads/TaskController.cpp:555
5  libxul.so  mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal  xpcom/threads/TaskController.cpp:879
6  libxul.so  mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal  xpcom/threads/TaskController.cpp:702
6  libxul.so  mozilla::TaskController::ProcessPendingMTTask  xpcom/threads/TaskController.cpp:491
6  libxul.so  mozilla::TaskController::TaskController const  xpcom/threads/TaskController.cpp:218

It is a noticed issue that WorkerRunnable refactoring will fix. Set it as blocked by bug 1800659.

Assignee: nobody → echuang
Depends on: 1800659

First crash reports came in with the 20230322211554 Nightly, but the volume is low and may not be reliable. Is this maybe a regression from bug 1775784?

Flags: needinfo?(bugmail)

Regardless, if this is blocked on bug 1800659, it doesn't seem likely that we'll be able to address this with an uplift for older releases.

We believe bug 1800659 will fix this but unfortunately it is going to land in Fx116 and will not be appropriate to uplift. That said, we should be able to create an upliftable version of the fix for just this bug that does not include the more extensive changes in bug 1800659, although it seems like the crashes have tapered off somewhat, not sure what that means.

Flags: needinfo?(bugmail)
Severity: -- → S3
Priority: -- → P3

This crash's volume is pretty low. We've received fewer than 400 crash reports. It's only #45 on our list of top content process crashes. So if creating an upliftable fix is a lot of work, then feel free close this bug as fixed in 116 and skip the uplift.

Crash Signature: [@ __aarch64_ldadd8_relax] → [@ __aarch64_ldadd4_acq_rel] [@ __aarch64_ldadd4_rel] [@ __aarch64_ldadd4_relax] [@ __aarch64_ldadd8_acq_rel] [@ __aarch64_ldadd8_rel] [@ __aarch64_ldadd8_relax]
Flags: needinfo?(bugmail)

I dug into this a little more.

Restating:

  • The bug that's causing the nullptr crash in this stack is that MessagePortIdentifierRunnable steals the aOwner RefPtr we pass in by reference using std::move, so it's not safe to call aOwner->ErrorPropagationDispatch in the failed dispatch case.
  • The weird "hey, I'm stealing your strong reference, muahahaha" idiom I think is an outgrowth of when RemoteWorkerChild was not threadsafe refcounted. I fixed that in bug 1775784.
  • The reason bug 1800659 fixes the problem is that it makes it impossible that dispatch will fail.
    • The only possible situation where dispatch could have failed is if SharedWorkerGlobalScope.close was called because this would make dispatch fail but would not have advanced the mState; mState only would advance when the parent receives the self-cancellation request.
  • Edited: The explicit calls to ForceClose that are newly introduced in that patch are important and necessary because until we have transferred the MessagePortIdentifier from IPC into a UniqueMessagePortId (which wraps MessagePortIdentifier) we don't have any RAII magic that will call ForceClose for us automatically. But they don't factor in to the crash.

So I think the most correct fix for this crash is to stop MessagePortIdentifierRunnable's constructor from stealing the refcount which is very unhygienic and just let it have its own freshly allocated refcount. We can land this on m-c and then uplift it. I'll quickly provide a non-risky patch for this.

Assignee: echuang → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)

Before bug 1775784, RemoteWorkerChild's refcount was not threadsafe and
so RemoteWorkerChild-related logic went to great contortions to avoid
trying to create new refcounts off the worker launcher thread, leading
to the logic stealing the caller's RefPtr. However, this was unsound
in the case that the RefPtr would be used after the creation of the
MessagePortIdentifierRuynnable, which was the case if dispatch of the
runnable failed. This fix changes us to the standard XPCOM calling
convention of passing a raw pointer which the MPIR then acquires a
fresh strong refcount. This code is not remotely performance
sensitive and the prior behavior was surprising, so there's no reason
to try and maintain the transfer of the refcount.

Duplicate of this bug: 1833791

Copying crash signatures from duplicate bugs.

Crash Signature: [@ __aarch64_ldadd4_acq_rel] [@ __aarch64_ldadd4_rel] [@ __aarch64_ldadd4_relax] [@ __aarch64_ldadd8_acq_rel] [@ __aarch64_ldadd8_rel] [@ __aarch64_ldadd8_relax] → [@ __aarch64_ldadd4_acq_rel] [@ __aarch64_ldadd4_rel] [@ __aarch64_ldadd4_relax] [@ __aarch64_ldadd8_acq_rel] [@ __aarch64_ldadd8_rel] [@ __aarch64_ldadd8_relax] [@ mozilla::dom::RemoteWorkerChild::AddRef]
Pushed by bugmail@asutherland.org: https://hg.mozilla.org/integration/autoland/rev/fbeeda98d792 MPIR should not steal the caller's RefPtr. r=edenchuang

The severity field for this bug is set to S3. However, the following bug duplicate has higher severity:

:asuth, could you consider increasing the severity of this bug to S2?

For more information, please visit BugBot documentation.

Flags: needinfo?(bugmail)
Crash Signature: [@ __aarch64_ldadd4_acq_rel] [@ __aarch64_ldadd4_rel] [@ __aarch64_ldadd4_relax] [@ __aarch64_ldadd8_acq_rel] [@ __aarch64_ldadd8_rel] [@ __aarch64_ldadd8_relax] [@ mozilla::dom::RemoteWorkerChild::AddRef] → [@ __aarch64_ldadd4_acq_rel] [@ __aarch64_ldadd4_rel] [@ __aarch64_ldadd4_relax] [@ __aarch64_ldadd8_acq_rel] [@ __aarch64_ldadd8_rel] [@ __aarch64_ldadd8_relax] [@ mozilla::dom::RemoteWorkerChild::AddRef]
Flags: needinfo?(bugmail)
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch

Comment on attachment 9337624 [details]
Bug 1833312 - MPIR should not steal the caller's RefPtr. r=edenchuang!

Beta/Release Uplift Approval Request

  • User impact if declined: Some rare crashes. This correlates with use of SharedWorkers which is not a common thing but is something that can easily change if popular sites change how they use workers. (NB: This just landed on m-c for nightly, but I want to get the request in sooner rather than later; it could be reasonable to wait a day for baking.)
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Turns a definite nullptr crash without the patching into definitely not a crash with the patch. The change in behavior is an additional addref/release pair of a threadsafe object where we don't expect any fallout from potential changes in where the last release occurs.
  • String changes made/needed:
  • Is Android affected?: Yes
Attachment #9337624 - Flags: approval-mozilla-beta?

Comment on attachment 9337624 [details]
Bug 1833312 - MPIR should not steal the caller's RefPtr. r=edenchuang!

Approved for 115.0b4 and Fenix/Focus 115.0b4

Attachment #9337624 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: