Crash in [@ __aarch64_ldadd8_relax] in mozilla::dom::RemoteWorkerChild::SharedWorkerOp::StartOnMainThread(RefPtr<mozilla::dom::RemoteWorkerChild>&)
Categories
(Core :: DOM: Workers, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr102 | --- | unaffected |
firefox113 | --- | wontfix |
firefox114 | --- | wontfix |
firefox115 | --- | fixed |
firefox116 | --- | fixed |
People
(Reporter: cpeterson, Assigned: asuth)
References
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file)
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-beta+
|
Details | Review |
This crash looks like a regression starting in Fenix Nightly 113.0a1.
Crash report: https://crash-stats.mozilla.org/report/index/3749953e-b5dd-49b2-87f7-e34860230516
Reason: SIGSEGV / SEGV_MAPERR
Top 10 frames of crashing thread:
0 libmozglue.so __aarch64_ldadd8_relax
1 libxul.so mozilla::dom::RemoteWorkerChild::SharedWorkerOp::StartOnMainThread dom/workers/remoteworkers/RemoteWorkerChild.cpp:872
2 libxul.so mozilla::dom::RemoteWorkerChild::SharedWorkerOp::MaybeStart dom/workers/remoteworkers/RemoteWorkerChild.cpp:823
2 libxul.so mozilla::detail::RunnableFunction<mozilla::dom::RemoteWorkerChild::SharedWorkerOp::MaybeStart xpcom/threads/nsThreadUtils.h:548
3 libxul.so mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:114
4 libxul.so mozilla::RunnableTask::Run xpcom/threads/TaskController.cpp:555
5 libxul.so mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:879
6 libxul.so mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:702
6 libxul.so mozilla::TaskController::ProcessPendingMTTask xpcom/threads/TaskController.cpp:491
6 libxul.so mozilla::TaskController::TaskController const xpcom/threads/TaskController.cpp:218
Comment 1•1 year ago
|
||
It is a noticed issue that WorkerRunnable refactoring will fix. Set it as blocked by bug 1800659.
Comment 2•1 year ago
|
||
First crash reports came in with the 20230322211554 Nightly, but the volume is low and may not be reliable. Is this maybe a regression from bug 1775784?
Comment 3•1 year ago
|
||
Regardless, if this is blocked on bug 1800659, it doesn't seem likely that we'll be able to address this with an uplift for older releases.
Assignee | ||
Comment 4•1 year ago
|
||
We believe bug 1800659 will fix this but unfortunately it is going to land in Fx116 and will not be appropriate to uplift. That said, we should be able to create an upliftable version of the fix for just this bug that does not include the more extensive changes in bug 1800659, although it seems like the crashes have tapered off somewhat, not sure what that means.
Assignee | ||
Updated•1 year ago
|
Reporter | ||
Comment 5•1 year ago
|
||
This crash's volume is pretty low. We've received fewer than 400 crash reports. It's only #45 on our list of top content process crashes. So if creating an upliftable fix is a lot of work, then feel free close this bug as fixed in 116 and skip the uplift.
Assignee | ||
Comment 6•1 year ago
•
|
||
I dug into this a little more.
Restating:
- The bug that's causing the nullptr crash in this stack is that MessagePortIdentifierRunnable steals the aOwner RefPtr we pass in by reference using std::move, so it's not safe to call
aOwner->ErrorPropagationDispatch
in the failed dispatch case. - The weird "hey, I'm stealing your strong reference, muahahaha" idiom I think is an outgrowth of when RemoteWorkerChild was not threadsafe refcounted. I fixed that in bug 1775784.
- The reason bug 1800659 fixes the problem is that it makes it impossible that dispatch will fail.
- The only possible situation where dispatch could have failed is if SharedWorkerGlobalScope.close was called because this would make dispatch fail but would not have advanced the mState; mState only would advance when the parent receives the self-cancellation request.
- Edited: The explicit calls to ForceClose that are newly introduced in that patch are important and necessary because until we have transferred the MessagePortIdentifier from IPC into a UniqueMessagePortId (which wraps MessagePortIdentifier) we don't have any RAII magic that will call ForceClose for us automatically. But they don't factor in to the crash.
So I think the most correct fix for this crash is to stop MessagePortIdentifierRunnable's constructor from stealing the refcount which is very unhygienic and just let it have its own freshly allocated refcount. We can land this on m-c and then uplift it. I'll quickly provide a non-risky patch for this.
Assignee | ||
Comment 7•1 year ago
|
||
Before bug 1775784, RemoteWorkerChild's refcount was not threadsafe and
so RemoteWorkerChild-related logic went to great contortions to avoid
trying to create new refcounts off the worker launcher thread, leading
to the logic stealing the caller's RefPtr. However, this was unsound
in the case that the RefPtr would be used after the creation of the
MessagePortIdentifierRuynnable, which was the case if dispatch of the
runnable failed. This fix changes us to the standard XPCOM calling
convention of passing a raw pointer which the MPIR then acquires a
fresh strong refcount. This code is not remotely performance
sensitive and the prior behavior was surprising, so there's no reason
to try and maintain the transfer of the refcount.
Comment 9•1 year ago
|
||
Copying crash signatures from duplicate bugs.
Comment 10•1 year ago
|
||
Comment 11•1 year ago
|
||
The severity field for this bug is set to S3
. However, the following bug duplicate has higher severity:
- Bug 1833791: S2
:asuth, could you consider increasing the severity of this bug to S2
?
For more information, please visit BugBot documentation.
Assignee | ||
Updated•1 year ago
|
Comment 12•1 year ago
|
||
bugherder |
Assignee | ||
Comment 13•1 year ago
|
||
Comment on attachment 9337624 [details]
Bug 1833312 - MPIR should not steal the caller's RefPtr. r=edenchuang!
Beta/Release Uplift Approval Request
- User impact if declined: Some rare crashes. This correlates with use of SharedWorkers which is not a common thing but is something that can easily change if popular sites change how they use workers. (NB: This just landed on m-c for nightly, but I want to get the request in sooner rather than later; it could be reasonable to wait a day for baking.)
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Turns a definite nullptr crash without the patching into definitely not a crash with the patch. The change in behavior is an additional addref/release pair of a threadsafe object where we don't expect any fallout from potential changes in where the last release occurs.
- String changes made/needed:
- Is Android affected?: Yes
Comment 14•1 year ago
|
||
Comment on attachment 9337624 [details]
Bug 1833312 - MPIR should not steal the caller's RefPtr. r=edenchuang!
Approved for 115.0b4 and Fenix/Focus 115.0b4
Comment 15•1 year ago
|
||
bugherder uplift |
Description
•