Hit MOZ_CRASH(Found dangling CheckedUnsafePtr) at /builds/worker/workspace/obj-build/dist/include/mozilla/dom/quota/CheckedUnsafePtr.h:247
Categories
(Core :: DOM: Service Workers, defect, P2)
Tracking
()
People
(Reporter: jkratzer, Assigned: jstutte)
References
(Blocks 1 open bug, Regression)
Details
(4 keywords, Whiteboard: [bugmon:bisected,confirmed][adv-main99+r])
Attachments
(3 files, 1 obsolete file)
Testcase found while fuzzing mozilla-central rev 504105450146 (built with: --enable-address-sanitizer --enable-fuzzing).
Testcase can be reproduced using the following commands:
$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch --build 504105450146 --asan --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.zip
Hit MOZ_CRASH(Found dangling CheckedUnsafePtr) at /builds/worker/workspace/obj-build/dist/include/mozilla/dom/quota/CheckedUnsafePtr.h:247
=================================================================
==77459==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000001 (pc 0x7f2fb3daf2e2 bp 0x7ffd97b81940 sp 0x7ffd97b81920 T0)
==77459==The signal is caused by a WRITE memory access.
==77459==Hint: address points to the zero page.
#0 0x7f2fb3daf2e2 in NotifyCheckFailure /builds/worker/workspace/obj-build/dist/include/mozilla/dom/quota/CheckedUnsafePtr.h:247:31
#1 0x7f2fb3daf2e2 in NotifyCheckFailure<mozilla::CrashOnDanglingCheckedUnsafePtr> /builds/worker/workspace/obj-build/dist/include/mozilla/dom/quota/CheckedUnsafePtr.h:215:13
#2 0x7f2fb3daf2e2 in mozilla::CheckCheckedUnsafePtrs<mozilla::CrashOnDanglingCheckedUnsafePtr>::Check(nsTArray<mozilla::detail::CheckedUnsafePtrBaseCheckingEnabled*>&) /builds/worker/workspace/obj-build/dist/include/mozilla/dom/quota/CheckedUnsafePtr.h:239:7
#3 0x7f2fb3daf1fd in mozilla::detail::SupportCheckedUnsafePtrImpl<mozilla::CrashOnDanglingCheckedUnsafePtr, (mozilla::CheckingSupport)1>::~SupportCheckedUnsafePtrImpl() /builds/worker/workspace/obj-build/dist/include/mozilla/dom/quota/CheckedUnsafePtr.h:287:13
#4 0x7f2fb455b4d1 in mozilla::dom::WorkerPrivate::Release() /builds/worker/workspace/obj-build/dist/include/mozilla/dom/WorkerPrivate.h:126:3
#5 0x7f2fb45a3cd0 in mozilla::dom::(anonymous namespace)::TopLevelWorkerFinishedRunnable::Run() /dom/workers/WorkerPrivate.cpp:314:22
#6 0x7f2fac994952 in mozilla::ThrottledEventQueue::Inner::ExecuteRunnable() /xpcom/threads/ThrottledEventQueue.cpp:254:22
#7 0x7f2fac98c92f in mozilla::ThrottledEventQueue::Inner::Executor::Run() /xpcom/threads/ThrottledEventQueue.cpp:81:15
#8 0x7f2fac98e7c2 in mozilla::RunnableTask::Run() /xpcom/threads/TaskController.cpp:467:16
#9 0x7f2fac9540bd in mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /xpcom/threads/TaskController.cpp:770:26
#10 0x7f2fac951618 in mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) /xpcom/threads/TaskController.cpp:606:15
#11 0x7f2fac951d29 in mozilla::TaskController::ProcessPendingMTTask(bool) /xpcom/threads/TaskController.cpp:390:36
#12 0x7f2fac996bf4 in operator() /xpcom/threads/TaskController.cpp:127:37
#13 0x7f2fac996bf4 in mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::$_1>::Run() /xpcom/threads/nsThreadUtils.h:531:5
#14 0x7f2fac9743b7 in nsThread::ProcessNextEvent(bool, bool*) /xpcom/threads/nsThread.cpp:1195:16
#15 0x7f2fac97f59c in NS_ProcessNextEvent(nsIThread*, bool) /xpcom/threads/nsThreadUtils.cpp:467:10
#16 0x7f2fb451ca17 in mozilla::dom::workerinternals::RuntimeService::Cleanup() /dom/workers/RuntimeService.cpp:1716:14
#17 0x7f2fb4525729 in mozilla::dom::workerinternals::RuntimeService::Observe(nsISupports*, char const*, char16_t const*) /dom/workers/RuntimeService.cpp:2048:5
#18 0x7f2fac81853e in nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) /xpcom/ds/nsObserverList.cpp:70:19
#19 0x7f2fac8215d1 in nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) /xpcom/ds/nsObserverService.cpp:292:19
#20 0x7f2fac71bc69 in mozilla::AdvanceShutdownPhaseInternal(mozilla::ShutdownPhase, bool, char16_t const*, nsCOMPtr<nsISupports> const&) /xpcom/base/AppShutdown.cpp:360:21
#21 0x7f2fac9ea387 in mozilla::ShutdownXPCOM(nsIServiceManager*) /xpcom/build/XPCOMInit.cpp:638:5
#22 0x7f2fb99d734c in XRE_TermEmbedding() /toolkit/xre/nsEmbedFunctions.cpp:222:3
#23 0x7f2fadedefe5 in mozilla::ipc::ScopedXREEmbed::Stop() /ipc/glue/ScopedXREEmbed.cpp:90:5
#24 0x7f2fb99d7c43 in XRE_InitChildProcess(int, char**, XREChildData const*) /toolkit/xre/nsEmbedFunctions.cpp:711:16
#25 0x5617866f20ad in content_process_main(mozilla::Bootstrap*, int, char**) /browser/app/../../ipc/contentproc/plugin-container.cpp:57:28
#26 0x5617866f24d8 in main /browser/app/nsBrowserApp.cpp:327:18
#27 0x7f2fd09990b2 in __libc_start_main /build/glibc-eX1tMB/glibc-2.31/csu/../csu/libc-start.c:308:16
#28 0x561786641179 in _start (/home/jkratzer/builds/mc-asan/firefox+0x5d179)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /builds/worker/workspace/obj-build/dist/include/mozilla/dom/quota/CheckedUnsafePtr.h:247:31 in NotifyCheckFailure
==77459==ABORTING
Reporter | ||
Comment 1•4 years ago
|
||
Comment 3•4 years ago
|
||
This looks like an intentional opt crash catching a problem, but I guess diagnostic crashes don't get shipped to Release. Can be this some bad value other than null? since this is essentially a "weak pointer" class it is a bit worrying.
Comment 4•4 years ago
|
||
Bugmon Analysis
Verified bug as reproducible on mozilla-central 20220126163134-35652548842a.
The bug appears to have been introduced in the following build range:
Start: d376f1776505991972805ddddc079822184c014e (20220125080215)
End: a47b76ed47bf1aa3868b83dfe7fcec44dfe08332 (20220125094815)
Pushlog: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=d376f1776505991972805ddddc079822184c014e&tochange=a47b76ed47bf1aa3868b83dfe7fcec44dfe08332
Assignee | ||
Comment 5•4 years ago
|
||
It is a "regression" from bug 1744025 which introduced this diagnostic assert but it indicates a pre-existing potential lifecycle issue with a weak pointer, so sec-high seems appropriate for now.
Assignee | ||
Comment 6•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #5)
It is a "regression" from bug 1744025 which introduced this diagnostic assert but it indicates a pre-existing potential lifecycle issue with a weak pointer, so sec-high seems appropriate for now.
:asuth, after we talked I am a bit less sure if this merits a sec-high, actually. IIUC this indicates only, that we leave a weak pointer around that could be accessed later, but we have no evidence that this actually happens.
Comment 7•4 years ago
|
||
The specific circumstances this is happening under are unlikely to be exploitable (we don't perform XPCOM shutdown of content processes on release), but something interesting is potentially happening here. Unfortunately although the new CheckedUnsafePtr assertion is helpfully indicative of a problem going on, it's also on the level of the famous Simpsons Car Gone! car alarm in terms of providing next steps.
:jkratzer, is it possible to get a pernosco trace of this?
:saschanaz, this might be also be in your area of interest/expertise as the test case is primarily a SW (sw.js) that involves weblocks and AbortController (gonna delay setting a needinfo until we get word back about pernosco-ability):
;(async () => {
const callback = async function (e) {
const opts = { mode: "shared", signal: abort.signal }
navigator.locks.request("weblock_0", opts, callback)
e = null
}
const abort = new AbortController()
await navigator.locks.request("weblock_0", callback)
})()
Reporter | ||
Comment 8•4 years ago
|
||
You can find a pernosco session for this bug, here.
Note for me: Check if explicitly __delete__ing from https://searchfox.org/mozilla-central/rev/8a08700a0392a783296a0940edc09c1cc3dd515c/dom/locks/LockManagerParent.cpp#45 helps
Edit: Hmm no, AbortFollower does not keep a strong reference to signal anymore... 🤔
A more minified repro.
Some weird points:
- If I move
const abort
to the worker global, then no failure - If I remove
await
, then no failure
I don't see how they can affect here 🤔
but something interesting is potentially happening here.
Yes! (although maybe not in security point of view)
So the failure here is caused by WorkerGlobalScopeBase::mWorkerPrivate
being still alive after WorkerPrivate destruction. The scope object is alive because LockRequestChild is keeping the objects alive (CallbackObject which agains keeps AbortSignal here). The IPC layer prepares to kill PLockRequest as it's life is over (observed by LockRequestChild::ActorDestroy
), but WorkerPrivate destruction happens earlier, and boom.
So this is why a mere cycle collection issue like bug 1752075 observed dangling WorkerPrivate.
Restoring WorkerGlobalScopeBase::mWorkerPrivate
to raw WorkerPrivate*
"workarounds" this issue, but not sure it's really a good fix. Thoughts?
Can ThreadSafeWeakPtr<WorkerPrivate>
be used there, btw?
Comment 13•4 years ago
•
|
||
It sounds like the StrongWorkerRef is being released too early. It's not necessary to drop the StrongWorkerRef when the callback happens (and in fact should not be dropped until it's safe to do so). Note that StrongWorkerRef has the implication that it treats the worker as definitely active and not eligible for GC. If we want to make sure that we don't shutdown without doing some cleanup, but don't want to prevent GC, an IPCWorkerRef can be used instead. (And CacheWorkerRef is an example of having a WorkerRef wrapper thing that transitions between StrongWorkerRef and IPCWorkerRef under the hood.)
Comment 14•4 years ago
|
||
(In reply to Kagami :saschanaz from comment #12)
Can
ThreadSafeWeakPtr<WorkerPrivate>
be used there, btw?
We should probably discuss this further in https://docs.google.com/document/d/1GRRcZapr5_Ue0Jbq_Imed9IBWmhhSAvwPSuUilVe2io/ but the general situation is (and some of this may still want to be reflected into that document):
- CheckedUnsafePtr was just the fastest thing we had to be able to get explosions on misuse in debug builds without potentially introducing new types of failures. (And in particular, moving to holding strong references instead was not okay with current semantics and would hide serious logic bogs.)
- That said, the failure mode in non-release builds of potentially being a UAF if used is concerning a real problem. We're not trying to say that was okay. ThreadSafeWeakPtr turning it into a nullptr crash or throwing an error code is definitely more desirable if an attempt is made to be dereferenced.
- However we still likely want to be able to catch coding logic errors under debug builds that indicate if there were any ThreadSafeWeakPtrs in existence when we free the WorkerPrivate.
- There's definitely a continuum related to worker shutdown between "everything needs to be 100% perfectly cleaned up when the worker fully terminates" versus "things can still be happening but they need to turn into no-ops if something dangerous would happen". I'm leaning towards the former because in my experience the latter approach frequently results in moving parts that don't handle the error edge cases resulting in very complex failure modes that are hard to reason about.
- That said, I think it is potentially quite reasonable for us to only be strict about WorkerPrivate* but to potentially be more forgiving for nsIGlobalObject. That said, I do think we probably want nsIGlobalObject::CreateKeepAlive or something that maps to a StrongWorkerRef/others so that we can also raise the bar for object lifecycle cleanup everywhere in general, not just on workers. Noting that for windows things are more complex where the global can be dead but the agent is still alive and DOM objects potentially still can be used for stuff in that situation.
It sounds like the StrongWorkerRef is being released too early. It's not necessary to drop the StrongWorkerRef when the callback happens
I can make StrongWorkerRef live longer, but it's not that LockRequestChild itself needs WorkerRef, and doing so also does not solves cases like bug 1752075. (There was an issue, but having a dangling WorkerPrivate whenever WorkerGlobalScope lives longer is still confusing).
Edit: I'll try IPCWorkerRef, thanks! Reading https://searchfox.org/mozilla-central/rev/8b46752d1e583b2f817c451f93ba515fb865554d/dom/workers/WorkerRef.h#83-98 ...
Updated•4 years ago
|
Updated•4 years ago
|
Comment 17•4 years ago
|
||
:saschanaz, since this bug contains a bisection range, could you fill (if possible) the regressed_by field?
For more information, please visit auto_nag documentation.
Comment 18•4 years ago
|
||
This is a similar issue when landing bug 1744025.
In basic, we assume that WorkerPrivate::mSelfRef is the last RefPtr and will be released in WorkerFinishedRunnable/TopLevelWorkerFinishedRunnable::Run().
WorkerGlobalScopeBase has a CheckUnsafePtr<WorkerPrivate> member. However, WorkerGlobalScopeBase's life cycle is out of WorkerPrivate.
So everywhere code keeps a WorkerGlobalScopeBase, especially when it binds to web API implementations, is potential that has the same problem if it doesn't release the WorkerGlobalScopeBase before WorkerFinishedRunnable::Run().
According to the code here, before the WorkerThreadPrimaryRunnable schedule a TopLevelWorkerFinishedRunnable/WorkerFinishedRunnable, it would schedule a GC to break things related JS objects. And it seems to be the last chance to make sure the corresponding WorkerGlobalScopeBase::mWorkerPrivate is set as nullptr, otherwise, we will meet the same core-dump stack again.
Since the WorkerGlobalScopeBase's life cycle is out of WorkerPrivate, I am not sure if it is a correct idea, but it makes me consider that it should not be a member of WorkerGlobalScopeBase. We probably should query it by GetCurrentThreadWorkerPrivate() when needed and keep it with RefPtr to make sure the WorkerPrivate keeps alive during the functionality.
Comment 19•4 years ago
•
|
||
(In reply to Eden Chuang[:edenchuang] from comment #18)
Since the WorkerGlobalScopeBase's life cycle is out of WorkerPrivate, I am not sure if it is a correct idea, but it makes me consider that it should not be a member of WorkerGlobalScopeBase. We probably should query it by GetCurrentThreadWorkerPrivate() when needed and keep it with RefPtr to make sure the WorkerPrivate keeps alive during the functionality.
When the WorkerPrivate is getting deleted, no one should have any references to the WorkerGlobalScopeBase because there will no longer be any (safe) place for code to run. The conceptual event target is gone. It's only because of other bugs/invariant violations right now that runnables can end up running on the bare WorkerThread nsThread like what we witnessed in bug 1743671. (There are some follow-ups identified after discussion with :nika that we can use to help address this problem, but the underlying lifecycle issues still stand.)
This problem cascades into GetCurrentThreadWorkerPrivate() which under the hood uses CycleCollectedJSContext::Get
's thread-local-storage that assumes we don't have the above problem. Because workers can get misused, if we used GetCurrentThreadWorkerPrivate there's then a potential for an entirely different WorkerPrivate global to be (mis-)used.
Aside: I have a reply on the phabricator revision that I'm still working on and hope to finish up after lunch.
Updated•4 years ago
|
When the WorkerPrivate is getting deleted, no one should have any references to the WorkerGlobalScopeBase
Sounds like it can have better assertion - Check the existence of scope when WorkerPrivate is being destroyed? Could be even better if the current assertion message can say which classes have the dangling reference, but not sure it's plausible.
Assignee | ||
Comment 21•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #6)
:asuth, after we talked I am a bit less sure if this merits a sec-high, actually. IIUC this indicates only, that we leave a weak pointer around that could be accessed later, but we have no evidence that this actually happens.
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #7)
The specific circumstances this is happening under are unlikely to be exploitable (we don't perform XPCOM shutdown of content processes on release), but something interesting is potentially happening here. Unfortunately although the new CheckedUnsafePtr assertion is helpfully indicative of a problem going on, it's also on the level of the famous Simpsons Car Gone! car alarm in terms of providing next steps.
Hi Daniel, in this light would sec-moderate be more appropriate here for triage?
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 22•4 years ago
|
||
Assignee | ||
Comment 23•4 years ago
•
|
||
After talking with :asuth, this patch would just help to:
- moot the CheckedUnsafePtr check for dangling
WorkerGlobalScope
objects - get a nullptr crash in case someone tries to really access that
mWorkerPrivate
after the worker died, indicating a real abuse
It is not meant to be a solution for the WorkerGlobalScope
lifecycle issues themselves.
Updated•4 years ago
|
Updated•4 years ago
|
![]() |
||
Comment 24•3 years ago
|
||
Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/c0922a11dfc958b24a5febb219dae8e556beab4c
https://hg.mozilla.org/mozilla-central/rev/c0922a11dfc9
Comment 25•3 years ago
|
||
Bugmon Analysis
Verified bug as fixed on rev mozilla-central 20220216214005-d0676cb0864b.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.
Assignee | ||
Comment 26•3 years ago
|
||
Though we mitigated the consequences I think we still want to reason about the other patch here: https://phabricator.services.mozilla.com/D137792
:saschanaz, or do you prefer to spin of a new bug for that one?
Assignee | ||
Comment 27•3 years ago
|
||
Pascal, it might be worth uplifting the mitigation patch to beta? Even though we have no proof that there is a real issue, it gives us some more confidence that nothing bad can happen here. Give me a sign if we should ask for it.
(In reply to Jens Stutte [:jstutte] from comment #26)
Though we mitigated the consequences I think we still want to reason about the other patch here: https://phabricator.services.mozilla.com/D137792
:saschanaz, or do you prefer to spin of a new bug for that one?
If :asuth's argument that "nothing should grab the global scope object while the worker is dying" still holds, I can go rebase and land it here together.
Comment 29•3 years ago
|
||
I responded on https://phabricator.services.mozilla.com/D137792. The tl;dr is that LockManager::mOwner holds a strong reference to the worker global scope without holding a WorkerRef to justify holding that reference, which results in the worker global scope outliving the WorkerPrivate destructor. It's probably appropriate to upgrade the WeakWorkerRef to an IPCWorkerRef and have LockManager shutdown drop the mOwner ref.
Given all the time that's passed, it may be that some additional cycle collection traversal is now in place that might also help clean up the LockManager, but I think the changes mentioned in the PR likely still make sense.
Updated•3 years ago
|
Updated•3 years ago
|
Closing again per https://phabricator.services.mozilla.com/D137792#4648737.
Updated•3 years ago
|
Comment 31•3 years ago
•
|
||
IIUC, this didn't technically regress in 98, it was an underlying issue exposed during that timeframe. What does that mean for ESR? This patch grafts cleanly there, but we already missed our window to get it into the 91.8esr release :. I guess we'd also need bug 1756172 if we did want to uplift?
Jens, thoughts? I don't think I understand your patch well enough, so just passing NI.
Assignee | ||
Comment 33•3 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #31)
IIUC, this didn't technically regress in 98, it was an underlying issue exposed during that timeframe.
Yes and no. This specific crash had been introduced by the patch stack that added CheckedUnsafePtr to mWorkerPrivate, but that is just a diagnostic thing without effect in release. These patches are not in ESR. The underlying potential lifecycle problem is older, but it is unknown if it ever caused crashes or other trouble.
What does that mean for ESR? This patch grafts cleanly there, but we already missed our window to get it into the 91.8esr release :. I guess we'd also need bug 1756172 if we did want to uplift?
Yes, without bug 1756172 this patch could cause real crashes in release that weren't present before. So if also those graft cleanly we could think about uplifting, but as said we have no concrete sign that there were ever bad things happening. Moving the ni? to asuth as he might have a better feeling for the importance here.
Comment 34•3 years ago
•
|
||
I think the motivating criteria for uplift would be web locks. The basic web locks impl bug 1725734 landed in 93 and web locks was enabled in nightly by bug 1739233 in 96. Since they're not in 91 and will never be uplifted to 91, and we're already quite close to ESR 102, I'd say we shouldn't uplift.
Generally this bug with bug 1756172 is a very nice set of lifecycle improvements, but given our distance from 91 and that we've been iteratively improving things and, as :jstutte says, don't have concrete evidence of bad things, it seems like there's only risk in uplifting these changes as there could be other things we haven't uplifted which would mean ESR91 would be a new thing behaviorally that we haven't seen before.
Comment 35•3 years ago
|
||
Thanks for the thoughtful responses everyone!
Updated•3 years ago
|
Updated•3 years ago
|
Comment 36•3 years ago
|
||
Comment on attachment 9262178 [details]
Bug 1752120 - Release WorkerRef in LockRequestChild::ActorDestroy r=asuth,smaug
Revision D137792 was moved to bug 1764921. Setting attachment 9262178 [details] to obsolete.
Updated•3 years ago
|
Description
•