1756172 - Crash [@ IsOnWorkerThread]

Reporter

Description

•

4 years ago

Testcase found while fuzzing mozilla-central rev 2b187d932ca1 (built with: --enable-debug --enable-fuzzing).

Testcase can be reproduced using the following commands:

$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch --build 2b187d932ca1 --debug --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.zip

[@ IsOnWorkerThread]

    ==25029==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000480 (pc 0x7f416630ef4e bp 0x7f4159ae9280 sp 0x7f4159ae9270 T25043)
    ==25029==The signal is caused by a READ memory access.
    ==25029==Hint: address points to the zero page.
        #0 0x7f416630ef4e in IsOnWorkerThread /dom/workers/WorkerPrivate.cpp:5347:3
        #1 0x7f416630ef4e in AssertIsOnWorkerThread /dom/workers/WorkerPrivate.cpp:5355:3
        #2 0x7f416630ef4e in mozilla::dom::WorkerGlobalScopeBase::GetGlobalJSObjectPreserveColor() const /dom/workers/WorkerScope.cpp:259:19
        #3 0x7f4166521780 in HasJSGlobal /dom/base/nsIGlobalObject.h:123:37
        #4 0x7f4166521780 in mozilla::dom::ScriptSettingsStackEntry::ScriptSettingsStackEntry(nsIGlobalObject*, mozilla::dom::ScriptSettingsStackEntry::Type) /dom/script/ScriptSettings.cpp:171:3
        #5 0x7f41664f998a in mozilla::dom::AutoJSAPI::AutoJSAPI(nsIGlobalObject*, bool, mozilla::dom::ScriptSettingsStackEntry::Type) /dom/script/ScriptSettings.cpp:423:7
        #6 0x7f41664f96d6 in mozilla::dom::AutoEntryScript::AutoEntryScript(nsIGlobalObject*, char const*, bool) /dom/script/AutoEntryScript.cpp:66:7
        #7 0x7f41618f925e in void mozilla::dom::Promise::MaybeSomething<JS::Handle<JS::Value> const&>(JS::Handle<JS::Value> const&, void (mozilla::dom::Promise::*)(JSContext*, JS::Handle<JS::Value>)) /builds/worker/workspace/obj-build/dist/include/mozilla/dom/Promise.h:326:21
        #8 0x7f416631fea6 in InvokeMethod<(lambda at /dom/workers/WorkerScope.cpp:1129:14), void ((lambda at /dom/workers/WorkerScope.cpp:1129:14)::*)(const mozilla::MozPromise<bool, nsresult, true>::ResolveOrRejectValue &) const, mozilla::MozPromise<bool, nsresult, true>::ResolveOrRejectValue> /builds/worker/workspace/obj-build/dist/include/mozilla/MozPromise.h:630:12
        #9 0x7f416631fea6 in InvokeCallbackMethod<false, (lambda at /dom/workers/WorkerScope.cpp:1129:14), void ((lambda at /dom/workers/WorkerScope.cpp:1129:14)::*)(const mozilla::MozPromise<bool, nsresult, true>::ResolveOrRejectValue &) const, mozilla::MozPromise<bool, nsresult, true>::ResolveOrRejectValue, RefPtr<mozilla::MozPromise<bool, nsresult, true>::Private> > /builds/worker/workspace/obj-build/dist/include/mozilla/MozPromise.h:661:5
        #10 0x7f416631fea6 in mozilla::MozPromise<bool, nsresult, true>::ThenValue<mozilla::dom::ServiceWorkerGlobalScope::SkipWaiting(mozilla::ErrorResult&)::$_3>::DoResolveOrRejectInternal(mozilla::MozPromise<bool, nsresult, true>::ResolveOrRejectValue&) /builds/worker/workspace/obj-build/dist/include/mozilla/MozPromise.h:914:7
        #11 0x7f4162438602 in mozilla::MozPromise<bool, nsresult, true>::ThenValueBase::ResolveOrRejectRunnable::Run() /builds/worker/workspace/obj-build/dist/include/mozilla/MozPromise.h:487:21
        #12 0x7f41619cdb3e in OnDiscard /xpcom/threads/nsThreadUtils.cpp:94:9
        #13 0x7f41619cdb3e in non-virtual thunk to mozilla::CancelableRunnable::OnDiscard() /xpcom/threads/nsThreadUtils.cpp
        #14 0x7f4166319f07 in mozilla::dom::(anonymous namespace)::ExternalRunnableWrapper::Cancel() /dom/workers/WorkerPrivate.cpp:219:13
        #15 0x7f416630bb5f in mozilla::dom::WorkerRunnable::Run() /dom/workers/WorkerRunnable.cpp:247:5
        #16 0x7f41619c10a9 in nsThread::ProcessNextEvent(bool, bool*) /xpcom/threads/nsThread.cpp:1165:16
        #17 0x7f41619bd574 in NS_ProcessPendingEvents(nsIThread*, unsigned int) /xpcom/threads/nsThreadUtils.cpp:432:19
        #18 0x7f41662fea22 in mozilla::dom::WorkerPrivate::ClearMainEventQueue(mozilla::dom::WorkerPrivate::WorkerRanOrNot) /dom/workers/WorkerPrivate.cpp:3750:5
        #19 0x7f41662da980 in mozilla::dom::workerinternals::(anonymous namespace)::WorkerThreadPrimaryRunnable::Run() /dom/workers/RuntimeService.cpp:2193:23
        #20 0x7f41619c10a9 in nsThread::ProcessNextEvent(bool, bool*) /xpcom/threads/nsThread.cpp:1165:16
        #21 0x7f41619c84fa in NS_ProcessNextEvent(nsIThread*, bool) /xpcom/threads/nsThreadUtils.cpp:467:10
        #22 0x7f4162475dcb in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /ipc/glue/MessagePump.cpp:300:20
        #23 0x7f4162391da7 in MessageLoop::RunInternal() /ipc/chromium/src/base/message_loop.cc:331:10
        #24 0x7f4162391cb2 in RunHandler /ipc/chromium/src/base/message_loop.cc:324:3
        #25 0x7f4162391cb2 in MessageLoop::Run() /ipc/chromium/src/base/message_loop.cc:306:3
        #26 0x7f41619bc7de in nsThread::ThreadFunc(void*) /xpcom/threads/nsThread.cpp:387:10
        #27 0x7f41760afa57 in _pt_root /nsprpub/pr/src/pthreads/ptthread.c:201:5
        #28 0x7f4176e2b608 in start_thread /build/glibc-eX1tMB/glibc-2.31/nptl/pthread_create.c:477:8
        #29 0x7f41769f3292 in __clone /build/glibc-eX1tMB/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
    
    UndefinedBehaviorSanitizer can not provide additional info.
    SUMMARY: UndefinedBehaviorSanitizer: SEGV /dom/workers/WorkerPrivate.cpp:5347:3 in IsOnWorkerThread
    ==25029==ABORTING

Jason Kratzer [:jkratzer]

Reporter

Comment 1

•

4 years ago

Attached file Testcase — Details

Jason Kratzer [:jkratzer]

Reporter

Updated

•

4 years ago

Component: DOM: Core & HTML → DOM: Service Workers

Jens Stutte [:jstutte]

Assignee

Updated

•

4 years ago

Flags: needinfo?(echuang)

Bugmon [:jkratzer for issues]

Comment 2

•

4 years ago

Bugmon Analysis
Verified bug as reproducible on mozilla-central 20220218090822-6938c3b26d14.
The bug appears to have been introduced in the following build range:

Start: 85fb6ffa2887b21084c20349e999b1ec46db1993 (20220216131952)
End: c0922a11dfc958b24a5febb219dae8e556beab4c (20220216170132)
Pushlog: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=85fb6ffa2887b21084c20349e999b1ec46db1993&tochange=c0922a11dfc958b24a5febb219dae8e556beab4c

Keywords: regression

Whiteboard: [bugmon:confirm] → [bugmon:bisected,confirmed]

Jens Stutte [:jstutte]

Assignee

Comment 3

•

4 years ago

This is actually not very concerning, we just assert here (so there is no potential UAF in old code pre-bug 1752120 in release here). But we should probably be careful with too much information on how to reproduce invariant violations like this, and the regressing bug is hidden, too.

Group: core-security

Regressed by: 1752120

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

status-firefox97: --- → unaffected

status-firefox98: --- → unaffected

status-firefox99: --- → affected

status-firefox-esr91: --- → unaffected

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Has Regression Range: --- → yes

Andrew McCreight [:mccr8]

Updated

•

3 years ago

Group: core-security → dom-core-security

Daniel Veditz [:dveditz]

Updated

•

3 years ago

Keywords: sec-other

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

3 years ago

Severity: -- → S3

Priority: -- → P3

Jens Stutte [:jstutte]

Assignee

Comment 5

•

3 years ago

Hi Jason, re-reading comment 0 I see: SEGV on unknown address 0x000000000480 which is not a plain nullptr so this is not the assert triggering but we want to read some member from a WorkerPrivate nullptr, IIUC. As the stack alone does not really reveal where that comes from, would it be possible to have a pernosco session here? Thanks!

Flags: needinfo?(echuang) → needinfo?(jkratzer)

Jason Kratzer [:jkratzer]

Reporter

Comment 6

•

3 years ago

You can find a pernosco session for this bug here.

Flags: needinfo?(jkratzer)

Jens Stutte [:jstutte]

Assignee

Comment 7

•

3 years ago

Thanks a lot!

I put some notes. So IIUC we should unroot the global only after having done mWorkerPrivate->ClearMainEventQueue(WorkerPrivate::WorkerRan); ?

Flags: needinfo?(bugmail)

Jens Stutte [:jstutte]

Assignee

Comment 8

•

3 years ago

Attached file Bug 1756172: Unroot the global scopes only after the last worker event ran. r?#dom-worker-reviewers — Details

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → jstutte

Status: NEW → ASSIGNED

Andrew Sutherland [:asuth] (he/him)

Comment 9

•

3 years ago

•

Edited

(In reply to Jens Stutte [:jstutte] from comment #7)

I put some notes. So IIUC we should unroot the global only after having done mWorkerPrivate->ClearMainEventQueue(WorkerPrivate::WorkerRan); ?

This is one of those cases where the thing that's crashing is indicating a potentially concerning lifetime problem but maybe we also want the thing that's catching this by virtue of crashing to assert under fuzzing but silently not crash or do bad things in production.

In terms of the things of what's catching this: In the pernosco trace, we don't actually care about the not-taken call to WorkerPrivateSaysForbidScript because the global was already marked as dying (nsIGlobalObject::StartDying) previously during the transitions to Canceling and Killing. So if we ever got to CallbackObject::CallSetup::CallSetup's invocation of IsScriptForbidden we'd say it's not okay. But we're not getting there because it's the call to nsIGlobalObject::HasJSGlobal during AutoEntryScript construction for the Promise that's crashing earlier. (Edited) Perhaps we could have the subclassed WorkerGlobalScopeBase::GetGlobalJSObjectPreserveColor that HasJSGlobal calls assert hard and return null when mWorkerPrivate is null. I'm not immediately clear on whether there will be runtime bails from this somewhere or just a change in where the crash is. :smaug NI-ing you to see if you have any thoughts about HasJSGlobal's behavior unrooting; pre-edit I was wondering if we should return false for HasJSGlobal when mDying, but that sounds too dramatic given that the HasJSGlobal was in an assert, not a runtime path. Note that I have not yet translated the Peers meeting about doing better about poisoning globals/JSContexts and part of that discussion did involve the behavior of AutoEntryScript/etc., so that might potentially be a relevant place to interact with the global (for new AES's).

In terms of the lifetime problem, the problem is that ServiceWorkerGlobalScope::SkipWaiting has a lambda that holds a promise that's basically entirely unaware of the lifecycle stuff going on. It's (through a helper wrapper) calling RemoteWorkerChild::MaybeSendSetServiceWorkerSkipWaitingFlag that involves bouncing to the worker launcher thread (bypassing the main thread). Arguably this scenario could be simpler if we were sending the SkipWaiting request over a PBackground path from the worker, which is something we'd discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1672493#c1. Specifically, async MozPromise/callbacks from the IPC actor will automatically reject when the actor is torn down, which is a fairly straightforward lifecycle step.

Note that skipWaiting is explicitly NewObject which means that we can't just have a dedicated holder slot for skipWaiting. The simplest fix for this might just be to acquire and hold a StrongWorkerRef for the duration of the skipwaiting promise since the request has a finite/bounded lifetime (which is likely why we were cavalier with the promise's ownership).

Flags: needinfo?(bugmail) → needinfo?(bugs)

Jens Stutte [:jstutte]

Assignee

Comment 10

•

3 years ago

•

Edited

Trying to translate this into a more local reasoning about the lifetime-relation between WorkerPrivate and WorkerGlobalScopeBase:

The "thing that's crashing" here is an access to the mWorkerPrivate of WorkerGlobalScopeBase after it has already been nulled out. I might be simplifying things, but IMHO we have only two options here regarding the lifetime-relation between WorkerPrivate and WorkerGlobalScopeBase derived objects:

We expect WorkerGlobalScopeBase derived objects to never outlive their mWorkerPrivate.
This seems to be kind of the current state, only that evidently there are violations to this. If this is what we want, we should just continue to enforce this (for example by adding more asserts before any access to mWorkerPrivate) and adjust any violations (actually we already catch them now by having nulled out the mWorkerPrivate). Apparently one violation can be caused by the mWorkerPrivate->ClearMainEventQueue(WorkerPrivate::WorkerRan); done after we nulled out the global's mWorkerPrivate and thus keeping this pointer alive until the latest possible moment in the WorkerPrivate's live seems advisable to me, as proposed by the patch here.
We expect the global's mWorkerPrivate to possibly go away at any time, introducing a further (most definite) way of telling the global "this worker is dead".
This would imply to null-check mWorkerPrivate before accessing it at runtime in all places and to act accordingly.

IIUC the state of the worker is already well modeled by its internal state machine, so until the global can access its mWorkerPrivate it knows the worker's state and acts accordingly and having mWorkerPrivate set to nullptr does not add new information. It just guards us from accessing a potentially dead object. And until today we kind of expected that 1. is true, so I would expect to be there only few violations we can find and should handle.

This is one of those cases where the thing that's crashing is indicating a potentially concerning lifetime problem but maybe we also want the thing that's catching this by virtue of crashing to assert under fuzzing but silently not crash or do bad things in production.

Not sure if I second this. I'd rather prefer to first have a clear decision between 1. and 2. and then adjust the code accordingly. To crash only while fuzzing reads like "we want 1. but we are lazy to find all violations". Now that can be the right call if we expect those violations to be very hard to find.

In terms of the things of what's catching this: In the pernosco trace, we don't actually care about the not-taken call to WorkerPrivateSaysForbidScript because the global was already marked as dying (nsIGlobalObject::StartDying) previously during the transitions to Canceling and Killing. So if we ever got to CallbackObject::CallSetup::CallSetup's invocation of IsScriptForbidden we'd say it's not okay. But we're not getting there because it's the call to nsIGlobalObject::HasJSGlobal during AutoEntryScript construction for the Promise that's crashing earlier.

Should I read this as "on the way to check if we should actually execute script here we died earlier due to the missing mWorkerPrivate" ? Such a check should definitely just fail but not crash in the state we are in here, IIUC. I assume (but I did not check) that if the mWorkerPrivate was still set in this situation, everything would have just been fine. And the latest possible access to it seems to be triggered by the final ClearMainEventQueue, so ensuring that we still can access our-self during that clear still seems advisable to me. It might not solve other violations, but we can be kind of sure that the worker thread will not try to use the global and its mWorkerPrivate anymore only after that point, and that any other access must come from a different thread then.

Long story short, IMHO we should:

decide that we want 1.
repair this specific violation, potentially with attachment 9268485 [details]
enforce the invariant from 1. further (adding some asserts), though this is kind of optional and mostly documents our assumption everywhere (setting mWorkerPrivate to nullptr technically is sufficient to catch them).

Jens Stutte [:jstutte]

Assignee

Comment 11

•

3 years ago

•

Edited

repair this specific violation

More precisely actually this specific violation can be seen just as a false-positive as we null out the mWorkerPrivate too early. D138442 was wrong in claiming "Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends." as the worker really ceases to process events only later.

Andrew Sutherland [:asuth] (he/him)

Comment 12

•

3 years ago

Situation 1 is the only case that makes sense given that the event target should go away when the worker shuts down and the WorkerPrivate goes away. How can something legally be referencing a global which has a same-thread-usage invariant when the thread no longer exists?

Any event loop processing can reveal bugs. ClearMainEventQueue is not special in this specific case; it's generally plausible that we could see the exact same crash scenario on the bare worker thread after the primary runnable has completed. (And we could probably shift the reproduction to behave like this if we removed the call to ClearMainEventQueue.) https://bugzilla.mozilla.org/show_bug.cgi?id=1743671#c19 is some recent discussion in this space about us potentially tightening up the lack of meaningful invariants around our worker event targets.

My concern about crashing is primarily to avoid crashing on the release channel. It's very easy for sites to change their behavior (ex: using our new-ish implementations of things) and uncover longstanding correctness issues that we haven't addressed. While it would be nice to have the crash reports from the field, it feels like the wrong trade-off if the net result might be to cause non-stop crashing of content processes for that origin. It's a situation where I'd much rather have the discussed ability to effectively trigger a crash report without actually taking down the browser process.

Jens Stutte [:jstutte]

Assignee

Comment 13

•

3 years ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)

Any event loop processing can reveal bugs. ClearMainEventQueue is not special in this specific case; it's generally plausible that we could see the exact same crash scenario on the bare worker thread after the primary runnable has completed.

Ah, that was probably what I was missing. Thanks! Still I think we should move unrooting after ClearMainEventQueue, simply because it is a well known and expected case.

My concern about crashing is primarily to avoid crashing on the release channel. It's very easy for sites to change their behavior (ex: using our new-ish implementations of things) and uncover longstanding correctness issues that we haven't addressed. While it would be nice to have the crash reports from the field, it feels like the wrong trade-off if the net result might be to cause non-stop crashing of content processes for that origin. It's a situation where I'd much rather have the discussed ability to effectively trigger a crash report without actually taking down the browser process.

That would mean to go through all accesses of mWorkerPrivate and add a MOZ_ASSERT and an if (!mWorkerPrivate) bail out;, I assume. If you know this to be such sensible that we cannot just rely on tests and fuzzers, then we should probably do so.

Andrew Sutherland [:asuth] (he/him)

Comment 14

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #13)

That would mean to go through all accesses of mWorkerPrivate and add a MOZ_ASSERT and an if (!mWorkerPrivate) bail out;, I assume. If you know this to be such sensible that we cannot just rely on tests and fuzzers, then we should probably do so.

I guess my specific proposal would be:

We shouldn't be adding new MOZ_CRASH() calls related to this. We should favor MOZ_ASSERT or MOZ_DIAGNOSTIC_ASSERT followed by an attempt to bail and not crash.
We should make AutoEntryScript not crash when mWorkerPrivate is null, but it definitely should at least MOZ_ASSERT with not having the global. But it would be nice if in production the case did not crash out of an abundance of caution.

In terms of transforming all existing mWorkerPrivate calls... I don't know? I think AutoEntryScript should cover most of the scenarios I would imagine are still lingering right now. I think it would probably be most beneficial right now to:

audit the cases like this specific scenario where the WorkerPrivate is involved in creating promises that involve runnables involving RemoteWorkerChild and fix those.
try and help improve the event target situation as discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1743671#c19 so that we could potentially assert when we're seeing runnables dispatched late, but this is a more complex issue

Jens Stutte [:jstutte]

Assignee

Comment 15

•

3 years ago

Attached file Bug 1756172: Add paranoid lifecycle checks in WorkerGlobalScopeBase::GetGlobalJSObject(PreserveColor). r?#dom-worker-reviewers (obsolete) — Details

Depends on D141507

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 16

•

3 years ago

"We expect WorkerGlobalScopeBase derived objects to never outlive their mWorkerPrivate" sounds reasonable, but it means also that we assume there aren't ever any leaks. But leaks do occur, unfortunately, and we shouldn't crash because of them.

The stack trace in comment AutoEntryScript::AutoEntryScript is about the generic issue we've had with AutoEntryScript forever that it doesn't have Init() method similarly to AutoJSAPI. Whoever calls AutoJSAPI::Init() is supposed to check the return value, and that returns false for example if the
global doesn't have JS object anymore.
I commented about this also in https://phabricator.services.mozilla.com/D141719

Flags: needinfo?(bugs)

Jens Stutte [:jstutte]

Assignee

Comment 17

•

3 years ago

•

Edited

(In reply to Olli Pettay [:smaug] from comment #16)

"We expect WorkerGlobalScopeBase derived objects to never outlive their mWorkerPrivate" sounds reasonable, but it means also that we assume there aren't ever any leaks. But leaks do occur, unfortunately, and we shouldn't crash because of them.

That kind of brings us back to the options from comment 10. I understood Andrew is saying "this is an invariant we should enforce" and it is also what our code assumed before (risking UAF on the mWorkerPrivate in case of leaks which then we wanted to mitigate in bug 1752120 by nulling them out). So the first patch of this stack basically restores the previous behavior before bug 1752120 (through some re-ordering of calls) but keeping the UAF-safety and adding the asserts to enforce the (presumed) invariant, which might turn out difficult to achieve.

If we really cannot prevent from leaks and must accept them as a normal and expected state of operations, we will never be able to guarantee the invariant of option 1. That would automatically mean, we loosen the strict coupling of WorkerPrivate's lifecycle with WorkerGlobalScopeBase derived objects and need to always check mWorkerPrivate before using it, bringing us straight to option 2.

The stack trace in comment AutoEntryScript::AutoEntryScript is about the generic issue we've had with AutoEntryScript forever that it doesn't have Init() method similarly to AutoJSAPI. Whoever calls AutoJSAPI::Init() is supposed to check the return value, and that returns false for example if the global doesn't have JS object anymore.
I commented about this also in https://phabricator.services.mozilla.com/D141719

Well, this is one of the probably many consequences of 2. - we must add a meaningful and robust error handling if we always need to check mWorkerPrivate before using - if it is not there we need to act accordingly without crashing subsequently. And if we can repair this long standing issue together, even better.

I'd then propose to land https://bugzilla.mozilla.org/attachment.cgi?id=9268485 minus the sentinel asserts asap in order to repair the regression from bug 1752120 which makes us crash much more often than in the past before it hits release and then add additional patches to harden incrementally all accesses to mWorkerPrivate. But I'd want to hear also asuth' opinion on the (non-)possibility to enforce the invariant from option 1, of course.

Flags: needinfo?(bugmail)

Jens Stutte [:jstutte]

Assignee

Comment 18

•

3 years ago

•

Edited

Just adding some pieces of my (patchwork) understanding here:

IIUC, WorkerGlobalScopeBase asserts in most of its functions (AFAICS all functions that access mWorkerPrivate) that we are on the worker thread. So a better wording of the invariant we want to enforce could be:

"We do not expect anyone to use worker-thread-only functions on the global scope after we exit from WorkerThreadPrimaryRunnable::Run".

I assume this describes quite well the assumed state of things before we tried to improve them here - we just trusted that nobody would ever want to use mWorkerPrivate afterwards.

This is mainly achieved by emptying the event queue(s) of the worker thread and killing all pending timers before leaving. After that point we would not expect other events that want to access the global scopes' worker-thread-only functions to come in on the worker thread before it is completely torn down. If they do, it is probably an error and should be corrected.

We enforce this invariant already by setting mWorkerPrivate to nullptr, but we did it just too early in bug 1752120. Obviously this is a very hard enforcement, as we will always crash now. That could be ok if we expect this to happen rarely enough to not matter. In fact such use is equivalent to a UAF on the code pre bug 1752120 and we did not see those happen very often (but they are obviously harder to detect than a nullptr crash). If instead we want to soften the consequences, we need to treat all accesses to m̀WorkerPrivate` well just like scenario 2.

Now the main driver for all these changes was to audit the uses of WorkerPrivate* by the introduction of CheckedUnsafePtr<WorkerPrivate>. This resulted in the detection of the dangling WorkerPrivate* of WorkerGlobalScopeBase in some cases. This diagnostic assert is actually the equivalent of the sentinel asserts we wanted to introduce here - just triggered later on destruction of the global scope object and detecting a leak. It might thus give us a hint (a lower boundary actually, as it is checked later) on how often this can be expected to happen and if it is worth to try to enforce the stronger invariant: "We expect WorkerGlobalScopeBase derived objects to never outlive their mWorkerPrivate".

All this seems to tell me, that:

I'd then propose to land https://bugzilla.mozilla.org/attachment.cgi?id=9268485 minus the sentinel asserts asap in order to repair the regression from bug 1752120 which makes us crash much more often than in the past before it hits release and then add additional patches to harden incrementally all accesses to mWorkerPrivate.

is the right path forward here, unless we want to rewind all the CheckedUnsafePtr patches and start something completely different - which sounds bad to me as they actually helped to detect real problems (like here) and are also a step towards our ability to transform them to real smart pointers (of whatever flavor, strong or weak) one day.

The scope of this bug here should just be "repair the regression from bug 1752120". All the rest can be done in normal, non-secure bugs, I think, as we should then be on the safe side from the UAF point of view. And if this fix here really causes many crashes, we will have the right drive to finally improve things.

(I assume, Eden is already following this discussion, too, but in case - fyi)

Flags: needinfo?(echuang)

Donal Meehan [:dmeehan]

Updated

•

3 years ago

status-firefox100: --- → affected

Jens Stutte [:jstutte]

Assignee

Comment 19

•

3 years ago

•

Edited

(asuth and Eden are well involved, so clearing their ni?s)

Jason, would it be possible to get a check if this test case crashes on an optimized non-debug build (simulation as release) ? At least at this specific line the crash might be just debug only. Thank you!

Flags: needinfo?(jkratzer)

Flags: needinfo?(echuang)

Flags: needinfo?(bugmail)

Jason Kratzer [:jkratzer]

Reporter

Comment 20

•

3 years ago

:jstutte, it does not crash for me using an optimized build of mozilla-central rev 607fa621b6f2.

Flags: needinfo?(jkratzer)

Phabricator Automation

Updated

•

3 years ago

Attachment #9268485 - Attachment description: Bug 1756172: Unroot the globals only after ClearMainEventQueue ran. r?#dom-worker-reviewers → Bug 1756172: Unroot the globals only after last worker event ran. r?#dom-worker-reviewers

Andrew Sutherland [:asuth] (he/him)

Comment 21

•

3 years ago

Attached file Bug 1756172 - Eliminate skipWaiting warnings. r=#dom-workers! (obsolete) — Details

Phabricator Automation

Updated

•

3 years ago

Attachment #9268485 - Attachment description: Bug 1756172: Unroot the globals only after last worker event ran. r?#dom-worker-reviewers → Bug 1756172: Unroot the global scopes only after the last worker event ran. r?#dom-worker-reviewers

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 22

•

3 years ago

Attached file Bug 1756172, Make also DebuggerScope support weak references, r=jstutte — Details

Depends on D141507

Phabricator Automation

Updated

•

3 years ago

Attachment #9269388 - Attachment description: Bug 1756172, Make also DebuggerScope support weak referencese, r=jstutte → Bug 1756172, Make also DebuggerScope support weak references, r=jstutte

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Keywords: leave-open

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Keywords: leave-open

Phabricator Automation

Updated

•

3 years ago

Attachment #9269332 - Attachment is obsolete: true

Phabricator Automation

Updated

•

3 years ago

Attachment #9268824 - Attachment is obsolete: true

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 23

•

3 years ago

Unroot the global scopes only after the last worker event ran. r=dom-worker-reviewers,asuth,smaug
https://hg.mozilla.org/integration/autoland/rev/867ebd62b5ebba33e1fcd9a62cfb755d0065d1c2
https://hg.mozilla.org/mozilla-central/rev/867ebd62b5eb

Make also DebuggerScope support weak references, r=asuth
https://hg.mozilla.org/integration/autoland/rev/15d87a15fe819a5039dc98d38d17b580a293b215
https://hg.mozilla.org/mozilla-central/rev/15d87a15fe81

Group: dom-core-security → core-security-release

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox100: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 100 Branch

Bugmon [:jkratzer for issues]

Comment 24

•

3 years ago

Bugmon Analysis
Bug appears to be fixed on mozilla-central 20220325214737-2b624fdb002e but BugMon was unable to find a usable build for 2b187d932ca1.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

status-firefox100: fixed → verified

Keywords: bugmon

BugBot [:suhaib / :marco/ :calixte]

Comment 25

•

3 years ago

The patch landed in nightly and beta is affected.
:jstutte, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 26

•

3 years ago

Andrew, wdyt? We are already in late beta, IIUC, and release seems not affected by these specific crashes. So just riding the trains might be ok here?

Flags: needinfo?(jstutte) → needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Comment 27

•

3 years ago

Yes, this likely should just ride the trains given that this landed after the last beta in the cycle and CheckedUnsafePtr is behaving like the pre-CheckedUnsafePtr transition in release.

That said, wacky crashes are possible if there are changes in what the web is doing, so if we see a spike in crashes in workers on release at any point during worker shutdown, this would then want a dot release. But it's not worth the risk to try and uplift it at this very late date.

status-firefox99: affected → wontfix

Flags: needinfo?(bugmail)

Frederik Braun [:freddy] (OOO until August 18th)

Updated

•

3 years ago

Whiteboard: [bugmon:bisected,confirmed] → [bugmon:bisected,confirmed][adv-main100+r]

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Regressions: 1769878

Daniel Veditz [:dveditz]

Updated

•

3 years ago

Group: core-security-release

Kagami Rosylight [:saschanaz] (they/them)

Updated

•

3 years ago

Regressions: 1764921

Testcase 4 years ago Jason Kratzer [:jkratzer] 17.59 KB, application/octet-stream		Details
Bug 1756172: Unroot the global scopes only after the last worker event ran. r?#dom-worker-reviewers 3 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1756172: Add paranoid lifecycle checks in WorkerGlobalScopeBase::GetGlobalJSObject(PreserveColor). r?#dom-worker-reviewers 3 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1756172 - Eliminate skipWaiting warnings. r=#dom-workers! 3 years ago Andrew Sutherland [:asuth] (he/him) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1756172, Make also DebuggerScope support weak references, r=jstutte 3 years ago Olli Pettay [:smaug][bugs@pettay.fi] 48 bytes, text/x-phabricator-request		Details \| Review