<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Bugmon [:jkratzer for issues]

Assignee

Comment 2

•

4 years ago

The chase is opened!

Flags: needinfo?(echuang)

Daniel Veditz [:dveditz]

Comment 3

•

4 years ago

This looks like an intentional opt crash catching a problem, but I guess diagnostic crashes don't get shipped to Release. Can be this some bad value other than null? since this is essentially a "weak pointer" class it is a bit worrying.

Group: core-security → dom-core-security

Keywords: csectype-uaf, sec-high

Comment 4

•

4 years ago

Bugmon Analysis
Verified bug as reproducible on mozilla-central 20220126163134-35652548842a.
The bug appears to have been introduced in the following build range:

Start: d376f1776505991972805ddddc079822184c014e (20220125080215)
End: a47b76ed47bf1aa3868b83dfe7fcec44dfe08332 (20220125094815)
Pushlog: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=d376f1776505991972805ddddc079822184c014e&tochange=a47b76ed47bf1aa3868b83dfe7fcec44dfe08332

Whiteboard: [bugmon:confirm] → [bugmon:bisected,confirmed]

Assignee

Comment 5

•

4 years ago

It is a "regression" from bug 1744025 which introduced this diagnostic assert but it indicates a pre-existing potential lifecycle issue with a weak pointer, so sec-high seems appropriate for now.

Assignee

Updated

•

4 years ago

Comment 6

•

4 years ago

(In reply to Jens Stutte [:jstutte] from comment #5)

It is a "regression" from bug 1744025 which introduced this diagnostic assert but it indicates a pre-existing potential lifecycle issue with a weak pointer, so sec-high seems appropriate for now.

:asuth, after we talked I am a bit less sure if this merits a sec-high, actually. IIUC this indicates only, that we leave a weak pointer around that could be accessed later, but we have no evidence that this actually happens.

Flags: needinfo?(bugmail)

Jason Kratzer [:jkratzer]

Comment 7

•

4 years ago

The specific circumstances this is happening under are unlikely to be exploitable (we don't perform XPCOM shutdown of content processes on release), but something interesting is potentially happening here. Unfortunately although the new CheckedUnsafePtr assertion is helpfully indicative of a problem going on, it's also on the level of the famous Simpsons Car Gone! car alarm in terms of providing next steps.

:jkratzer, is it possible to get a pernosco trace of this?

:saschanaz, this might be also be in your area of interest/expertise as the test case is primarily a SW (sw.js) that involves weblocks and AbortController (gonna delay setting a needinfo until we get word back about pernosco-ability):

;(async () => {
  const callback = async function (e) {
    const opts = { mode: "shared", signal: abort.signal }
    navigator.locks.request("weblock_0", opts, callback)
    e = null
  }
  const abort = new AbortController()
  await navigator.locks.request("weblock_0", callback)
})()

Flags: needinfo?(jkratzer)

Flags: needinfo?(echuang)

Flags: needinfo?(bugmail)

Reporter

Comment 8

•

4 years ago

You can find a pernosco session for this bug, here.

Flags: needinfo?(jkratzer)

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Comment 9

•

4 years ago

•

Edited

Note for me: Check if explicitly __delete__ing from https://searchfox.org/mozilla-central/rev/8a08700a0392a783296a0940edc09c1cc3dd515c/dom/locks/LockManagerParent.cpp#45 helps

Edit: Hmm no, AbortFollower does not keep a strong reference to signal anymore... 🤔

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Updated

•

4 years ago

Flags: needinfo?(krosylight)

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Comment 10

•

4 years ago

•

Edited

Attached file 1752120.html — Details

A more minified repro.

Some weird points:

If I move const abort to the worker global, then no failure
If I remove await, then no failure

I don't see how they can affect here 🤔

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Comment 11

•

4 years ago

•

Edited

but something interesting is potentially happening here.

Yes! (although maybe not in security point of view)

So the failure here is caused by WorkerGlobalScopeBase::mWorkerPrivate being still alive after WorkerPrivate destruction. The scope object is alive because LockRequestChild is keeping the objects alive (CallbackObject which agains keeps AbortSignal here). The IPC layer prepares to kill PLockRequest as it's life is over (observed by LockRequestChild::ActorDestroy), but WorkerPrivate destruction happens earlier, and boom.

So this is why a mere cycle collection issue like bug 1752075 observed dangling WorkerPrivate.

Restoring WorkerGlobalScopeBase::mWorkerPrivate to raw WorkerPrivate* "workarounds" this issue, but not sure it's really a good fix. Thoughts?

Flags: needinfo?(krosylight) → needinfo?(bugmail)

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Comment 12

•

4 years ago

Can ThreadSafeWeakPtr<WorkerPrivate> be used there, btw?

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Updated

•

4 years ago

Comment 13

•

4 years ago

•

Edited

It sounds like the StrongWorkerRef is being released too early. It's not necessary to drop the StrongWorkerRef when the callback happens (and in fact should not be dropped until it's safe to do so). Note that StrongWorkerRef has the implication that it treats the worker as definitely active and not eligible for GC. If we want to make sure that we don't shutdown without doing some cleanup, but don't want to prevent GC, an IPCWorkerRef can be used instead. (And CacheWorkerRef is an example of having a WorkerRef wrapper thing that transitions between StrongWorkerRef and IPCWorkerRef under the hood.)

Flags: needinfo?(bugmail)

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Comment 14

•

4 years ago

(In reply to Kagami :saschanaz from comment #12)

Can ThreadSafeWeakPtr<WorkerPrivate> be used there, btw?

We should probably discuss this further in https://docs.google.com/document/d/1GRRcZapr5_Ue0Jbq_Imed9IBWmhhSAvwPSuUilVe2io/ but the general situation is (and some of this may still want to be reflected into that document):

CheckedUnsafePtr was just the fastest thing we had to be able to get explosions on misuse in debug builds without potentially introducing new types of failures. (And in particular, moving to holding strong references instead was not okay with current semantics and would hide serious logic bogs.)
That said, the failure mode in non-release builds of potentially being a UAF if used is concerning a real problem. We're not trying to say that was okay. ThreadSafeWeakPtr turning it into a nullptr crash or throwing an error code is definitely more desirable if an attempt is made to be dereferenced.
However we still likely want to be able to catch coding logic errors under debug builds that indicate if there were any ThreadSafeWeakPtrs in existence when we free the WorkerPrivate.
There's definitely a continuum related to worker shutdown between "everything needs to be 100% perfectly cleaned up when the worker fully terminates" versus "things can still be happening but they need to turn into no-ops if something dangerous would happen". I'm leaning towards the former because in my experience the latter approach frequently results in moving parts that don't handle the error edge cases resulting in very complex failure modes that are hard to reason about.
That said, I think it is potentially quite reasonable for us to only be strict about WorkerPrivate* but to potentially be more forgiving for nsIGlobalObject. That said, I do think we probably want nsIGlobalObject::CreateKeepAlive or something that maps to a StrongWorkerRef/others so that we can also raise the bar for object lifecycle cleanup everywhere in general, not just on workers. Noting that for windows things are more complex where the global can be dead but the agent is still alive and DOM objects potentially still can be used for stuff in that situation.

Comment 15

•

4 years ago

•

Edited

It sounds like the StrongWorkerRef is being released too early. It's not necessary to drop the StrongWorkerRef when the callback happens

I can make StrongWorkerRef live longer, but it's not that LockRequestChild itself needs WorkerRef, and doing so also does not solves cases like bug 1752075. (There was an issue, but having a dangling WorkerPrivate whenever WorkerGlobalScope lives longer is still confusing).

Edit: I'll try IPCWorkerRef, thanks! Reading https://searchfox.org/mozilla-central/rev/8b46752d1e583b2f817c451f93ba515fb865554d/dom/workers/WorkerRef.h#83-98 ...

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Comment 16

•

4 years ago

Attached file Bug 1752120 - Release WorkerRef in LockRequestChild::ActorDestroy r=asuth,smaug (obsolete) — Details

Updated

•

4 years ago

Assignee: nobody → krosylight

Status: NEW → ASSIGNED

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Attachment #9262178 - Attachment description: Bug 1752120 - Use IPCWorkerRef for LockRequest r=asuth,smaug → Bug 1752120 - Release WorkerRef in LockRequestChild::ActorDestroy r=asuth,smaug

Comment 17

•

4 years ago

:saschanaz, since this bug contains a bisection range, could you fill (if possible) the regressed_by field?
For more information, please visit auto_nag documentation.

Flags: needinfo?(krosylight)

Eden Chuang[:edenchuang]

Comment 18

•

4 years ago

This is a similar issue when landing bug 1744025.

In basic, we assume that WorkerPrivate::mSelfRef is the last RefPtr and will be released in WorkerFinishedRunnable/TopLevelWorkerFinishedRunnable::Run().

WorkerGlobalScopeBase has a CheckUnsafePtr<WorkerPrivate> member. However, WorkerGlobalScopeBase's life cycle is out of WorkerPrivate.
So everywhere code keeps a WorkerGlobalScopeBase, especially when it binds to web API implementations, is potential that has the same problem if it doesn't release the WorkerGlobalScopeBase before WorkerFinishedRunnable::Run().

According to the code here, before the WorkerThreadPrimaryRunnable schedule a TopLevelWorkerFinishedRunnable/WorkerFinishedRunnable, it would schedule a GC to break things related JS objects. And it seems to be the last chance to make sure the corresponding WorkerGlobalScopeBase::mWorkerPrivate is set as nullptr, otherwise, we will meet the same core-dump stack again.

Since the WorkerGlobalScopeBase's life cycle is out of WorkerPrivate, I am not sure if it is a correct idea, but it makes me consider that it should not be a member of WorkerGlobalScopeBase. We probably should query it by GetCurrentThreadWorkerPrivate() when needed and keep it with RefPtr to make sure the WorkerPrivate keeps alive during the functionality.

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Updated

•

4 years ago

Rank: 1

Flags: needinfo?(krosylight)

Regressed by: 1744025

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Updated

•

4 years ago

Rank: 1

BugBot [:suhaib / :marco/ :calixte]

Comment 19

•

4 years ago

•

Edited

(In reply to Eden Chuang[:edenchuang] from comment #18)

Since the WorkerGlobalScopeBase's life cycle is out of WorkerPrivate, I am not sure if it is a correct idea, but it makes me consider that it should not be a member of WorkerGlobalScopeBase. We probably should query it by GetCurrentThreadWorkerPrivate() when needed and keep it with RefPtr to make sure the WorkerPrivate keeps alive during the functionality.

When the WorkerPrivate is getting deleted, no one should have any references to the WorkerGlobalScopeBase because there will no longer be any (safe) place for code to run. The conceptual event target is gone. It's only because of other bugs/invariant violations right now that runnables can end up running on the bare WorkerThread nsThread like what we witnessed in bug 1743671. (There are some follow-ups identified after discussion with :nika that we can use to help address this problem, but the underlying lifecycle issues still stand.)

This problem cascades into GetCurrentThreadWorkerPrivate() which under the hood uses CycleCollectedJSContext::Get's thread-local-storage that assumes we don't have the above problem. Because workers can get misused, if we used GetCurrentThreadWorkerPrivate there's then a potential for an entirely different WorkerPrivate global to be (mis-)used.

Aside: I have a reply on the phabricator revision that I'm still working on and hope to finish up after lunch.

Updated

•

4 years ago

Has Regression Range: --- → yes

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Comment 20

•

4 years ago

•

Edited

When the WorkerPrivate is getting deleted, no one should have any references to the WorkerGlobalScopeBase

Sounds like it can have better assertion - Check the existence of scope when WorkerPrivate is being destroyed? Could be even better if the current assertion message can say which classes have the dangling reference, but not sure it's plausible.

Assignee

Comment 21

•

4 years ago

(In reply to Jens Stutte [:jstutte] from comment #6)

:asuth, after we talked I am a bit less sure if this merits a sec-high, actually. IIUC this indicates only, that we leave a weak pointer around that could be accessed later, but we have no evidence that this actually happens.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #7)

The specific circumstances this is happening under are unlikely to be exploitable (we don't perform XPCOM shutdown of content processes on release), but something interesting is potentially happening here. Unfortunately although the new CheckedUnsafePtr assertion is helpfully indicative of a problem going on, it's also on the level of the famous Simpsons Car Gone! car alarm in terms of providing next steps.

Hi Daniel, in this light would sec-moderate be more appropriate here for triage?

Flags: needinfo?(dveditz)

Daniel Veditz [:dveditz]

Updated

•

4 years ago

Flags: needinfo?(dveditz)

Keywords: sec-high → sec-moderate

Assignee

Updated

•

4 years ago

Severity: -- → S3

Priority: -- → P2

Assignee

Comment 22

•

4 years ago

Attached file Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r?#dom-worker-reviewers — Details

Assignee

Comment 23

•

4 years ago

•

Edited

After talking with :asuth, this patch would just help to:

moot the CheckedUnsafePtr check for dangling WorkerGlobalScope objects
get a nullptr crash in case someone tries to really access that mWorkerPrivate after the worker died, indicating a real abuse

It is not meant to be a solution for the WorkerGlobalScope lifecycle issues themselves.

Updated

•

4 years ago

Attachment #9263298 - Attachment description: Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r?#dom-worker-reviewers → WIP: Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

4 years ago

Attachment #9263298 - Attachment description: WIP: Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. → Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r?#dom-worker-reviewers

Comment 24

•

3 years ago

Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/c0922a11dfc958b24a5febb219dae8e556beab4c
https://hg.mozilla.org/mozilla-central/rev/c0922a11dfc9

Group: dom-core-security → core-security-release

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox99: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 99 Branch

Bugmon [:jkratzer for issues]

Comment 25

•

3 years ago

Bugmon Analysis
Verified bug as fixed on rev mozilla-central 20220216214005-d0676cb0864b.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Status: RESOLVED → VERIFIED

status-firefox99: fixed → verified

Keywords: bugmon

Assignee

Comment 26

•

3 years ago

Though we mitigated the consequences I think we still want to reason about the other patch here: https://phabricator.services.mozilla.com/D137792

:saschanaz, or do you prefer to spin of a new bug for that one?

Status: VERIFIED → REOPENED

Flags: needinfo?(krosylight)

Resolution: FIXED → ---

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Assignee

Comment 27

•

3 years ago

Pascal, it might be worth uplifting the mitigation patch to beta? Even though we have no proof that there is a real issue, it gives us some more confidence that nothing bad can happen here. Give me a sign if we should ask for it.

Flags: needinfo?(pascalc)

Comment 28

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #26)

Though we mitigated the consequences I think we still want to reason about the other patch here: https://phabricator.services.mozilla.com/D137792

:saschanaz, or do you prefer to spin of a new bug for that one?

If :asuth's argument that "nothing should grab the global scope object while the worker is dying" still holds, I can go rebase and land it here together.

Flags: needinfo?(krosylight) → needinfo?(bugmail)

Assignee

Updated

•

3 years ago

Regressions: 1756172

Comment 29

•

3 years ago

I responded on https://phabricator.services.mozilla.com/D137792. The tl;dr is that LockManager::mOwner holds a strong reference to the worker global scope without holding a WorkerRef to justify holding that reference, which results in the worker global scope outliving the WorkerPrivate destructor. It's probably appropriate to upgrade the WeakWorkerRef to an IPCWorkerRef and have LockManager shutdown drop the mOwner ref.

Given all the time that's passed, it may be that some additional cycle collection traversal is now in place that might also help clean up the LockManager, but I think the changes mentioned in the PR likely still make sense.

Flags: needinfo?(bugmail)

Tom Ritter [:tjr]

Updated

•

3 years ago

Whiteboard: [bugmon:bisected,confirmed] → [bugmon:bisected,confirmed][adv-main99+r]

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Updated

•

3 years ago

Attachment #9262178 - Attachment is obsolete: true

Comment 30

•

3 years ago

Closing again per https://phabricator.services.mozilla.com/D137792#4648737.

Status: REOPENED → RESOLVED

Closed: 3 years ago → 3 years ago

Resolution: --- → FIXED

Pascal Chevrel:pascalc (PTO until Sept 2)

Updated

•

3 years ago

Flags: needinfo?(pascalc)

Ryan VanderMeulen [:RyanVM]

Comment 31

•

3 years ago

•

Edited

IIUC, this didn't technically regress in 98, it was an underlying issue exposed during that timeframe. What does that mean for ESR? This patch grafts cleanly there, but we already missed our window to get it into the 91.8esr release :. I guess we'd also need bug 1756172 if we did want to uplift?

status-firefox98: --- → wontfix

status-firefox-esr91: --- → ?

Flags: needinfo?(krosylight)

Flags: in-testsuite+

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Comment 32

•

3 years ago

Jens, thoughts? I don't think I understand your patch well enough, so just passing NI.

Assignee: krosylight → jstutte

Flags: needinfo?(krosylight) → needinfo?(jstutte)

Assignee

Comment 33

•

3 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #31)

IIUC, this didn't technically regress in 98, it was an underlying issue exposed during that timeframe.

Yes and no. This specific crash had been introduced by the patch stack that added CheckedUnsafePtr to mWorkerPrivate, but that is just a diagnostic thing without effect in release. These patches are not in ESR. The underlying potential lifecycle problem is older, but it is unknown if it ever caused crashes or other trouble.

What does that mean for ESR? This patch grafts cleanly there, but we already missed our window to get it into the 91.8esr release :. I guess we'd also need bug 1756172 if we did want to uplift?

Yes, without bug 1756172 this patch could cause real crashes in release that weren't present before. So if also those graft cleanly we could think about uplifting, but as said we have no concrete sign that there were ever bad things happening. Moving the ni? to asuth as he might have a better feeling for the importance here.

Flags: needinfo?(jstutte) → needinfo?(bugmail)

Ryan VanderMeulen [:RyanVM]

Comment 34

•

3 years ago

•

Edited

I think the motivating criteria for uplift would be web locks. The basic web locks impl bug 1725734 landed in 93 and web locks was enabled in nightly by bug 1739233 in 96. Since they're not in 91 and will never be uplifted to 91, and we're already quite close to ESR 102, I'd say we shouldn't uplift.

Generally this bug with bug 1756172 is a very nice set of lifecycle improvements, but given our distance from 91 and that we've been iteratively improving things and, as :jstutte says, don't have concrete evidence of bad things, it seems like there's only risk in uplifting these changes as there could be other things we haven't uplifted which would mean ESR91 would be a new thing behaviorally that we haven't seen before.

Flags: needinfo?(bugmail)

Comment 35

•

3 years ago

Thanks for the thoughtful responses everyone!

status-firefox-esr91: ? → wontfix

Kagami Rosylight [:saschanaz] (they/them) (OOO until 2025-08-18)

Updated

•

3 years ago

Attachment #9262178 - Attachment is obsolete: false

Updated

•

3 years ago

Updated

•

3 years ago

Group: core-security-release