Closed Bug 1744025 Opened 2 years ago Closed 8 months ago

Refactor WorkerPrivate to support CheckedUnsafePtr and convert existing raw class member pointers

Categories

(Core :: DOM: Workers, task)

task

Tracking

()

RESOLVED FIXED

People

(Reporter: jstutte, Assigned: edenchuang)

References

(Blocks 1 open bug)

Details

Attachments

(11 files, 1 obsolete file)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

We currently have many long-living WorkerPrivate* mWorkerPrivate; class variables in our code. It is difficult to always oversee their lifecycle and to guarantee that the referred WorkerPrivate instance is still alive.

Adding thread-safe ref-counting to WorkerPrivate will help to get rid of some if not all of them and might eventually help us to remove some flavor of WorkerRef, too.

This is delicate as adding ref-counts will increase the probability of leaks but it might help to detect unwanted/unhandled circles in our references that are currently invisible.

Keywords: leave-open
Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Attachment #9253470 - Attachment description: Bug 1744025: Use RefPtr for WorkerGlobalScopeBase::mWorkerPrivate. r=#dom-worker-reviewers → Bug 1744025: Use RefPtr<WorkerPrivate> in WorkerGlobalScopeBase. r=#dom-worker-reviewers

Many of the WorkerPrivate::GetParent uses look like this:

    // Walk up to our containing page
    WorkerPrivate* wp = aWorkerPrivate;
    while (wp->GetParent()) {
      wp = wp->GetParent();
    }

    nsCOMPtr<nsPIDOMWindowInner> window = wp->GetWindow();

while we already have WorkerPrivate::GetAncestorWindow. I assume we can use this everywhere even though it does an additional:

  if (mLoadInfo.mWindow) {
    return mLoadInfo.mWindow;
  }

before walking the parents?

Flags: needinfo?(echuang)

Short proposal: Let's use CheckedUnsafePtr with checking always enabled in nightly to replace the raw WorkerPrivate uses.

Longer Proposal and Rationale

Currently we canonically expect there to be 2 strong references to WorkerPrivate:

  1. The WorkerPrivate held by the owner of the worker so that it can message the worker and so that it can terminate it. This strong reference is dropped immediately upon requesting termination.
  2. The self-reference held by the worker as long as it is alive.
    • This self-reference also then gets registered with the runtime service largely for being able to expose the worker for debugging. Because of the existence of nested workers, this gets ugly and arguably it could be appropriate for the runtime service to be given its own strong reference. Attaching a debugger could then also use that existing strong reference to acquire a specific strong reference for the debugger.

Where I think we've historically run into trouble is that the implementation has held raw pointers to the WorkerPrivate which have conceptually been borrowing the self-reference's strong reference. The fatal flaw in this approach is that any kind of logic bug where something outlives the global potentially turns into a UAF.

We faced a similar situation with IndexedDB, which led to the creation of CheckedUnsafePtr and which seems suitable for use in this case as well. Here we are primarily concerned that it's indicative of a logic bug if we use these borrowed pointers beyond their use point.

This is preferable to converting the raw pointers to RefPtrs, as

  1. It's very much our intent that all of the accesses that are being done are being done under the auspices of a specific owning strong-reference.
  2. Marking the WorkerPrivate as threadsafe refcounted and standardizing use of RefPtr's to WorkerPrivate when there should only be the two strong references potentially throws away a lot of the good intent behind the current strategy. (Reiterating that the current strategy has serious issues because of UAFs).

SafeRefPtr could also potentially accomplish this as it requires intent to create new references, but as you remarked in chat, it has the logistical problem that it's a much larger initial set of changes.

Another option would be to introduce some other kind of holding class (or reuse something like nsMainThreadPtrHolder in nsProxyRelease.h that supports other threads) that could allow ref-counted pointers that reuse the underlying strong reference so that we can keep that strong reference alive to avoid UAFs, but to also allow us to add some assertions/checks. Because it's definitely the case that there are logic bugs where these borrowed pointers are used too late and we want to detect that, not just paper over them.

Meta: More things should be using nsCOMPtr<nsIGlobalObject>

nsIGlobalObject exists to allow code that wants to work with a global to not have to have a separate code path for windows and workers (plus there are more global types than just those!). There are a lot of cases where code is accessing WorkerPrivate when it really wants to be accessing nsIGlobalObject.

But it should be noted that in the nsIGlobalObject case, all the reasons that code already needs to use WorkerRefs still apply. Which is why in https://phabricator.services.mozilla.com/D125555#4135011 I proposed (and I think brought up elsewhere with :smaug? Maybe Matrix?) that nsIGlobalObject should likely have a generic concept that uses WorkerRefs on workers and something else on windows so that:

  1. Code can indicate that it currently needs the global / its underpinnings to be kept alive, at least until...
  2. It's notified that the global is terminating, at which point the code is responsible for cleaning up as quickly as possible and dropping its reference to the keepalive.

(Note that windows and nsIGlobalObject do already somewhat have a concept of this through DOMEventTargetHelper, but its semantics don't line up exactly.)

When performing cleanups where it might seem like we want more code to be holding a strong reference to WorkerPrivate, we really want them to be holding a reference to nsIGlobalObject and/or a keepalive.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #6)

Short proposal: Let's use CheckedUnsafePtr with checking always enabled in nightly to replace the raw WorkerPrivate uses.

OK, changing this bug to this scope then.

Meta: More things should be using nsCOMPtr<nsIGlobalObject>

nsIGlobalObject exists to allow code that wants to work with a global to not have to have a separate code path for windows and workers (plus there are more global types than just those!). There are a lot of cases where code is accessing WorkerPrivate when it really wants to be accessing nsIGlobalObject.

But it should be noted that in the nsIGlobalObject case, all the reasons that code already needs to use WorkerRefs still apply. Which is why in https://phabricator.services.mozilla.com/D125555#4135011 I proposed (and I think brought up elsewhere with :smaug? Maybe Matrix?) that nsIGlobalObject should likely have a generic concept that uses WorkerRefs on workers and something else on windows so that:

  1. Code can indicate that it currently needs the global / its underpinnings to be kept alive, at least until...
  2. It's notified that the global is terminating, at which point the code is responsible for cleaning up as quickly as possible and dropping its reference to the keepalive.

(Note that windows and nsIGlobalObject do already somewhat have a concept of this through DOMEventTargetHelper, but its semantics don't line up exactly.)

When performing cleanups where it might seem like we want more code to be holding a strong reference to WorkerPrivate, we really want them to be holding a reference to nsIGlobalObject and/or a keepalive.

This sounds like an additional action to take and merits probably its own bug. This bug might want to note suspicious uses with // XXX: comments, though.

Summary: Refactor WorkerPrivate to be thread-safe ref-counted → Refactor WorkerPrivate to support CheckedUnsafePtr and convert existing raw class member pointers
Attachment #9253469 - Attachment description: Bug 1744025: Make WorkerPrivate thread-safe ref-counted. r=#dom-worker-reviewers → Bug 1744025: Make WorkerPrivate support CheckedUnsafePtr. r=#dom-worker-reviewers
Attachment #9253470 - Attachment description: Bug 1744025: Use RefPtr<WorkerPrivate> in WorkerGlobalScopeBase. r=#dom-worker-reviewers → Bug 1744025: Use CheckedUnsafePtr<WorkerPrivate> in WorkerGlobalScopeBase. r=#dom-worker-reviewers
Attachment #9253471 - Attachment description: Bug 1744025: Use RefPtr for WorkerPrivate in WorkerPrivate.h almost everywhere. r=#dom-worker-reviewers → Bug 1744025: Use CheckedUnsafePtr<WorkerPrivate> in WorkerPrivate.h. r=#dom-worker-reviewers
Attachment #9253472 - Attachment description: Bug 1744025: Use RefPtr for WorkerPrivate in WorkerDebugger. r=#dom-worker-reviewers → Bug 1744025: Use CheckedUnsafePtr<WorkerPrivate> in WorkerDebugger. r=#dom-worker-reviewers
Attachment #9253471 - Attachment description: Bug 1744025: Use CheckedUnsafePtr<WorkerPrivate> in WorkerPrivate.h. r=#dom-worker-reviewers → Bug 1744025: Use CheckedUnsafePtr<WorkerPrivate> in WorkerPrivate. r=#dom-worker-reviewers

(In reply to Jens Stutte [:jstutte] from comment #7)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #6)

Meta: More things should be using nsCOMPtr<nsIGlobalObject>

This sounds like an additional action to take and merits probably its own bug. This bug might want to note suspicious uses with // XXX: comments, though.

I filed bug 1744251 for this.

See Also: → 1744251
Attachment #9253471 - Attachment description: Bug 1744025: Use CheckedUnsafePtr<WorkerPrivate> in WorkerPrivate. r=#dom-worker-reviewers → Bug 1744025: Use CheckedUnsafePtr<WorkerPrivate> in WorkerPrivate where possible. r=#dom-worker-reviewers
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2c8941fdcee5
Make WorkerPrivate support CheckedUnsafePtr. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/26d3176a00ef
Use CheckedUnsafePtr<WorkerPrivate> in WorkerGlobalScopeBase. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/e9b749f80ec2
Use CheckedUnsafePtr<WorkerPrivate> in WorkerPrivate where possible. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/e6b99af4b7ca
Use CheckedUnsafePtr<WorkerPrivate> in WorkerDebugger. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/e1bc4e2066c7
Remove only assertion-relevant mWorkerPrivate from FetchBody. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/913cdee4824f
Use CheckedUnsafePtr<WorkerPrivate> in Notification. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/46e9f06636da
Use CheckedUnsafePtr<WorkerPrivate> in PerformanceWorker. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/b6452430763d
Use CheckedUnsafePtr<WorkerPrivate> in WorkerEventTarget. r=dom-worker-reviewers,edenchuang

Backed out for causing bp hybrid build bustages in dom/clients/api/Clients.cpp

Backout link: https://hg.mozilla.org/integration/autoland/rev/d675762911a779c1765e144795c2a5294454ee09

Push with failures

Failure log build bustage

ERROR -  /builds/worker/checkouts/gecko/dom/clients/api/Clients.cpp:225:36: error: member access into incomplete type 'mozilla::dom::WorkerGlobalScope'
[task 2021-12-07T23:15:47.594Z] 23:15:47     INFO -    if (!workerPrivate->GlobalScope()->WindowInteractionAllowed()) {
[task 2021-12-07T23:15:47.595Z] 23:15:47     INFO -                                     ^
[task 2021-12-07T23:15:47.595Z] 23:15:47     INFO -  /builds/worker/workspace/obj-build/dist/include/mozilla/dom/WorkerPrivate.h:71:7: note: forward declaration of 'mozilla::dom::WorkerGlobalScope'
[task 2021-12-07T23:15:47.595Z] 23:15:47     INFO -  class WorkerGlobalScope;

LE:
Failure log wpt

Flags: needinfo?(jstutte)

Those bustages need some reasoning about the include order. I will come to this next week, keeping the ni?.

The build error is caused by the patch https://phabricator.services.mozilla.com/D132709.
#include "mozilla/dom/WorkerScope.h" is removed from WorkerPrivate.h. It make Client.cpp has no complete type information about WorkerScope.

New patch submitted for the build errors by replacing #include "mozilla/dom/WorkerPrivate.h" with #include "mozilla/dom/WorkerScope.h" in cpp files where using WorkerPrivate::GlobalScope()

Flags: needinfo?(echuang)
Pushed by echuang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/43186fbbf8b4
Make WorkerPrivate support CheckedUnsafePtr. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/5598943a94fd
Use CheckedUnsafePtr<WorkerPrivate> in WorkerGlobalScopeBase. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/cfb822df5b3f
Use CheckedUnsafePtr<WorkerPrivate> in WorkerPrivate where possible. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/8abd6ba69815
Use CheckedUnsafePtr<WorkerPrivate> in WorkerDebugger. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/a5edfa1c9cd6
Remove only assertion-relevant mWorkerPrivate from FetchBody. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/af171636a87f
Use CheckedUnsafePtr<WorkerPrivate> in Notification. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/1580a4ea1a85
Use CheckedUnsafePtr<WorkerPrivate> in PerformanceWorker. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/7aa395dcdbe4
Use CheckedUnsafePtr<WorkerPrivate> in WorkerEventTarget. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/89dca4fc5940
Replace include "mozilla/dom/WorkerPrivate.h" with include "mozilla/dom/WorkerScope.h" where WorkerPrivate->GlobalScope() is called. r=dom-worker-reviewers,smaug,jstutte

Backed out for causing wpt failures on CheckedUnsafePtr.h

Backout link: https://hg.mozilla.org/integration/autoland/rev/ddf7a6c0a118b33abaab02b44522612d02a61b48

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=T3yGSafuS4i-OpLAig3yFQ.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=89dca4fc59400792ca29993f169a5acbc025d9ed

Failure log: https://treeherder.mozilla.org/logviewer?job_id=361346297&repo=autoland&lineNumber=7875

Failure line: SUMMARY: AddressSanitizer: SEGV /builds/worker/workspace/obj-build/dist/include/mozilla/dom/quota/CheckedUnsafePtr.h:247:31 in NotifyCheckFailure

Flags: needinfo?(echuang)

Whohoo, that is at least confirming that CheckedUnsafePtr helps finding issues. I am a bit puzzled why this did not happen on our earlier try runs, though. The Push seems to be quite buggy, too (on several jobs I read Log parsing failed. Unable to generate failure summary.). Not sure if we should just give it a second try?

Flags: needinfo?(jstutte)

The Push seems to be quite buggy, too (on several jobs I read Log parsing failed. Unable to generate failure summary.).

I guess that's something to do with today's AWS outage.

test-linux1804-64-asan-qr/opt-web-platform-tests-e10s wpt1 happened five times with proper logs and I think that was enough for a backout.

It definitely looks like CheckedUnsafePtr caught a case where a CheckedUnsafePtr outlived the worker dropping its self-ref. The fact that this happened in xpcom-shutdown-threads suggests this might not be a content global but instead something like the OS.File global and that ASAN performance characteristics slowed things down to shuffle thread scheduling. (edit: Although this was a content process which inherently has different shutdown characteristics than the parent, so maybe that's also related and maybe it was a content global that just didn't have a lot of time to shutdown before XPCOM shutdown.) We definitely need to investigate and/or add more instrumentation before relanding, and should probably target a bunch of ASAN jobs like this as good candidates to be introducing a chaos mode factor.

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:jstutte, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #23)

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:jstutte, could you have a look please?
For more information, please visit auto_nag documentation.

Eden?

Flags: needinfo?(jstutte)

Since these promises are also holding the GlobalObject(ServiceWorkerGlobalScope), they are needed to be added into the CycleCollection checking.
Otherwise, ServiceWorkerGlobalScope can not be unrooted as expected.

Depends on D133483

Flags: needinfo?(echuang)
Pushed by echuang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/dfb7e4b76f92
Make WorkerPrivate support CheckedUnsafePtr. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/cbe7670b4ae5
Use CheckedUnsafePtr<WorkerPrivate> in WorkerGlobalScopeBase. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/968efcd33efb
Use CheckedUnsafePtr<WorkerPrivate> in WorkerPrivate where possible. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/c56f48bc403a
Use CheckedUnsafePtr<WorkerPrivate> in WorkerDebugger. r=dom-worker-reviewers,asuth
https://hg.mozilla.org/integration/autoland/rev/06f11a04cc8a
Remove only assertion-relevant mWorkerPrivate from FetchBody. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/38ebc27679a4
Use CheckedUnsafePtr<WorkerPrivate> in Notification. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/27c8e30bf969
Use CheckedUnsafePtr<WorkerPrivate> in PerformanceWorker. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/b31f930a6e32
Use CheckedUnsafePtr<WorkerPrivate> in WorkerEventTarget. r=dom-worker-reviewers,edenchuang
https://hg.mozilla.org/integration/autoland/rev/a362ed940cd7
Replace include "mozilla/dom/WorkerPrivate.h" with include "mozilla/dom/WorkerScope.h" where WorkerPrivate->GlobalScope() is called. r=dom-worker-reviewers,smaug,jstutte
https://hg.mozilla.org/integration/autoland/rev/35a19eee470b
Adding FetchEvent.mHandled FetchEvent.mPreloadResponse into CycleCollection checking. r=dom-worker-reviewers,smaug
Blocks: 1751790

In an intermediate version of D132710 we wanted to use RefPtr and in AutoSyncLoopHolder::Run the local WP
variable would have kept alive the WorkerPrivate during the sync loop. But downgrading for now to CheckedUnsafePtr
does not give us any lifecycle guarantees here anymore, so the naming was inappropriate.

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2ae62a58f031
Restore original naming of local WP variable in AutoSyncLoopHolder::Run r=dom-worker-reviewers,smaug
Depends on: 1752856
Regressions: 1752856

I think that the remaining work has been/will be done by Eden.

Assignee: jstutte → echuang
See Also: → 1741869

Comment on attachment 9304871 [details]
Bug 1744025 - Print out the creation stack and the last assignment stack of CheckedUnsafePtr when it is unsafe. r=#dom-worker-reviewers

Revision D162823 was moved to bug 1789399. Setting attachment 9304871 [details] to obsolete.

Attachment #9304871 - Attachment is obsolete: true
Blocks: 1744251

The leave-open keyword is there and there is no activity for 6 months.
:edenchuang, maybe it's time to close this bug?
For more information, please visit BugBot documentation.

Flags: needinfo?(echuang)

Remove the leave-open keyword and set it as resolved-worksforme.

We don't have any further implementation for CheckedUnsafePtr here.

Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Flags: needinfo?(echuang)
Keywords: leave-open
Resolution: --- → WORKSFORME

(Changing to fixed since a number of patches did land here.)

Resolution: WORKSFORME → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: