Closed Bug 1808820 Opened 1 year ago Closed 1 year ago

Assertion failure: !globalScopeAlive, at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2106

Categories

(Core :: Graphics: WebGPU, defect)

defect

Tracking

()

VERIFIED FIXED
112 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox108 --- disabled
firefox109 --- disabled
firefox110 --- disabled
firefox111 --- disabled
firefox112 --- fixed

People

(Reporter: tsmith, Assigned: jimb)

References

(Blocks 2 open bugs, Regression)

Details

(4 keywords, Whiteboard: [bugmon:bisected,confirmed][fuzzblocker])

Attachments

(3 files)

Attached file testcase.html

Found while fuzzing m-c 20221217-3ccb0b86ab11 (--enable-debug --enable-fuzzing)

To reproduce via Grizzly Replay:

$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.html

Assertion failure: !globalScopeAlive, at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2106

#0 0x7fee8f8c1c2a in mozilla::dom::workerinternals::(anonymous namespace)::WorkerThreadPrimaryRunnable::Run() /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2106:7
#1 0x7fee8ad85598 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1191:16
#2 0x7fee8ad8ba1d in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:476:10
#3 0x7fee8b97d4ea in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:300:20
#4 0x7fee8b89fe28 in MessageLoop::RunInternal() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381:10
#5 0x7fee8b89fd31 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374:3
#6 0x7fee8b89fd31 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356:3
#7 0x7fee8ad80a97 in nsThread::ThreadFunc(void*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:383:10
#8 0x7fee9db23c86 in _pt_root /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201:5
#9 0x7fee9e3ccb42 in start_thread nptl/pthread_create.c:442:8
#10 0x7fee9e45e9ff  misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Flags: in-testsuite?

Jens, what does this assertion mean? Thanks.

Flags: needinfo?(jstutte)

That assertion means that something living outside the worker kept a reference to the worker's global scope such that it has not been freed before the worker ends. If we could get a pernosco session for this we will hopefully discover, who. Thanks Tyson!

Flags: needinfo?(jstutte) → needinfo?(twsmith)

Verified bug as reproducible on mozilla-central 20230106214742-7968ae37c117.
The bug appears to have been introduced in the following build range:

Start: e6e2286d2ac25001127a1cf54a87a95fb435c734 (20220708093332)
End: 807e95cd9956aa4967ddddc80f8ccab4ad370e8d (20220708081410)
Pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e6e2286d2ac25001127a1cf54a87a95fb435c734&tochange=807e95cd9956aa4967ddddc80f8ccab4ad370e8d

Keywords: regression
Whiteboard: [bugmon:bisected,confirmed]

(In reply to Jens Stutte [:jstutte] from comment #2)

If we could get a pernosco session for this we will hopefully discover, who. Thanks Tyson!

You got it :)

A Pernosco session is available here: https://pernos.co/debug/lWL9CW6z65G5DwmTN5-pdQ/index.html

Flags: needinfo?(twsmith)

:nical, since you are the author of the regressor, bug 1750576, could you take a look? Also, could you set the severity field?

For more information, please visit auto_nag documentation.

Flags: needinfo?(nical.bugzilla)

edit: Ignore the analysis here; it was way too cursory and is wrong.

Wrong analysis for posterity:

I've only dug into this a little bit, but it seems like the IPC flow with the promise in https://phabricator.services.mozilla.com/D146817 may be okay such that https://searchfox.org/mozilla-central/source/dom/base/DOMMozPromiseRequestHolder.h isn't necessary, but that the problem is that the PCanvasManager top-level protocol isn't being torn down so the IPC promise is not being automatically rejected, so this may potentially be dependent on :janv's work on bug 1809044? (Unless a WorkerRef is being used for the protocol lifecycle maintenance?)

It looks like past the assert we do try to safely handle the case if globalScopeAlive exists:
https://searchfox.org/mozilla-central/rev/fb9a504ca73529fa550efe488db2a012a4bf5169/dom/workers/RuntimeService.cpp#2111-2116

Is that enough to make us safe, or does the fact that it was alive here mean it might be causing problems elsewhere?

Flags: needinfo?(bugmail)

I think the failure mode will be one of a safe error/no-op/early return, a safe null de-ref crash, a safe explicit MOZ_CRASH/MOZ_RELEASE_ASSERT, or a safe leak, yes. In particular, because it seems like the top-level protocol is not getting torn down but the event target WorkerThread will be gone, I mainly expect this to result in leaks.

Flags: needinfo?(bugmail)

Was there any specific opt-in step taken to run WebGPU on a worker? For the first nightly milestone we aim to allow WebGPU only on main thread at first.

I won't have time to look into this soon, forwarding the needinfo to Jim to make sure someone in the team has the reminder to keep an eye on it.

Flags: needinfo?(nical.bugzilla) → needinfo?(jimb)
Attached file prefs.js

(In reply to Nicolas Silva [:nical] from comment #9)

Was there any specific opt-in step taken to run WebGPU on a worker?

This is the prefs.js file that was in use when the issue was found by fuzzers.

I'm not familiar with WebGPU at all, but just looking at the WebIDL for GPUProvider and the code for Instance::RequestAdapter() I can't at a glance see anything that would prevent this from being exposed on workers when dom.webgpu.enabled is set to true.

There has been an increase in reports recently (maybe another fix unblocked this?). Marking as fuzzblocker.

Whiteboard: [bugmon:bisected,confirmed] → [bugmon:bisected,confirmed][fuzzblocker]

(In reply to Andrew McCreight [:mccr8] from comment #11)

I'm not familiar with WebGPU at all, but just looking at the WebIDL for GPUProvider and the code for Instance::RequestAdapter() I can't at a glance see anything that would prevent this from being exposed on workers when dom.webgpu.enabled is set to true.

Flags: needinfo?(nical.bugzilla)

It seems we get this now also as an intermittent, see bug 1809668. Can we just unhide this to allow for easier duping and such?

WebGPU shouldn't be accessible in workers at all, for the time being.

Assignee: nobody → jimb
Flags: needinfo?(jimb)
Attachment #9318848 - Attachment description: WIP: Bug 1808820: Don't offer WebGPU in worker scopes. → Bug 1808820: Don't offer WebGPU in worker scopes.
Flags: needinfo?(nical.bugzilla)

Unhiding to make tracking dupes easier, and Jim has just queued the patch for landing.

Group: gfx-core-security
Pushed by jblandy@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/445044d219ad
Don't offer WebGPU in worker scopes. r=webgpu-reviewers,webidl,nical,saschanaz
See Also: → webgpu-workers
Duplicate of this bug: 1809668
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch

Verified bug as fixed on rev mozilla-central 20230222094403-3408467a0885.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Status: RESOLVED → VERIFIED
Keywords: bugmon
Regressions: 1818379
Regressions: 1818918
No longer regressions: 1818918
Duplicate of this bug: 1817557

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #8)

I think the failure mode will be one of a safe error/no-op/early return, a safe null de-ref crash, a safe explicit MOZ_CRASH/MOZ_RELEASE_ASSERT, or a safe leak, yes. In particular, because it seems like the top-level protocol is not getting torn down but the event target WorkerThread will be gone, I mainly expect this to result in leaks.

Is this not sufficient to tear down the protocol when the worker is shutdown? CanvasManagerChild has an IPCWorkerRef, calls Destroy in the worker ref callback, which in turn calls Close on the IPDL actor, along with destroying its own reference?
https://searchfox.org/mozilla-central/rev/e9b338c2d597067f99e96d5f20769f41f312fa8f/gfx/ipc/CanvasManagerChild.cpp#112

See Also: → 1858731

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #25)

Is this not sufficient to tear down the protocol when the worker is shutdown? CanvasManagerChild has an IPCWorkerRef, calls Destroy in the worker ref callback, which in turn calls Close on the IPDL actor, along with destroying its own reference?
https://searchfox.org/mozilla-central/rev/e9b338c2d597067f99e96d5f20769f41f312fa8f/gfx/ipc/CanvasManagerChild.cpp#112

Apologies, my investigation in comment 6 was way too cursory and seems completely wrong; the pernosco trace (it's now rebuilt if anyone wants to look at it; you won't have to wait) shows that indeed the IPCWorkerRef is synchronously tearing down PCanvasManager which also tears down the PWebGPU it manages.

In general it can be an anti-pattern to drop a non-WeakWorkerRef synchronously inside a callback because usually the reason the StrongWorkerRef/IPCWorkerRef exists is to keep the worker alive beyond receiving the notification. Like because the holder wants to keep doing IPC for a bit (dom/cache does this, but should not) or because the holder has bounced runnables to other threads and needs to wait for them all to finish up/be mooted. For classes that are able to clean up synchronously when the worker starts shutting down and don't want to prevent the worker from being GC'ed and that's why they're using an IPCWorkerRef, WeakWorkerRef may be more appropriate.

That said, it's not immediately clear to me what would have been retaining the reference to the global object, then. Maybe there's a Promise somewhere and the promise chaining needs to use AddCallbacksWithCycleCollectedArgs so when the global gets unrooted they can be cycle collected and drop the reference to the global? If one can figure out which of the WorkerGlobalScopeBase::AddRef calls don't have paired WorkerGlobalScopeBase::Release calls in the pernosco trace, there's the answer. Unfortunately it would take a bit of work for me to get pernosco-bridge back into working order to figure that out but someone eyeballing the stacks of the AddRef/Release calls who know what WebGPU is doing may be able to figure it out. (Also, there may be other tooling options.)

Ah, I see you've moved dropping the WorkerRef to ActorDestroy in https://phabricator.services.mozilla.com/D190831 which is best practice for that, yeah.

Just to be clear, though, I don't think that will fix whatever was the cause of the problem in comment 0 and it's probably worth testing the fuzzing test-case again or having the fuzzer team otherwise see if this problem recurs, etc.

Right, that's why I wrote another patch in the same stack to allow WebGPU to run on DOM workers with a pref flip. The fuzzer can flip it and see if it happens again.

In general it can be an anti-pattern to drop a non-WeakWorkerRef synchronously inside a callback because usually the reason the StrongWorkerRef/IPCWorkerRef exists is to keep the worker alive beyond receiving the notification.

I've been working with the belief I was supposed to free the worker ref synchronously when I get the shutdown callback, because otherwise if I dispatch, I tend to trip this assert in CI:
https://searchfox.org/mozilla-central/rev/7d77ff808f8407a3e4fc0911779da446c050f9ee/dom/workers/WorkerPrivate.h#549

This happens when holding onto a ThreadSafeWorkerRef.

To be clear, the objects holding onto the ThreadSafeWorkerRef are not DOM objects, just various gfx related objects, and so they can in theory live slightly longer than average, and don't have any dependencies on GC/JS/etc.

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #29)

I've been working with the belief I was supposed to free the worker ref synchronously when I get the shutdown callback, because otherwise if I dispatch, I tend to trip this assert in CI:
https://searchfox.org/mozilla-central/rev/7d77ff808f8407a3e4fc0911779da446c050f9ee/dom/workers/WorkerPrivate.h#549

If this assertion is firing it means the WorkerRef is being dropped during cycle collection. (The flag that activates the assertion is only set during the call to GC.) WorkerRefs do need to be dropped deterministically without involving GC/CC.

I'd be interested in seeing any stacks / jobs with logs where you're seeing this if it comes up again or they're readily available. We're definitely concerned about WorkerRef ergonomics so understanding where those aren't good is helpful.

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #30)

To be clear, the objects holding onto the ThreadSafeWorkerRef are not DOM objects, just various gfx related objects, and so they can in theory live slightly longer than average, and don't have any dependencies on GC/JS/etc.

This should be fine, then.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: