Assertion failure: !globalScopeAlive, at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2106
Categories
(Core :: Graphics: WebGPU, defect)
Tracking
()
People
(Reporter: tsmith, Assigned: jimb)
References
(Blocks 2 open bugs, Regression)
Details
(4 keywords, Whiteboard: [bugmon:bisected,confirmed][fuzzblocker])
Attachments
(3 files)
Found while fuzzing m-c 20221217-3ccb0b86ab11 (--enable-debug --enable-fuzzing)
To reproduce via Grizzly Replay:
$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.html
Assertion failure: !globalScopeAlive, at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2106
#0 0x7fee8f8c1c2a in mozilla::dom::workerinternals::(anonymous namespace)::WorkerThreadPrimaryRunnable::Run() /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2106:7
#1 0x7fee8ad85598 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1191:16
#2 0x7fee8ad8ba1d in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:476:10
#3 0x7fee8b97d4ea in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:300:20
#4 0x7fee8b89fe28 in MessageLoop::RunInternal() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381:10
#5 0x7fee8b89fd31 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374:3
#6 0x7fee8b89fd31 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356:3
#7 0x7fee8ad80a97 in nsThread::ThreadFunc(void*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:383:10
#8 0x7fee9db23c86 in _pt_root /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201:5
#9 0x7fee9e3ccb42 in start_thread nptl/pthread_create.c:442:8
#10 0x7fee9e45e9ff misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Reporter | ||
Updated•2 years ago
|
Comment 2•2 years ago
|
||
That assertion means that something living outside the worker kept a reference to the worker's global scope such that it has not been freed before the worker ends. If we could get a pernosco session for this we will hopefully discover, who. Thanks Tyson!
Comment 3•2 years ago
|
||
Verified bug as reproducible on mozilla-central 20230106214742-7968ae37c117.
The bug appears to have been introduced in the following build range:
Start: e6e2286d2ac25001127a1cf54a87a95fb435c734 (20220708093332)
End: 807e95cd9956aa4967ddddc80f8ccab4ad370e8d (20220708081410)
Pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e6e2286d2ac25001127a1cf54a87a95fb435c734&tochange=807e95cd9956aa4967ddddc80f8ccab4ad370e8d
Reporter | ||
Comment 4•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #2)
If we could get a pernosco session for this we will hopefully discover, who. Thanks Tyson!
You got it :)
A Pernosco session is available here: https://pernos.co/debug/lWL9CW6z65G5DwmTN5-pdQ/index.html
Updated•2 years ago
|
Comment 5•2 years ago
|
||
:nical, since you are the author of the regressor, bug 1750576, could you take a look? Also, could you set the severity field?
For more information, please visit auto_nag documentation.
Comment 6•2 years ago
•
|
||
edit: Ignore the analysis here; it was way too cursory and is wrong.
Wrong analysis for posterity:
I've only dug into this a little bit, but it seems like the IPC flow with the promise in https://phabricator.services.mozilla.com/D146817 may be okay such that https://searchfox.org/mozilla-central/source/dom/base/DOMMozPromiseRequestHolder.h isn't necessary, but that the problem is that the PCanvasManager top-level protocol isn't being torn down so the IPC promise is not being automatically rejected, so this may potentially be dependent on :janv's work on bug 1809044? (Unless a WorkerRef is being used for the protocol lifecycle maintenance?)
Comment 7•2 years ago
|
||
It looks like past the assert we do try to safely handle the case if globalScopeAlive exists:
https://searchfox.org/mozilla-central/rev/fb9a504ca73529fa550efe488db2a012a4bf5169/dom/workers/RuntimeService.cpp#2111-2116
Is that enough to make us safe, or does the fact that it was alive here mean it might be causing problems elsewhere?
Comment 8•2 years ago
|
||
I think the failure mode will be one of a safe error/no-op/early return, a safe null de-ref crash, a safe explicit MOZ_CRASH/MOZ_RELEASE_ASSERT, or a safe leak, yes. In particular, because it seems like the top-level protocol is not getting torn down but the event target WorkerThread will be gone, I mainly expect this to result in leaks.
Comment 9•2 years ago
|
||
Was there any specific opt-in step taken to run WebGPU on a worker? For the first nightly milestone we aim to allow WebGPU only on main thread at first.
I won't have time to look into this soon, forwarding the needinfo to Jim to make sure someone in the team has the reminder to keep an eye on it.
Reporter | ||
Comment 10•2 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #9)
Was there any specific opt-in step taken to run WebGPU on a worker?
This is the prefs.js file that was in use when the issue was found by fuzzers.
Comment 11•2 years ago
|
||
I'm not familiar with WebGPU at all, but just looking at the WebIDL for GPUProvider and the code for Instance::RequestAdapter() I can't at a glance see anything that would prevent this from being exposed on workers when dom.webgpu.enabled is set to true.
Reporter | ||
Comment 12•2 years ago
|
||
There has been an increase in reports recently (maybe another fix unblocked this?). Marking as fuzzblocker.
Comment 13•2 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #11)
I'm not familiar with WebGPU at all, but just looking at the WebIDL for GPUProvider and the code for Instance::RequestAdapter() I can't at a glance see anything that would prevent this from being exposed on workers when dom.webgpu.enabled is set to true.
Comment 14•2 years ago
|
||
It seems we get this now also as an intermittent, see bug 1809668. Can we just unhide this to allow for easier duping and such?
Assignee | ||
Comment 15•2 years ago
•
|
||
WebGPU shouldn't be accessible in workers at all, for the time being.
Assignee | ||
Comment 16•2 years ago
|
||
Assignee | ||
Comment 17•2 years ago
|
||
Updated•2 years ago
|
Updated•2 years ago
|
Comment 18•2 years ago
|
||
Unhiding to make tracking dupes easier, and Jim has just queued the patch for landing.
Comment 19•2 years ago
|
||
Assignee | ||
Updated•2 years ago
|
Comment 21•2 years ago
|
||
bugherder |
Comment 22•2 years ago
|
||
Verified bug as fixed on rev mozilla-central 20230222094403-3408467a0885.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment 25•1 year ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #8)
I think the failure mode will be one of a safe error/no-op/early return, a safe null de-ref crash, a safe explicit MOZ_CRASH/MOZ_RELEASE_ASSERT, or a safe leak, yes. In particular, because it seems like the top-level protocol is not getting torn down but the event target WorkerThread will be gone, I mainly expect this to result in leaks.
Is this not sufficient to tear down the protocol when the worker is shutdown? CanvasManagerChild has an IPCWorkerRef, calls Destroy in the worker ref callback, which in turn calls Close on the IPDL actor, along with destroying its own reference?
https://searchfox.org/mozilla-central/rev/e9b338c2d597067f99e96d5f20769f41f312fa8f/gfx/ipc/CanvasManagerChild.cpp#112
Comment 26•1 year ago
|
||
(In reply to Andrew Osmond [:aosmond] (he/him) from comment #25)
Is this not sufficient to tear down the protocol when the worker is shutdown? CanvasManagerChild has an IPCWorkerRef, calls Destroy in the worker ref callback, which in turn calls Close on the IPDL actor, along with destroying its own reference?
https://searchfox.org/mozilla-central/rev/e9b338c2d597067f99e96d5f20769f41f312fa8f/gfx/ipc/CanvasManagerChild.cpp#112
Apologies, my investigation in comment 6 was way too cursory and seems completely wrong; the pernosco trace (it's now rebuilt if anyone wants to look at it; you won't have to wait) shows that indeed the IPCWorkerRef is synchronously tearing down PCanvasManager which also tears down the PWebGPU it manages.
In general it can be an anti-pattern to drop a non-WeakWorkerRef synchronously inside a callback because usually the reason the StrongWorkerRef/IPCWorkerRef exists is to keep the worker alive beyond receiving the notification. Like because the holder wants to keep doing IPC for a bit (dom/cache does this, but should not) or because the holder has bounced runnables to other threads and needs to wait for them all to finish up/be mooted. For classes that are able to clean up synchronously when the worker starts shutting down and don't want to prevent the worker from being GC'ed and that's why they're using an IPCWorkerRef, WeakWorkerRef may be more appropriate.
That said, it's not immediately clear to me what would have been retaining the reference to the global object, then. Maybe there's a Promise somewhere and the promise chaining needs to use AddCallbacksWithCycleCollectedArgs so when the global gets unrooted they can be cycle collected and drop the reference to the global? If one can figure out which of the WorkerGlobalScopeBase::AddRef
calls don't have paired WorkerGlobalScopeBase::Release
calls in the pernosco trace, there's the answer. Unfortunately it would take a bit of work for me to get pernosco-bridge back into working order to figure that out but someone eyeballing the stacks of the AddRef/Release calls who know what WebGPU is doing may be able to figure it out. (Also, there may be other tooling options.)
Comment 27•1 year ago
|
||
Ah, I see you've moved dropping the WorkerRef to ActorDestroy in https://phabricator.services.mozilla.com/D190831 which is best practice for that, yeah.
Comment 28•1 year ago
|
||
Just to be clear, though, I don't think that will fix whatever was the cause of the problem in comment 0 and it's probably worth testing the fuzzing test-case again or having the fuzzer team otherwise see if this problem recurs, etc.
Comment 29•1 year ago
|
||
Right, that's why I wrote another patch in the same stack to allow WebGPU to run on DOM workers with a pref flip. The fuzzer can flip it and see if it happens again.
In general it can be an anti-pattern to drop a non-WeakWorkerRef synchronously inside a callback because usually the reason the StrongWorkerRef/IPCWorkerRef exists is to keep the worker alive beyond receiving the notification.
I've been working with the belief I was supposed to free the worker ref synchronously when I get the shutdown callback, because otherwise if I dispatch, I tend to trip this assert in CI:
https://searchfox.org/mozilla-central/rev/7d77ff808f8407a3e4fc0911779da446c050f9ee/dom/workers/WorkerPrivate.h#549
This happens when holding onto a ThreadSafeWorkerRef.
Comment 30•1 year ago
|
||
To be clear, the objects holding onto the ThreadSafeWorkerRef are not DOM objects, just various gfx related objects, and so they can in theory live slightly longer than average, and don't have any dependencies on GC/JS/etc.
Comment 31•1 year ago
•
|
||
(In reply to Andrew Osmond [:aosmond] (he/him) from comment #29)
I've been working with the belief I was supposed to free the worker ref synchronously when I get the shutdown callback, because otherwise if I dispatch, I tend to trip this assert in CI:
https://searchfox.org/mozilla-central/rev/7d77ff808f8407a3e4fc0911779da446c050f9ee/dom/workers/WorkerPrivate.h#549
If this assertion is firing it means the WorkerRef is being dropped during cycle collection. (The flag that activates the assertion is only set during the call to GC.) WorkerRefs do need to be dropped deterministically without involving GC/CC.
I'd be interested in seeing any stacks / jobs with logs where you're seeing this if it comes up again or they're readily available. We're definitely concerned about WorkerRef ergonomics so understanding where those aren't good is helpful.
(In reply to Andrew Osmond [:aosmond] (he/him) from comment #30)
To be clear, the objects holding onto the ThreadSafeWorkerRef are not DOM objects, just various gfx related objects, and so they can in theory live slightly longer than average, and don't have any dependencies on GC/JS/etc.
This should be fine, then.
Description
•