1808820 - Assertion failure: !globalScopeAlive, at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2106

Tyson Smith [:tsmith]

Reporter

Description

•

2 years ago

Attached file testcase.html — Details

Found while fuzzing m-c 20221217-3ccb0b86ab11 (--enable-debug --enable-fuzzing)

To reproduce via Grizzly Replay:

$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.html

Assertion failure: !globalScopeAlive, at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2106

#0 0x7fee8f8c1c2a in mozilla::dom::workerinternals::(anonymous namespace)::WorkerThreadPrimaryRunnable::Run() /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2106:7
#1 0x7fee8ad85598 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1191:16
#2 0x7fee8ad8ba1d in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:476:10
#3 0x7fee8b97d4ea in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:300:20
#4 0x7fee8b89fe28 in MessageLoop::RunInternal() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:381:10
#5 0x7fee8b89fd31 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:374:3
#6 0x7fee8b89fd31 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:356:3
#7 0x7fee8ad80a97 in nsThread::ThreadFunc(void*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:383:10
#8 0x7fee9db23c86 in _pt_root /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201:5
#9 0x7fee9e3ccb42 in start_thread nptl/pthread_create.c:442:8
#10 0x7fee9e45e9ff  misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Flags: in-testsuite?

Tyson Smith [:tsmith]

Reporter

Updated

•

2 years ago

Blocks: fuzzing-webgpu

Andrew McCreight [:mccr8]

Comment 1

•

2 years ago

Jens, what does this assertion mean? Thanks.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Comment 2

•

2 years ago

That assertion means that something living outside the worker kept a reference to the worker's global scope such that it has not been freed before the worker ends. If we could get a pernosco session for this we will hopefully discover, who. Thanks Tyson!

Flags: needinfo?(jstutte) → needinfo?(twsmith)

Bugmon [:jkratzer for issues]

Comment 3

•

2 years ago

Verified bug as reproducible on mozilla-central 20230106214742-7968ae37c117.
The bug appears to have been introduced in the following build range:

Start: e6e2286d2ac25001127a1cf54a87a95fb435c734 (20220708093332)
End: 807e95cd9956aa4967ddddc80f8ccab4ad370e8d (20220708081410)
Pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e6e2286d2ac25001127a1cf54a87a95fb435c734&tochange=807e95cd9956aa4967ddddc80f8ccab4ad370e8d

Keywords: regression

Whiteboard: [bugmon:bisected,confirmed]

Tyson Smith [:tsmith]

Reporter

Comment 4

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #2)

If we could get a pernosco session for this we will hopefully discover, who. Thanks Tyson!

You got it :)

A Pernosco session is available here: https://pernos.co/debug/lWL9CW6z65G5DwmTN5-pdQ/index.html

Flags: needinfo?(twsmith)

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox108: --- → disabled

status-firefox109: --- → disabled

status-firefox110: affected → fix-optional

status-firefox-esr102: --- → unaffected

Regressed by: 1750576

BugBot [:suhaib / :marco/ :calixte]

Comment 5

•

2 years ago

:nical, since you are the author of the regressor, bug 1750576, could you take a look? Also, could you set the severity field?

For more information, please visit auto_nag documentation.

Flags: needinfo?(nical.bugzilla)

Andrew Sutherland [:asuth] (he/him)

Comment 6

•

2 years ago

•

Edited

edit: Ignore the analysis here; it was way too cursory and is wrong.

Wrong analysis for posterity:

I've only dug into this a little bit, but it seems like the IPC flow with the promise in https://phabricator.services.mozilla.com/D146817 may be okay such that https://searchfox.org/mozilla-central/source/dom/base/DOMMozPromiseRequestHolder.h isn't necessary, but that the problem is that the PCanvasManager top-level protocol isn't being torn down so the IPC promise is not being automatically rejected, so this may potentially be dependent on :janv's work on bug 1809044? (Unless a WorkerRef is being used for the protocol lifecycle maintenance?)

Daniel Veditz [:dveditz]

Comment 7

•

2 years ago

It looks like past the assert we do try to safely handle the case if globalScopeAlive exists:
https://searchfox.org/mozilla-central/rev/fb9a504ca73529fa550efe488db2a012a4bf5169/dom/workers/RuntimeService.cpp#2111-2116

Is that enough to make us safe, or does the fact that it was alive here mean it might be causing problems elsewhere?

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Comment 8

•

2 years ago

I think the failure mode will be one of a safe error/no-op/early return, a safe null de-ref crash, a safe explicit MOZ_CRASH/MOZ_RELEASE_ASSERT, or a safe leak, yes. In particular, because it seems like the top-level protocol is not getting torn down but the event target WorkerThread will be gone, I mainly expect this to result in leaks.

Flags: needinfo?(bugmail)

Andrew McCreight [:mccr8]

Updated

•

2 years ago

Keywords: sec-other

Nicolas Silva [:nical]

Comment 9

•

2 years ago

Was there any specific opt-in step taken to run WebGPU on a worker? For the first nightly milestone we aim to allow WebGPU only on main thread at first.

I won't have time to look into this soon, forwarding the needinfo to Jim to make sure someone in the team has the reminder to keep an eye on it.

Flags: needinfo?(nical.bugzilla) → needinfo?(jimb)

Tyson Smith [:tsmith]

Reporter

Comment 10

•

2 years ago

Attached file prefs.js — Details

(In reply to Nicolas Silva [:nical] from comment #9)

Was there any specific opt-in step taken to run WebGPU on a worker?

This is the prefs.js file that was in use when the issue was found by fuzzers.

Andrew McCreight [:mccr8]

Comment 11

•

2 years ago

I'm not familiar with WebGPU at all, but just looking at the WebIDL for GPUProvider and the code for Instance::RequestAdapter() I can't at a glance see anything that would prevent this from being exposed on workers when dom.webgpu.enabled is set to true.

Tyson Smith [:tsmith]

Reporter

Comment 12

•

2 years ago

There has been an increase in reports recently (maybe another fix unblocked this?). Marking as fuzzblocker.

status-firefox110: fix-optional → disabled

status-firefox111: --- → disabled

status-firefox112: --- → disabled

Whiteboard: [bugmon:bisected,confirmed] → [bugmon:bisected,confirmed][fuzzblocker]

Jens Stutte [:jstutte]

Comment 13

•

2 years ago

(In reply to Andrew McCreight [:mccr8] from comment #11)

I'm not familiar with WebGPU at all, but just looking at the WebIDL for GPUProvider and the code for Instance::RequestAdapter() I can't at a glance see anything that would prevent this from being exposed on workers when dom.webgpu.enabled is set to true.

Flags: needinfo?(nical.bugzilla)

Jens Stutte [:jstutte]

Comment 14

•

2 years ago

It seems we get this now also as an intermittent, see bug 1809668. Can we just unhide this to allow for easier duping and such?

Jim Blandy :jimb

Assignee

Comment 15

•

2 years ago

•

Edited

WebGPU shouldn't be accessible in workers at all, for the time being.

Assignee: nobody → jimb

Flags: needinfo?(jimb)

Jim Blandy :jimb

Assignee

Comment 16

•

2 years ago

Attached file Bug 1808820: Don't offer WebGPU in worker scopes. — Details

Jim Blandy :jimb

Assignee

Comment 17

•

2 years ago

try push

Phabricator Automation

Updated

•

2 years ago

Attachment #9318848 - Attachment description: WIP: Bug 1808820: Don't offer WebGPU in worker scopes. → Bug 1808820: Don't offer WebGPU in worker scopes.

Nicolas Silva [:nical]

Updated

•

2 years ago

Flags: needinfo?(nical.bugzilla)

Andrew McCreight [:mccr8]

Comment 18

•

2 years ago

Unhiding to make tracking dupes easier, and Jim has just queued the patch for landing.

Group: gfx-core-security

Pulsebot

Comment 19

•

2 years ago

Pushed by jblandy@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/445044d219ad Don't offer WebGPU in worker scopes. r=webgpu-reviewers,webidl,nical,saschanaz

Jim Blandy :jimb

Assignee

Updated

•

2 years ago

Updated

•

2 years ago

Duplicate of this bug: 1809668

Cristina Horotan [:chorotan]

Comment 21

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/445044d219ad

Status: NEW → RESOLVED

Closed: 2 years ago

Resolution: --- → FIXED

Target Milestone: --- → 112 Branch

Bugmon [:jkratzer for issues]

Comment 22

•

2 years ago

Verified bug as fixed on rev mozilla-central 20230222094403-3408467a0885.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Status: RESOLVED → VERIFIED

Keywords: bugmon

Treeherder Bug Filer

Updated

•

2 years ago

Regressions: 1818379

Treeherder Bug Filer

Updated

•

2 years ago

Regressions: 1818918

Comment hidden (Intermittent Failures Robot)

Kagami Rosylight [:saschanaz] (they/them) (OOO until Dec 11)

Updated

•

2 years ago

No longer regressions: 1818918

Tyson Smith [:tsmith]

Reporter

Updated

•

2 years ago

Duplicate of this bug: 1817557

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox112: disabled → fixed

Andrew Osmond [:aosmond] (he/him)

Comment 25

•

1 year ago

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #8)

I think the failure mode will be one of a safe error/no-op/early return, a safe null de-ref crash, a safe explicit MOZ_CRASH/MOZ_RELEASE_ASSERT, or a safe leak, yes. In particular, because it seems like the top-level protocol is not getting torn down but the event target WorkerThread will be gone, I mainly expect this to result in leaks.

Is this not sufficient to tear down the protocol when the worker is shutdown? CanvasManagerChild has an IPCWorkerRef, calls Destroy in the worker ref callback, which in turn calls Close on the IPDL actor, along with destroying its own reference?
https://searchfox.org/mozilla-central/rev/e9b338c2d597067f99e96d5f20769f41f312fa8f/gfx/ipc/CanvasManagerChild.cpp#112

Andrew Osmond [:aosmond] (he/him)

Updated

•

1 year ago

Updated

•

1 year ago

Blocks: 1858732

Andrew Sutherland [:asuth] (he/him)

Comment 26

•

1 year ago

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #25)

Is this not sufficient to tear down the protocol when the worker is shutdown? CanvasManagerChild has an IPCWorkerRef, calls Destroy in the worker ref callback, which in turn calls Close on the IPDL actor, along with destroying its own reference?
https://searchfox.org/mozilla-central/rev/e9b338c2d597067f99e96d5f20769f41f312fa8f/gfx/ipc/CanvasManagerChild.cpp#112

Apologies, my investigation in comment 6 was way too cursory and seems completely wrong; the pernosco trace (it's now rebuilt if anyone wants to look at it; you won't have to wait) shows that indeed the IPCWorkerRef is synchronously tearing down PCanvasManager which also tears down the PWebGPU it manages.

In general it can be an anti-pattern to drop a non-WeakWorkerRef synchronously inside a callback because usually the reason the StrongWorkerRef/IPCWorkerRef exists is to keep the worker alive beyond receiving the notification. Like because the holder wants to keep doing IPC for a bit (dom/cache does this, but should not) or because the holder has bounced runnables to other threads and needs to wait for them all to finish up/be mooted. For classes that are able to clean up synchronously when the worker starts shutting down and don't want to prevent the worker from being GC'ed and that's why they're using an IPCWorkerRef, WeakWorkerRef may be more appropriate.

That said, it's not immediately clear to me what would have been retaining the reference to the global object, then. Maybe there's a Promise somewhere and the promise chaining needs to use AddCallbacksWithCycleCollectedArgs so when the global gets unrooted they can be cycle collected and drop the reference to the global? If one can figure out which of the WorkerGlobalScopeBase::AddRef calls don't have paired WorkerGlobalScopeBase::Release calls in the pernosco trace, there's the answer. Unfortunately it would take a bit of work for me to get pernosco-bridge back into working order to figure that out but someone eyeballing the stacks of the AddRef/Release calls who know what WebGPU is doing may be able to figure it out. (Also, there may be other tooling options.)

Andrew Sutherland [:asuth] (he/him)

Comment 27

•

1 year ago

Ah, I see you've moved dropping the WorkerRef to ActorDestroy in https://phabricator.services.mozilla.com/D190831 which is best practice for that, yeah.

Andrew Sutherland [:asuth] (he/him)

Comment 28

•

1 year ago

Just to be clear, though, I don't think that will fix whatever was the cause of the problem in comment 0 and it's probably worth testing the fuzzing test-case again or having the fuzzer team otherwise see if this problem recurs, etc.

Andrew Osmond [:aosmond] (he/him)

Comment 29

•

1 year ago

Right, that's why I wrote another patch in the same stack to allow WebGPU to run on DOM workers with a pref flip. The fuzzer can flip it and see if it happens again.

In general it can be an anti-pattern to drop a non-WeakWorkerRef synchronously inside a callback because usually the reason the StrongWorkerRef/IPCWorkerRef exists is to keep the worker alive beyond receiving the notification.

I've been working with the belief I was supposed to free the worker ref synchronously when I get the shutdown callback, because otherwise if I dispatch, I tend to trip this assert in CI:
https://searchfox.org/mozilla-central/rev/7d77ff808f8407a3e4fc0911779da446c050f9ee/dom/workers/WorkerPrivate.h#549

This happens when holding onto a ThreadSafeWorkerRef.

Andrew Osmond [:aosmond] (he/him)

Comment 30

•

1 year ago

To be clear, the objects holding onto the ThreadSafeWorkerRef are not DOM objects, just various gfx related objects, and so they can in theory live slightly longer than average, and don't have any dependencies on GC/JS/etc.

Andrew Sutherland [:asuth] (he/him)

Comment 31

•

1 year ago

•

Edited

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #29)

I've been working with the belief I was supposed to free the worker ref synchronously when I get the shutdown callback, because otherwise if I dispatch, I tend to trip this assert in CI:
https://searchfox.org/mozilla-central/rev/7d77ff808f8407a3e4fc0911779da446c050f9ee/dom/workers/WorkerPrivate.h#549

If this assertion is firing it means the WorkerRef is being dropped during cycle collection. (The flag that activates the assertion is only set during the call to GC.) WorkerRefs do need to be dropped deterministically without involving GC/CC.

I'd be interested in seeing any stacks / jobs with logs where you're seeing this if it comes up again or they're readily available. We're definitely concerned about WorkerRef ergonomics so understanding where those aren't good is helpful.

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #30)

To be clear, the objects holding onto the ThreadSafeWorkerRef are not DOM objects, just various gfx related objects, and so they can in theory live slightly longer than average, and don't have any dependencies on GC/JS/etc.

This should be fine, then.

testcase.html 2 years ago Tyson Smith [:tsmith] 582 bytes, text/html		Details
prefs.js 2 years ago Tyson Smith [:tsmith] 9.57 KB, application/x-javascript		Details
Bug 1808820: Don't offer WebGPU in worker scopes. 2 years ago Jim Blandy :jimb 48 bytes, text/x-phabricator-request		Details \| Review