Bug 1837186 Comment 2 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

Andrew Sutherland [:asuth] (he/him)

on 2023-06-11 18:31:46 PDT

Hm, yeah, the [tier 1 failure log](https://treeherder.mozilla.org/logviewer?job_id=418906142&repo=mozilla-central&lineNumber=17882) (and also the tier 2 log which seems identical) definitely seems to suggest there are some lifecycle problems relating to the worker runtime embedding and that the test at https://searchfox.org/mozilla-central/source/dom/promise/tests/test_webassembly_compile.html is great at showing this and in particular [terminateCompileInWorker](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/promise/tests/test_webassembly_compile.html#95-117) is an amazing stress test to create this situation.

Specifically, the log shows:
- [Dispatch to the worker is failing because we've reached mStatus=Dead](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/WorkerPrivate.cpp#1589-1591)
- [We're doing our event loop draining after exiting WorkerPrivate::DoRunLoop](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#2167) (reported as line 2171, but I think that's an optimization artifact) where the WorkerJSContext is still alive.
- The JS runtime is using a JSDispatchRunnable which [we create in DispatchToEventLoop](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#651) which we [provide to the JSRuntime in  InitJSContextForWorker](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#694-697).
- The assertion is firing because [we nulled out WorkerPrivate::mJSContext in WorkerPrivate::DoRunLoop right after we set mStatus to Dead](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/WorkerPrivate.cpp#3268) despite our willingness above to try and create and dispatch runnables.

It seems like:
- We're doing the right thing already as it relates to aborting the WASM compilations:
  - [CompileStreamTask::streamFailed_ may be the right place for the signaling per this comment](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/js/src/wasm/WasmJS.cpp#4892-4893) "Set on stream thread and read racily on helper thread to abort compilation"
  - That [gets set in rejectAndDestroyAfterHelperThreadStarted](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/js/src/wasm/WasmJS.cpp#4945)
  - The [cool searchfox diagram of calls to that method (across multiple platforms that make the symbols weird)](https://searchfox.org/mozilla-central/query/default?q=calls-to%3A%27CompileStreamTask%3A%3ArejectAndDestroyAfterHelperThreadStarted%27%20depth%3A4) comes from JSStreamConsumer.
  - We [expose](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#699-700) the [ConsumeStream](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#655) method which uses [FetchUtil::StreamResponseToJS](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#676) which creates a [JSStreamConsumer](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#335) which [holds a WorkerStreamOwner](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#344) which [holds a StrongWorkerRef](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#331) which [gets ProxyReleased in the JSStreamConsumer destructor](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#396) and whose [StrongWorkerRef callback closes the stream but leaves the release to that proxy release](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#302-306).
- There is an explicit way to shutdown the async JS tasks via [JS::ShutdownAsyncTasks](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/js/public/Promise.h#586-595) which [nsCycleCollector::ShutdownCollect calls](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/xpcom/base/nsCycleCollector.cpp#3412) and which we end up calling for workers via [our call to nsCycleCollector_shutdown in WorkerThreadPrimaryRunnable::Run](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#2203) which is after the place where we're crashing.
- The [comments in the underlying implementation for JS::ShutdownAsyncTasks](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/js/src/vm/OffThreadPromiseRuntimeState.cpp#241-278) suggest quite reasonably that we definitely need to have a barrier where we make DispatchToEventLoop start failing/rejecting at some point and guarantee that we run all the runnables that were enqueued prior to that point.
- We definitely have an ordering problem here as it relates to honoring the DispatchToEventLoop contract.  There's probably also some meta question about where we clear `mJSContext` and its relationship to where we call `nsCycleCollector_shutdown`, but that's probably something to ponder after addressing the DispatchToEventLoop contract and considering where else we use mJSContext.
  - In general, the problem is that we don't spin the event loop [between transitioning to dead from canceling and clearing mJSContext](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/WorkerPrivate.cpp#3238-3268).  If we set the status to `mDead` with the mutex held, drop the mutex, spin the event loop, then re-acquire the mutex and clear `mJSContext`, maybe everything becomes correct and starts working.

I'm setting a needinfo on myself to revisit this after letting my brain sleep on this, but others should feel free to chime in.

Revision 1 by

Andrew Sutherland [:asuth] (he/him)

on 2023-06-11 18:34:27 PDT

Hm, yeah, the [tier 1 failure log](https://treeherder.mozilla.org/logviewer?job_id=418906142&repo=mozilla-central&lineNumber=17882) (and also the tier 2 log which seems identical) definitely seems to suggest there are some lifecycle problems relating to the worker runtime embedding and that the test at https://searchfox.org/mozilla-central/source/dom/promise/tests/test_webassembly_compile.html is great at showing this and in particular [terminateCompileInWorker](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/promise/tests/test_webassembly_compile.html#95-117) is an amazing stress test to create this situation.

Specifically, the log shows:
- [Dispatch to the worker is failing because we've reached mStatus=Dead](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/WorkerPrivate.cpp#1589-1591)
- [We're doing our event loop draining after exiting WorkerPrivate::DoRunLoop](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#2167) (reported as line 2171, but I think that's an optimization artifact) where the WorkerJSContext is still alive.
- The JS runtime is using a JSDispatchRunnable which [we create in DispatchToEventLoop](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#651) which we [provide to the JSRuntime in  InitJSContextForWorker](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#694-697).
- The assertion is firing because [we nulled out WorkerPrivate::mJSContext in WorkerPrivate::DoRunLoop right after we set mStatus to Dead](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/WorkerPrivate.cpp#3268) but didn't drain the event loop in between, so the runnable actually runs after we cleared mJSContext.

It seems like:
- We're doing the right thing already as it relates to aborting the WASM compilations:
  - [CompileStreamTask::streamFailed_ may be the right place for the signaling per this comment](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/js/src/wasm/WasmJS.cpp#4892-4893) "Set on stream thread and read racily on helper thread to abort compilation"
  - That [gets set in rejectAndDestroyAfterHelperThreadStarted](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/js/src/wasm/WasmJS.cpp#4945)
  - The [cool searchfox diagram of calls to that method (across multiple platforms that make the symbols weird)](https://searchfox.org/mozilla-central/query/default?q=calls-to%3A%27CompileStreamTask%3A%3ArejectAndDestroyAfterHelperThreadStarted%27%20depth%3A4) comes from JSStreamConsumer.
  - We [expose](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#699-700) the [ConsumeStream](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#655) method which uses [FetchUtil::StreamResponseToJS](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#676) which creates a [JSStreamConsumer](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#335) which [holds a WorkerStreamOwner](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#344) which [holds a StrongWorkerRef](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#331) which [gets ProxyReleased in the JSStreamConsumer destructor](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#396) and whose [StrongWorkerRef callback closes the stream but leaves the release to that proxy release](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/fetch/FetchUtil.cpp#302-306).
- There is an explicit way to shutdown the async JS tasks via [JS::ShutdownAsyncTasks](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/js/public/Promise.h#586-595) which [nsCycleCollector::ShutdownCollect calls](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/xpcom/base/nsCycleCollector.cpp#3412) and which we end up calling for workers via [our call to nsCycleCollector_shutdown in WorkerThreadPrimaryRunnable::Run](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/RuntimeService.cpp#2203) which is after the place where we're crashing.
- The [comments in the underlying implementation for JS::ShutdownAsyncTasks](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/js/src/vm/OffThreadPromiseRuntimeState.cpp#241-278) suggest quite reasonably that we definitely need to have a barrier where we make DispatchToEventLoop start failing/rejecting at some point and guarantee that we run all the runnables that were enqueued prior to that point.
- We definitely have an ordering problem here as it relates to honoring the DispatchToEventLoop contract.  There's probably also some meta question about where we clear `mJSContext` and its relationship to where we call `nsCycleCollector_shutdown`, but that's probably something to ponder after addressing the DispatchToEventLoop contract and considering where else we use mJSContext.
  - In general, the problem is that we don't spin the event loop [between transitioning to dead from canceling and clearing mJSContext](https://searchfox.org/mozilla-central/rev/f60c919b9c9959573d7e7a2d86f07d4d7053ac79/dom/workers/WorkerPrivate.cpp#3238-3268).  If we set the status to `mDead` with the mutex held, drop the mutex, spin the event loop, then re-acquire the mutex and clear `mJSContext`, maybe everything becomes correct and starts working.

I'm setting a needinfo on myself to revisit this after letting my brain sleep on this, but others should feel free to chime in.

Back to Bug 1837186 Comment 2

Bugzilla

Quick Search

Bug 1837186 Comment 2 Edit History