1984739 - Another Atomics.waitAsync tsan failure in atomics-wait-async.https.any.worker.html

Reporter

Description

•

9 months ago

Another attempt at landing Atomics.waitAsync has run aground on TSAN warnings about deadlock in worker threads.

Here's the relevant part of the log for the failing job:

Main thread
#4 js::FutexThread::lock()
#5 JSContext::requestInterrupt(js::InterruptReason)
#6 JS_RequestInterruptCallback(JSContext*) 
#7 mozilla::dom::WorkerPrivate::DispatchControlRunnable(already_AddRefed<mozilla::dom::WorkerRunnable>) 
#8 mozilla::dom::WorkerPrivate::Dispatch(already_AddRefed<mozilla::dom::WorkerRunnable>, nsIEventTarget*) <-- WORKER LOCK CLAIMED HERE
#9 mozilla::dom::WorkerThreadRunnable::DispatchInternal(mozilla::dom::WorkerPrivate*) 
#10 mozilla::dom::WorkerRunnable::Dispatch(mozilla::dom::WorkerPrivate*)

vs

DOM Worker
#4 BaseAutoLock
#5 mozilla::dom::WorkerPrivate::Dispatch(already_AddRefed<mozilla::dom::WorkerRunnable>, nsIEventTarget*) 
#6 mozilla::dom::WorkerThreadRunnable::DispatchInternal(mozilla::dom::WorkerPrivate*) 
#7 mozilla::dom::WorkerRunnable::Dispatch(mozilla::dom::WorkerPrivate*) 
#8 mozilla::dom::workerinternals::(anonymous namespace)::DispatchToEventLoop(void*, std::unique_ptr<JS::Dispatchable, JS::DeletePolicy<JS::Dispatchable>>&&) 
#9 DispatchResolveAndDestroy 
#10 js::OffThreadPromiseTask::removeFromCancellableListAndDispatch(js::AutoLockHelperThreadState const&) 
#11 js::WaitAsyncTimeoutTask::run(JSContext*, JS::Dispatchable::MaybeShuttingDown)                          <-- FUTEX LOCK CLAIMED HERE
#12 JS::Dispatchable::Run(JSContext*, std::unique_ptr<JS::Dispatchable, JS::DeletePolicy<JS::Dispatchable>>&&, JS::Dispatchable::MaybeShuttingDown) 
#13 mozilla::dom::DelayedJSDispatchableHandler::Call(char const*) 
#14 mozilla::dom::WorkerGlobalScopeBase::RunTimeoutHandler(mozilla::dom::Timeout*) 
#15 non-virtual thunk to mozilla::dom::WorkerGlobalScopeBase::RunTimeoutHandler(mozilla::dom::Timeout*) 
#16 mozilla::dom::TimeoutManager::RunTimeout(mozilla::TimeStamp const&, mozilla::TimeStamp const&, bool) 
#17 mozilla::dom::TimeoutExecutor::MaybeExecute()

There's once again a potential deadlock, this time as the AsyncTimeoutTask attempts to run.

I think we can once again solve this in the waitAsync code by moving this call to removeFromCancellableListAndDispatch outside the scope of this AutoLockFutexAPI. It would be nice if the worker code didn't force us to do this, though.

Iain Ireland [:iain]

Reporter

Comment 1

•

9 months ago

Ugh. This code in the FutexWaiterListHead destructor makes everything more complicated. We want to callRemoveFromCancellableListAndDispatch in a loop, but we're holding the futex lock across the entire loop because we're also going to remove the node from the linked list.

And this code might be similarly problematic.

Dan Minor [:dminor]

Comment 2

•

9 months ago

Hey Eden, we're running into problems with Atomics.waitAsync and potential deadlocks in worker code, I was wondering if you'd have time for a quick look to see if you have any ideas for us to help resolve this.

Flags: needinfo?(echuang)

Eden Chuang[:edenchuang]

Assignee

Comment 3

•

9 months ago

According to the stack, it looks like

Main thread: Dispatch a WorkerControlRunnable -> Lock the Worker mutex -> call into JS_RequestInterruptCallback() -> try to acquire FutexThread mutex

Worker thread: call into the js::WaitAsyncTimeoutTask -> Lock the FutexThread mutex -> dispatch some WorkerThreadRunnable to Worker -> try to acquire Worker mutex

According to WorkerPrivate::DispatchControlRunnable(), we seem to be able to move the JS_RequestInterruptCallback into a MutexAutoUnlock block to decouple WorkerPrivate::mMutex and the js::mutex under FutexThread::Lock.

I mean WorkerPrivate::DispatchControlRunnable() can be rewirte as followings.

nsresult WorkerPrivate::DispatchControlRunnable(
    already_AddRefed<WorkerRunnable> aWorkerRunnable) {
  ...
  {
    MutexAutoLock lock(mMutex);
    ...
    if (JSContext* cx = mJSContext) {
      MOZ_ASSERT(mThread);
      {
         MutexAutoUnlock unlock(mMutex);
         JS_RequestInterruptCallback(cx);
      }
    }
    mCondVar.Notify();
  }
  ...
}

Andrew, do you have any concerns about releasing the mutex while executing the JS_RequestInterruptCallback()?

Flags: needinfo?(echuang) → needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Comment 4

•

9 months ago

•

Edited

tl;dr: I think it's probably fine if we believe StrongWorkerRefs cover the usage, in which case I suggest some assertions below. If we don't want to depend on that, I think we need to otherwise ensure the Worker is limited in its ability to progress in its shutdown progress.

Analysis

UAF concerns

I think our primary concern is that we need to make sure our use of the cx is not a UAF. Once we drop the mutex, we need to have confidence that there's no way the worker can exit its run loop and destroy the context. Currently, holding the mutex provides that guarantee because the worker can't change its state with the mutex held.

For WorkerThread, we have mozilla::dom::WorkerThread::mOtherThreadsDispatchingViaEventTarget for a similar purpose. We could adopt that approach in which case we would want to make sure we increment the count before dropping the mutex, then reacquire the mutex and decrement the count before notifying the condvar. (And then make sure to check that value and potentially wait on the condvar before we go tearing down the WorkerJSContext.)

That said, if whoever is calling the dispatch method is covered by a StrongWorkerRef and it cannot be released until we return control to the caller, we're fine. (But a WorkerPrivate refcount is not sufficient on its own, it's mozilla::dom::WorkerPrivate::HasActiveWorkerRefs that we need to return true.) In general, we already depend on people to have correctly done that. Bug 1837283 could potentially help with that if we also required that for OMT control runnable dispatch too.

If we depend on that, I think we would ideally want the equivalent of MOZ_ASSERT_DEBUG_OR_FUZZING(mWorkerPrivate->HasActiveWorkerRefs()); both before and after the call to JS interrupt so that we could get fuzzing coverage to detect if that invariant is being violated. Because that function depends on looking into mozilla::dom::WorkerPrivate::mWorkerThreadAccessible it requires a little extra legwork to maintain for off-worker-thread access, but we could do it via a field that conditionally exists #if DEBUG || FUZZING and have the field maintained by mozilla::dom::WorkerPrivate::AddWorkerRef and mozilla::dom::WorkerPrivate::RemoveWorkerRef.

That said, it wouldn't be shocking for there to be cases like the RuntimeService shutdown code where we ask the workers why they look like they're hanging using control runnables that might be angry about that, so maybe the mOtherThreadsDispatchingViaEventTarget counter idiom is better.

Ordering concerns

One question is if we even need to re-acquire the lock and so we could move the CondVar notification up above the call to the JS-interrupt. If we do the mOtherThreadsDispatchingViaEventTarget idiom, we definitely need to re-acquire the lock, decrement the variable, and notify the condvar afterwards. If we do the StrongWorkerRef checks, that also would imply also reacquiring the lock at least in the debug-or-fuzzing case. So it sorta sounds like either way we might as well stick with the existing ordering and intentionally reacquire the lock and notify afterwards even if it's more wasteful.

For posterity, since I already wrote it up in the process of thinking this through, I understand the current rationales for the calls to be:

The call to notify is to handle the case where the worker is parked in mozilla::dom::WorkerPrivate::WaitForWorkerEvents. Its strict ordering constraint is we must have put the runnable in the control queue already (which we will have).
The call to JS_RequestInterruptCallback is to handle the case where the worker is busy in JS code and potentially could be in an infinite loop, so we have to interrupt. As noted above, we just need to have put the runnable in the control queue.
- In the event we're not running JS or we're on the way out of JS back to our event loop or our nested event loop, we'll check the control queue regardless of the call to notify above, and it's possible the notify could make us go through an extra no-op turn of the loop.

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Comment 5

•

9 months ago

Actually, forget the StrongWorkerRef stuff, I think we just need to do the mOtherThreadsDispatchingViaEventTarget-style thing because ThreadSafeWorkerRefs use control runnables to release.

Eden Chuang[:edenchuang]

Assignee

Comment 6

•

9 months ago

Thank you for the detailed explanation.

I will try mOtherThreadsDispatchingViaEventTarget mechanism to break the Lock sequence between WorkerPrivate::mMutex and FutexThread::Lock.
This could help resolve the WorkerRunnable dispatching and JS execution deadlock situation.

However, to avoid the deadlock perfectly, we should not introduce any lock sequence or ensure that the lock sequence is always in the same order, regardless of the situation. However, it seems that it is not easy to determine the correct sequence between WorkerPrivate::mMutex and FutexThread::Lock. I would also like to ask Iain to figure out how to decouple the locks in the JS code, so we will not easily introduce a deadlock again.

Flags: needinfo?(iireland)

Iain Ireland [:iain]

Reporter

Comment 7

•

9 months ago

There are two relevant locks here. There are three possible orderings that we could enforce:

The worker private mutex must be claimed before the futex lock.
The worker private mutex must be claimed after the futex lock.
Neither lock should be claimed while the other is held.

In JS_RequestInterruptCallback, we have to claim the futex lock to wake up any threads that are currently suspended inside of Atomics.wait. I don't think there's anything we can change about that. This means that we either to pick option #1 above, or to release the worker lock before calling JS_RequestInterruptFallback.

In AtomicsWaitAsyncCriticalSection, we claim the futex lock (as specified here), do some work, and then release the lock. If there is a timeout, we call delayedDispatchToEventLoop to enqueue a timeout job with the embedder. The specification says that this should happen before releasing the lock, but the worker implementation of delayedDispatchToEventLoop claims the WorkerPrivate mutex. Here, our options are to pick option #2, or release the futex lock before dispatching the timeout task. In this patch, we went with the latter option, because we couldn't come up with a way for anything to go wrong if the timeout task is enqueued outside the critical section.

In this bug, we ran into a second, similar issue. In this case, when a timeout task runs, we have to resolve the promise created by Atomics.waitAsync with the string "timed-out". To do so, we dispatch a task to the worker that originally called Atomics.waitAsync. We also update some of the futex-guarded state. Again, this is specified to do all the work inside the critical section, but we could also choose to move removeFromCancellableListAndDispatch outside the critical section in the same way as above. If we don't, we have to go with option #2 above.

As linked in comment 1, there are two other cases where we call removeFromCancellableListAndDispatch:

In Atomics.notify, we may have to resolve an arbitrary number of promises created by other threads with the string "ok". This involves dispatching a task to those threads. It's a little annoying to move this out of the loop, but we already have to do something similar for promises that we want to resolve on our own thread.
If a SharedArrayBuffer dies, any promise that was created by a (timeout-free) waitAsync on an index in that buffer can no longer be resolved. In this code we clean up the tasks that would resolve those promises. Moving removeFromCancellableListAndDispatch outside of the critical section here is annoying because there can be many waiters, so we would have to build a list of waiters inside the critical section and then free them outside.

We can't explicitly claim the worker lock in any of this code because it's generic across embedders; the main thread and worklets don't have the same locking problems.

So revisiting the options above:

Worker before futex: the three remaining cases where SM dispatches tasks to the embedding must be rewritten to move those calls outside the futex critical section.
Futex before worker: DispatchControlRunnable must drop the worker mutex before calling JS_RequestInterruptCallback.
Mutually exclusive: Both sets of changes must be made.

Miscellaneous thoughts:

I should maybe point out that all these tasks are cancelled as part of context shutdown; I'm not sure how that interacts with the concerns about workers exiting their run loops.

Given that all of our problems are around dispatching tasks, I suppose another option would be to split out a separate mutex that only covers the dispatch queue. I don't know enough about the worker private to say whether this would be any good.

Flags: needinfo?(iireland)

Eden Chuang[:edenchuang]

Assignee

Comment 8

•

8 months ago

Attached file Bug 1984739 - Decoupling the dependency between WorkerPrivate::mMutex and FutexThread::Lock in WorkerPrivate::DispatchControlRunnables. r=asuth — Details

When dispatching a WorkerControlRunnable, it first requires WorkerPrivate::mMutex to push the runnable into the ControlQueue, and then calls JS_RequestInterruptCallback(), which requires FutexThread::Lock, to interrupt the JS execution for ControlRunnable handling.
However, the required lock sequence could cause a deadlock when JS codes holds FutexThread::Lock and tries to dispatch a runnable to the Worker.

This patch introduces WorkerPrivate::mDispatchingControlRunnables to decouple the WorkerPrivate::mMutex and FutexThread::Lock, and also to prevent the JSContext from being destroyed while dispatching a ControlRunnable.

Phabricator Automation

Updated

•

8 months ago

Assignee: nobody → echuang

Status: NEW → ASSIGNED

Pulsebot

Comment 9

•

8 months ago

Pushed by echuang@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/20a94d0def3a https://hg.mozilla.org/integration/autoland/rev/eddd19f1deb9 Decoupling the dependency between WorkerPrivate::mMutex and FutexThread::Lock in WorkerPrivate::DispatchControlRunnables. r=asuth

agoloman

Comment 10

•

8 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/eddd19f1deb9

Status: ASSIGNED → RESOLVED

Closed: 8 months ago

status-firefox145: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 145 Branch

Eden Chuang[:edenchuang]

Assignee

Comment 11

•

8 months ago

Since https://bugzilla.mozilla.org/show_bug.cgi?id=1884148 is marked as fixed, I assume the patch effectively resolves the deadlock issue.

Iain, please let me know if you are still experiencing issues with this matter. Thank you.

Flags: needinfo?(iireland)

Iain Ireland [:iain]

Reporter

Comment 12

•

8 months ago

Bug 1884148 has landed successfully, so it appears that we have resolved the deadlock issue. Thanks!

Flags: needinfo?(iireland)

Mihai Boldan, Desktop QA [:mboldan]

Updated

•

7 months ago

QA Whiteboard: [qa-triage-done-c146/b145]

Bugzilla

Another Atomics.waitAsync tsan failure in atomics-wait-async.https.any.worker.html

Categories

(Core :: JavaScript Engine, task, P1)

Tracking

()

People

(Reporter: iain, Assigned: edenchuang)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Analysis

UAF concerns

Ordering concerns

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Attachment

General

Description

File Name

Content Type