Closed Bug 1984739 Opened 9 months ago Closed 8 months ago

Another Atomics.waitAsync tsan failure in atomics-wait-async.https.any.worker.html

Categories

(Core :: JavaScript Engine, task, P1)

task

Tracking

()

RESOLVED FIXED
145 Branch
Tracking Status
firefox145 --- fixed

People

(Reporter: iain, Assigned: edenchuang)

References

Details

Attachments

(1 file)

Another attempt at landing Atomics.waitAsync has run aground on TSAN warnings about deadlock in worker threads.

Here's the relevant part of the log for the failing job:

Main thread
#4 js::FutexThread::lock()
#5 JSContext::requestInterrupt(js::InterruptReason)
#6 JS_RequestInterruptCallback(JSContext*) 
#7 mozilla::dom::WorkerPrivate::DispatchControlRunnable(already_AddRefed<mozilla::dom::WorkerRunnable>) 
#8 mozilla::dom::WorkerPrivate::Dispatch(already_AddRefed<mozilla::dom::WorkerRunnable>, nsIEventTarget*) <-- WORKER LOCK CLAIMED HERE
#9 mozilla::dom::WorkerThreadRunnable::DispatchInternal(mozilla::dom::WorkerPrivate*) 
#10 mozilla::dom::WorkerRunnable::Dispatch(mozilla::dom::WorkerPrivate*)

vs

DOM Worker
#4 BaseAutoLock
#5 mozilla::dom::WorkerPrivate::Dispatch(already_AddRefed<mozilla::dom::WorkerRunnable>, nsIEventTarget*) 
#6 mozilla::dom::WorkerThreadRunnable::DispatchInternal(mozilla::dom::WorkerPrivate*) 
#7 mozilla::dom::WorkerRunnable::Dispatch(mozilla::dom::WorkerPrivate*) 
#8 mozilla::dom::workerinternals::(anonymous namespace)::DispatchToEventLoop(void*, std::unique_ptr<JS::Dispatchable, JS::DeletePolicy<JS::Dispatchable>>&&) 
#9 DispatchResolveAndDestroy 
#10 js::OffThreadPromiseTask::removeFromCancellableListAndDispatch(js::AutoLockHelperThreadState const&) 
#11 js::WaitAsyncTimeoutTask::run(JSContext*, JS::Dispatchable::MaybeShuttingDown)                          <-- FUTEX LOCK CLAIMED HERE
#12 JS::Dispatchable::Run(JSContext*, std::unique_ptr<JS::Dispatchable, JS::DeletePolicy<JS::Dispatchable>>&&, JS::Dispatchable::MaybeShuttingDown) 
#13 mozilla::dom::DelayedJSDispatchableHandler::Call(char const*) 
#14 mozilla::dom::WorkerGlobalScopeBase::RunTimeoutHandler(mozilla::dom::Timeout*) 
#15 non-virtual thunk to mozilla::dom::WorkerGlobalScopeBase::RunTimeoutHandler(mozilla::dom::Timeout*) 
#16 mozilla::dom::TimeoutManager::RunTimeout(mozilla::TimeStamp const&, mozilla::TimeStamp const&, bool) 
#17 mozilla::dom::TimeoutExecutor::MaybeExecute()

There's once again a potential deadlock, this time as the AsyncTimeoutTask attempts to run.

I think we can once again solve this in the waitAsync code by moving this call to removeFromCancellableListAndDispatch outside the scope of this AutoLockFutexAPI. It would be nice if the worker code didn't force us to do this, though.

Ugh. This code in the FutexWaiterListHead destructor makes everything more complicated. We want to callRemoveFromCancellableListAndDispatch in a loop, but we're holding the futex lock across the entire loop because we're also going to remove the node from the linked list.

And this code might be similarly problematic.

Hey Eden, we're running into problems with Atomics.waitAsync and potential deadlocks in worker code, I was wondering if you'd have time for a quick look to see if you have any ideas for us to help resolve this.

Flags: needinfo?(echuang)

According to the stack, it looks like

Main thread: Dispatch a WorkerControlRunnable -> Lock the Worker mutex -> call into JS_RequestInterruptCallback() -> try to acquire FutexThread mutex

Worker thread: call into the js::WaitAsyncTimeoutTask -> Lock the FutexThread mutex -> dispatch some WorkerThreadRunnable to Worker -> try to acquire Worker mutex

According to WorkerPrivate::DispatchControlRunnable(), we seem to be able to move the JS_RequestInterruptCallback into a MutexAutoUnlock block to decouple WorkerPrivate::mMutex and the js::mutex under FutexThread::Lock.

I mean WorkerPrivate::DispatchControlRunnable() can be rewirte as followings.

nsresult WorkerPrivate::DispatchControlRunnable(
    already_AddRefed<WorkerRunnable> aWorkerRunnable) {
  ...
  {
    MutexAutoLock lock(mMutex);
    ...
    if (JSContext* cx = mJSContext) {
      MOZ_ASSERT(mThread);
      {
         MutexAutoUnlock unlock(mMutex);
         JS_RequestInterruptCallback(cx);
      }
    }
    mCondVar.Notify();
  }
  ...
}

Andrew, do you have any concerns about releasing the mutex while executing the JS_RequestInterruptCallback()?

Flags: needinfo?(echuang) → needinfo?(bugmail)

tl;dr: I think it's probably fine if we believe StrongWorkerRefs cover the usage, in which case I suggest some assertions below. If we don't want to depend on that, I think we need to otherwise ensure the Worker is limited in its ability to progress in its shutdown progress.

Analysis

UAF concerns

I think our primary concern is that we need to make sure our use of the cx is not a UAF. Once we drop the mutex, we need to have confidence that there's no way the worker can exit its run loop and destroy the context. Currently, holding the mutex provides that guarantee because the worker can't change its state with the mutex held.

For WorkerThread, we have mozilla::dom::WorkerThread::mOtherThreadsDispatchingViaEventTarget for a similar purpose. We could adopt that approach in which case we would want to make sure we increment the count before dropping the mutex, then reacquire the mutex and decrement the count before notifying the condvar. (And then make sure to check that value and potentially wait on the condvar before we go tearing down the WorkerJSContext.)

That said, if whoever is calling the dispatch method is covered by a StrongWorkerRef and it cannot be released until we return control to the caller, we're fine. (But a WorkerPrivate refcount is not sufficient on its own, it's mozilla::dom::WorkerPrivate::HasActiveWorkerRefs that we need to return true.) In general, we already depend on people to have correctly done that. Bug 1837283 could potentially help with that if we also required that for OMT control runnable dispatch too.

If we depend on that, I think we would ideally want the equivalent of MOZ_ASSERT_DEBUG_OR_FUZZING(mWorkerPrivate->HasActiveWorkerRefs()); both before and after the call to JS interrupt so that we could get fuzzing coverage to detect if that invariant is being violated. Because that function depends on looking into mozilla::dom::WorkerPrivate::mWorkerThreadAccessible it requires a little extra legwork to maintain for off-worker-thread access, but we could do it via a field that conditionally exists #if DEBUG || FUZZING and have the field maintained by mozilla::dom::WorkerPrivate::AddWorkerRef and mozilla::dom::WorkerPrivate::RemoveWorkerRef.

That said, it wouldn't be shocking for there to be cases like the RuntimeService shutdown code where we ask the workers why they look like they're hanging using control runnables that might be angry about that, so maybe the mOtherThreadsDispatchingViaEventTarget counter idiom is better.

Ordering concerns

One question is if we even need to re-acquire the lock and so we could move the CondVar notification up above the call to the JS-interrupt. If we do the mOtherThreadsDispatchingViaEventTarget idiom, we definitely need to re-acquire the lock, decrement the variable, and notify the condvar afterwards. If we do the StrongWorkerRef checks, that also would imply also reacquiring the lock at least in the debug-or-fuzzing case. So it sorta sounds like either way we might as well stick with the existing ordering and intentionally reacquire the lock and notify afterwards even if it's more wasteful.

For posterity, since I already wrote it up in the process of thinking this through, I understand the current rationales for the calls to be:

  • The call to notify is to handle the case where the worker is parked in mozilla::dom::WorkerPrivate::WaitForWorkerEvents. Its strict ordering constraint is we must have put the runnable in the control queue already (which we will have).
  • The call to JS_RequestInterruptCallback is to handle the case where the worker is busy in JS code and potentially could be in an infinite loop, so we have to interrupt. As noted above, we just need to have put the runnable in the control queue.
    • In the event we're not running JS or we're on the way out of JS back to our event loop or our nested event loop, we'll check the control queue regardless of the call to notify above, and it's possible the notify could make us go through an extra no-op turn of the loop.
Flags: needinfo?(bugmail)

Actually, forget the StrongWorkerRef stuff, I think we just need to do the mOtherThreadsDispatchingViaEventTarget-style thing because ThreadSafeWorkerRefs use control runnables to release.

Thank you for the detailed explanation.

I will try mOtherThreadsDispatchingViaEventTarget mechanism to break the Lock sequence between WorkerPrivate::mMutex and FutexThread::Lock.
This could help resolve the WorkerRunnable dispatching and JS execution deadlock situation.

However, to avoid the deadlock perfectly, we should not introduce any lock sequence or ensure that the lock sequence is always in the same order, regardless of the situation. However, it seems that it is not easy to determine the correct sequence between WorkerPrivate::mMutex and FutexThread::Lock. I would also like to ask Iain to figure out how to decouple the locks in the JS code, so we will not easily introduce a deadlock again.

Flags: needinfo?(iireland)

There are two relevant locks here. There are three possible orderings that we could enforce:

  1. The worker private mutex must be claimed before the futex lock.
  2. The worker private mutex must be claimed after the futex lock.
  3. Neither lock should be claimed while the other is held.

In JS_RequestInterruptCallback, we have to claim the futex lock to wake up any threads that are currently suspended inside of Atomics.wait. I don't think there's anything we can change about that. This means that we either to pick option #1 above, or to release the worker lock before calling JS_RequestInterruptFallback.

In AtomicsWaitAsyncCriticalSection, we claim the futex lock (as specified here), do some work, and then release the lock. If there is a timeout, we call delayedDispatchToEventLoop to enqueue a timeout job with the embedder. The specification says that this should happen before releasing the lock, but the worker implementation of delayedDispatchToEventLoop claims the WorkerPrivate mutex. Here, our options are to pick option #2, or release the futex lock before dispatching the timeout task. In this patch, we went with the latter option, because we couldn't come up with a way for anything to go wrong if the timeout task is enqueued outside the critical section.

In this bug, we ran into a second, similar issue. In this case, when a timeout task runs, we have to resolve the promise created by Atomics.waitAsync with the string "timed-out". To do so, we dispatch a task to the worker that originally called Atomics.waitAsync. We also update some of the futex-guarded state. Again, this is specified to do all the work inside the critical section, but we could also choose to move removeFromCancellableListAndDispatch outside the critical section in the same way as above. If we don't, we have to go with option #2 above.

As linked in comment 1, there are two other cases where we call removeFromCancellableListAndDispatch:

  1. In Atomics.notify, we may have to resolve an arbitrary number of promises created by other threads with the string "ok". This involves dispatching a task to those threads. It's a little annoying to move this out of the loop, but we already have to do something similar for promises that we want to resolve on our own thread.
  2. If a SharedArrayBuffer dies, any promise that was created by a (timeout-free) waitAsync on an index in that buffer can no longer be resolved. In this code we clean up the tasks that would resolve those promises. Moving removeFromCancellableListAndDispatch outside of the critical section here is annoying because there can be many waiters, so we would have to build a list of waiters inside the critical section and then free them outside.

We can't explicitly claim the worker lock in any of this code because it's generic across embedders; the main thread and worklets don't have the same locking problems.

So revisiting the options above:

  1. Worker before futex: the three remaining cases where SM dispatches tasks to the embedding must be rewritten to move those calls outside the futex critical section.
  2. Futex before worker: DispatchControlRunnable must drop the worker mutex before calling JS_RequestInterruptCallback.
  3. Mutually exclusive: Both sets of changes must be made.

Miscellaneous thoughts:

I should maybe point out that all these tasks are cancelled as part of context shutdown; I'm not sure how that interacts with the concerns about workers exiting their run loops.

Given that all of our problems are around dispatching tasks, I suppose another option would be to split out a separate mutex that only covers the dispatch queue. I don't know enough about the worker private to say whether this would be any good.

Flags: needinfo?(iireland)

When dispatching a WorkerControlRunnable, it first requires WorkerPrivate::mMutex to push the runnable into the ControlQueue, and then calls JS_RequestInterruptCallback(), which requires FutexThread::Lock, to interrupt the JS execution for ControlRunnable handling.
However, the required lock sequence could cause a deadlock when JS codes holds FutexThread::Lock and tries to dispatch a runnable to the Worker.

This patch introduces WorkerPrivate::mDispatchingControlRunnables to decouple the WorkerPrivate::mMutex and FutexThread::Lock, and also to prevent the JSContext from being destroyed while dispatching a ControlRunnable.

Assignee: nobody → echuang
Status: NEW → ASSIGNED
Pushed by echuang@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/20a94d0def3a https://hg.mozilla.org/integration/autoland/rev/eddd19f1deb9 Decoupling the dependency between WorkerPrivate::mMutex and FutexThread::Lock in WorkerPrivate::DispatchControlRunnables. r=asuth
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → 145 Branch

Since https://bugzilla.mozilla.org/show_bug.cgi?id=1884148 is marked as fixed, I assume the patch effectively resolves the deadlock issue.

Iain, please let me know if you are still experiencing issues with this matter. Thank you.

Flags: needinfo?(iireland)

Bug 1884148 has landed successfully, so it appears that we have resolved the deadlock issue. Thanks!

Flags: needinfo?(iireland)
QA Whiteboard: [qa-triage-done-c146/b145]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: