Open Bug 1268793 Opened 4 years ago Updated 2 years ago
(This is related to bug 1260432, but really orthogonal.) Atomics.wait and Atomics.wake are implemented internally as a combination of a single global lock plus a per-agent condition variable: when an agent needs to wait, it waits on the condition variable, and a wakeup signals any waiting condition variables. Both wait and wake need to acquire the lock to do almost anything. The global lock controls the critical region for all the futex state for all workers in all domains. This is problematic for several reasons: - Disjoint subsets of the agents in an application that coordinate through disjoint locations in shared memory will have contention at that one lock, and it is not possible to reason about lock contention in one subset independent of the others. - Fast paths become difficult to construct because of the contended lock, everyone needs access to it even to check if any work needs to be done. For example, on simple benchmarks I see tremendous slowdowns if the application calls wake() blindly without having a system to see if anyone is waiting first. The reason is that wait() has a large critical region (a fair amount of code on its own and then the waiting on the cond var) and can retain ownership of the lock for a fairly long time. - Since there is a single global lock, multiple applications in multiple domains impact each other in the same way as disjoint subsets of agents within an application. A more sophisticated system is really called for. This boils down to a concurrent hash table where the location to be waited on is mapped (in a lock-free way) to a dedicated or quasi-dedicated lock/cond pair that can be used solely for implementing the sleeping for a particular agent. (Certainly one cond per agent, but possibly a shared lock per hash bucket or somesuch, if that simplifies anything.) Benchmarks to test this will appear in https://github.com/lars-t-hansen/parlib-simple/tree/master/bench.
> Benchmarks to test this will appear in https://github.com/lars-t-hansen/parlib-simple/tree/master/bench. Specifically, look at comments in lock.js for various configuration options; the most relevant benchmark is contended-lock4.html, this can also be configured in the .html and in contended-lock-worker.js. In general the finding is that it pays to avoid calling Atomics.wake() in various ways: first, only when the state of the lock indicates that it is necessary (the lock is contended); second, by tracking the number of waiters and not calling Atomics.wake() if there are no waiters; third, by avoiding calling Atomics.wait() (and thus incrementing the waiters count) by means of Atomics.pause(). On that note: Atomics.pause() is only beneficial if the contended critical region is "sufficiently long". If the critical region is very short then the chances are that wait() will not wait after all because the holder of the region will have already left it; if the critical region is of "medium" length then it's a wash. But if the critical region is "long enough" but not "extremely long" then pause can provide a nice boost in performance. (This is with a prototype implementation of Atomics.pause for MacOS.) The calculus might change if Atomics.pause can be inlined and if pauses for earlier iterations are fairly short, but it really depends on the program, on the cost of Atomics.wait, the cost of Atomics.wake, and probably the operating system.
You need to log in before you can comment on or make changes to this bug.