Closed
Bug 985865
Opened 10 years ago
Closed 9 years ago
PJS: Investigate synchronization overhead in work item scheduling
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: lth, Unassigned)
References
Details
Attachments
(2 files)
9.01 KB,
patch
|
Details | Diff | Splinter Review | |
23.27 KB,
patch
|
Details | Diff | Splinter Review |
The current PJS scheduler may have synchronization overhead that can be avoided. This bug tracks that work. - The pendingSlices_ member of ThreadPool is hammered on by all the workers - The worker structures contain atomic variables, and they are so small that there may be false sharing if they are allocated next to each other General approaches here will be to introduce padding, distributing the shared counter, and - in the best of all worlds - make synchronized variables unsynchronized. If the work items are small then the synchronization overhead could be a large part of the cost (TBD).
Comment 1•10 years ago
|
||
I'm working on a patch to remove pendingSlices_. The idea is that since the # of workers is fixed for a workload, each worker structure would have a thread-local array of ThreadPoolWorker pointers. To steal work, each worker chooses at random a worker from its local worker array, removing workers with no more work at the same time. The worker finishes when its local array of workers is empty. This should solve the first bullet above. This array of pointers will also be allocated contiguously after the the ThreadPoolWorker itself, making each ThreadPoolWorker bigger than a cache line, so it should coincidentally solve the second bullet too.
Reporter | ||
Comment 2•10 years ago
|
||
This patch has two fixes: - Padding for ThreadPoolWorker. I don't think this will matter in any case since those structures are heap-allocated, not allocated in an array, but it's good to be sure. - Distributing the counter for pendingSlices_ by moving it from the ThreadPool into the workers, and making the hasWork test more complex. The key to this hack is to un-inline ForkJoinGetSlice. These changes don't appear to move the needle at all on either the 4-Core AMD or the 4x2 Intel i7, with reasonable thread count settings. It is possible that what really needs to be targeted is the synchronized updates to pendingSlices_ and sliceBounds_, but that will impact how work stealing and agreement among workers to stop computing are implemented in a major way. It would be useful to have some indication first that it'll be worth it to go after that.
Comment 3•10 years ago
|
||
This doesn't have the inlining, but I don't think that matters at least for the displace benchmark. It doesn't improve anything all that much. What I am noticing is on my 4-core HT machine, thread count > 4 results in a lot more stealing happening. Contention on sliceBounds is probably the culprit. I'm going to look into splitting the lower and upper bounds into 2 separate fields that can be CAS'd separately. Steal-half should also be worthy looking into.
Updated•10 years ago
|
Assignee: nobody → shu
Status: NEW → ASSIGNED
Comment 4•10 years ago
|
||
> I'm going to look into splitting the lower and upper bounds into 2 separate
> fields that can be CAS'd separately.
I just experimented with this, and it didn't seem to have any effects.
Comment 5•10 years ago
|
||
The more-threads slowdown probably isn't caused by workstealing, as I see the slowdown pattern even when I hardcode workstealing to off in ThreadPool::workStealing. Hmm...
Updated•10 years ago
|
Assignee: shu → nobody
Comment 6•10 years ago
|
||
On my 4-core HT machine, I manually clocked the time around waiting for worker threads to join, and recorded all the wait times for convolve-mapPar with thread-count=4 and =8, respectively. Note that I'm just benchmarking the time it takes for this block in ThreadPool.cpp: { AutoLockMonitor lock(*this); waitForWorkers(lock); } By the time they get to this point all slices have been taken by some worker thread, but may not have finished. The sum of the wait times are: thread-count=4: 34.6476 ms thread-count=8: 319.308 ms What is going on here? Is it the joinBarrier_ condvar?
Comment 7•10 years ago
|
||
FWIW, there is no jump in sum wait time from tc=3 to tc=4, but from tc=4 to tc=5 is almost 6x! From 34 ms to 198 ms.
Comment 8•10 years ago
|
||
Busy waiting for workers instead of using a condvar gives us a 1.2x improvement on convolve, but a negligible improvement on deform. I imagine deform is serialized on the GC lock in refillList more than anything.
Reporter | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•