Closed Bug 985865 Opened 10 years ago Closed 9 years ago

PJS: Investigate synchronization overhead in work item scheduling

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: lth, Unassigned)

References

Details

Attachments

(2 files)

The current PJS scheduler may have synchronization overhead that can be avoided.  This bug tracks that work.

- The pendingSlices_ member of ThreadPool is hammered on by all the workers
- The worker structures contain atomic variables, and they are so small
  that there may be false sharing if they are allocated next to each other

General approaches here will be to introduce padding, distributing the shared counter, and - in the best of all worlds - make synchronized variables unsynchronized.  If the work items are small then the synchronization overhead could be a large part of the cost (TBD).
I'm working on a patch to remove pendingSlices_.

The idea is that since the # of workers is fixed for a workload, each worker structure would have a thread-local array of ThreadPoolWorker pointers. To steal work, each worker chooses at random a worker from its local worker array, removing workers with no more work at the same time. The worker finishes when its local array of workers is empty. This should solve the first bullet above.

This array of pointers will also be allocated contiguously after the the ThreadPoolWorker itself, making each ThreadPoolWorker bigger than a cache line, so it should coincidentally solve the second bullet too.
This patch has two fixes:

- Padding for ThreadPoolWorker.  I don't think this will matter in any case
  since those structures are heap-allocated, not allocated in an array, but
  it's good to be sure.

- Distributing the counter for pendingSlices_ by moving it from the ThreadPool
  into the workers, and making the hasWork test more complex.  The key to this
  hack is to un-inline ForkJoinGetSlice.

These changes don't appear to move the needle at all on either the 4-Core AMD or the 4x2 Intel i7, with reasonable thread count settings.

It is possible that what really needs to be targeted is the synchronized updates to pendingSlices_ and sliceBounds_, but that will impact how work stealing and agreement among workers to stop computing are implemented in a major way.  It would be useful to have some indication first that it'll be worth it to go after that.
This doesn't have the inlining, but I don't think that matters at least for the
displace benchmark. It doesn't improve anything all that much.

What I am noticing is on my 4-core HT machine, thread count > 4 results in a
lot more stealing happening. Contention on sliceBounds is probably the culprit.

I'm going to look into splitting the lower and upper bounds into 2 separate
fields that can be CAS'd separately.

Steal-half should also be worthy looking into.
Assignee: nobody → shu
Status: NEW → ASSIGNED
> I'm going to look into splitting the lower and upper bounds into 2 separate
> fields that can be CAS'd separately.

I just experimented with this, and it didn't seem to have any effects.
The more-threads slowdown probably isn't caused by workstealing, as I see the slowdown pattern even when I hardcode workstealing to off in ThreadPool::workStealing. Hmm...
Assignee: shu → nobody
On my 4-core HT machine, I manually clocked the time around waiting for worker threads to join, and recorded all the wait times for convolve-mapPar with thread-count=4 and =8, respectively.

Note that I'm just benchmarking the time it takes for this block in ThreadPool.cpp:

    {
        AutoLockMonitor lock(*this);
        waitForWorkers(lock);
    }

By the time they get to this point all slices have been taken by some worker thread, but may not have finished.

The sum of the wait times are:

 thread-count=4: 34.6476 ms
 thread-count=8: 319.308 ms

What is going on here? Is it the joinBarrier_ condvar?
FWIW, there is no jump in sum wait time from tc=3 to tc=4, but from tc=4 to tc=5 is almost 6x! From 34 ms to 198 ms.
Busy waiting for workers instead of using a condvar gives us a 1.2x improvement on convolve, but a negligible improvement on deform. I imagine deform is serialized on the GC lock in refillList more than anything.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: