Closed Bug 985865 Opened 10 years ago Closed 9 years ago

PJS: Investigate synchronization overhead in work item scheduling

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: lth, Unassigned)

References

Details

Attachments

(2 files)

Distribute the counter, introduce padding 10 years ago Lars T Hansen [:lth] 9.01 KB, patch		Details \| Diff \| Splinter Review
WIP Remove pendingSlices_ and push "has work" state into each worker thread 10 years ago Shu-yu Guo [:shu] 23.27 KB, patch		Details \| Diff \| Splinter Review

Lars T Hansen [:lth]

Reporter

Description

•

10 years ago

The current PJS scheduler may have synchronization overhead that can be avoided.  This bug tracks that work.

- The pendingSlices_ member of ThreadPool is hammered on by all the workers
- The worker structures contain atomic variables, and they are so small
  that there may be false sharing if they are allocated next to each other

General approaches here will be to introduce padding, distributing the shared counter, and - in the best of all worlds - make synchronized variables unsynchronized.  If the work items are small then the synchronization overhead could be a large part of the cost (TBD).

Shu-yu Guo [:shu]

Comment 1

•

10 years ago

I'm working on a patch to remove pendingSlices_.

The idea is that since the # of workers is fixed for a workload, each worker structure would have a thread-local array of ThreadPoolWorker pointers. To steal work, each worker chooses at random a worker from its local worker array, removing workers with no more work at the same time. The worker finishes when its local array of workers is empty. This should solve the first bullet above.

This array of pointers will also be allocated contiguously after the the ThreadPoolWorker itself, making each ThreadPoolWorker bigger than a cache line, so it should coincidentally solve the second bullet too.

Lars T Hansen [:lth]

Reporter

Comment 2

•

10 years ago

Attached patch Distribute the counter, introduce padding — Details — Splinter Review

This patch has two fixes:

- Padding for ThreadPoolWorker.  I don't think this will matter in any case
  since those structures are heap-allocated, not allocated in an array, but
  it's good to be sure.

- Distributing the counter for pendingSlices_ by moving it from the ThreadPool
  into the workers, and making the hasWork test more complex.  The key to this
  hack is to un-inline ForkJoinGetSlice.

These changes don't appear to move the needle at all on either the 4-Core AMD or the 4x2 Intel i7, with reasonable thread count settings.

It is possible that what really needs to be targeted is the synchronized updates to pendingSlices_ and sliceBounds_, but that will impact how work stealing and agreement among workers to stop computing are implemented in a major way.  It would be useful to have some indication first that it'll be worth it to go after that.

Shu-yu Guo [:shu]

Comment 3

•

10 years ago

Attached patch WIP Remove pendingSlices_ and push "has work" state into each worker thread — Details — Splinter Review

This doesn't have the inlining, but I don't think that matters at least for the
displace benchmark. It doesn't improve anything all that much.

What I am noticing is on my 4-core HT machine, thread count > 4 results in a
lot more stealing happening. Contention on sliceBounds is probably the culprit.

I'm going to look into splitting the lower and upper bounds into 2 separate
fields that can be CAS'd separately.

Steal-half should also be worthy looking into.

Shu-yu Guo [:shu]

Updated

•

10 years ago

Assignee: nobody → shu

Status: NEW → ASSIGNED

Shu-yu Guo [:shu]

Comment 4

•

10 years ago

> I'm going to look into splitting the lower and upper bounds into 2 separate
> fields that can be CAS'd separately.

I just experimented with this, and it didn't seem to have any effects.

Shu-yu Guo [:shu]

Comment 5

•

10 years ago

The more-threads slowdown probably isn't caused by workstealing, as I see the slowdown pattern even when I hardcode workstealing to off in ThreadPool::workStealing. Hmm...

Shu-yu Guo [:shu]

Updated

•

10 years ago

Assignee: shu → nobody

Shu-yu Guo [:shu]

Comment 6

•

10 years ago

On my 4-core HT machine, I manually clocked the time around waiting for worker threads to join, and recorded all the wait times for convolve-mapPar with thread-count=4 and =8, respectively.

Note that I'm just benchmarking the time it takes for this block in ThreadPool.cpp:

    {
        AutoLockMonitor lock(*this);
        waitForWorkers(lock);
    }

By the time they get to this point all slices have been taken by some worker thread, but may not have finished.

The sum of the wait times are:

 thread-count=4: 34.6476 ms
 thread-count=8: 319.308 ms

What is going on here? Is it the joinBarrier_ condvar?

Shu-yu Guo [:shu]

Comment 7

•

10 years ago

FWIW, there is no jump in sum wait time from tc=3 to tc=4, but from tc=4 to tc=5 is almost 6x! From 34 ms to 198 ms.

Shu-yu Guo [:shu]

Comment 8

•

10 years ago

Busy waiting for workers instead of using a condvar gives us a 1.2x improvement on convolve, but a negligible improvement on deform. I imagine deform is serialized on the GC lock in refillList more than anything.

Lars T Hansen [:lth]

Reporter

Updated

•

9 years ago

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

PJS: Investigate synchronization overhead in work item scheduling

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: lth, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Updated

Attachment

General

Description

File Name

Content Type