Closed Bug 985887 Opened 10 years ago Closed 9 years ago

PJS: Investigate synchronization overhead in job management

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: lth, Assigned: shu)

References

Details

Attachments

(2 files)

There's a suspicion that the condition variable that is used to start a parallel job and synchronize when the job is done creates a serializing bottleneck.
There are two different approaches to this problem (if it is real):

- A separate lock / condition variable per worker, to avoid serializing on the lock after notifyAll();
  in general, distributing locks to keep the workers from interacting more than necessary.

- Keeping the workers running (ie not blocked on a condition variable) for a while after a
  parallel section, in case more work comes along.

I've toyed with a busy-wait loop in the worker helper, and tried to avoid some locking.  This moves the needle slightly on my test case, but has the bad side effect that once the number of threads is too large the performance gets much worse very quickly; the non-busy-waiting solution does not have that problem.

MacBook Pro with patch:

2539
1446
1072
886
982
1568

Ditto without:

2555
1510
1244
895
952
940

(cat-convolve-mapPar-wide.js)
Attached patch busywait.patchSplinter Review
Re comment #1, the timings are for --thread-count=1/2/3/4/5/6 on a 4-core hyperthreaded machine.

The best straightforward serial time is 910ms; the best clever serial algorithm is 570ms.
Also see https://bugzilla.mozilla.org/show_bug.cgi?id=981547#c15.

Observations:

- With GC more or less ruled out from the convolve workload (short vectors), the sequential mapTo()
  is still 2.5x as fast as the parallel mapParTo().  This drops to 2x if we introduce busy-waiting.

- Without busy-waiting, Instruments shows 40% of the time spent waiting for the condition variable.
  With busy-waiting this drops to something smallish (less than 5% IIRC).  The time is now charged
  to the worker's "helper" function, as it should be.

Ergo, we can wait more or less efficiently, but we still have to wait.

Given that so much time is spent waiting for work the most reasonable conclusion is that work is slow to arrive, ie, there is a serializing bottleneck.  Since the serial user code in this benchmark does virtually nothing it is reasonable to suppose that the path from the user code to the fork point really is quite long (longer than it might look).

(I'm currently running with patches applied for TypedObject array's map() (sequential) and for the TypedObject predicates.)

There are 60000 rows.  The serial code runs at a speed of 100 rows/ms.  Assume the parallel workers can process 200 rows/ms with four cores (50 rows/ms/core), taking into account overhead for work stealing synchronization.  Then the work takes 300ms, and there are 900ms unaccounted for.  900ms / 60000 rows = 15 us/row overhead.  At 2.5GHz that's 15 * 1000 * 2.5 = 37500 clocks per row.

There's no evidence yet, from all our experiments, that the per-core speed is particularly low, but it's an assumption that might be wrong.  If the speed is 25 rows/ms/core then the time in parallel execution is 600 ms and serial overhead is 10 us/row = 25000 clocks per row.

When I run an optimized build through Instruments (30s and 15s runs), the top hits are all in synchronization functions but the deeper picture is unclear.
This patch by itself isn't a win for us -- we still need to implement a way to
get the # of physical cores (vs hardware threads) to avoid spinning on more
threads than we have cores.
Attachment #8396096 - Flags: review?(lhansen)
Assignee: nobody → shu
Status: NEW → ASSIGNED
Depends on: 987466
Comment on attachment 8396096 [details] [diff] [review]
Busy wait before waiting on condvar for more work; busy wait when joining threads

Review of attachment 8396096 [details] [diff] [review]:
-----------------------------------------------------------------

Nit(?): The explanation of the manipulation of pool->activeWorkers_ in helperLoop() feels opaque.  The central invariant is that threads wait by checking for activeWorkers_ to reach zero; once that happens other threads may run ahead to do something else, and it would be bad for the present thread to start working.  That might usefully be documented somewhere; the wording here about "inconsistent state" is probably insufficient.

Nit: Of secondary concern is the terminology "switching to another thread" in the explanation, since true concurrency is involved and no thread switching need take place, only another thread getting in ahead of us.
Attachment #8396096 - Flags: review?(lhansen) → review+
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: