Closed Bug 981547 Opened 9 years ago Closed 8 years ago

Work required to remove overhead from mapPar

Categories

(Core :: JavaScript Engine, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: lth, Unassigned)

References

Details

Attachments

(1 file)

(Follow-on work to bug 972581, possibly this gets to be split up into several work items by and by.)

At present, the inner loop of TypedObjectArray.mapPar has significant overhead in that it contains run-time tests for various constant conditions (eg, simple type vs complex type) and may compute arguments that are not needed by the kernel function (input index, output offset, output cursor), and it checks conditions whose state may be apparent from the kernel itself (kernel's result type).  It will be necessary to get rid of this overhead if parallel computation is going to show worthwhile speedup over serial computation on low-core-count systems.

Niko's summary on IRC, edited:

mapPar is (modulo bugs) callsite cloned, this basically means we make
a copy of the fn per callsite so that the type information is not
polluted.  Also constant propagation improves.

For example, simple vs complex type is totally predictable based on
TI, and I think most of the choices are.  we'll have to inline the
accessor though.

so I think what we have to change to make that happen is: make
TypeDescrIsSimpleType intrinsic and inline it, or maybe just constant
fold the DESCR_KIND() access, or both.

And then I tihnk we probably will have to move the computation of
inGrainTypeIscomplex into the kernel so that ion can see it better.
(sometimes ion is conservative around upvars.)  that gets rid of most
of the ifs, except the return value.  for that I was hoping that we'd
inline the call to the kernel.  actually TI should help here too
because (I think) if the value is always undefined (or never
undefined) we'll predict that and eliminate the `if (r !== undefined)`
check.

(end summary)
I don't think I set the flags properly to get the callsite cloning we need. In particular, helper fns generally need to be tagged as callsite cloned, unless they are declared inside another callsite cloned fn.
Depends on: 983577
Some early investigations:

- My benchmark is once again the cat convolver, mapping an array of indices to to implement the inner
  loop of a matrix algorithm where the matrix is stored in a 1D array.  There are four cases:
    
    - Sequential map, index array is Array:        750ms
    - Sequential map, index array is TypedObject: 3194ms
    - Parallel map, index array is Array:         3360ms
    - Parallel map, index array is TypedObject:   3714ms

  One run, Quad MBP, 16GB RAM.  I'll attach sources etc later.

  The slowdown for the sequential map has been filed as bug 983577.

The rest of these notes pertain to the difference between the TypedObject cases, but should be seen as preliminary since it appears that there are some other effects - the sequential slowdown - that are really most significant.

- Specializing the mapThread function within the parallel map implementation with mapThreadSimple(),
  in the case when both types are simple, does not make much of a difference even though the body
  of the latter is very straightforward.  This could be because the time goes elsewhere after all,
  or because the slice manipulation is more costly than anticipated or has some sort of nasty side
  effect on performance.  (Or because the change is not observable given the slowness of typed arrays.)

- Specializing for the number of arguments in the mapped function also results in a small speedup.
Re comment #2: The slowness for the TypedObject sequential map could largely (though not entirely) be dealt with by specializing the inner loop for the simple-typed case.  The corresponding fixes for the parallel case had, as observed earlier, little effect.  Nor have some hacks to avoid initializing the variables for the complex case, if we're going to take the simple branch, had much effect.  The performance loss is elsewhere.
This is a quad-core Intel-based machine with 2-way hyperthreading.

Setting --thread-count=3 at the shell reduces the running time of the benchmark by half, which suggest a problem having to do with worker coordination.

The default thread count is supposed to be 7, though --thread-count=7 yields a faster run than whatever truly is the default.

3 seems like the sweet spot, too.
I tried hacking on a few aspects of the work scheduling algorithm - the slice size on the self-hosted side, and the work-stealing in the thread pool (stealing half the available work from the victim instead of just one slice) - but none of this moved the needle in any meaningful way.
The other half of CAS overhead is that there's a CAS for every call to getSlice() from the worker.  For a balanced algorithm this overhead is mostly wasted; for an algorithm with a relatively fine grain the overhead can be substantial.   Some research I did last winter informally suggested costs for *uncontended* CAS in the 200-500 clocks range.  It's not hard to imagine realistic mapped functions whose cost falls below that, and certainly that fall below 10x that.

It seems to me that each worker could be a little more aggressive about biting off work for itself in an effort to keep the CAS cost under control.  Of course, if that leaves workers idle because it reduces the efficacy of work stealing then that's bad, but it may be a worthwhile experiment anyway.
Re comment #6, an experiment that grabs slices aggressively and caches them locally in the worker to reduce CAS cost shows that the needle moved perhaps a tiny bit, but not more than that.
"Instruments" suggests that a lot of time is spent waiting for condition variables, which is somewhat credible.  The program contains 60,000 parallel sections, all fairly short (640-element arrays).  So I tried a little busy-waiting before blocking, but to no avail.

(It's probably time to go after this in earnest with a profiler and an optimized build with symbols.)
I believe that the culprit here is more likely condvars than CAS contention. For your CAS experiment though, did you change just the C++ function? We inline that, so what you want to change is the asm here: http://dxr.mozilla.org/mozilla-central/source/js/src/jit/shared/CodeGenerator-x86-shared.cpp#1876
(In reply to Shu-yu Guo [:shu] from comment #9)

I missed that.  Thanks.

I also think the condition variables are more likely the problem, given how short the parallel sections are and how many there are of them.  I think my next approach (beyond trying to get some numbers with a proper profiler) will be to mangle the benchmark to increase the grain of concurrency to see what that does.
So with a larger work increment size (20-40x what I had before) and a carefully selected number of workers, the parallel implementation is running as fast as the sequential implementation, on a quad-core Intel, dropping from 1500ms to about 900ms.

All that really proves - taking into account that there isn't anything interesting going on between parallel sections - is that there is quite a bit of per-operation overhead in the parallel engine.

I'm seeing some periodic bailouts due to bitwise operations and other arithmetic.
For now, benchmarks are here:
https://github.com/lars-t-hansen/moz-sandbox

Look in "convolve", the "larger work increment" program is cat-convolve-mapPar-wide.js.
Depends on: 985865
So far we've found a few things, this goes beyond mapPar really:

- On several hyperthreaded machines, on several benchmarks, we've found that setting --thread-count to
  the number of cores minus one significantly boosts performance over the default, which is the
  number of threads minus one.  On a 4x2 MacBook Pro, setting thread-count=3 doubled the speed of the
  cat convolver.  On a 8x2 Xeon, setting thread-count=7 (I think - Niko did this) improved the speed
  sixfold of the deform benchmark.  (In the latter case it also improved the serial performance.)

- There is some slightly hazy evidence that the main thread does a disproportionate amount of the work,
  suggesting that there's quite a spin-up lag after the condition variable is notified.  (The notifyAll
  will basically require all threads to wake up and then contend for the same lock, thus serializing all
  the workers at the start of the run.)

- The pendingSlices_ counter of the ThreadPool is hotly contented with atomic ops, everyone decrements
  this when grabbing a slice.  The performance impact is unclear.

- The sliceBounds_ variables in the ThreadPoolWorkers are hit heavily with atomic ops, and they may
  become contended when there's significant work stealing going on.  The performance impact is unclear.
Depends on: 985887
Attached patch mapTo.patchSplinter Review
This patch represents a simple experiment: in-place 'mapTo' and 'mapParTo' methods that allow us to factor out allocation and GC overhead.

Mapping across a single row at a time (640 elements), this reduces both sequential and parallel running times by 250-300ms.  For the sequential code that's a big win (over 30%) but for the parallel code it is not (10-15%).  The parallel code is now barely better than 1/3 the performance of the equivalent sequential code.

The interesting result is that factoring out allocation/GC does not create a dramatic difference for the parallel code.  Time is lost elsewhere.

(4x2 MacBook Pro, --thread-count=4, benchmarks are cat-convolve-map{,Par}To.js at github/lars-t-hansen/moz-sandbox/convolve)
Re comment #14, the busy-wait work in #985887 has visible impact here since the parallel sections are short and there is a lot of start/stop overhead.

Apple's Instruments shows about 40% of the time on worker threads being spent waiting for condition variables, without busy-waiting; this could be misleading because the entire run was probably taken into account (though it was a long run).

(Instruments also suggests a lot of time is spent in _pthread_mutex_check_init; this could be misleading or it could be a symptom of creating locks at a high rate or creating locks that have error checking enabled.)
(In reply to Lars T Hansen [:lth] from comment #15)

> (Instruments also suggests a lot of time is spent in
> _pthread_mutex_check_init; this could be misleading or it could be a symptom
> of creating locks at a high rate or creating locks that have error checking
> enabled.)

Locks are created with DEFAULT attributes, which is probably OK.  (PosixNSPR.cpp)  One lock and one condition variable is created per Monitor instance, and there's one Monitor per parallel job.  Lock and condvar creation and destruction don't seem very expensive (looking at Apple's source, at least).
Blocks: 983864
See bug 983864 for another benchmark, where the problem is that mapPar is not implemented for Sized TypedObject arrays.
No longer blocks: 983864
Depends on: 983864
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.