Closed Bug 827768 Opened 10 years ago Closed 9 years ago

[RiverTrail] ParallelArray scatter is slow

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 965609

People

(Reporter: pnkfelix, Unassigned)

References

Details

(Whiteboard: PJS)

The ParallelArray implementation of scatter could probably be much faster.  The implementations of both strategies for parallel scatter are sometimes much slower (i.e. factors of 3x to 4x) than the sequential version on pnkfelix's scatter-strategies benchmark [1], running on his 4-core laptop (fklock-Oenone)

(Caveat: don't read too much into the results presented below.  I am cutting my analysis of my benchmark short because there are some details that do not make sense to me [2])

pnkfelix has for example observed (from data in [4]):

  * par/divide-scatter-vector is nearly 3x slower (rel. time 0.36) than sequential on an identity-scatter size 10,000.  par/divide-output-range is 2x slower there (rel. time 0.5).

  * For the same but with input size 1,000,000, the respective relative times are 0.17 and 0.28; so even more severe slowdowns.

  * There are sometimes speedups as well, for some inputs, but it is not clear whether this is reliable; another index construction produced speedups (relative times 1.8 and 2.0) for input size 10,000, but slowdowns 
when the input was pumped up to 1,000,000.

  * (Some of the slowdowns seem likely attributable to GC overhead; this is to be expected for e.g. par/divide-scatter-vector as it is currently allocating a separate full-sized output array to each thread to act as an write log.)

----

So, what to do.

Suggestions:

  * Timing of individual phases (see [3]) indicated that the merge step often takes more time than the fill step.

  * Timing of individual phases also indicates that allocation is taking the bulk of the time in many cases (often more than fill or merge individually, and sometimes more than their sum).  This can be addressed by a couple different changes:

    ** Revise the conflict array to be a bitvector.  Especially if it can be a typed array of integers.  (Note that a parallel-merge will need to take care to chunk its work accordingly here; I think niko already did something similar for filter.)
 
    ** When no combine function is given to resolve collisions, then do not bother allocating separate output arrays for each thread; just write directly into the output.  (Note that in this case one still needs to detect collisions, so you still need a per-thread conflict array; but as noted in the previous bullet, this could be made much cheaper.)

    ** There may be other potential fixes for the case when the user does provide a combine function.  I considered at the outset using a specialized representation for the write log, e.g. an array of (index,value) pairs, which one would sort and then do a parallel merge on these.  However, this would require sorting and seems hard to justify without a proper prototype, which may take too much time to develop.

Okay, those are my notes.  (Maybe this should have been a blog post, but I wanted to make sure this work-item would be tracked for the future.)

[1]: https://github.com/pnkfelix/iontrail/blob/4907c482ee407be1048c8cc6b45594597528f212/js/src/jit-test/benchmarks/scatter-strategies.js

[2]: Namely on how the behavior could differ, and by such a large degree, between index:makeIdentityIdx and index:makeRotateIdx when the output lengths are equivalent.  Something may be wrong in the benchmark itself.

[3]: https://github.com/pnkfelix/iontrail/commit/9cc07c2e14620429d4e3b2c3993476fad521a605

[4]: Table of values that was basis of analysis above:

k:0 s:(1e4->1e4)*5    mode:seq                        time:     4 (1)     GCs:0
k:0 s:(1e4->1e4)*5    mode:par/divide-scatter-vector  time:    11 (0.36)  GCs:0
k:0 s:(1e4->1e4)*5    mode:par/divide-output-range    time:     8 (0.5)   GCs:0

k:1 s:(1e6->1e6)*5    mode:seq                        time:   162 (1)     GCs:2
k:1 s:(1e6->1e6)*5    mode:par/divide-scatter-vector  time:   916 (0.17)  GCs:20
k:1 s:(1e6->1e6)*5    mode:par/divide-output-range    time:   568 (0.28)  GCs:2

k:2 s:(1e4->1e4)*5    mode:seq                        time:    18 (1)     GCs:0
k:2 s:(1e4->1e4)*5    mode:par/divide-scatter-vector  time:    10 (1.8)   GCs:0
k:2 s:(1e4->1e4)*5    mode:par/divide-output-range    time:     9 (2)     GCs:0

k:3 s:(1e4->10 )*5    mode:seq                        time:     6 (1)     GCs:0
k:3 s:(1e4->10 )*5    mode:par/divide-scatter-vector  time:     4 (1.5)   GCs:0
k:3 s:(1e4->10 )*5    mode:par/divide-output-range    time:    10 (0.6)   GCs:0

k:4 s:(1e6->1e6)*5    mode:seq                        time:   155 (1)     GCs:2
k:4 s:(1e6->1e6)*5    mode:par/divide-scatter-vector  time:   910 (0.17)  GCs:20
k:4 s:(1e6->1e6)*5    mode:par/divide-output-range    time:   566 (0.27)  GCs:2

k:5 s:(1e6->1e4)*5    mode:seq                        time:   205 (1)     GCs:0
k:5 s:(1e6->1e4)*5    mode:par/divide-scatter-vector  time:   156 (1.31)  GCs:0
k:5 s:(1e6->1e4)*5    mode:par/divide-output-range    time:   493 (0.41)  GCs:0

k:6 s:(1e6->10 )*5    mode:seq                        time:   201 (1)     GCs:0
k:6 s:(1e6->10 )*5    mode:par/divide-scatter-vector  time:   136 (1.47)  GCs:0
k:6 s:(1e6->10 )*5    mode:par/divide-output-range    time:   547 (0.36)  GCs:0
Summary: [meta] [RiverTrail] ParallelArray scatter is slow → [RiverTrail] ParallelArray scatter is slow
Blocks: PJS
  * Timing of individual phases ... indicated that the merge step often takes more time than the fill step.

I meant to add after this bullet: "An obvious way to address this is to do the merge in parallel as well, by partitioning according to the length of the intermediate arrays."

I started work on this [1], but got mired in that implementation bailing out too much on my test cases [2], and I was not able to determine if it was working properly. I need to shift gears now, so I'm leaving this code in its not-ready-for-prime-time state.

[1]: https://github.com/pnkfelix/iontrail/commits/scatter-parallelize-merge 

[2]: https://github.com/pnkfelix/iontrail/commit/21e9a89eeb44c6e9a9e0204fcce28c533a6b701f
(In reply to Felix S Klock II [:pnkfelix, :fklock] from comment #0)
>   * Timing of individual phases also indicates that allocation is taking the
> bulk of the time in many cases (often more than fill or merge individually,
> and sometimes more than their sum).  This can be addressed by a couple
> different changes:
> 
>     ** Revise the conflict array to be a bitvector.  Especially if it can be
> a typed array of integers.  (Note that a parallel-merge will need to take
> care to chunk its work accordingly here; I think niko already did something
> similar for filter.)


If we are constructing an array of arbitrary objects (as opposed to a typed array) and if we continue to allocate thread-local arrays, then we do not need a conflict array at all and thus could avoid allocating it entirely; we can instead use the 'hole' value to indicate the slot has not yet been filled.

(Note that this is incompatible with the other suggestion for the case when no combine function is provided to 'not bother allocating separate output arrays for each thread')
See Also: → 964741
I believe shu is going to just remove the current parallel scatter code, so the data collected here is probably going to be irrelevant.  Closing this ticket in favor of Bug 965609.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.