1365692 - stylo: Parallel traversal significantly weakens style sharing

Assignee

Description

•

8 years ago

Julian and I discussed this in the office. This may shed some light on other parallelism issues as well, though it's less urgent on its own (since "only" a 3x win wouldn't stop of us from shipping). Julian is going to look at this, but has other more-pressing work to do first (i.e. bug 1365681 and bug 1365682).

Julian Seward [:jseward]

Comment 1

•

8 years ago

More of a concern (to me) is that transitioning to the parallel version seems to give a big performance hit. For a balanced tree of 5903 nodes and 100 iterations, I have the following times in milliseconds STYLO_THREADS=1 666 667 668 2 1258 1274 1264 3 899 906 895 4 742 709 756

Julian Seward [:jseward]

Comment 2

•

8 years ago

I profiled with Callgrind, collecting data for 200 iterations, for the 1- and 2- thread cases. Data collection was active only whilst Stylo was iterating and so the profiles are free from almost all extraneous junk. Summary: 1 thread: 7.85G insns, 25.2M locked bus cycles 2 threads: 19.55G insns, 64.7M locked bus cycles There is no sign of Rayon burning instructions in spins. This is with the fix for bug 1365681 applied. Top 4 functions by call count: 1 thread: 16.64M selectors::matching::matches_simple_selector 14.14M <bit_vec::BitVec<B>>::push 13.05M selectors::matching::matches_complex_selector 13.05M selectors::matching::matches_complex_selector_internal 2 threads: 44.19M selectors::matching::matches_simple_selector 14.14M <bit_vec::BitVec<B>>::push 30.05M selectors::matching::matches_complex_selector 30.05M selectors::matching::matches_complex_selector_internal I find it strange that the basic work functions (selectors::matching::matches_*) should be called different numbers of times depending on how the traversal is done (parallel vs sequential). Is it possible that the parallel traversal is somehow duplicating work?

Julian Seward [:jseward]

Comment 3

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #1) > [..] For a balanced tree of 5903 nodes and 100 iterations [..] I should add, this is with 10 selectors, not 10000 selectors that the original bloom-basic test uses.

Julian Seward [:jseward]

Comment 4

•

8 years ago

Attached file callgrind.out.one-thread-200-iters — Details

Callgrind results for one thread.

Julian Seward [:jseward]

Comment 5

•

8 years ago

Attached file callgrind.out.two-threads-200-iters — Details

Callgrind results for two threads.

Emilio Cobos Álvarez (:emilio)

Comment 6

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #2) > 14.14M <bit_vec::BitVec<B>>::push So surprised this one is not actually inlined.

Julian Seward [:jseward]

Comment 7

•

8 years ago

(In reply to Emilio Cobos Álvarez [:emilio] from comment #6) > So surprised this one is not actually inlined. OT, but .. maybe because it is mostly called from a different compilation unit, function style::traversal::recalc_style_at. Even more OT, but .. I sure hope the compiler can reduce |B::bits()| at compile time, otherwise performance of ::push will be miserable, since div and mod against B::bits() looks like it happens on every call.

callgrind.out.one-thread-200-iters 8 years ago Julian Seward [:jseward] 272.81 KB, text/plain		Details
callgrind.out.two-threads-200-iters 8 years ago Julian Seward [:jseward] 356.80 KB, text/plain		Details
Part 1 - Enable breadth-first traversal. v1 8 years ago Bobby Holley (:bholley) 1.34 KB, patch	emilio : review+	Details \| Diff \| Splinter Review
Part 2 - Don't do recursive tail calls if there's work in the queue. v1 8 years ago Bobby Holley (:bholley) 3.11 KB, patch	emilio : review+	Details \| Diff \| Splinter Review
Part 3 - Eliminate an unnecessary heap allocation. v1 8 years ago Bobby Holley (:bholley) 2.15 KB, patch	emilio : review+	Details \| Diff \| Splinter Review
Part 4 - Use ArrayVec and tweak the SmallVec sizes. v1 8 years ago Bobby Holley (:bholley) 9.13 KB, patch	emilio : review+	Details \| Diff \| Splinter Review