stylo: Parallel traversal significantly weakens style sharing

RESOLVED FIXED

Status

()

Core
CSS Parsing and Computation
P1
normal
RESOLVED FIXED
2 months ago
a month ago

People

(Reporter: bholley, Assigned: bholley)

Tracking

(Blocks: 2 bugs)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(6 attachments)

Julian and I discussed this in the office. This may shed some light on other parallelism issues as well, though it's less urgent on its own (since "only" a 3x win wouldn't stop of us from shipping).

Julian is going to look at this, but has other more-pressing work to do first (i.e. bug 1365681 and bug 1365682).
More of a concern (to me) is that transitioning to the parallel version
seems to give a big performance hit.  For a balanced tree of 5903 nodes
and 100 iterations, I have the following times in milliseconds

STYLO_THREADS=1   666  667  668
              2  1258 1274 1264
              3   899  906  895
              4   742  709  756
I profiled with Callgrind, collecting data for 200 iterations, for the 1- and
2- thread cases.  Data collection was active only whilst Stylo was iterating
and so the profiles are free from almost all extraneous junk.  Summary:

1 thread:   7.85G insns, 25.2M locked bus cycles
2 threads: 19.55G insns, 64.7M locked bus cycles

There is no sign of Rayon burning instructions in spins.  This is with
the fix for bug 1365681 applied.

Top 4 functions by call count:

1 thread:
   16.64M selectors::matching::matches_simple_selector
   14.14M <bit_vec::BitVec<B>>::push
   13.05M selectors::matching::matches_complex_selector
   13.05M selectors::matching::matches_complex_selector_internal

2 threads:
   44.19M selectors::matching::matches_simple_selector
   14.14M <bit_vec::BitVec<B>>::push
   30.05M selectors::matching::matches_complex_selector
   30.05M selectors::matching::matches_complex_selector_internal

I find it strange that the basic work functions
(selectors::matching::matches_*) should be called different numbers of times
depending on how the traversal is done (parallel vs sequential).  Is it
possible that the parallel traversal is somehow duplicating work?
(In reply to Julian Seward [:jseward] from comment #1)
> [..] For a balanced tree of 5903 nodes and 100 iterations [..]

I should add, this is with 10 selectors, not 10000 selectors that
the original bloom-basic test uses.
Created attachment 8872567 [details]
callgrind.out.one-thread-200-iters

Callgrind results for one thread.
Created attachment 8872568 [details]
callgrind.out.two-threads-200-iters

Callgrind results for two threads.
(In reply to Julian Seward [:jseward] from comment #2)
>    14.14M <bit_vec::BitVec<B>>::push

So surprised this one is not actually inlined.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #6)
> So surprised this one is not actually inlined.

OT, but .. maybe because it is mostly called from a different
compilation unit, function style::traversal::recalc_style_at.

Even more OT, but .. I sure hope the compiler can reduce |B::bits()|
at compile time, otherwise performance of ::push will be miserable,
since div and mod against B::bits() looks like it happens on every call.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #6)
> (In reply to Julian Seward [:jseward] from comment #2)
> >    14.14M <bit_vec::BitVec<B>>::push
> 
> So surprised this one is not actually inlined.

Filed https://github.com/contain-rs/bit-vec/pull/47 , which may or may not help.
(In reply to Julian Seward [:jseward] from comment #1)
> More of a concern (to me) is that transitioning to the parallel version
> seems to give a big performance hit.  For a balanced tree of 5903 nodes
> and 100 iterations, I have the following times in milliseconds
> 
> STYLO_THREADS=1   666  667  668
>               2  1258 1274 1264
>               3   899  906  895
>               4   742  709  756

Interesting! Can you check what happens with STYLO_THREAD=1 FORCE_STYLO_THREAD_POOL=1? This will give you a single thread but still use rayon. Just want to see which side of the fence this falls on.

It would also be interesting to remeasure with DISABLE_STYLE_SHARING_CACHE=1. That is the one place I can imagine us doing more work with multiple threads, since sharing is thread-local and therefore we can get less sharing depending on how the elements are split up across the threads.
Flags: needinfo?(jseward)
(In reply to Julian Seward [:jseward] from comment #1)
> More of a concern (to me) is that transitioning to the parallel version
> seems to give a big performance hit.  For a balanced tree of 5903 nodes
> and 100 iterations, I have the following times in milliseconds
> 
> STYLO_THREADS=1   666  667  668
>               2  1258 1274 1264
>               3   899  906  895
>               4   742  709  756

So, I can't reproduce these results on a clean build. I also applied your iteration patch in bug 1367962, and see the console spew of 200 iterations, but adding more iterations doesn't seem to affect the timing of the test (it always takes ~30ms), so I'm not sure what's going on there. Do you get similar results without your iteration stuff?

Looking at the profiles in kcachegrind, the single-threaded profile seems to spend almost no time doing real selector matching (the callsites tracing back to get_matching_rules), whereas the 2-thread profile seems to do the expected amount. Both profiles seem to spend an equivalent amount of time matching revalidation selectors (the callstacks from recalc_style_at).
Here are some numbers with a clean build from about 6 hours ago, and no
patches (hence no iteration).  This is using the testcases from bug 1368415,
in particular the tree "tree_general_5000_50percent", a balanced tree
containing 5903 nodes, and 10 selectors per node, not 10000 as the original
bloom-basic test case has.

  1s = STYLO_THREADS=1
  1p = STYLO_THREADS=1 and FORCE_STYLO_THREAD_POOL=1

  CACHE?=no means DISABLE_STYLE_SHARING_CACHE=1

  THREADS CACHE?   BEST    // MEASURED

  1s      yes      19.62   // 20.08 19.62 19.71

  1p      yes      30.02   // 30.33 30.02 30.48
  2       yes      22.61   // 22.61 23.48 22.64
  3       yes      19.62   // 20.69 19.81 19.62
  4       yes      18.44   // 18.44 19.64 19.61
  6       yes      18.12   // 18.12 19.40 19.10
  8       yes      17.92   // 18.69 17.92 21.50


  1s      no       33.68   // 34.68 34.04 33.68

  1p      no       34.61   // 34.88 35.02 34.61
  2       no       25.91   // 25.91 26.19 26.66
  3       no       22.47   // 24.63 24.47 22.47
  4       no       20.40   // 21.76 20.40 21.38
  6       no       19.84   // 21.69 19.84 20.80
  8       no       19.48   // 20.65 19.48 19.90

Takeaways are:

* the cache is effective, to varying degrees, in all configurations

* with the cache disabled, 1s and 1p perform similarly, as we'd expect

* with the cache enabled, 1p performs much worse than 1s

* with the cache enabled, 2 performs significantly worse than 1s,
  which is quite troubling.

Given the relatively modest median case hardware we expect I think we
should try at least to ensure that the 2 thread case is significantly
faster than the 1s case.

bholley, can you reproduce any of this?  Is it worth chasing?
Flags: needinfo?(jseward) → needinfo?(bobbyholley)
I can reproduce with your modified testcase. Investigating.
It looks like a style sharing issue. I set dom to tree_general_5000_50percent, and bumped the selector count to 50000. Setting DUMP_STYLE_STATISTICS=1, I get the following:

Without FORCE_STYLO_THREAD_POOL:
[PERF] perf block start
[PERF],traversal,sequential
[PERF],elements_traversed,5906
[PERF],elements_styled,5906
[PERF],elements_matched,16
[PERF],styles_shared,5890
[PERF],selectors,50634
[PERF],revalidation_selectors,248
[PERF],dependency_selectors,364
[PERF],declarations,1998
[PERF],stylist_rebuilds,2
[PERF],traversal_time_ms,8.221144991694018
[PERF] perf block end


With FORCE_STYLO_THREAD_POOL:
[PERF] perf block start
[PERF],traversal,parallel
[PERF],elements_traversed,5906
[PERF],elements_styled,5906
[PERF],elements_matched,2955
[PERF],styles_shared,2951
[PERF],selectors,50634
[PERF],revalidation_selectors,248
[PERF],dependency_selectors,364
[PERF],declarations,1998
[PERF],stylist_rebuilds,2
[PERF],traversal_time_ms,239.97544002486393
[PERF] perf block end

I have some hunches, give me a few.
So this reproduces on style-sharing.html from [1] as well. The reason it appears on Julian's testcase and not on the bloom-basic in [1] is because Julian removed the div/span distinction, which exists to sabotage the style sharing cache (and give a better measure of raw performance).

So the upshot is that the issue described here is about the parallel traversal hampering the style sharing cache, even when there's only one thread.

I'm going to morph this bug to look at that, since we already have so much discussion on it, and will file another bug for the issue I wanted Julian to look into.

[1] https://github.com/heycam/style-perf-tests
Blocks: 1369066
Filed bug 1369066 for the original issue. I'll investigate this one after breakfast.
Assignee: jseward → bobbyholley
No longer blocks: 1369066
Priority: P2 → P1
Summary: stylo: Investigate why parallelism win on bloom-basic.html is 3x and not 4x → stylo: Parallel traversal significantly weakens style sharing
Back from breakfast, paged this back in, and the answer is obvious: the tail call optimization is screwing us. I think we'll need to remove it.
Blocks: 1368302
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #16)
> [..] the tail call optimization is screwing us. I think we'll need to
> remove it.

Are you still of that opinion?  If so, how should we move this forward?
(In reply to Julian Seward [:jseward] from comment #17)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #16)
> > [..] the tail call optimization is screwing us. I think we'll need to
> > remove it.
> 
> Are you still of that opinion?

Yes - the problem there is that the tail-call optimization can cause us to process elements from level N+1 despite still having elements from level N in our queue. This messes up style sharing, which relies on locality of processing for siblings and same-level cousins.

Further investigation with Niko revealed an additional problem, which is that the rayon Queue is FIFO for the calling thread (and FILO for stealers). We need it to be FILO to maintain our depth-at-a-time ordering.

> If so, how should we move this forward?

Last week Niko and I came up with some patches to rayon+servo, but I got side-tracked with bug 1370107 before I finished measuring them. I'm going to do that next, and hopefully get it all landed shortly.

In the mean time on your end, please continue to poke at parallel performance in VTune (on bloom-basic and elsewhere - note that bug 1370107 should improve bloom-basic performance). Feel free to pass DISABLE_STYLE_SHARING_CACHE=1 if you want to eliminate style sharing from any comparison.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #18)
> Further investigation with Niko revealed an additional problem, which is
> that the rayon Queue is FIFO for the calling thread (and FILO for stealers).
> We need it to be FILO to maintain our depth-at-a-time ordering.

Sorry, I got the acronyms backwards as usual. It's currently FILO, we need it to be FIFO.
So, I finally made time to do the measurements. I used the iteration patch from bug 1367962 comment 2, with 1000 iterations, and 6 threads on my MBP. I sometimes repeated the measurements 3 or 5 times (full results listed where available), but the results are pretty stable.

The basic conclusion of these numbers is that:
* There is no significant performance impact of the changes to rayon on their own.
* Without style sharing, the traversal order changes are performance-neutral.
* With style sharing, the traversal order changes are a significant (~20%) performance win. Memory presumably improves as well given the increased sharing.

This means that we should go ahead with releasing a new rayon with the breadth-first option. Niko, can you do that?

Here are the measurements from Wikipedia:

Base:
Mean time: 21.29, Median time: 21.03, Best time: 20.49
Mean time: 21.59, Median time: 21.30, Best time: 20.79
Mean time: 21.12, Median time: 20.84, Best time: 20.34

Updated Rayon:
Mean time: 21.75, Median time: 21.49, Best time: 20.99

Updated Rayon, Breadth-first traversal, No Tail call:
Mean time: 17.32, Median time: 17.21, Best time: 15.69

Base, DISABLE_STYLE_SHARING_CACHE=1
Mean time: 23.52, Median time: 23.22, Best time: 22.69
Mean time: 23.91, Median time: 23.53, Best time: 22.92
Mean time: 23.66, Median time: 23.35, Best time: 22.70
Mean time: 23.83, Median time: 23.51, Best time: 22.90
Mean time: 23.54, Median time: 23.21, Best time: 22.57

Updated Rayon, DISABLE_STYLE_SHARING_CACHE=1
Mean time: 23.74, Median time: 23.41, Best time: 22.87
Mean time: 23.98, Median time: 23.69, Best time: 23.23
Mean time: 23.46, Median time: 23.32, Best time: 22.83
Mean time: 24.42, Median time: 23.63, Best time: 23.07
Mean time: 23.88, Median time: 23.58, Best time: 22.95

Updated Rayon, Breadth-first traversal, No Tail call, DISABLE_STYLE_SHARING_CACHE=1:
Mean time: 23.92, Median time: 23.63, Best time: 22.81
Mean time: 24.34, Median time: 24.14, Best time: 23.48
Mean time: 23.69, Median time: 23.50, Best time: 22.80
Flags: needinfo?(nmatsakis)
Some more numbers, all pointing towards this being a win:

=====Myspace====
Base:
Mean time: 3.65, Median time: 3.49, Best time: 3.17
Mean time: 3.65, Median time: 3.52, Best time: 3.20

New:
Mean time: 3.43, Median time: 3.36, Best time: 3.00
Mean time: 3.38, Median time: 3.33, Best time: 3.05

====Github diff====
Base: Mean time: 51.68, Median time: 50.81, Best time: 48.07
New: Mean time: 31.90, Median time: 31.86, Best time: 23.92
I'm going to upload my patches for this so we can get them reviewed in parallel to the rayon release + upgrade. They obviously won't compile until the rayon bump happens.
Created attachment 8875930 [details] [diff] [review]
Part 1 - Enable breadth-first traversal. v1

MozReview-Commit-ID: KJA2drcLTb5
Attachment #8875930 - Flags: review?(emilio+bugs)
Created attachment 8875931 [details] [diff] [review]
Part 2 - Don't do recursive tail calls if there's work in the queue. v1

MozReview-Commit-ID: 8JEdjqAIYmQ
Attachment #8875931 - Flags: review?(emilio+bugs)
Created attachment 8875932 [details] [diff] [review]
Part 3 - Eliminate an unnecessary heap allocation. v1

MozReview-Commit-ID: 4cleAH0JsKK
Attachment #8875932 - Flags: review?(emilio+bugs)
Comment on attachment 8875930 [details] [diff] [review]
Part 1 - Enable breadth-first traversal. v1

Review of attachment 8875930 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/style/gecko/global_style_data.rs
@@ +84,5 @@
> +                // queue to be always FIFO, rather than FIFO for stealers and
> +                // FILO for the owner (which is what rayon does by default). This
> +                // ensures that we process all the elements at a given depth before
> +                // proceeding to the next depth, which is important for style sharing.
> +                .breadth_first()

Hmm, so this is a per-pool config? I guess that's fine.
Attachment #8875930 - Flags: review?(emilio+bugs) → review+
Attachment #8875931 - Flags: review?(emilio+bugs) → review+
Comment on attachment 8875932 [details] [diff] [review]
Part 3 - Eliminate an unnecessary heap allocation. v1

Review of attachment 8875932 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/style/parallel.rs
@@ +58,1 @@
>  type NodeList<N> = SmallVec<[SendNode<N>; WORK_UNIT_MAX]>;

You should use ArrayVec now, I think :)
Attachment #8875932 - Flags: review?(emilio+bugs) → review+
(In reply to Emilio Cobos Álvarez [:emilio] from comment #26)
> Comment on attachment 8875930 [details] [diff] [review]
> Part 1 - Enable breadth-first traversal. v1
> 
> Review of attachment 8875930 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: servo/components/style/gecko/global_style_data.rs
> @@ +84,5 @@
> > +                // queue to be always FIFO, rather than FIFO for stealers and
> > +                // FILO for the owner (which is what rayon does by default). This
> > +                // ensures that we process all the elements at a given depth before
> > +                // proceeding to the next depth, which is important for style sharing.
> > +                .breadth_first()
> 
> Hmm, so this is a per-pool config? I guess that's fine.

Yeah, Niko said it was easier to use with the rayon architecture for some reason.

(In reply to Emilio Cobos Álvarez [:emilio] from comment #27)
> Comment on attachment 8875932 [details] [diff] [review]
> Part 3 - Eliminate an unnecessary heap allocation. v1
> 
> Review of attachment 8875932 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: servo/components/style/parallel.rs
> @@ +58,1 @@
> >  type NodeList<N> = SmallVec<[SendNode<N>; WORK_UNIT_MAX]>;
> 
> You should use ArrayVec now, I think :)

Ok, I agree that the current setup could use a bit of tweaking. I'll attach another patch.
Created attachment 8876263 [details] [diff] [review]
Part 4 - Use ArrayVec and tweak the SmallVec sizes. v1

MozReview-Commit-ID: 1tEZiPdp9WQ
Attachment #8876263 - Flags: review?(emilio+bugs)
Attachment #8876263 - Flags: review?(emilio+bugs) → review+
Rayon PR is here: https://github.com/nikomatsakis/rayon/pull/360

In that PR, I did raise one question concerning an alternative to the `spawn_tail` API that seems superior to me, and on which I would appreciate some feedback:

https://github.com/nikomatsakis/rayon/pull/360#issuecomment-307557564
Flags: needinfo?(nmatsakis)
(In reply to Niko Matsakis [:nmatsakis] from comment #30)
> Rayon PR is here: https://github.com/nikomatsakis/rayon/pull/360
> 
> In that PR, I did raise one question concerning an alternative to the
> `spawn_tail` API that seems superior to me, and on which I would appreciate
> some feedback:
> 
> https://github.com/nikomatsakis/rayon/pull/360#issuecomment-307557564

As mentioned in the PR, I'm all for a bare accessor in favor of spawn_tail.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eca44be52ee94a5345cfa94e34a7e11d8a9365a0
https://github.com/servo/servo/pull/17334
I forgot that I needed to rework the patches a bit for the different API semantics that Niko eventually settled on.

Here's an updated green try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=24c20213823e79b3837bd16d37e93692e3b92252

I also reverified with a local build that the style sharing improves enormously with these patches (~2000 -> ~7000 styles shared on obama).
https://hg.mozilla.org/integration/autoland/rev/f6f65219155e
Status: NEW → RESOLVED
Last Resolved: a month ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.