Closed Bug 1369066 Opened 7 years ago Closed 7 years ago

stylo: Investigate why parallelism win on bloom-basic.html is 3x and not 4x

Categories

(Core :: CSS Parsing and Computation, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: bholley, Assigned: jseward)

References

(Depends on 1 open bug)

Details

+++ This bug was initially created as a clone of Bug #1365692 +++

This was the original thing I wanted to investigate in bug 1365692. You should be able to reproduce by using the bloom-basic from [1], or alternatively, passing DISABLE_STYLE_SHARING_CACHE=1.

[1] https://github.com/heycam/style-perf-tests
No longer depends on: 1365692
Some baseline measurements on a quiet machine:

1s  223 199 203 202   best 199  1.02 x

1p  202 202 222 207   best 202  1.00 x
2   114 118 116 120   best 114  1.77 x
3    85  85  84 84    best  84  2.40 x
4    68  72  73 73    best  68  2.97 x
6    63  64  62 63    best  62  3.25 x  // HT, unreliable
8    60  59  60 59    best  59  3.42 x  // HT, unreliable


1p is with FORCE_STYLO_THREAD_POOL=1

General STR:
STYLO_ITERS=1 STYLO_THREADS=<N> DISPLAY=:2.0 \
  ./gcc-Og-nondebug-stylo/dist/bin/firefox-bin -P dev -no-remote \
    /home/sewardj/MOZ/STYLO/style-perf-tests/perf-reftest/bloom-basic.html

(e10s disabled in profile)

The 1s->1p transition doesn't seem to be much of a problem, which is
good.  I plan to use 1p rather than 1s as a baseline, since the 1s-vs-1p
difference adds complicating factors (different code paths).
Ach, the multi-iteration patch (bug 1367962) doesn't work for this
case, and I can't imagine how to get reliable profile numbers
without it.
Depends on: 1367962
For a single iteration, VTune can't get enough samples to show anything
that would clearly distinguish the 1p and 4p cases.

Running on Callgrind shows that there's a very high D1 miss rate in
<style::selector_map::SelectorMap<style::stylist::Rule>>::get_matching_rules,
or really in the inlined function may_match (matching.rs:182), here:

        if *hash == 0 {

apparently 100.146M misses from 100.161M references (not sure if I
believe that).  Is that expected?  What's the structure/locality at
this point?
Flags: needinfo?(bobbyholley)
(In reply to Julian Seward [:jseward] from comment #2)
> Ach, the multi-iteration patch (bug 1367962) doesn't work for this
> case, and I can't imagine how to get reliable profile numbers
> without it.

I'm going to write up a more reliable multi-iteration patch to test some other things this morning, and I can share that with you.

(In reply to Julian Seward [:jseward] from comment #3)
> For a single iteration, VTune can't get enough samples to show anything
> that would clearly distinguish the 1p and 4p cases.
> 
> Running on Callgrind shows that there's a very high D1 miss rate in
> <style::selector_map::SelectorMap<style::stylist::Rule>>::get_matching_rules,
> or really in the inlined function may_match (matching.rs:182), here:
> 
>         if *hash == 0 {
> 
> apparently 100.146M misses from 100.161M references (not sure if I
> believe that).  Is that expected?  What's the structure/locality at
> this point?

So, for each element we iterate over a very large array of Rule objects, one per selector. Assuming we have enough selectors (set in the testcase), this should be large enough to avoid sharing cache lines between elements (since each iteration through the array blows the cache).

So we have Rule objects [1], each of which has a Selector object [2] followed by two other words. Each Selector object has a SelectorInner [3], followed by a u32. Each SelectorInner has two words (for the ArcSlice in ComplexSelector) and then four u32 hashes. Those hashes are where we're getting the cache miss.

Counting up on a 64-bit machine, that gives us 16 + 8 (rounded up 4) + 16 + 16 = 56. So that's almost a complete cache line but not quite, whereas your measurements show it being a complete cache line (otherwise we'd have ~10% fewer misses than accesses).

I might be mis-counting though, so let me actually check size_of after breakfast.

Either way, I wouldn't think the number of cache misses should affect the win we see on parallelism (though it might explain why sequential stylo is slightly slower on this testcase than Chrome).

[1] http://searchfox.org/mozilla-central/rev/1a0d9545b9805f50a70de703a3c04fc0d22e3839/servo/components/style/stylist.rs#1341
[2] http://searchfox.org/mozilla-central/rev/1a0d9545b9805f50a70de703a3c04fc0d22e3839/servo/components/selectors/parser.rs#195
[3] http://searchfox.org/mozilla-central/rev/1a0d9545b9805f50a70de703a3c04fc0d22e3839/servo/components/selectors/parser.rs#156

Yeah, I think that makes sense. The general structure here is that we have an array of
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #4)
> Yeah, I think that makes sense. The general structure here is that we have
> an array of

(Ignore this last part).

So I just measured and the size of Rule is indeed 64 bytes, which explains exactly what you're seeing. I'm looking into why, which might be fruitful for boosting sequential performance on this testcase. However, we still have the question of why parallel performance isn't at least 4x.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #5)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #4)
> > Yeah, I think that makes sense. The general structure here is that we have
> > an array of
> 
> (Ignore this last part).
> 
> So I just measured and the size of Rule is indeed 64 bytes, which explains
> exactly what you're seeing.

Oh I see - ArcSlice is 3 words, not 2.

I think I can shave 2 words off the size of Rule. Will file a bug about that.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #4)
> I'm going to write up a more reliable multi-iteration patch to test some
> other things this morning, and I can share that with you.

That would be excellent.  Once we have this iterating properly, we might be in
with a chance of figuring out what's going on.

> >         if *hash == 0 {
> > 
> > apparently 100.146M misses from 100.161M references (not sure if I
> 
> Either way, I wouldn't think the number of cache misses should affect the
> win we see on parallelism

Oh, I agree (re not explaining the parallelism lack-of-win).  I mentioned the
cache misses only because they stood out, and I wanted to check that they
correspond with your understanding, which it looks like they do.  They might
be pretty harmless in that even a simple hardware prefetcher should take care
of them -- so Callgrind is in that sense, unrepresentative of real hardware.
(In reply to Julian Seward [:jseward] from comment #7)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #4)
> > I'm going to write up a more reliable multi-iteration patch to test some
> > other things this morning, and I can share that with you.
> 
> That would be excellent.  Once we have this iterating properly, we might be
> in
> with a chance of figuring out what's going on.

Ok. Might be a few hours, FWIW.

Note that in the mean time, you can try just bumping up the number of selectors (increasing the per-element work) to make the testcase take longer.

> 
> > >         if *hash == 0 {
> > > 
> > > apparently 100.146M misses from 100.161M references (not sure if I
> > 
> > Either way, I wouldn't think the number of cache misses should affect the
> > win we see on parallelism
> 
> Oh, I agree (re not explaining the parallelism lack-of-win).  I mentioned the
> cache misses only because they stood out, and I wanted to check that they
> correspond with your understanding, which it looks like they do.  They might
> be pretty harmless in that even a simple hardware prefetcher should take care
> of them -- so Callgrind is in that sense, unrepresentative of real hardware.

So I just wrote the patch to remove one of the words, and there's a significant performance improvement (100000 selectors, 1000 elements, which is a configuration where sequential stylo is worse than Chrome). Removing the other word is harder, but I have a plan.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #8)
> (In reply to Julian Seward [:jseward] from comment #7)
> > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #4)
> > > I'm going to write up a more reliable multi-iteration patch to test some
> > > other things this morning, and I can share that with you.
> > 
> > That would be excellent.  Once we have this iterating properly, we might be
> > in
> > with a chance of figuring out what's going on.
> 
> Ok. Might be a few hours, FWIW.

I didn't around to this today, but I did take a moment to look into why your existing patch wasn't working for bloom-basic. It looks like your patch only works if there's a subtree restyle hint on the root, which there isn't in this case. You can force there to be one by applying the following patch to util.js:

diff --git a/perf-reftest/util.js b/perf-reftest/util.js
index 45f756c..7291a74 100644
--- a/perf-reftest/util.js
+++ b/perf-reftest/util.js
@@ -35,6 +35,10 @@ function build_rule(selector, selectorRepeat, declaration, ruleRepeat) {
 }
 
 function flush_style(element) {
+  var s = document.createElement('style');
+  s.innerHTML = '*|* { color: red }';
+  document.head.appendChild(s);
+  s.remove();
   getComputedStyle(element || document.documentElement).color;
 }
 


I verified that the iteration machinery works on bloom-basic with this patch. So you should be unblocked to continue profiling with VTune.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #9)
> You can force there to be one by applying the following patch to util.js:

Thanks!

So, just for everybody's entertainment and annoyance :-), with 100 iterations
the slowdown isn't so marked -- 3.51 x with 4 threads, vs 2.97 x for the single
iteration case (per comment 1).  Does the machinery that measures the time
and produces the popup box, have a sequential section outside the loop?

Anyway, no matter.  I continue to investigate.
(In reply to Julian Seward [:jseward] from comment #10)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #9)
> > You can force there to be one by applying the following patch to util.js:
> 
> Thanks!
> 
> So, just for everybody's entertainment and annoyance :-), with 100 iterations
> the slowdown isn't so marked -- 3.51 x with 4 threads, vs 2.97 x for the
> single
> iteration case (per comment 1).  Does the machinery that measures the time
> and produces the popup box, have a sequential section outside the loop?

Not that I can think of offhand.
Though I guess one major difference is that the restyles will not allocate (or at least allocate as much), whereas the original styling will. I wouldn't think this would affect things though, because the profiles I looked at showed the lion's share of the time happening in selector matching.
I finally managed to get VTune to show me inter-core line-transfer counts,
using the events OFFCORE_REQUESTS.DEMAND_RFO and possibly L2_RQSTS.ALL_RFO.
I'm pretty hazy what these mean and didn't find good documentation on them,
but testing on a toy program that causes a lot of such transfers seems to
confirm that they are the right events to measure.

The next step is to figure out if they are actually significant, and what
their causes are.  I'm looking at them because I didn't see other obvious
microarchitectural bad interactions (that would explain the slowdown) in the
VTune profiles.
Related:
MESI protocol profiling toolery at https://bugs.kde.org/show_bug.cgi?id=380942
Despite considerable profiling, I don't have a clear explanation for the perf
lossage.  My least-worst (and only) theory is that it is due L3 cache misses
associated with accesses of AncestorHashes objects in may_match, but the
evidence isn't clear-cut.

  fn may_match<E>(hashes: &AncestorHashes,
                  bf: &BloomFilter)
                  -> bool
      where E: Element,
  {
      // Check against the list of precomputed hashes.
      for hash in hashes.0.iter() {
          // If we hit the 0 sentinel hash, that means the rest are [..] 
          if *hash == 0 {  // <--------------- HERE
              break;
          }

          if !bf.might_contain_hash(*hash) {
              return false;
          }
      }

      true
  }

Profiling with Callgrind, for a single iteration, and only collecting
information for the styling pass, shows that the misses noted in comment #3 in
D1 also occur in the LL cache, at the point marked above.  The LL cache is
shared amongst all cores, so it's not completely implausible that adding
threads would increase cache contention.

The "bf.might_contain_hash(*hash)" bit contributes no cache misses at all.
This is consistent with there being few Bloom filters (one per thread) but
many AncestorHashes objects (one per DOM node?)

Fiddling around with Callgrind runs with different numbers of threads and
LL cache sizes of 1, 2, 4 and 8MB, shows a LL miss "cliff" somewhere
between 1MB and 2MB size.  For 4 threads that is as follows:

  1MB LL  1894M insns, 35.516M LLmisses
  2MB LL  1811M insns,  0.586M LLmisses
  4MB LL  1823M insns,  0.244M LLmisses
  8MB LL  1884M insns,  0.345M LLmisses

I don't know if that is significant or not.  I'd be marginally more convinced
if the cliff was between 4MB and 8MB, but it isn't.

A more serious problem is that VTune doesn't provide supporting evidence.
Using it to profile multiple iterations of styling, it shows a LL miss rate of
around 1 per 10000 reads, which is too low to give any significant performance
loss.  VTune also doesn't show anything else that seems like bad interaction
with the microarchitecture.

Additionally, using a build from today, with recently landed patches that
(AIUI) reduce space use, the lossage is much smaller.  For 100 iterations:

  1p  1.00 x
  2   1.96 x
  3   2.89 x
  4   3.80 x

As a side comment, the idiom of iterating over 4 32-bit words (the ancestor
hashes) to find the first zero, and having this be a hot path, looks like
something waiting for a clever bit of vector code to come along.  Couldn't we
do this with a mov[ua]ps, cmppd against zero, movmskb, clz?  Maybe that would
be slower.

Also -- unless I misunderstood the counting Bloom filter stuff -- we're only
using 24 bits out of each 32 bit hash value.  So at least in theory,
AncestorHashes could be reduced to 12 bytes.

I'm not sure where to go with this now, given that I can't show much of a
parallel slowdown any more.  Does this test case, with its huge Rule object
arrays, typify real workloads?
(In reply to Julian Seward [:jseward] from comment #15)
> Despite considerable profiling, I don't have a clear explanation for the perf
> lossage.  My least-worst (and only) theory is that it is due L3 cache misses
> associated with accesses of AncestorHashes objects in may_match, but the
> evidence isn't clear-cut.
> 
>   fn may_match<E>(hashes: &AncestorHashes,
>                   bf: &BloomFilter)
>                   -> bool
>       where E: Element,
>   {
>       // Check against the list of precomputed hashes.
>       for hash in hashes.0.iter() {
>           // If we hit the 0 sentinel hash, that means the rest are [..] 
>           if *hash == 0 {  // <--------------- HERE
>               break;
>           }
> 
>           if !bf.might_contain_hash(*hash) {
>               return false;
>           }
>       }
> 
>       true
>   }
> 
> Profiling with Callgrind, for a single iteration, and only collecting
> information for the styling pass, shows that the misses noted in comment #3
> in
> D1 also occur in the LL cache, at the point marked above.  The LL cache is
> shared amongst all cores, so it's not completely implausible that adding
> threads would increase cache contention.

Even in the case that it's entirely read-only? I'd think that'd be the "shared" part of MESI.

> 
> The "bf.might_contain_hash(*hash)" bit contributes no cache misses at all.
> This is consistent with there being few Bloom filters (one per thread) but
> many AncestorHashes objects (one per DOM node?)

One per selector.

This testcase is almost completely cache-bound. For each element, we match against against a giant array of many thousands of Rule objects. Each one has an inline set of hashes. The first hash will always be rejected by the bloom filter, and then we'll move on to the next Rule. This is why shrinking Rule speeds up the testcase - we can fit more in each cache line. However, there are still enough Rules to totally blow all the caches before moving on to the next element and repeating the process.

So it's cache-bound. That said, it's cache-bound on read-only data, so I'd think it should be equally cache-bound for either the single or multi-threaded case. But it's probably not worth spending any more time on, see below.

> 
> Fiddling around with Callgrind runs with different numbers of threads and
> LL cache sizes of 1, 2, 4 and 8MB, shows a LL miss "cliff" somewhere
> between 1MB and 2MB size.  For 4 threads that is as follows:
> 
>   1MB LL  1894M insns, 35.516M LLmisses
>   2MB LL  1811M insns,  0.586M LLmisses
>   4MB LL  1823M insns,  0.244M LLmisses
>   8MB LL  1884M insns,  0.345M LLmisses
> 
> I don't know if that is significant or not.  I'd be marginally more convinced
> if the cliff was between 4MB and 8MB, but it isn't.
> 
> A more serious problem is that VTune doesn't provide supporting evidence.
> Using it to profile multiple iterations of styling, it shows a LL miss rate
> of
> around 1 per 10000 reads, which is too low to give any significant
> performance
> loss.  VTune also doesn't show anything else that seems like bad interaction
> with the microarchitecture.
> 
> Additionally, using a build from today, with recently landed patches that
> (AIUI) reduce space use, the lossage is much smaller.  For 100 iterations:
> 
>   1p  1.00 x
>   2   1.96 x
>   3   2.89 x
>   4   3.80 x
> 
> As a side comment, the idiom of iterating over 4 32-bit words (the ancestor
> hashes) to find the first zero, and having this be a hot path, looks like
> something waiting for a clever bit of vector code to come along.  Couldn't we
> do this with a mov[ua]ps, cmppd against zero, movmskb, clz?  Maybe that would
> be slower.

The hot path isn't really finding the first zero. The hot path is checking each hash against the bloom filter until we get a rejection _or_ find the first zero. If we run out of hashes before we get a rejection, we'll have to take the slow path anyway.

I could imagine it being possible to somehow do some crazy vector stuff here, but I doubt it's worth it. And it wouldn't help on this particular (synthetic) testcase, where we always reject on the first hash.

> 
> Also -- unless I misunderstood the counting Bloom filter stuff -- we're only
> using 24 bits out of each 32 bit hash value.  So at least in theory,
> AncestorHashes could be reduced to 12 bytes.

This is a wonderful observation, which sparked bug 1371949, which made things even faster. Well done!

> 
> I'm not sure where to go with this now, given that I can't show much of a
> parallel slowdown any more.  Does this test case, with its huge Rule object
> arrays, typify real workloads?

Certainly not in any complete sense. I think we've gleaned enough useful observations from this testcase, and banging on it further is probably diminishing returns. I think it would be most productive for you to move on to analyzing parallel performance on more realistic testcases. Obama wikipedia (saved locally as Web Page (Complete)) and myspace-orig [1] are good places to start. Those are places where stylo does quite well in general, but they're testcases that we understand pretty well, and thus good places to measure the parallelism doing its thing and see if there's anything we can improve. 

When working with real sites, I'd suggest profiling with DISABLE_STYLE_SHARING_CACHE=1, such that the different style sharing characteristics from different numbers of worker threads don't confound your measurements.

[1] https://www.dropbox.com/s/h51fspacftf1lcq/myspace.tar.bz2?dl=0
I find this interesting in that it makes me think more carefully about the
relationship between the memory system and parallelism.

Imagine the limit case here, where we have infinite cores.  Even if the tree
traversal is arbitrarily parallelisable (no communication or synchronisation
between threads), we are going to be limited by how fast the DOM can be pulled
out of main memory up to the per-core private caches (the L2s).  So we're
limited by the bandwidth from main memory to the L3, by capacity conflicts in
the L3, and by bandwidth from the L3 to the L2s.

So, in the limit, getting exactly N speedup from N cores requires that DOM is
already cached completely in core-private caches (L1s and L2s), and the right
bits are cached for each core.  Neither of which are ever really going to be
the case.

Maybe this is stating the obvious.  Anyway.  Apart from the improved
understanding (for me, at least) I am pleased that this investigation produced
some useful speedup (bug 1371949 comment 1).  As suggested I will now continue
by analysing the Obama and MySpace test cases.
Hm, that makes sense - though note that it's really the Rule list rather than the DOM we're cache missing on. Each thread is going to be grinding over that same list - would we be seeing L3 cache thrashing as a result? What is the L3 cache line size?

Another question - is it more expensive to fetch a cache line as shared vs exclusive?

All this does seem to jive with the evidence that shrinking Rule made more of a relative difference on parallel than on sequential, which implies that cache misses hurt us more in the parallel case.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #18)
> Hm, that makes sense - though note that it's really the Rule list rather
> than the DOM we're cache missing on. Each thread is going to be grinding
> over that same list

Interesting.  Can you guesstimate how many bytes the list would occupy?

If it's less than 8MB then the whole list will fit in the L3 and so only needs
to be pulled in once.  But if it's larger than 256KB then it won't fit in the
private per-core caches (L2 and L1) and so the more cores we have running, the
higher the bandwidth demand between L2s and L3.  It would be kinda interesting
to know if that is a plausible explanation.

> What is the L3 cache line size?

On this machine (Skylake, and I suspect many earlier Intels) 64 bytes,
same as the L2 and L1D line sizes.

> Another question - is it more expensive to fetch a cache line as shared vs
> exclusive?

I don't know, but I would guess not.

> which implies that cache misses hurt us more in the parallel case.

Yes .. I'd phrase it somewhat differently though.  I'd say that increasing
parallelism increases the bandwidth and capacity demands on those parts of the
memory system that are not duplicated per core -- the L3 and main memory --
and so can increase the time it takes to service an L2 miss.

For example, a miss in a core-private cache (L2) might take longer to service
in the parallel case, because either the data is less likely to be in the L3
due to it having to also hold data for the other (active) cores, or because
the L2s<->L3 or L3<->memory connections are busier, for the same reason.
(In reply to Julian Seward [:jseward] from comment #19)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #18)
> > Hm, that makes sense - though note that it's really the Rule list rather
> > than the DOM we're cache missing on. Each thread is going to be grinding
> > over that same list
> 
> Interesting.  Can you guesstimate how many bytes the list would occupy?

40 bytes before bug 1371949, 32 bytes after. This gets multiplied by the number of selectors we put in the testcase, which for the in-tree bloom-basic.html is 20,000. So we're looking at 800K with what you were measuring, and 640k now.

> If it's less than 8MB then the whole list will fit in the L3 and so only
> needs
> to be pulled in once.

Sounds like that's the case.

>  But if it's larger than 256KB then it won't fit in the
> private per-core caches (L2 and L1) and so the more cores we have running,
> the
> higher the bandwidth demand between L2s and L3.  It would be kinda
> interesting
> to know if that is a plausible explanation.

It might be. The problem with verifying this is that if I lower the number of selectors to, say, 2000, which should fit entirely in L2, the testcase ends up less bound by bloom filter rejections and other noise from the traversal starts to creep in.

> 
> > What is the L3 cache line size?
> 
> On this machine (Skylake, and I suspect many earlier Intels) 64 bytes,
> same as the L2 and L1D line sizes.
> 
> > Another question - is it more expensive to fetch a cache line as shared vs
> > exclusive?
> 
> I don't know, but I would guess not.
> 
> > which implies that cache misses hurt us more in the parallel case.
> 
> Yes .. I'd phrase it somewhat differently though.  I'd say that increasing
> parallelism increases the bandwidth and capacity demands on those parts of
> the
> memory system that are not duplicated per core -- the L3 and main memory --
> and so can increase the time it takes to service an L2 miss.
> 
> For example, a miss in a core-private cache (L2) might take longer to service
> in the parallel case, because either the data is less likely to be in the L3
> due to it having to also hold data for the other (active) cores, or because
> the L2s<->L3 or L3<->memory connections are busier, for the same reason.

I see - that makes lots of sense. Anyway, I think we're good here.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.