Closed Bug 1357304 Opened 8 years ago Closed 8 years ago

stylo: Store bloom filter hashes inline in the selector

Categories

(Core :: CSS Parsing and Computation, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: bholley, Assigned: bholley)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

I have patches for this. This improves stylo's sequential+uncached performance on [1] and [2] by about 4x, making stylo clock in at about 25% faster than Gecko when all is said and done. Caching and parallelism will give additional wins here, but for the next few days I'm just focusing on raw selector matching performance. Emilio and I worked out some nice plans on Saturday, so there's more to come in that area. [1] http://searchfox.org/mozilla-central/source/testing/talos/talos/tests/perf-reftest/bloom-basic.html [2] http://searchfox.org/mozilla-central/source/testing/talos/talos/tests/perf-reftest/bloom-basic-ref.html
MozReview-Commit-ID: Lm25v2K21v5
Attachment #8859024 - Flags: review?(emilio+bugs)
MozReview-Commit-ID: DxG6USsPIkh
Attachment #8859025 - Flags: review?(emilio+bugs)
MozReview-Commit-ID: F07JkdduLaI
Attachment #8859026 - Flags: review?(emilio+bugs)
Attachment #8859024 - Flags: review?(emilio+bugs) → review+
Comment on attachment 8859025 [details] [diff] [review] Part 2 - Introduce SelectorInner and use it for top-level matching. v1 Review of attachment 8859025 [details] [diff] [review]: ----------------------------------------------------------------- ::: servo/components/selectors/parser.rs @@ +121,5 @@ > } > } > > #[derive(PartialEq, Eq, Hash, Clone)] > +pub struct SelectorInner<Impl: SelectorImpl> { docs, please state the intention of why this type should exist, and all that :)
Attachment #8859025 - Flags: review?(emilio+bugs) → review+
Comment on attachment 8859026 [details] [diff] [review] Part 3 - Store bloom filter hashes inline. v1 Review of attachment 8859026 [details] [diff] [review]: ----------------------------------------------------------------- Just curious, have you made any memory usage measurements? ::: servo/components/selectors/parser.rs @@ +135,5 @@ > + let mut hashes = [0; NUM_ANCESTOR_HASHES]; > + { > + // Compute ancestor hashes for the bloom filter. We precompute these > + // and store them inline to optimize cache performance during selector > + // matching. This matters a lot. Put this in the member definition as a doc comment? @@ +267,5 @@ > + > +impl<'a, Impl: SelectorImpl> Iterator for AncestorIterator<'a, Impl> { > + type Item = &'a Arc<ComplexSelector<Impl>>; > + fn next(&mut self) -> Option<Self::Item> { > + while self.curr.is_some() { use |while let Some(selector) = self.cur.take()| instead.
Attachment #8859026 - Flags: review?(emilio+bugs) → review+
(In reply to Emilio Cobos Álvarez [:emilio] from comment #6) > Comment on attachment 8859026 [details] [diff] [review] > Part 3 - Store bloom filter hashes inline. v1 > > Review of attachment 8859026 [details] [diff] [review]: > ----------------------------------------------------------------- > > Just curious, have you made any memory usage measurements? I have not. I think this is probably worth it in any case.
So the key here is to not have to pointer-chase down the selector chain to find the relevant hashes, plus not having to pointer-chase into the atoms and whatnot, right?
Correct - we don't have to examine the combinators at all. It's worth noting that there wasn't all _that_ much pointer-chasing in the path we were hitting in this testcase. The selector in question was "span > div", which means that the steps we were doing before was: * Deference the pointer on the Selector to get the ComplexSelector corresponding to |div| and everything to the left. * Match on the combinator, and determine that the next selector corresponded to an ancestor. * Dereference the |next| pointer to get the ComplexSelector corresponding to |span|. * Dereference the |compound_selector| Vec<SimpleSelector> on the ComplexSelector to get the first SimpleSelector. * Match on the SimpleSelector to determine it to be a LocalName. * Dereference both the name and the lower_name. * Check those two hashes against the bloom filter. So ignoring the matches, we eliminated 5 dereferences, though the last two were always the same atom across all the selectors and thus likely in L1. So let's call it 3, which amounted to the difference between 1200ms and 270ms. Cache locality matters.
Right. So on that testcase, we end up doing about 50e6 bloom filter tests, I think. 930ms/50e6 is about 19ns if I'm counting my 0s right. I can totally believe that 5 derefs could take that long, unless they're all in L1... We should remeasure bug 1348935 once this lands.
Blocks: 1348935
(In reply to Boris Zbarsky [:bz] (still a bit busy) (if a patch has no decent message, automatic r-) from comment #11) > Right. So on that testcase, we end up doing about 50e6 bloom filter tests, > I think. 930ms/50e6 is about 19ns if I'm counting my 0s right. I can > totally believe that 5 derefs could take that long, unless they're all in > L1... > > We should remeasure bug 1348935 once this lands. I'm planning on doing more refactoring to improve cache locality in selector matching in general, so I'll remeasure that testcase once I've optimized the heck out of all my microbenchmarks. :-)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: