Component: General → CSS Parsing and Computation
Ugh, so this is all time spent inserting into the rule tree, because of the shape it has... The rule tree is kinda needed because of transitions and animations, and because of style reparenting of ::first-child... So this is a very annoying case. Bug 1483963 is a very similar thing, for transitions instead... I don't really know how to remove the rule tree without re-running selector-matching for transitions and animations, or regressing our behavior to something closer to what Blink or WebKit do... Maybe that's worth a shot, but it's surely a lot of work. Does it sound reasonable to get rid of the rule tree? We get bitten by it very frequently, and parallel hashmaps sounds like a nightmare. Tradeoffs are: * Need to re-selector-match for style attribute changes and animations. _But_ can avoid cloning it over and over which causes at least part of bug 1478953. * Will regress some animation edge-cases which we already get wrong for CSS rules, like transitioning and changing a style attribute at the same time might not trigger the transition (bug 1393323). This will regress our behavior, but it will still be better than the other browsers IMO. * Will make the inspector APIs actually run selector-matching, but I think we don't care much about it and GetCSSStyleRules is already pretty expensive. * Will make first-line reparenting run selector-matching, but again I don't think we care much about it because it's already quite broken. Alternatives are: * Do nothing. This particular test-case is very artificial, but this certainly comes up, see the bug mentioned above. * Lock-free hashmap for rule tree insertion with many kids (let's not please). * Make rule tree insertion not lock-free. I think this actually may be easier than getting rid of the rule tree entirely. We'd basically have a lock (Mutex / RwLock?) for the children, in a Vec<> or something, or maybe this Vec-or-hashmap depending on what this is about. Opinions? I can't promise I can get to do any of these short-term, but given how often it comes up it might be worth it...
Bug 1493507 is another example where the rule tree is just really slow.
If we can avoid writing a lock free HashMap implementation that would be good. Can we switch to a RwLock-protected HashMap when the child list gets too big, but keep the lock free linked list that we currently use for small child lists?
As you know I'm not too keen on regressing our behavior for transitions though it sounds like you're saying we already get those cases wrong? The cases I'm particularly concerned about are the ones here: https://bug1192592.bmoattachments.org/attachment.cgi?id=8843824 which we currently get right. I need to write WPT for those to encourage other vendors to match the spec here. I'd prefer we don't regress those cases.
I don't think I have more to say here... I guess heycam's idea may be worth a shot.
(In reply to Cameron McCormack (:heycam) from comment #3) > If we can avoid writing a lock free HashMap implementation that would be > good. Can we switch to a RwLock-protected HashMap when the child list gets > too big, but keep the lock free linked list that we currently use for small > child lists? I'm trying to do the obvious first and switching to RwLock(HashMap) unconditionally, see numbers, and see if we really need the linked-list at all.
Doesn't seem to regress significantly anything, if only improves performance of tabpaint somehow: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=863e03c8ab61668af22fae57c0c3bf9d7a46ee15
Assignee: nobody → emilio
I need to profile this a bit more, but talos was pretty happy about this, and it solves the known performance issues here such as the test-case from bug 1483963 for example. This also gets rid of a bunch of unsafe code which is nice. This still keeps the same GC scheme, removing the key from the hashmap when needed. I kept those as release assertions, but should probably be turned into debug-only assertions.
Yes, would be great to see a perf-reftest that exercises contention for the hashmap. But I can imagine that any regression from locking would be made up for by the gains of not having to iterate the linked list.
This is a benchmark for the best case of the single list, over 30k rules. New code is of course slower, but this is orders of magnitude more heavy than the other tests that this code makes faster.
(In reply to Cameron McCormack (:heycam) from comment #9) > Yes, would be great to see a perf-reftest that exercises contention for the > hashmap. But I can imagine that any regression from locking would be made > up for by the gains of not having to iterate the linked list. I wrote some benchmarks, in advance to the ones that I had already written. Unpatched: $ ./mach test-unit --bench -p style rule_tree Finished release [optimized] target(s) in 0.58s Running target/release/deps/style_tests-a46fb4e60f238ef7 running 7 tests test stylist::test_stylist_rule_tree_accessors ... ignored test rule_tree::bench::bench_expensive_insertion ... bench: 47,394,689 ns/iter (+/- 5,973,365) test rule_tree::bench::bench_expensive_insertion_parallel ... bench: 15,846,818 ns/iter (+/- 934,040) test rule_tree::bench::bench_expensive_single_insertion ... bench: 4,028,905 ns/iter (+/- 251,460) test rule_tree::bench::bench_insertion_basic ... bench: 484,423 ns/iter (+/- 37,998) test rule_tree::bench::bench_insertion_basic_parallel ... bench: 924,268 ns/iter (+/- 53,865) test rule_tree::bench::bench_insertion_basic_per_element ... bench: 313 ns/iter (+/- 32) test result: ok. 0 passed; 0 failed; 1 ignored; 6 measured; 146 filtered out Patched: $ ./mach test-unit --bench -p style rule_tree Compiling style v0.0.1 (file:///home/emilio/src/moz/servo/components/style) Compiling style_tests v0.0.1 (file:///home/emilio/src/moz/servo/tests/unit/style) Finished release [optimized] target(s) in 1m 22s Running target/release/deps/style_tests-a46fb4e60f238ef7 running 7 tests test stylist::test_stylist_rule_tree_accessors ... ignored test rule_tree::bench::bench_expensive_insertion ... bench: 2,429,111 ns/iter (+/- 861,055) test rule_tree::bench::bench_expensive_insertion_parallel ... bench: 9,964,591 ns/iter (+/- 5,418,138) test rule_tree::bench::bench_expensive_single_insertion ... bench: 22,649,583 ns/iter (+/- 10,432,404) test rule_tree::bench::bench_insertion_basic ... bench: 674,705 ns/iter (+/- 57,066) test rule_tree::bench::bench_insertion_basic_parallel ... bench: 1,292,533 ns/iter (+/- 775,756) test rule_tree::bench::bench_insertion_basic_per_element ... bench: 569 ns/iter (+/- 93) So there's definitely some locking and hashmap overhead (I poked a bit over perf). Note that this is also measuring the GC which does a hashmap lookup per node GCd, and that's all the nodes in these benchmarks (that's 20% of each benchmark here). We could try to be smarter about GCing by only putting parents on the child list for example... Also note that the *parallel tests are expected to be slower, since we're locking, but in practice the rule tree is not really as contended as these micro-benchmarks are (most of the time on the style threads is doing the cascade / selector-matching). So over-all I think this is a fair trade-off, but happy to listen to other opinions.
I emailed emilio and cameron about this last night because I wasn't logged into bugzilla on my phone. Moving the discussion back here. Bobby wrote: > Thoughts on the hybrid approach? If it's not too complex to add it seems like the best of both worlds. Cameron wrote: > I'm wondering how important rule tree performance is for non-degenerate cases. > > For example when I profile page load of our old favorite Obama Wikipedia page in single threaded mode, > I get 12 ms under rule_tree functions and 295.4 ms under ServoStyleSet functions, so around 4%. > Just guessing since I don't think > they've been uploaded, but from the comment 11 benchmark results maybe we have a 50-80% regression in > insertion performance? I can't really tell from the profile how much of the page load rule_tree function > work is insertion versus lookup. Probably worth actually profiling a page load with the patch applied. > > If it's a 15ms regression in total styling time, is that an acceptable trade-off for losing the complexity > of the lock-free rule tree? > > We should also measure memory usage with the unconditional hash maps. > > In terms of complexity of the hybrid approach, I feel like it shouldn't be much harder than counting how > many children we iterate through in the linked list, and if we fail to find the rule node we want, and it's > greater than the threshold, then we do the CAS with a pointer to a boxed RwLock<HashMap> (with appropriate > pointer tagging to distinguish it from a regular rule node child), with the HashMap already locked, and if > we win that CAS we can grab all the rule nodes in the list and insert them into the HashMap, and release the > lock. But Emilio would have a better idea. Emilio wrote: > I think I'd prefer to avoid the complexity unless we can prove the > difference is measurable in the real world. > > The micro-benchmarks we regress are the ones that contend for the rule > tree very heavily, while that's not something that I expect to see in > pages, where most of the time is actually going to be spent either > selector-matching or doing the cascade. > >> We should also measure memory usage with the unconditional hash maps. > On 9/27/18 8:35 AM, Cameron McCormack wrote: > I'm wondering how important rule tree performance is for non-degenerate > cases. > > For example when I profile page load of our old favorite Obama Wikipedia > page in single threaded mode, I get 12 ms under rule_tree functions and > 295.4 ms, so around 4%. Just guessing since I don't think they've been > uploaded, but from the comment 11 benchmark results maybe we have a > 50-80% regression in insertion performance? I can't really tell from > the profile how much of the page load rule_tree function work is > insertion versus lookup. Probably worth actually profiling a page load > with the patch applied. > > If it's a 15ms regression in total styling time, is that an acceptable > trade-off for losing the complexity of the lock-free rule tree? > > We should also measure memory usage with the unconditional hash maps. > > That's true, my latest patch should report that memory correctly. > > (quoting Cameron's last paragraph about implementation): > > Well, you also need to teach the children iterator code (if you don't > want to rely on assumptions that may stop holding) about it, which means > that you at least need to have that logic in two places, I think... > > Also, this would make performance not quite as predictable. If you make > a restyle where you need to upgrade a gazillion rule nodes it'd end up > being super-slow. > > As I said I think it's not impossible, but I'm not quite sure it's worth > it, and I think I prefer to avoid it if possible.
(In reply to Bobby Holley (:bholley) from comment #12) > > If it's a 15ms regression in total styling time, is that an acceptable trade-off for losing the complexity > > of the lock-free rule tree? I don't think it would be, even for a fraction of that. Yes, the rule tree is complex, but we've invested a lot in getting it right, and it's quite well-proven and well-optimized at this point. I don't feel a lot of urgency to simplify it if it comes at a measurable performance cost across style-heavy pages. > Emilio wrote: > > I think I'd prefer to avoid the complexity unless we can prove the > > difference is measurable in the real world. It would be great to measure time spend in rule tree code with and without your patch on some of our usual suspects (obama, html5, amazon, github, facebook, etc) with both parallel and STYLO_THREADS=1. > > The micro-benchmarks we regress are the ones that contend for the rule > > tree very heavily, while that's not something that I expect to see in > > pages, where most of the time is actually going to be spent either > > selector-matching or doing the cascade. It looks like bench_expensive_single_insertion is the one that suffers the most, and that one is uncontended, right? > > Well, you also need to teach the children iterator code (if you don't > > want to rely on assumptions that may stop holding) about it, which means > > that you at least need to have that logic in two places, I think... > > > > Also, this would make performance not quite as predictable. If you make > > a restyle where you need to upgrade a gazillion rule nodes it'd end up > > being super-slow. Those both seem like relatively small costs. Basically - the rule tree is already complex, and this adds some additional complexity, but not a game-changing amount. So unless we make the strong case that the microbenchmark results don't translate into anything more than ~1ms on real-world style traversals, I'm inclined to try the hybrid approach.
(In reply to Bobby Holley (:bholley) from comment #13) > It looks like bench_expensive_single_insertion is the one that suffers the > most, and that one is uncontended, right? Yes, though the magnitude of the test is way higher than the others. In particular, I had to do 30k rule insertions on a row to get it to be so slow.
(In reply to Emilio Cobos Álvarez (:emilio) from comment #14) > (In reply to Bobby Holley (:bholley) from comment #13) > > It looks like bench_expensive_single_insertion is the one that suffers the > > most, and that one is uncontended, right? > > Yes, though the magnitude of the test is way higher than the others. In > particular, I had to do 30k rule insertions on a row to get it to be so slow. Is the curve non-linear? The way I read that is that simple, vanilla insertions get 5x slower. The of the test isn't really relevant here - what's relevant is the measurable amount of time spent in the rule tree on real sites, and the fact that we don't want that to go up 5x (or even 1.5x).
Attachment #9011854 - Attachment description: Bug 1493420 - Use a RwLock'd HashMap instead of a lock-free linked list for rule node children. → Bug 1493420 - Use a RwLock'd HashMap instead of a lock-free linked list for rule node children. r=heycam
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/2428766459c6 Use a RwLock'd HashMap instead of a lock-free linked list for rule node children. r=heycam
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/autoland/rev/4dc9bcee48b2 followup: Bump max allowed stack size of RuleNode. irc-r=heycam
You need to log in before you can comment on or make changes to this bug.