Closed Bug 1367635 Opened 7 years ago Closed 7 years ago

stylo: share reset structs during the style traversal

Categories

(Core :: CSS Parsing and Computation, defect, P4)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox57 --- fixed

People

(Reporter: n.nethercote, Assigned: heycam)

References

(Blocks 2 open bugs)

Details

Attachments

(13 files, 4 obsolete files)

28.21 KB, patch
Details | Diff | Splinter Review
179.74 KB, patch
Details | Diff | Splinter Review
59 bytes, text/x-review-board-request
heycam
: review+
Details
59 bytes, text/x-review-board-request
emilio
: review+
Details
59 bytes, text/x-review-board-request
emilio
: review+
Details
59 bytes, text/x-review-board-request
emilio
: review+
Details
59 bytes, text/x-review-board-request
emilio
: review+
Details
59 bytes, text/x-review-board-request
emilio
: review+
Details
59 bytes, text/x-review-board-request
emilio
: review+
Details
59 bytes, text/x-review-board-request
emilio
: review+
Details
59 bytes, text/x-review-board-request
emilio
: review+
Details
59 bytes, text/x-review-board-request
emilio
: review+
Details
59 bytes, text/x-review-board-request
emilio
: review+
Details
I did a DMD run on the Obama Wikipedia page, which included these measurements:

> Unreported {
>   7,836 blocks in heap block record 2 of 3,899
>   2,507,520 bytes (2,507,520 requested / 0 slop)
>   Individual block sizes: 320 x 7,836
>   3.18% of the heap (19.67% cumulative)
>   7.02% of unreported (43.38% cumulative)
>   Allocated at {
>     #01: replace_malloc (/home/njn/moz/mi2/memory/replace/dmd/DMD.cpp:1278 (discriminator 2))
>     #02: alloc::heap::exchange_malloc (/checkout/src/liballoc/heap.rs:138)
>     #03: alloc::boxed::{{impl}}::new<style::stylearc::ArcInner<style::gecko_properties::ComputedValues>> (/checkout/src/liballoc/boxed.rs:238)
>     #04: style::matching::PrivateMatchMethods::cascade_with_rules<style::gecko::wrapper::GeckoElement> (/home/njn/moz/mi2/servo/components/style/matching.rs:225)
>     #05: style::matching::PrivateMatchMethods::cascade_internal<style::gecko::wrapper::GeckoElement> (/home/njn/moz/mi2/servo/components/style/matching.rs:274)
>     #06: style::matching::PrivateMatchMethods::cascade_primary<style::gecko::wrapper::GeckoElement> (/home/njn/moz/mi2/servo/components/style/matching.rs:294)
>     #07: style::matching::MatchMethods::match_and_cascade<style::gecko::wrapper::GeckoElement> (/home/njn/moz/mi2/servo/components/style/matching.rs:621)
>     #08: style::traversal::compute_style<style::gecko::wrapper::GeckoElement,style::gecko::traversal::RecalcStyleOnly> (/home/njn/moz/mi2/servo/components/style/traversal.rs:766)
>     #09: style::traversal::recalc_style_at<style::gecko::wrapper::GeckoElement,style::gecko::traversal::RecalcStyleOnly> (/home/njn/moz/mi2/servo/components/style/traversal.rs:629)
>     #10: style::gecko::traversal::{{impl}}::process_preorder (/home/njn/moz/mi2/servo/components/style/gecko/traversal.rs:47)
>   }
> }
>
> Unreported {
>   4,402 blocks in heap block record 3 of 3,899
>   1,408,640 bytes (1,408,640 requested / 0 slop)
>   Individual block sizes: 320 x 4,402
>   1.79% of the heap (21.46% cumulative)
>   3.94% of unreported (47.32% cumulative)
>   Allocated at {
>     #01: replace_malloc (/home/njn/moz/mi2/memory/replace/dmd/DMD.cpp:1278 (discriminator 2))
>     #02: alloc::heap::exchange_malloc (/checkout/src/liballoc/heap.rs:138)
>     #03: alloc::boxed::{{impl}}::new<style::stylearc::ArcInner<style::gecko_properties::ComputedValues>> (/checkout/src/liballoc/boxed.rs:238)
>     #04: geckoservo::glue::Servo_ComputedValues_Inherit (/home/njn/moz/mi2/servo/ports/geckolib/glue.rs:1256)
>     #05: mozilla::ServoStyleSet::ResolveStyleForText(nsIContent*, nsStyleContext*) (/home/njn/moz/mi2/layout/style/ServoStyleSet.cpp:380)
>     #06: mozilla::ServoRestyleManager::TextPostTraversalState::ComputeStyle(nsIContent*) (/home/njn/moz/mi2/layout/base/ServoRestyleManager.cpp:179 (discriminator 2))
>     #07: mozilla::ServoRestyleManager::ProcessPostTraversalForText(nsIContent*, nsStyleChangeList&, mozilla::ServoRestyleManager::TextPostTraversalState&) (/home/njn/moz/mi2/layout/base/ServoRestyleM
anager.cpp:458 (discriminator 1))
>     #08: mozilla::ServoRestyleManager::ProcessPostTraversal(mozilla::dom::Element*, nsStyleContext*, mozilla::ServoStyleSet*, nsStyleChangeList&) (/home/njn/moz/mi2/layout/base/ServoRestyleManager.cp
p:433)
>     #09: mozilla::ServoRestyleManager::ProcessPostTraversal(mozilla::dom::Element*, nsStyleContext*, mozilla::ServoStyleSet*, nsStyleChangeList&) (/home/njn/moz/mi2/layout/base/ServoRestyleManager.cp
p:430 (discriminator 1))
>     #10: mozilla::ServoRestyleManager::ProcessPostTraversal(mozilla::dom::Element*, nsStyleContext*, mozilla::ServoStyleSet*, nsStyleChangeList&) (/home/njn/moz/mi2/layout/base/ServoRestyleManager.cp
p:430 (discriminator 1))
>   }
> }

That's 4 MiBs of Arc<ComputedValue>s. The above output shows that each Arc<ComputedValue> is 320 bytes. This includes:

- 23 Arc<> fields, each 8 bytes, giving 184 bytes

- The cached_system_font field, which is Option<ComputedSystemFont>.  ComputedSystemFont has 16 fields, all with type Au(?), so that's another 64 (for the Au fields) + 8 (for the Option tag) bytes.

- The other four fields look like they're about another 8 words, so about 64 bytes.

All that adds up to 320, roughly. That's a lot more than nsStyleContext, which is the equivalent Gecko type.
Flags: needinfo?(manishearth)
https://github.com/servo/servo/pull/17030 incidentally fixes the ComputedSystemFont stuff. Emilio moved it to the context and we don't use it anyway.

We're probably not sharing Arcs as much as we could. Unsure.

How large is nsStyleContext?
Flags: needinfo?(manishearth)
nsStyleContext itself is about 20 words.  If you have any reset structs that are not cacheable in the ruletree, you add another 15 words when you allocate the mCachedResetData.  But I'm pretty sure that in most cases mCachedResetData is null.
So, the lion's share of the issue here is, as bz points out in comment 2, the fact that we don't cache reset structs in the rule tree. We could consider doing this, which would improve perf and memory to some degree at the expense of complexity. But note that we only save on memory to the extent that we have a lot of RuleNode sharing.

Marking this as something that doesn't block us from shipping but a source of memory or perf improvements if we're scrounging.
Priority: -- → P4
Summary: stylo: ComputedValues is too big → stylo: consider caching reset structs in the rule tree and shrinking ServoComputedValues
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #4)
> So, the lion's share of the issue here is, as bz points out in comment 2,
> the fact that we don't cache reset structs in the rule tree. We could
> consider doing this, which would improve perf and memory to some degree at
> the expense of complexity. But note that we only save on memory to the
> extent that we have a lot of RuleNode sharing.

Note also that this would force us to recreate the rule tree on every resize of any page with viewport units, and probably in a lot of other situations where we don't need to right now.
See bug 1367854 comment 19, which suggests (but does not yet prove) that we may need to do this.
Bobby and I discussed an approach to this after the call today.

Gecko makes the decision to cache data in the rule tree per struct.  For Stylo, we could do this, but making the decision for all reset structs at once (i.e. cache all or none) will be easier.

So we could do this in two stages.

The first stage would be to support caching of reset data with no conditions (i.e., no dependency on font-size, writing mode, etc.).  Instead of having a separate struct/allocation for the reset structs that the rule node can point to (allocation during the cascade will hurt, need to addref all structs separately when caching), or storing the 15 struct pointers inline in the rule node (bloating rule nodes that can't cache any data, and still needing to addref all structs separately when caching), we can have a single pointer to a ComputedValues object that has the reset data we are considered to be caching.  This can just be the same ComputedValues object that we just computed, and decided has reset data that is safe to cache in the rule node.  This avoids the allocation and the 15 refcount bumps when caching.  When we decide we have reset data that is safe to cache, we just CAS it in to the field on the rule node.

One tricky thing is that if we hold a strong reference to the ComputedValues, then this causes a cycle.  But we could probably make the rule tree GC aware of this, and still add rule nodes to the free list if they only have a single refcount and we know that it is due its cached ComputedValues' pointer to back to itself.

The second stage would be to look at caching with conditions.  Like Gecko, we could have a linked list or some other structure to store the different variations of cached data.  In many cases I think we would still only have a single cached ComputedValues (perhaps with conditions).  So we could still probably avoid additional allocations until we have more than one cached on the rule node.  The field on the rule node would become a tagged pointer.  The conditions for the "only one cached ComputedValues" case should probably be a separate field on the rule node then too.  Having multiple cached computed values does make the rule tree GC more complicated, though, as it's not just a single possible strong reference that we need to ignore.

We could have a separate Arc-like object that supports two separate refcounts.  But maybe the rule tree free node tracking hack is good enough.
Emilio, given your concerns in bug 1367854 comment 21 and onward, what do you think about the proposed scheme in comment 7?
Flags: needinfo?(emilio+bugs)
Summary: stylo: consider caching reset structs in the rule tree and shrinking ServoComputedValues → stylo: consider caching style structs in the rule tree
Assignee: nobody → cam
I've talked with Cameron a bit (need to run now): http://logs.glob.uno/?c=mozilla%23servo&s=8+Aug+2017&e=8+Aug+2017#c727021

Basically, I think a separate cache would be a better option. If the rule node cache was optimally implemented the out-of-band sharing potentially gets us a bit less sharing, as discussed in the conversation linked above, but I think it's worth because we can get both more sharing in a shorter term, and less headaches in general.

Advantages:

 * More self-contained.
 * No need to care about rebuilding / clearing cached data from the rule tree on viewport resizes / root font size changes / etc.
 * No need to track cyclic references on the rule tree, which is simpler and seems harder to mess up.
 * We can implement restrictions (caching stuff depending on font-size / writing-mode / etc.) trivially, which would allow way more sharing short term, and also experimenting with other improvements if we can think of them.

As mentioned, it has some potential drawbacks:

 * You share per-thread, not globally, so potentially slightly less sharing. I don't think it's really relevant, and you can end up doing wasted work when trying to race to insert cached data on the rule tree with the other approach too anyway.

 * No sharing across traversals. Seems hard to tell how much it matters in practice, but I suspect not much given we restyle less than Gecko in response to dynamic changes.

Source for Blink disabling this sharing (and style sharing in general) for some elements matching hover / active / etc. are is [1], fwiw.

[1]: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/css/SelectorChecker.cpp?l=924&rcl=242e2067b136691816302d52dd63ac1614f9d843
Flags: needinfo?(emilio+bugs)
One disadvantage I forgot to mention of the in-TLS-cache approach is that it means we can't avoid the extra allocation when adding an entry to the cache.

(In reply to Emilio Cobos Álvarez [:emilio] from comment #9)
> Advantages:
> 
> [...]

All these simplicity-related advantages are attractive.

> As mentioned, it has some potential drawbacks:
> 
>  * You share per-thread, not globally, so potentially slightly less sharing.
> I don't think it's really relevant, and you can end up doing wasted work
> when trying to race to insert cached data on the rule tree with the other
> approach too anyway.

I guess that's a question of how much sharing we lose due to contention, racing to insert cached data, versus uncontended lookups that would've found cached data but don't because they're not on the same thread.

>  * No sharing across traversals. Seems hard to tell how much it matters in
> practice, but I suspect not much given we restyle less than Gecko in
> response to dynamic changes.

Probably can measure this in Gecko without too much trouble.

So I'm not to sure about which approach to try.  Maybe I'll do that measurement first.
(In reply to Cameron McCormack (:heycam) from comment #10)
> One disadvantage I forgot to mention of the in-TLS-cache approach is that it
> means we can't avoid the extra allocation when adding an entry to the cache.

Which extra allocation do you mean? If it's in a HashMap... Sure, I guess. But if it's a LRUCache like we do for a few other things, we definitely can make that an ArrayVec/SmallVec to avoid allocations in most cases.

> (In reply to Emilio Cobos Álvarez [:emilio] from comment #9)
> > Advantages:
> > 
> > [...]
> 
> All these simplicity-related advantages are attractive.

I personally think that they overweight the drawbacks by a fair amount, but... :P
(In reply to Emilio Cobos Álvarez [:emilio] from comment #11)
> Which extra allocation do you mean? If it's in a HashMap... Sure, I guess.
> But if it's a LRUCache like we do for a few other things, we definitely can
> make that an ArrayVec/SmallVec to avoid allocations in most cases.

Yeah, I was thinking it would be a HashMap.  I'm not sure how much caching ability we'd lose by using an LRUCache instead.
I spent some time pondering this, and came to the conclusion that we should cache in the rule tree.

Our current memory regression relative to Gecko is probably the biggest question mark hanging over what we've built. Any areas where we don't achieve Gecko parity will require us to do a lot of extra work to justify the tradeoff, and I want to minimize that.

So memory is important, and I'm wary of under-powering the optimization that we're currently hoping will be the silver bullet that closes the gap. There are two reason why I'm concerned that a TLS cache would under-power us:
(1) Not sharing across threads. We've measured a significant difference in ComputedValues sharing between sequential and parallel, and that gap is measurable on AWSY. It might be that the impact of doing ComputedValues sharing per-thread is worse because of the whole transitive cousin sharing thing, but it's hard to be very certain about that.
(2) Not sharing across traversals. We can't quite do something similar to be 1370604, because we won't be traversing the entire tree and the location of reset structs is mostly independent of tree structure. Maybe the particular subtree will be enough, or maybe not.

So I can't say with certainty that the TLS cache won't be good enough. But we have reasons why it might not be, and our "competition" (i.e. Gecko) is using precise sharing. We don't have time to do this twice, so I find it hard to justify not using precise sharing in stylo if we have the opportunity to.

My take on the downsides:
* Complexity/self-contained-ness: Definitely some of this, though adding another caching mechanism isn't free of complexity either, and we already have the rule node when we're doing the cascade. Checking whether the cache field is non-null is easy, and CASing the resulting ComputedValues into the cache field is easy too. The refcounting complexity is just adding an is_zero_refcount accessor that treats one as zero if the cache field is non-null. I do agree that multiple cache entries will make this trickier though.
* Clearing stale cache data out of the rule tree: This is certainly annoying and unfortunate, but only because we're trying to share across traversals, which the other solution doesn't give us. If we didn't care about that, we could solve this by just tagging the cache with the restyle generation.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #13)
> I spent some time pondering this, and came to the conclusion that we should
> cache in the rule tree.
> 
> Our current memory regression relative to Gecko is probably the biggest
> question mark hanging over what we've built. Any areas where we don't
> achieve Gecko parity will require us to do a lot of extra work to justify
> the tradeoff, and I want to minimize that.
> 
> So memory is important, and I'm wary of under-powering the optimization that
> we're currently hoping will be the silver bullet that closes the gap. There
> are two reason why I'm concerned that a TLS cache would under-power us:
> (1) Not sharing across threads. We've measured a significant difference in
> ComputedValues sharing between sequential and parallel, and that gap is
> measurable on AWSY. It might be that the impact of doing ComputedValues
> sharing per-thread is worse because of the whole transitive cousin sharing
> thing, but it's hard to be very certain about that.
> (2) Not sharing across traversals. We can't quite do something similar to be
> 1370604, because we won't be traversing the entire tree and the location of
> reset structs is mostly independent of tree structure. Maybe the particular
> subtree will be enough, or maybe not.

However, I think there's a point that you haven't considered enough (maybe you have, but you haven't said that in your comment), and it's that we don't have a great plan to cache with a given set of restrictions in the rule tree. This may mean that we probably get better sharing with a thread-local cache that supports keying on the parent font-size and writing-mode, than with a rule-node cache that doesn't.

This last bit depends on the content, of course, but font-size-relative units isn't anything rare at all.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #14)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #13)
> > I spent some time pondering this, and came to the conclusion that we should
> > cache in the rule tree.
> > 
> > Our current memory regression relative to Gecko is probably the biggest
> > question mark hanging over what we've built. Any areas where we don't
> > achieve Gecko parity will require us to do a lot of extra work to justify
> > the tradeoff, and I want to minimize that.
> > 
> > So memory is important, and I'm wary of under-powering the optimization that
> > we're currently hoping will be the silver bullet that closes the gap. There
> > are two reason why I'm concerned that a TLS cache would under-power us:
> > (1) Not sharing across threads. We've measured a significant difference in
> > ComputedValues sharing between sequential and parallel, and that gap is
> > measurable on AWSY. It might be that the impact of doing ComputedValues
> > sharing per-thread is worse because of the whole transitive cousin sharing
> > thing, but it's hard to be very certain about that.
> > (2) Not sharing across traversals. We can't quite do something similar to be
> > 1370604, because we won't be traversing the entire tree and the location of
> > reset structs is mostly independent of tree structure. Maybe the particular
> > subtree will be enough, or maybe not.
> 
> However, I think there's a point that you haven't considered enough (maybe
> you have, but you haven't said that in your comment), and it's that we don't
> have a great plan to cache with a given set of restrictions in the rule
> tree. This may mean that we probably get better sharing with a thread-local
> cache that supports keying on the parent font-size and writing-mode, than
> with a rule-node cache that doesn't.

I think this is heycam's "second stage" in comment 7. It seems to me that the usual singly-linked list allocate+CAS would work fine here. It would be some complexity, but again I'm willing to pay complexity here for more-precise sharing.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #15)
> > However, I think there's a point that you haven't considered enough (maybe
> > you have, but you haven't said that in your comment), and it's that we don't
> > have a great plan to cache with a given set of restrictions in the rule
> > tree. This may mean that we probably get better sharing with a thread-local
> > cache that supports keying on the parent font-size and writing-mode, than
> > with a rule-node cache that doesn't.
> 
> I think this is heycam's "second stage" in comment 7. It seems to me that
> the usual singly-linked list allocate+CAS would work fine here. It would be
> some complexity, but again I'm willing to pay complexity here for
> more-precise sharing.

Also, once we measure how much we need the conditional sharing and in what situations, we _could_ consider a hybrid approach where the conditional sharing lived in TLS, and the unconditional sharing lived in the rule tree. It would all depend on the measurements, which are easier to do when unconditional sharing is done.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #15)
> I think this is heycam's "second stage" in comment 7. It seems to me that
> the usual singly-linked list allocate+CAS would work fine here. It would be
> some complexity, but again I'm willing to pay complexity here for
> more-precise sharing.

Sure, but I'm not sure how easy is this at all...

To be honest I'm not sure how exactly the parallel aggregation of references depending on whether we have cached data or not going to work... Seems like there are a fair amount of situations to handle, and a lot of things that could go wrong, race, etc... (leaking being the least bad of the possible "bad" outcomes).

If you suddenly also need to track how many cached entries do you have (which would be the "second stage"), it becomes fairly more complicated...

But even supposing we have a clear design and algorithm for that, getting it exactly right isn't trivial, and even less if we're talking about getting it exactly right for 57, with all the other stuff we need to do.

I'm just saying that getting that "second stage" working before 57 seems fairly complicated to me (happy to be proven wrong!), and that the thread-local cache is likely to give better sharing (and more regular across pages) than only the "first stage" of it. Plus all the other advantages.

But anyway. Other thing I though we need to check is whether our root font-size setup (storing it on the device, etc.) works fine with the rule tree cache... At first I thought it wouldn't in the cross-doc getComputedStyle case (because we pull it from the global state instead of from the parent style context chain), but after thinking a bit more I don't think the rule tree cache would impact anything in that sense... So I think we're fine in that regard.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #17)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #15)
> > I think this is heycam's "second stage" in comment 7. It seems to me that
> > the usual singly-linked list allocate+CAS would work fine here. It would be
> > some complexity, but again I'm willing to pay complexity here for
> > more-precise sharing.
> 
> Sure, but I'm not sure how easy is this at all...
> 
> To be honest I'm not sure how exactly the parallel aggregation of references
> depending on whether we have cached data or not going to work... Seems like
> there are a fair amount of situations to handle, and a lot of things that
> could go wrong, race, etc... (leaking being the least bad of the possible
> "bad" outcomes).
> 
> If you suddenly also need to track how many cached entries do you have
> (which would be the "second stage"), it becomes fairly more complicated...
> 
> But even supposing we have a clear design and algorithm for that, getting it
> exactly right isn't trivial, and even less if we're talking about getting it
> exactly right for 57, with all the other stuff we need to do.
> 
> I'm just saying that getting that "second stage" working before 57 seems
> fairly complicated to me (happy to be proven wrong!), and that the
> thread-local cache is likely to give better sharing (and more regular across
> pages) than only the "first stage" of it. Plus all the other advantages.

I still have a different gut feeling about the complexity level.

But how about this: we do precise sharing for the non-conditional case, and then measure how much we would gain by the conditional sharing, and try to estimate whether TLS sharing would be acceptable for that case. If it is, we could consider the hybrid approach in comment 16.

> 
> But anyway. Other thing I though we need to check is whether our root
> font-size setup (storing it on the device, etc.) works fine with the rule
> tree cache... At first I thought it wouldn't in the cross-doc
> getComputedStyle case (because we pull it from the global state instead of
> from the parent style context chain), but after thinking a bit more I don't
> think the rule tree cache would impact anything in that sense... So I think
> we're fine in that regard.

Great!
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #18)
> I still have a different gut feeling about the complexity level.
> 
> But how about this: we do precise sharing for the non-conditional case, and
> then measure how much we would gain by the conditional sharing, and try to
> estimate whether TLS sharing would be acceptable for that case. If it is, we
> could consider the hybrid approach in comment 16.

Maintaining two different cache systems doesn't make me super-happy, and I think it'd be better to measure first...

But if you really really want to do the rule tree cache, I guess that's an option.
I have a preliminary patch for unconditional caching.  On the full page HTML spec, I get style struct memory usage like the following.  Without the patch:

│  │  │  │  ├───31.08 MB (09.32%) -- servo-style-structs
│  │  │  │  │   ├──10.82 MB (03.25%) ── Display
│  │  │  │  │   ├───8.14 MB (02.44%) -- (14 tiny)
│  │  │  │  │   │   ├──2.79 MB (00.84%) ── Position
│  │  │  │  │   │   ├──2.01 MB (00.60%) ── TextReset
│  │  │  │  │   │   ├──1.24 MB (00.37%) ── Border
│  │  │  │  │   │   ├──0.44 MB (00.13%) ── Background
│  │  │  │  │   │   ├──0.37 MB (00.11%) ── UserInterface
│  │  │  │  │   │   ├──0.37 MB (00.11%) ── Padding
│  │  │  │  │   │   ├──0.32 MB (00.10%) ── Margin
│  │  │  │  │   │   ├──0.27 MB (00.08%) ── Color
│  │  │  │  │   │   ├──0.11 MB (00.03%) ── Effects
│  │  │  │  │   │   ├──0.09 MB (00.03%) ── Content
│  │  │  │  │   │   ├──0.07 MB (00.02%) ── UIReset
│  │  │  │  │   │   ├──0.05 MB (00.01%) ── List
│  │  │  │  │   │   ├──0.02 MB (00.01%) ── sundries
│  │  │  │  │   │   └──0.01 MB (00.00%) ── Column
│  │  │  │  │   ├───6.52 MB (01.95%) ── Text
│  │  │  │  │   └───5.59 MB (01.68%) ── Font

With the patch:

│  │  │  │  ├───24.43 MB (07.51%) -- servo-style-structs
│  │  │  │  │   ├───6.79 MB (02.09%) -- (12 tiny)
│  │  │  │  │   │   ├──2.53 MB (00.78%) ── Position
│  │  │  │  │   │   ├──1.19 MB (00.37%) ── Border
│  │  │  │  │   │   ├──1.08 MB (00.33%) ── TextReset
│  │  │  │  │   │   ├──0.51 MB (00.16%) ── Background
│  │  │  │  │   │   ├──0.37 MB (00.11%) ── UserInterface
│  │  │  │  │   │   ├──0.31 MB (00.10%) ── Padding
│  │  │  │  │   │   ├──0.29 MB (00.09%) ── Margin
│  │  │  │  │   │   ├──0.27 MB (00.08%) ── Color
│  │  │  │  │   │   ├──0.11 MB (00.04%) ── Effects
│  │  │  │  │   │   ├──0.08 MB (00.02%) ── Content
│  │  │  │  │   │   ├──0.05 MB (00.01%) ── List
│  │  │  │  │   │   └──0.02 MB (00.01%) ── sundries
│  │  │  │  │   ├───6.58 MB (02.02%) ── Text
│  │  │  │  │   ├───5.68 MB (01.75%) ── Font
│  │  │  │  │   └───5.37 MB (01.65%) ── Display
I still think caching them is a bad idea...

How are we going to do for style attribute changes and such, given we don't create new rule nodes for them?

Should we start doing that? That'd be unfortunate I think...

I think it'd be nice not having to throw away that optimization, at least for animation rule nodes and such (though note that we currently do create different rule nodes because we create different declaration blocks).
Flags: needinfo?(cam)
(In reply to Emilio Cobos Álvarez [:emilio] from comment #21)
> How are we going to do for style attribute changes and such, given we don't
> create new rule nodes for them?

We could disable caching in style attribute rule nodes or their descendants.
Flags: needinfo?(cam)
I just did a quick prototype of TLS cache, since I really think it's the right tradeoff now, and should give better caching that the unconditional rule node cache approach, and the results seem to be better, even in the parallel case on an 8-core machine. It took me a few hours to prototype it.

Patch:

│  │  │  │  ├───21.74 MB (06.70%) -- servo-style-structs
│  │  │  │  │   ├───6.97 MB (02.15%) ── Text
│  │  │  │  │   ├───6.04 MB (01.86%) ── Display
│  │  │  │  │   ├───5.95 MB (01.83%) ── Font
│  │  │  │  │   └───2.78 MB (00.86%) -- (14 tiny)
│  │  │  │  │       ├──0.71 MB (00.22%) ── Position
│  │  │  │  │       ├──0.55 MB (00.17%) ── TextReset
│  │  │  │  │       ├──0.39 MB (00.12%) ── Border
│  │  │  │  │       ├──0.38 MB (00.12%) ── UserInterface
│  │  │  │  │       ├──0.28 MB (00.09%) ── Color
│  │  │  │  │       ├──0.11 MB (00.03%) ── Background
│  │  │  │  │       ├──0.09 MB (00.03%) ── Margin
│  │  │  │  │       ├──0.08 MB (00.03%) ── Padding
│  │  │  │  │       ├──0.06 MB (00.02%) ── UIReset
│  │  │  │  │       ├──0.05 MB (00.02%) ── Effects
│  │  │  │  │       ├──0.05 MB (00.01%) ── List
│  │  │  │  │       ├──0.02 MB (00.01%) ── sundries
│  │  │  │  │       ├──0.01 MB (00.00%) ── Content
│  │  │  │  │       └──0.01 MB (00.00%) ── Column

Patch + STYLO_THREADS=1:

│  │  │  │  ├───13.46 MB (04.61%) -- servo-style-structs
│  │  │  │  │   ├───4.13 MB (01.42%) ── Display
│  │  │  │  │   ├───3.96 MB (01.36%) ── Text
│  │  │  │  │   ├───3.78 MB (01.30%) ── Font
│  │  │  │  │   └───1.60 MB (00.55%) -- (12 tiny)
│  │  │  │  │       ├──0.43 MB (00.15%) ── Position
│  │  │  │  │       ├──0.27 MB (00.09%) ── TextReset
│  │  │  │  │       ├──0.26 MB (00.09%) ── Border
│  │  │  │  │       ├──0.26 MB (00.09%) ── UserInterface
│  │  │  │  │       ├──0.14 MB (00.05%) ── Color
│  │  │  │  │       ├──0.05 MB (00.02%) ── Margin
│  │  │  │  │       ├──0.05 MB (00.02%) ── Padding
│  │  │  │  │       ├──0.04 MB (00.01%) ── Effects
│  │  │  │  │       ├──0.03 MB (00.01%) ── Background
│  │  │  │  │       ├──0.03 MB (00.01%) ── UIReset
│  │  │  │  │       ├──0.03 MB (00.01%) ── List
│  │  │  │  │       └──0.02 MB (00.01%) ── sundries

No patch:

│  │  │  │  ├───31.39 MB (09.19%) -- servo-style-structs
│  │  │  │  │   ├──11.04 MB (03.23%) ── Display
│  │  │  │  │   ├───8.28 MB (02.43%) -- (14 tiny)
│  │  │  │  │   │   ├──2.88 MB (00.85%) ── Position
│  │  │  │  │   │   ├──2.01 MB (00.59%) ── TextReset
│  │  │  │  │   │   ├──1.27 MB (00.37%) ── Border
│  │  │  │  │   │   ├──0.46 MB (00.13%) ── Background
│  │  │  │  │   │   ├──0.38 MB (00.11%) ── Padding
│  │  │  │  │   │   ├──0.34 MB (00.10%) ── UserInterface
│  │  │  │  │   │   ├──0.32 MB (00.09%) ── Margin
│  │  │  │  │   │   ├──0.26 MB (00.08%) ── Color
│  │  │  │  │   │   ├──0.11 MB (00.03%) ── Effects
│  │  │  │  │   │   ├──0.11 MB (00.03%) ── Content
│  │  │  │  │   │   ├──0.07 MB (00.02%) ── UIReset
│  │  │  │  │   │   ├──0.05 MB (00.01%) ── List
│  │  │  │  │   │   ├──0.02 MB (00.01%) ── sundries
│  │  │  │  │   │   └──0.01 MB (00.00%) ── Column
│  │  │  │  │   ├───6.52 MB (01.91%) ── Text
│  │  │  │  │   └───5.55 MB (01.62%) ── Font

I don't know how the patch by Cameron looks like, maybe I'm worrying for complexity that just isn't there, but I suspect that's not the case, so I'd really really prefer to take the thread-local cache approach if the rule node thing is complex.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #23)
> Created attachment 8897909 [details] [diff] [review]
> Quick-n-dirty TLS cache prototype (with restrictions)
> 
> I just did a quick prototype of TLS cache, since I really think it's the
> right tradeoff now, and should give better caching that the unconditional
> rule node cache approach, and the results seem to be better, even in the
> parallel case on an 8-core machine.

So to be clear, it's not quite a fair comparison, since Cameron's patches only handle unconditional sharing.

That said, having this patch is a great benchmark to compare against. Cameron, can you try applying emilio's patch, and:
(1) reproduce his numbers from comment 23.
(2) disable his conditional sharing, and see how the unconditional sharing performance compares with your patch?

To recap from comment 13, my concerns with the TLS approach were (a) no sharing across threads, and (b) no sharing across traversals. Now that we have emilio's patches, it's probably relatively straightforward to measure how problematic (a) is in practice. (b) is harder, because we haven't really dug into the cross-restyle sharing case (and haven't fixed bug 1370604). But if the TLS approach looks good for (a), we can potentially investigate that.

Any other considerations I'm missing?
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #24)
> (In reply to Emilio Cobos Álvarez [:emilio] from comment #23)
> > Created attachment 8897909 [details] [diff] [review]
> > Quick-n-dirty TLS cache prototype (with restrictions)
> > 
> > I just did a quick prototype of TLS cache, since I really think it's the
> > right tradeoff now, and should give better caching that the unconditional
> > rule node cache approach, and the results seem to be better, even in the
> > parallel case on an 8-core machine.
> 
> So to be clear, it's not quite a fair comparison, since Cameron's patches
> only handle unconditional sharing.

Right, but that's the point I've been trying to make since comment 9, really: implementing this in a way that works better than the unconditional sharing is easier and cleaner. Specially if we want it for 57. Like, this is reasonably close to green[1] (It seems from that link that I just forgot to mark as uncacheable a few things on style fixups, and when using CSS variables, yay for that test!).

Again, I really may be overestimating the complexity of caching stuff conditionally on rule nodes in parallel, but to know that I'd need to see the code.

> To recap from comment 13, my concerns with the TLS approach were (a) no
> sharing across threads, and (b) no sharing across traversals. Now that we
> have emilio's patches, it's probably relatively straightforward to measure
> how problematic (a) is in practice. (b) is harder, because we haven't really
> dug into the cross-restyle sharing case (and haven't fixed bug 1370604). But
> if the TLS approach looks good for (a), we can potentially investigate that.

I think that given style sharing is a considerable memory win without (a) nor (b), there's no reason that this wouldn't be also true for this case. And again I'd rather do it this way if it's easier to both tweak, to maintain, and to prove correct, than the rule node cache.

I also think we still have more sizable memory wins to implement, like bug 1385154.

And again, really sorry if I'm really overestimating the complexity of the rule node caching, but doing it conditionally, in parallel, before 57, and correct (without any kind of tricky race) looks kinda hard to me.

Note that there were a few races on the initial rule tree that (even though harmless in that particular case) weren't caught after months later of the initial landing.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=00631bac21ab20fd862db6cbd531426b25f7640f&selectedJob=123627958
Thanks for writing that prototype patch to compare against, Emilio.  It's good to make a decision with numbers and code to compare. :-)

FWIW I think the unconditional caching that I have so far is not particularly complex.  Adding the conditions support will be more complex.  Let's see how it compares once I write it today.  These are the numbers I'm getting for the same patch, including one where I just make sharing happen in cases where we'd have conditions, as an approximation.


Without patch:

│  │  │  │  │  ├───33.78 MB (09.87%) -- servo-style-structs
│  │  │  │  │  │   ├──11.98 MB (03.50%) ── Display
│  │  │  │  │  │   ├───6.76 MB (01.98%) ── Text
│  │  │  │  │  │   ├───5.95 MB (01.74%) -- (13 tiny)
│  │  │  │  │  │   │   ├──2.06 MB (00.60%) ── TextReset
│  │  │  │  │  │   │   ├──1.39 MB (00.41%) ── Border
│  │  │  │  │  │   │   ├──0.68 MB (00.20%) ── Background
│  │  │  │  │  │   │   ├──0.45 MB (00.13%) ── Padding
│  │  │  │  │  │   │   ├──0.36 MB (00.10%) ── Margin
│  │  │  │  │  │   │   ├──0.32 MB (00.09%) ── UserInterface
│  │  │  │  │  │   │   ├──0.27 MB (00.08%) ── Color
│  │  │  │  │  │   │   ├──0.15 MB (00.04%) ── Content
│  │  │  │  │  │   │   ├──0.12 MB (00.04%) ── Effects
│  │  │  │  │  │   │   ├──0.07 MB (00.02%) ── UIReset
│  │  │  │  │  │   │   ├──0.05 MB (00.02%) ── List
│  │  │  │  │  │   │   ├──0.02 MB (00.01%) ── sundries
│  │  │  │  │  │   │   └──0.01 MB (00.00%) ── Column
│  │  │  │  │  │   ├───5.66 MB (01.65%) ── Font
│  │  │  │  │  │   └───3.43 MB (01.00%) ── Position

With patch:

│  │  │  │  ├───24.52 MB (07.53%) -- servo-style-structs
│  │  │  │  │   ├───6.73 MB (02.07%) -- (12 tiny)
│  │  │  │  │   │   ├──2.36 MB (00.73%) ── Position
│  │  │  │  │   │   ├──1.32 MB (00.41%) ── Border
│  │  │  │  │   │   ├──1.12 MB (00.34%) ── TextReset
│  │  │  │  │   │   ├──0.44 MB (00.14%) ── Background
│  │  │  │  │   │   ├──0.38 MB (00.12%) ── UserInterface
│  │  │  │  │   │   ├──0.31 MB (00.10%) ── Padding
│  │  │  │  │   │   ├──0.28 MB (00.09%) ── Margin
│  │  │  │  │   │   ├──0.27 MB (00.08%) ── Color
│  │  │  │  │   │   ├──0.11 MB (00.04%) ── Effects
│  │  │  │  │   │   ├──0.07 MB (00.02%) ── Content
│  │  │  │  │   │   ├──0.05 MB (00.01%) ── List
│  │  │  │  │   │   └──0.02 MB (00.01%) ── sundries
│  │  │  │  │   ├───6.71 MB (02.06%) ── Text
│  │  │  │  │   ├───5.74 MB (01.76%) ── Font
│  │  │  │  │   └───5.34 MB (01.64%) ── Display

Ignoring conditions:

│  │  │  │  ├───19.74 MB (06.13%) -- servo-style-structs
│  │  │  │  │   ├───7.04 MB (02.19%) ── Text
│  │  │  │  │   ├───6.68 MB (02.08%) -- (12 tiny)
│  │  │  │  │   │   ├──2.41 MB (00.75%) ── Display
│  │  │  │  │   │   ├──1.34 MB (00.42%) ── Border
│  │  │  │  │   │   ├──1.05 MB (00.33%) ── Position
│  │  │  │  │   │   ├──0.71 MB (00.22%) ── TextReset
│  │  │  │  │   │   ├──0.39 MB (00.12%) ── UserInterface
│  │  │  │  │   │   ├──0.28 MB (00.09%) ── Color
│  │  │  │  │   │   ├──0.14 MB (00.04%) ── Padding
│  │  │  │  │   │   ├──0.12 MB (00.04%) ── Background
│  │  │  │  │   │   ├──0.10 MB (00.03%) ── Effects
│  │  │  │  │   │   ├──0.07 MB (00.02%) ── Margin
│  │  │  │  │   │   ├──0.04 MB (00.01%) ── List
│  │  │  │  │   │   └──0.02 MB (00.01%) ── sundries
│  │  │  │  │   └───6.02 MB (01.87%) ── Font

Ignoring conditions + STYLO_THREADS=1:

│  │  │  │  ├────9.51 MB (03.34%) -- servo-style-structs
│  │  │  │  │    ├──6.41 MB (02.25%) -- (13 tiny)
│  │  │  │  │    │  ├──2.57 MB (00.90%) ── Font
│  │  │  │  │    │  ├──1.39 MB (00.49%) ── Display
│  │  │  │  │    │  ├──0.88 MB (00.31%) ── Position
│  │  │  │  │    │  ├──0.74 MB (00.26%) ── Border
│  │  │  │  │    │  ├──0.26 MB (00.09%) ── TextReset
│  │  │  │  │    │  ├──0.17 MB (00.06%) ── UserInterface
│  │  │  │  │    │  ├──0.11 MB (00.04%) ── Color
│  │  │  │  │    │  ├──0.08 MB (00.03%) ── Effects
│  │  │  │  │    │  ├──0.06 MB (00.02%) ── Background
│  │  │  │  │    │  ├──0.05 MB (00.02%) ── Padding
│  │  │  │  │    │  ├──0.05 MB (00.02%) ── Margin
│  │  │  │  │    │  ├──0.03 MB (00.01%) ── List
│  │  │  │  │    │  └──0.02 MB (00.01%) ── sundries
│  │  │  │  │    └──3.10 MB (01.09%) ── Text



Regardless of which approach we use, I think both our numbers show that we get good additional savings (50% more) from conditional caching.


As an aside, I realised this morning that with cached reset structs, we can easily implement Gecko's "start struct" optimization, where we begin with cached reset structs from an ancestor rule node, and just apply/cascade the declarations from the descendants.  Whether this is a win or not I'm not sure, but might be worth measuring.
I've now got a very messy patch with a lock-free linked list for storing the conditional data in the rule tree.  The numbers I'm getting are:

STYLO_THREADS=4, no caching

│  │  │  │  ├───29.33 MB (09.02%) -- servo-style-structs
│  │  │  │  │   ├──10.22 MB (03.14%) ── Display
│  │  │  │  │   ├───7.83 MB (02.41%) -- (13 tiny)
│  │  │  │  │   │   ├──2.86 MB (00.88%) ── Position
│  │  │  │  │   │   ├──1.81 MB (00.56%) ── TextReset
│  │  │  │  │   │   ├──1.12 MB (00.35%) ── Border
│  │  │  │  │   │   ├──0.46 MB (00.14%) ── Background
│  │  │  │  │   │   ├──0.35 MB (00.11%) ── Padding
│  │  │  │  │   │   ├──0.35 MB (00.11%) ── UserInterface
│  │  │  │  │   │   ├──0.28 MB (00.09%) ── Margin
│  │  │  │  │   │   ├──0.25 MB (00.08%) ── Color
│  │  │  │  │   │   ├──0.12 MB (00.04%) ── Content
│  │  │  │  │   │   ├──0.11 MB (00.03%) ── Effects
│  │  │  │  │   │   ├──0.06 MB (00.02%) ── UIReset
│  │  │  │  │   │   ├──0.04 MB (00.01%) ── List
│  │  │  │  │   │   └──0.02 MB (00.01%) ── sundries
│  │  │  │  │   ├───6.06 MB (01.86%) ── Text
│  │  │  │  │   └───5.23 MB (01.61%) ── Font

STYLO_THREADS=4, caching with no conditions

│  │  │  │  ├───24.03 MB (07.14%) -- servo-style-structs
│  │  │  │  │   ├───6.86 MB (02.04%) -- (12 tiny)
│  │  │  │  │   │   ├──2.65 MB (00.79%) ── Position
│  │  │  │  │   │   ├──1.21 MB (00.36%) ── Border
│  │  │  │  │   │   ├──1.01 MB (00.30%) ── TextReset
│  │  │  │  │   │   ├──0.54 MB (00.16%) ── Background
│  │  │  │  │   │   ├──0.35 MB (00.11%) ── UserInterface
│  │  │  │  │   │   ├──0.32 MB (00.09%) ── Padding
│  │  │  │  │   │   ├──0.26 MB (00.08%) ── Margin
│  │  │  │  │   │   ├──0.26 MB (00.08%) ── Color
│  │  │  │  │   │   ├──0.11 MB (00.03%) ── Effects
│  │  │  │  │   │   ├──0.09 MB (00.03%) ── Content
│  │  │  │  │   │   ├──0.04 MB (00.01%) ── List
│  │  │  │  │   │   └──0.02 MB (00.00%) ── sundries
│  │  │  │  │   ├───6.37 MB (01.89%) ── Text
│  │  │  │  │   ├───5.47 MB (01.62%) ── Font
│  │  │  │  │   └───5.33 MB (01.58%) ── Display


STYLO_THREADS=4, caching with conditions

│  │  │  │  ├───16.12 MB (05.18%) -- servo-style-structs
│  │  │  │  │   ├───5.80 MB (01.86%) ── Text
│  │  │  │  │   ├───5.28 MB (01.69%) -- (12 tiny)
│  │  │  │  │   │   ├──1.76 MB (00.56%) ── Display
│  │  │  │  │   │   ├──0.99 MB (00.32%) ── Border
│  │  │  │  │   │   ├──0.97 MB (00.31%) ── Position
│  │  │  │  │   │   ├──0.57 MB (00.18%) ── TextReset
│  │  │  │  │   │   ├──0.34 MB (00.11%) ── UserInterface
│  │  │  │  │   │   ├──0.24 MB (00.08%) ── Color
│  │  │  │  │   │   ├──0.10 MB (00.03%) ── Background
│  │  │  │  │   │   ├──0.09 MB (00.03%) ── Effects
│  │  │  │  │   │   ├──0.09 MB (00.03%) ── Padding
│  │  │  │  │   │   ├──0.06 MB (00.02%) ── Margin
│  │  │  │  │   │   ├──0.04 MB (00.01%) ── List
│  │  │  │  │   │   └──0.02 MB (00.01%) ── sundries
│  │  │  │  │   └───5.05 MB (01.62%) ── Font

STYLO_THREADS=1, no caching

│  │  │  │  ├───19.74 MB (06.65%) -- servo-style-structs
│  │  │  │  │   ├───8.73 MB (02.94%) -- (14 tiny)
│  │  │  │  │   │   ├──2.96 MB (01.00%) ── Font
│  │  │  │  │   │   ├──2.67 MB (00.90%) ── Position
│  │  │  │  │   │   ├──0.86 MB (00.29%) ── TextReset
│  │  │  │  │   │   ├──0.76 MB (00.26%) ── Border
│  │  │  │  │   │   ├──0.45 MB (00.15%) ── Background
│  │  │  │  │   │   ├──0.26 MB (00.09%) ── Padding
│  │  │  │  │   │   ├──0.19 MB (00.06%) ── UserInterface
│  │  │  │  │   │   ├──0.15 MB (00.05%) ── Margin
│  │  │  │  │   │   ├──0.13 MB (00.04%) ── Content
│  │  │  │  │   │   ├──0.12 MB (00.04%) ── Color
│  │  │  │  │   │   ├──0.10 MB (00.03%) ── Effects
│  │  │  │  │   │   ├──0.03 MB (00.01%) ── UIReset
│  │  │  │  │   │   ├──0.03 MB (00.01%) ── List
│  │  │  │  │   │   └──0.02 MB (00.01%) ── sundries
│  │  │  │  │   ├───7.55 MB (02.54%) ── Display
│  │  │  │  │   └───3.46 MB (01.16%) ── Text

STYLO_THREADS=1, caching with no conditions

│  │  │  │  ├───15.26 MB (05.19%) -- servo-style-structs
│  │  │  │  │   ├───5.02 MB (01.71%) -- (12 tiny)
│  │  │  │  │   │   ├──2.43 MB (00.83%) ── Position
│  │  │  │  │   │   ├──0.76 MB (00.26%) ── Border
│  │  │  │  │   │   ├──0.51 MB (00.17%) ── Background
│  │  │  │  │   │   ├──0.42 MB (00.14%) ── TextReset
│  │  │  │  │   │   ├──0.22 MB (00.07%) ── Padding
│  │  │  │  │   │   ├──0.19 MB (00.07%) ── UserInterface
│  │  │  │  │   │   ├──0.14 MB (00.05%) ── Margin
│  │  │  │  │   │   ├──0.13 MB (00.04%) ── Color
│  │  │  │  │   │   ├──0.10 MB (00.04%) ── Effects
│  │  │  │  │   │   ├──0.09 MB (00.03%) ── Content
│  │  │  │  │   │   ├──0.03 MB (00.01%) ── List
│  │  │  │  │   │   └──0.01 MB (00.00%) ── sundries
│  │  │  │  │   ├───3.58 MB (01.22%) ── Display
│  │  │  │  │   ├───3.58 MB (01.22%) ── Text
│  │  │  │  │   └───3.08 MB (01.05%) ── Font

STYLO_THREADS=1, caching with conditions

│  │  │  │  ├───10.69 MB (03.70%) -- servo-style-structs
│  │  │  │  │   ├───4.16 MB (01.44%) -- (12 tiny)
│  │  │  │  │   │   ├──1.61 MB (00.56%) ── Display
│  │  │  │  │   │   ├──0.91 MB (00.31%) ── Position
│  │  │  │  │   │   ├──0.76 MB (00.26%) ── Border
│  │  │  │  │   │   ├──0.26 MB (00.09%) ── TextReset
│  │  │  │  │   │   ├──0.19 MB (00.07%) ── UserInterface
│  │  │  │  │   │   ├──0.13 MB (00.04%) ── Color
│  │  │  │  │   │   ├──0.08 MB (00.03%) ── Effects
│  │  │  │  │   │   ├──0.07 MB (00.02%) ── Background
│  │  │  │  │   │   ├──0.06 MB (00.02%) ── Padding
│  │  │  │  │   │   ├──0.05 MB (00.02%) ── Margin
│  │  │  │  │   │   ├──0.03 MB (00.01%) ── List
│  │  │  │  │   │   └──0.02 MB (00.01%) ── sundries
│  │  │  │  │   ├───3.52 MB (01.22%) ── Text
│  │  │  │  │   └───3.01 MB (01.04%) ── Font


Obviously the reset structs are the ones to look at.  And for my patch the number of threads shouldn't affect the caching.  (The small differences are probably due to the style sharing cache, I guess.)

Anyway, I'll clean the patch up tomorrow and attach it so we can make a determination, before putting in the effort to get try green with it.
Attached patch rule node cache with conditions (obsolete) — Splinter Review
So this is what I've got after cleaning things up.  One thing still doesn't work, which is the clearing of the cached data during rule node gc.  And since this is really my first time writing code that needs to care about memory ordering in more than a trivial way, what I'm using is probably wrong.  But the patch should give you an idea of the complexity to weigh against the memory (and by extension time) savings.

Feedback welcome, Emilio and Bobby.
Comment on attachment 8898720 [details] [diff] [review]
rule node cache with conditions

Review of attachment 8898720 [details] [diff] [review]:
-----------------------------------------------------------------

FWIW, I asked Cam about the rest of the patch (since this doesn't call put_cached_style anywhere), and he told me that he only cleaned up the cache storage bit of the patch and not the others, which is fine. The rest should look similar enough I guess :)

I took a quick look at the patch before lunch... I fail to see how it works r / n, but that's probably due to lack of context I guess?

Overall looks reasonable. There are a few things that are going to be funny to prove correct though, which is the kind of thing I'm worried about... :(

::: servo/components/style/rule_tree/mod.rs
@@ +1065,3 @@
>              debug!("GC'ing {:?}", weak.ptr());
> +            let node = &*weak.ptr();
> +            node.clear_cached_styles();

should probably assert that the refcount is zero by here...

@@ +1087,5 @@
> +
> +        // FIXME(heycam): Can't we do this iteration without bumping the
> +        // refcounts of all rule nodes?
> +        for child in self.get().iter_children() {
> +            child.upgrade().clear_cached_styles();

Yeah, we kinda cheat for insertion to avoid this too, and presumably you can too as long as you're sure they don't get deleted first... I'd need to think a bit hard about the ordering properties of the free list to see if this fine to call during the GC though...

@@ +1424,5 @@
> +                      "cached ComputedValues must be for the same rule node");
> +
> +        let (cv, stored) = self.get().cached_style.put(style);
> +        if stored {
> +            self.get().refcount.fetch_sub(1, Ordering::Relaxed);

Well, this is fine because we don't do anything more complex in StrongRuleNode::drop, if the refcount is > 1, I suppose... Still seems somewhat nasty.

::: servo/components/style/rule_tree/style_cache.rs
@@ +89,5 @@
> +                StyleCacheValue::Single(cv) => {
> +                    // We attempted to store a value, but there was already a
> +                    // single value stored.
> +                    unsafe {
> +                        if (*cv).cache_condition_flags == flags {

I'm a bit unsure of how does this work, as of right now, without seeing the rest of the patch. In particular, `cache_condition_flags` seems to be a bitfield with HAVE_FONT_SIZE / HAVE_WRITING_MODE / etc.

Also, that relies on storing something else in the computed values, but I suppose that's fine (though slightly more invasive...).

Does that field also contain the actual font-size / writing-mode / etc? Otherwise, how do the conditions work at all?

@@ +150,5 @@
> +                    }
> +
> +                    // Create a new linked list node to point to the existing
> +                    // one, and attempt to store it.
> +                    let head = Box::new(StyleCacheList {

This may make a lot of very short-lived allocations, but maybe / probably that's ok.

@@ +198,5 @@
> +                    unsafe {
> +                        // The transmute is needed since we're dealing with const
> +                        // pointers, but Box::{from_raw,into_raw} only deals with
> +                        // mut pointers.
> +                        item = Box::<StyleCacheList>::from_raw(mem::transmute(list));

FYI, you can do `as *mut _` instead, I think.

@@ +293,5 @@
> +
> +impl Iterator for StyleCacheListIter {
> +    type Item = *const ComputedValues;
> +
> +    fn next(&mut self) -> Option<Self::Item> {

So this list is iterated without any kind of synchronization to look for the same cache_flags during insertion, and nothing prevents any other inserter to mutate the list.

If I'm right this is kinda fine (would need to check memory orderings and such I guess), because we only insert a new head of the list, right?

If so, your patch doesn't guarantee that we insert each set of conditions just once, is that also right? (but that's probably ok).

I guess that means that insertion can get somewhat expensive when multiple people are racing to insert in a list, since you need to iterate the whole list each insertion attempt... But hopefully that's rare enough I suppose?
A couple of quick answers before I move to the kitchen...

(In reply to Emilio Cobos Álvarez [:emilio] from comment #29)
> ::: servo/components/style/rule_tree/style_cache.rs
> @@ +89,5 @@
> > +                StyleCacheValue::Single(cv) => {
> > +                    // We attempted to store a value, but there was already a
> > +                    // single value stored.
> > +                    unsafe {
> > +                        if (*cv).cache_condition_flags == flags {
> 
> I'm a bit unsure of how does this work, as of right now, without seeing the
> rest of the patch. In particular, `cache_condition_flags` seems to be a
> bitfield with HAVE_FONT_SIZE / HAVE_WRITING_MODE / etc.
> 
> Also, that relies on storing something else in the computed values, but I
> suppose that's fine (though slightly more invasive...).
> 
> Does that field also contain the actual font-size / writing-mode / etc?
> Otherwise, how do the conditions work at all?

I was originally planning to have a RuleNodeCacheConditions objects, just like in your patch (and in Gecko), but I figured in the end that it would be nice if I could embed the conditions in the ComputedValues object itself.  The two values we can depend on, font-size property value and WritingMode, can be just got off the Box struct and the ComputedValues object itself.  (In Gecko, we potentially have a dependency on *inherited* font-size, but that's only for the font-size property itself, and since we are only caring about reset structs here, we don't need to worry about that.)  So yes, the cache_conditions_flags is just a u8 of UNCACHEABLE / HAVE_FONT_SIZE / HAVE_WRITING_MODE flags.

> @@ +150,5 @@
> > +                    }
> > +
> > +                    // Create a new linked list node to point to the existing
> > +                    // one, and attempt to store it.
> > +                    let head = Box::new(StyleCacheList {
> 
> This may make a lot of very short-lived allocations, but maybe / probably
> that's ok.

Yeah, we can do the XXX comment thing just below if that's an issue.

> 
> @@ +198,5 @@
> > +                    unsafe {
> > +                        // The transmute is needed since we're dealing with const
> > +                        // pointers, but Box::{from_raw,into_raw} only deals with
> > +                        // mut pointers.
> > +                        item = Box::<StyleCacheList>::from_raw(mem::transmute(list));
> 
> FYI, you can do `as *mut _` instead, I think.

Huh, didn't realize you could "as" from const to mut pointers.

> @@ +293,5 @@
> > +
> > +impl Iterator for StyleCacheListIter {
> > +    type Item = *const ComputedValues;
> > +
> > +    fn next(&mut self) -> Option<Self::Item> {
> 
> So this list is iterated without any kind of synchronization to look for the
> same cache_flags during insertion, and nothing prevents any other inserter
> to mutate the list.
> 
> If I'm right this is kinda fine (would need to check memory orderings and
> such I guess), because we only insert a new head of the list, right?

That's right.

> If so, your patch doesn't guarantee that we insert each set of conditions
> just once, is that also right? (but that's probably ok).

I think it is guaranteed that we only get the same set of conditions once.  Before we attempt to write a new value, we check if the existing value (or any of the list items) has a ComputedValues with matching condition flags.

> I guess that means that insertion can get somewhat expensive when multiple
> people are racing to insert in a list, since you need to iterate the whole
> list each insertion attempt... But hopefully that's rare enough I suppose?

Given the check for identical conditions, I think it shouldn't be a problem.  If we wanted to be clever, we could, the next time around the loop, only check up until the list node that was the previous iteration's current value.
Originally I had hoped that we could store a whole ComputedValues in the RuleNode's cache, and when we first produce a ComputedValues and decide to store in the cache, if we don't need to do any StyleAdjuster work on it, then we could return another reference to that same ComputedValues from apply_declarations.  Because the ComputedValues objects held in the RuleNode's cache would have a strong reference to the RuleNode itself, it seemed like we could just mess with the RuleNode's refcount when adding and removing entries from the cache to counteract this, and have refcount = 0 mean "there are no StrongRuleNode references aside from any in the cache".

Unfortunately, I don't think this scheme works, because we are caching ComputedValues objects, and really what we want to know (when messing with the refcount) is "is this a unique ComputedValues object", because only then do we want to discount its own strong RuleNode reference.  But if we can return a reference to this same ComputedValues object, then that means we can have Arc<ComputedValues> out in the wild referencing a RuleNode with refcount = 0.  This would mean during GC we would need to check whether this 0 really meant that we could destroy the RuleNode or not.  That check would involve looking at the current cache entries on the RuleNode and counting the number that have unique Arc<ComputedValues>s.  This is do-able, but would complicate the free list, since we would have many more RuleNodes in there with refcount = 0 which we would then determine aren't eligible for deletion.

It's an additional complication that the visited values has its own StrongRuleNode reference.

Anything more clever than this would need coordination with Arc<ComputedValues>, I guess, which would be even less clean.  And would probably exceed my desired cleverness limit anyway.

So for now I'm going to store ComputedValues in the RuleNode, but without a RuleNode reference (or visited_style) in it.  And measure the memory impact of that.  Note that my measurements from comment 27 did not try to use the same ComputedValues object for storing in the cache and immediately returning it from apply_declarations (when there were no style adjustments), so I don't think this changes the memory saving of the patch so far.

(Changing to store a special cache entry object instead of a ComputedValues object, which would just have the 15 reset struct pointers plus the dependent conditional data, could save a bit more.  This is probably a more marginal saving, though, so I'll save that for later.  Or I might do it, depending on how quickly I can flush out these last oranges.)
Attached patch rule node cache with conditions (obsolete) — Splinter Review
This patch, which I reworked a bit to not store entire ComputedValues objects in the cache, doesn't do as well as I expect.  I haven't managed to work out why it does worse than the TLS cache in STYLO_THREADS=4.
Attachment #8898720 - Attachment is obsolete: true
I dumped some stats from my patch on the full page HTML spec:

    227 cacheable 
    130 cacheable HAVE_FONT_SIZE
     17 cacheable HAVE_FONT_SIZE | HAVE_WRITING_MODE
      8 cacheable HAVE_WRITING_MODE
   1932 UNCACHEABLE due to border-bottom-color: currentcolor
    256 UNCACHEABLE due to border-right-color: currentcolor
   5117 UNCACHEABLE due to is_root()
      4 UNCACHEABLE due to margin-right: 5rem
   1071 UNCACHEABLE due to mutable style source
  98250 UNCACHEABLE due to pseudo
  11821 UNCACHEABLE due to text-decoration-color: currentcolor
    383 UNCACHEABLE due to vertical-align: inherit
  43702 UNCACHEABLE due to visited-dependent
  51083 used cached reset style

So 382 unique sets of reset structs, and 51083 times we re-use them.  That seems pretty good.  We don't know what specific structs were mutated there, so we can't directly correlate those to memory savings.

As Emilio mentions in a comment in his patch, it might be worth also caching visited dependent contexts.  The standout one above is probably the pseudo, where I'm faulting out of caching if we're computing for any pseudo.  The fact that we have all the eager pseudos to cascade probably doesn't help.  But really I'm only skipping pseudos because of pseudo restrictions, where we'd get different cascade results.  But now that I think about that, we shouldn't need to worry about that at all, since the fact that we have a rule node for the pseudo rule should be enough to distinguish it from other, non-pseudo things... so without that I get:

    248 cacheable 
    162 cacheable HAVE_FONT_SIZE
     17 cacheable HAVE_FONT_SIZE | HAVE_WRITING_MODE
     10 cacheable HAVE_WRITING_MODE
     99 UNCACHEABLE due to background-color: inherit
   1919 UNCACHEABLE due to border-bottom-color: currentcolor
   2405 UNCACHEABLE due to border-right-color: currentcolor
      1 UNCACHEABLE due to display: inherit
   4878 UNCACHEABLE due to is_root()
      4 UNCACHEABLE due to margin-right: 5rem
   1020 UNCACHEABLE due to mutable style source
      2 UNCACHEABLE due to opacity: inherit
   7915 UNCACHEABLE due to overflow-clip-box: inherit
  11526 UNCACHEABLE due to text-decoration-color: currentcolor
     15 UNCACHEABLE due to text-decoration-line: inherit
    136 UNCACHEABLE due to text-overflow: inherit
     82 UNCACHEABLE due to vector-effect: inherit
    353 UNCACHEABLE due to vertical-align: inherit
  40738 UNCACHEABLE due to visited-dependent
 130826 used cached reset style

I also checked how much it's wastefully caching all-default reset structs, but it's not that often:

    420 cached all default structs

I realize that I have some restrictions that Emilio's patch doesn't (e.g. the currentcolor ones).  (Also it looks like there's a small mistake in that patch where "initial" rather than "inherit" for reset properties is implying uncacheable.  But given the relatively few "inherit" values I see above that probably doesn't make a material difference on the memory usage.)


I'll do one more round of memory measurements with the pseudo thing fixed (for a comparison), and one we're also handling visited dependent style, and see where we stand.  If my patch still isn't clearly better at that point I think we should go for Emilio's.
(In reply to Cameron McCormack (:heycam) from comment #33)
>    1071 UNCACHEABLE due to mutable style source

An interesting thing to note is that this one is required in my patch (to handle style="" declarations changing underneath us), because we retain the cached data for subsequent traversals.  Emilio's patch does not need to worry about that.
(In reply to Cameron McCormack (:heycam) from comment #33)
> I'll do one more round of memory measurements with the pseudo thing fixed
> (for a comparison), and one we're also handling visited dependent style, and
> see where we stand.  If my patch still isn't clearly better at that point I
> think we should go for Emilio's.

Ok. If we do, I think we should make sure we understand why we can't achieve the same degree of sharing as gecko with the precise approach. The 7MB delta between emilio's patch and gecko accounts for 25-35% of our AWSY regression, and would put a bound on how close we can get.
(In reply to Cameron McCormack (:heycam) from comment #33)
> But now that I think about that, we
> shouldn't need to worry about that at all, since the fact that we have a
> rule node for the pseudo rule should be enough to distinguish it from other,
> non-pseudo things...

This is wrong, per the nine year old bug 469227 comment 9, since we have the same rule node for multiple selectors on the one rule, and one can selector have a restricting pseudo and the other not.  Still, we can handle this straightforwardly by turning off caching if we need to apply any restrictions.  (Or store which restricting pseudo we had in our conditions, but this isn't common enough to worry about.)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #35)
> Ok. If we do, I think we should make sure we understand why we can't achieve
> the same degree of sharing as gecko with the precise approach. The 7MB delta
> between emilio's patch and gecko accounts for 25-35% of our AWSY regression,
> and would put a bound on how close we can get.

Some more guesses:

* the fact that we don't decide to cache per struct, but for all reset structs at once (might be a factor)
* the fact that we can't cache entirely-specified inherited structs in the rule tree (less likely)
* Gecko's lazy struct resolution (might be...)


And here are some HTML spec measurements as of now:

STYLO_THREADS=4 heycam: 14.83 MB
STYLO_THREADS=4 emilio: 22.09 MB
STYLO_THREADS=1 heycam: 11.11 MB
STYLO_THREADS=1 emilio: 12.42 MB
                 Gecko:  2.87 MB

This is after I switched the initial/inherit handling in Emilio's patch.  Since we're basically on par with STYLO_THREADS=1, I think I've found the cases where I was missing sharing in my patch.  No AWSY measurements on the latest patches yet.

End of my day now, but tomorrow I'm going to do some instrumentation of Gecko to see whether any of those guesses above might be correct.
(In reply to Cameron McCormack (:heycam) from comment #37)
> * the fact that we don't decide to cache per struct, but for all reset
> structs at once (might be a factor)

With my current patch we are failing to cache reset data due to explicit inherit values in 9,589 cases.  I don't have recorded whether we would be able to successfully cache all other reset structs on the same rule node.  The total weight of all reset structs is 1,992 bytes.  The average weight of all reset structs but one, if we assume only one is uncacheable, is 1,860.  So that would give an average total of ~17 MB due to missed caching, with that assumption.  The real value is probably going to be a bit less.

> * the fact that we can't cache entirely-specified inherited structs in the
> rule tree (less likely)

In the HTML spec, Gecko caches 48,171 fully specified inherited structs in the rule tree, for a total saving of 192,684 bytes.  My guess is that these are all Color structs.

> * Gecko's lazy struct resolution (might be...)

In the HTML spec, forcing all style structs to be computed in the document only caused around 300 more to be created.


Caching per struct seems like the right thing to look at.
Well, that helped slightly but not at all by how much I'd hoped.

Another place where I think we are missing out on some sharing is anonymous box style data.  The HTML spec has a lot of them:

   4204 ResolveInheritingAnonymousBoxStyle :-moz-table-column
   3180 ResolveInheritingAnonymousBoxStyle :-moz-cell-content
   1278 ResolveInheritingAnonymousBoxStyle :-moz-table-row
    947 ResolveInheritingAnonymousBoxStyle :-moz-table-wrapper
    947 ResolveInheritingAnonymousBoxStyle :-moz-table-column-group
    908 ResolveInheritingAnonymousBoxStyle :-moz-table-row-group
     78 ResolveInheritingAnonymousBoxStyle :-moz-svg-text
        ...

They are unique ComputedValues objects, i.e. we don't use a cached anonymous box style in ServoStyleSet::ResolveNonInheritingAnonymousBoxStyle.

A bunch of these just have a |display: whatever| on them, so should be shareable.  But notably, the ::-moz-cell-content rule also has |text-overflow: inherit; overflow-clip-box: inherit;|, which causes Text and Box not to be shareable.  Text is 80 bytes, which is 254,500 bytes of lost sharing, and Box is 416 bytes, which is 1,322,880 bytes of lost sharing.

::-moz-table-wrapper is also interesting:

  http://searchfox.org/mozilla-central/rev/89e125b817c5d493babbc58ea526be970bd3748e/layout/style/res/ua.css#24

So that's all of Box, Margin, Position, and Effects that aren't cacheable due to inherit values.  That's a total of 936 bytes per ::-moz-table-wrapper, so 886,392 bytes of missed sharing.

Gecko is probably sharing a bunch of these styles with its style context sharing mechanism.  In Stylo, the style sharing cache won't be used for anonymous boxes.

The other top anonymous boxes other than ::-moz-cell-content and ::-moz-table-wrapper don't have uncacheable reset structs, but the inability to share the ComputedValues themselves (assuming they could be -- they might have different ancestor chains that wouldn't be cousin-shareable) could be up to ~2 MB.

Why is bug 1367904 ineffective here?  Is it because we don't have cousin shared ServoStyleContext objects?
(In reply to Cameron McCormack (:heycam) from comment #39)
> Why is bug 1367904 ineffective here?  Is it because we don't have cousin
> shared ServoStyleContext objects?

I mean bug 1368290.
My current patch, which adds conditions for the explicitly inherited reset properties mentioned in ::-moz-cell-content, only gets STYLO_THREADS=1 down to 10.21 MB.

│  │  │  │  ├───10.21 MB (03.29%) -- servo-style-structs
│  │  │  │  │   ├───3.69 MB (01.19%) ── Text
│  │  │  │  │   ├───3.29 MB (01.06%) ── Font
│  │  │  │  │   └───3.24 MB (01.04%) -- (12 tiny)
│  │  │  │  │       ├──1.54 MB (00.50%) ── Display
│  │  │  │  │       ├──0.70 MB (00.22%) ── Border
│  │  │  │  │       ├──0.43 MB (00.14%) ── Position
│  │  │  │  │       ├──0.21 MB (00.07%) ── UserInterface
│  │  │  │  │       ├──0.13 MB (00.04%) ── Color
│  │  │  │  │       ├──0.05 MB (00.02%) ── Background
│  │  │  │  │       ├──0.05 MB (00.01%) ── Margin
│  │  │  │  │       ├──0.04 MB (00.01%) ── Effects
│  │  │  │  │       ├──0.03 MB (00.01%) ── TextReset
│  │  │  │  │       ├──0.03 MB (00.01%) ── List
│  │  │  │  │       ├──0.03 MB (00.01%) ── sundries
│  │  │  │  │       └──0.01 MB (00.00%) ── Padding

With Nick's help in bug 1393636, we now have a breakdown of Gecko style struct usage per struct.  This is what I get currently in Gecko:

│  │  │  │  ├────2.99 MB (01.06%) -- (4 tiny)
│  │  │  │  │    ├──2.78 MB (00.99%) -- gecko-style-structs
│  │  │  │  │    │  ├──0.80 MB (00.29%) ── Display
│  │  │  │  │    │  ├──0.73 MB (00.26%) ── Font
│  │  │  │  │    │  ├──0.62 MB (00.22%) ── Text
│  │  │  │  │    │  ├──0.23 MB (00.08%) ── Position
│  │  │  │  │    │  ├──0.12 MB (00.04%) ── TextReset
│  │  │  │  │    │  ├──0.11 MB (00.04%) ── UserInterface
│  │  │  │  │    │  ├──0.04 MB (00.01%) ── Border
│  │  │  │  │    │  ├──0.03 MB (00.01%) ── Background
│  │  │  │  │    │  ├──0.02 MB (00.01%) ── sundries
│  │  │  │  │    │  ├──0.02 MB (00.01%) ── List
│  │  │  │  │    │  ├──0.02 MB (00.01%) ── Margin
│  │  │  │  │    │  ├──0.01 MB (00.00%) ── Padding
│  │  │  │  │    │  └──0.01 MB (00.00%) ── Effects

And writing out Stylo's usage for the top structs (with my current patch) as a percentage of Gecko struct usage, it's:

  Text:     595.2%
  Font:     450.7%
  Display:  192.5%
  Position: 187.0%
  UI:       190.9%

Display and Position would probably go down a bit if we allowed rule node cache conditions to dependent on arbitrary explicitly inherited properties (which would handle ::-moz-table-wrapper).

But why is the usage of Text and Font (two inherited structs) so high?
(In reply to Cameron McCormack (:heycam) from comment #41)
> │  │  │  │  │   ├───3.29 MB (01.06%) ── Font

To put that in perspective, that's ~5.6 Font structs per element in the document...
Ah, that's because there's a bare

  ::before { font: ... }

rule in the style sheet.  I haven't yet rebased over bug 1385154, which will help with this.
The current patch, along with bug 1385154 and the bug 1367854 comment 24 idea, gets us down to 5.25 MB with STYLO_THREADS=1.  Still no decrease in explicit/resident sizes as reported by AWSY.  I need to look into that and understand what's going on.
(In reply to Cameron McCormack (:heycam) from comment #39)
> A bunch of these just have a |display: whatever| on them, so should be
> shareable.  But notably, the ::-moz-cell-content rule also has
> |text-overflow: inherit; overflow-clip-box: inherit;|, which causes Text and
> Box not to be shareable.  Text is 80 bytes, which is 254,500 bytes of lost
> sharing, and Box is 416 bytes, which is 1,322,880 bytes of lost sharing.
> 
> ::-moz-table-wrapper is also interesting:
> 
>  
> http://searchfox.org/mozilla-central/rev/
> 89e125b817c5d493babbc58ea526be970bd3748e/layout/style/res/ua.css#24
> 
> So that's all of Box, Margin, Position, and Effects that aren't cacheable
> due to inherit values.  That's a total of 936 bytes per
> ::-moz-table-wrapper, so 886,392 bytes of missed sharing.
> 
> Gecko is probably sharing a bunch of these styles with its style context
> sharing mechanism.  In Stylo, the style sharing cache won't be used for
> anonymous boxes.

IIUC, Gecko checks whether all properties for a struct is inherit or initial regardless whether the struct itself is inherit or initial. That mechanism may work especially well for this kind of cases, I suppose.
For these anonymous boxes, not all properties in the one (reset) struct are set to inherit, so unfortunately it doesn't help.
Attached patch rule node cache with conditions (obsolete) — Splinter Review
Current patch.
Attachment #8900555 - Attachment is obsolete: true
Attached patch rule node cache with conditions (obsolete) — Splinter Review
Rebased.
Attachment #8901735 - Attachment is obsolete: true
We're going to land the TLS struct sharing patch.  Will push that (for me to review) and some followup patches (for Emilio to review) in a moment.
In terms of AWSY effect, it performs slightly better after these fixes.  (Presumably due to not faulting out of sharing for values of inherited properties.)
Attachment #8902610 - Attachment is obsolete: true
Comment on attachment 8907459 [details]
Bug 1367635 - Part 1: Add a TLS-based style struct caching mechanism.

https://reviewboard.mozilla.org/r/179144/#review184264
Attachment #8907459 - Flags: review?(cam) → review+
Comment on attachment 8907460 [details]
Bug 1367635 - Part 2: Set writing mode dependency and uncacheable state only for non-inherited properties.

https://reviewboard.mozilla.org/r/179146/#review184286
Attachment #8907460 - Flags: review?(emilio) → review+
Comment on attachment 8907463 [details]
Bug 1367635 - Part 5: Make structs uncacheable if logical float/clear property values are used.

https://reviewboard.mozilla.org/r/179152/#review184294

::: servo/components/style/properties/longhand/box.mako.rs:281
(Diff revision 1)
>          fn to_computed_value(&self, context: &Context) -> computed_value::T {
>              let ltr = context.style().writing_mode.is_bidi_ltr();
>              // https://drafts.csswg.org/css-logical-props/#float-clear
>              match *self {
> -                SpecifiedValue::inline_start if ltr => computed_value::T::left,
> -                SpecifiedValue::inline_start => computed_value::T::right,
> +                SpecifiedValue::inline_start => {
> +                    context.rule_cache_conditions.borrow_mut()

Maybe we could reduce the amount of borrow_mut() call adding methods on `Context` instead... your call.
Attachment #8907463 - Flags: review?(emilio) → review+
Comment on attachment 8907459 [details]
Bug 1367635 - Part 1: Add a TLS-based style struct caching mechanism.

https://reviewboard.mozilla.org/r/179144/#review184298

::: servo/components/style/rule_cache.rs:71
(Diff revision 1)
> +}
> +
> +/// A TLS cache from rules matched to computed values.
> +pub struct RuleCache {
> +    // FIXME(emilio): Consider using LRUCache or something like that?
> +    map: FnvHashMap<StrongRuleNode, Vec<(RuleCacheConditions, Arc<ComputedValues>)>>,

While at it, given how we use this, this could be a SmallVec.
Comment on attachment 8907464 [details]
Bug 1367635 - Part 6: Add writing mode dependency if special MozLength keywords are used.

https://reviewboard.mozilla.org/r/179154/#review184302

::: servo/components/style/values/computed/length.rs:731
(Diff revision 2)
>          match *self {
>              specified::MozLength::LengthOrPercentageOrAuto(ref lopoa) => {
>                  MozLength::LengthOrPercentageOrAuto(lopoa.to_computed_value(context))
>              }
>              specified::MozLength::ExtremumLength(ref ext) => {
> +                debug_assert!(context.for_non_inherited_property.is_some(),

Huh, too bad, this could've been derived :/
Attachment #8907464 - Flags: review?(emilio) → review+
Comment on attachment 8907465 [details]
Bug 1367635 - Part 7: Make structs uncacheable if currentcolor is used on properties not using ComplexColor storage.

https://reviewboard.mozilla.org/r/179156/#review184310

I wonder if your derive approach would be cleaner (just calling it from `cascade`, and tweaking the right RuleCacheConditions)... If it's too much of a hassle to change it at this point, maybe you could file an issue on it?
Attachment #8907465 - Flags: review?(emilio) → review+
(In reply to Emilio Cobos Álvarez [:emilio] from comment #76)
> I wonder if your derive approach would be cleaner (just calling it from
> `cascade`, and tweaking the right RuleCacheConditions)... If it's too much
> of a hassle to change it at this point, maybe you could file an issue on it?

I will file a followup for moving to a derive-based thing.  I think the currentcolor changes aren't very different in the derived version though.
Comment on attachment 8907466 [details]
Bug 1367635 - Part 8: Don't cache structs if they have been adjusted.

https://reviewboard.mozilla.org/r/179158/#review184314

::: servo/components/style/properties/properties.mako.rs:2741
(Diff revision 2)
>          % if property.ident == "display":
>          self.flags.insert(::properties::computed_value_flags::INHERITS_DISPLAY);
>          % endif
>  
> +        % if not property.style_struct.inherited:
> +        self.modified_reset = true;

This could move to the `INHERITS_RESET_STYLE` block. Or even just using that flag.

::: servo/components/style/properties/properties.mako.rs:2912
(Diff revision 2)
> +        self.modified_reset = false;
> +    }
> +
> +    /// Returns whether we have mutated any reset structs since the the last
> +    /// time `clear_modified_reset` was called.
> +    pub fn modified_reset(&self) -> bool {

no need to be `pub`, here and everywhere else.
Attachment #8907466 - Flags: review?(emilio) → review+
Comment on attachment 8907459 [details]
Bug 1367635 - Part 1: Add a TLS-based style struct caching mechanism.

https://reviewboard.mozilla.org/r/179144/#review184322

::: servo/components/style/properties/gecko.mako.rs:266
(Diff revision 1)
>  impl ComputedValuesInner {
> +    /// Whether we're a visited style.
> +    pub fn is_style_if_visited(&self) -> bool {
> +        self.flags.contains(IS_STYLE_IF_VISITED)
> +    }
> +

self-nit: there's an extra newline here.

::: servo/components/style/stylesheets/viewport_rule.rs:712
(Diff revision 1)
>          let provider = get_metrics_provider_for_product();
>  
>          let default_values = device.default_computed_values();
>  
> +        // FIXME(emilio): remove Default::default(),
> +        let mut conditions = Default::default();

Can you just address the fixme (here and above) and use `RuleCacheConditions::default`? I was being somewhat lazy because I wanted to get it done fast.
Comment on attachment 8907467 [details]
Bug 1367635 - Part 9: Don't use rule cache for property-restricted pseudo-elements.

https://reviewboard.mozilla.org/r/179160/#review184320

::: servo/components/style/properties/properties.mako.rs:3224
(Diff revision 2)
>              custom_properties, &inherited_custom_properties);
>  
> +    // A pseudo-element with property restrictions can result in different
> +    // computed values if it's also used for a non-pseudo. Don't cache any
> +    // structs in this case, or use existing cached structs.
> +    if pseudo.and_then(|p| p.property_restriction()).is_some() {

Should this be done instead in `RuleCache::find` (where we do the :visited handling)?

What prevents us to insert them in the cache? That's done outside of `cascade`. So probably this also needs the handling in `insert_if_possible`.

The builder already tracks the pseudo, so it should be easy.
Attachment #8907467 - Flags: review?(emilio)
Comment on attachment 8907461 [details]
Bug 1367635 - Part 3: Record the property we are computing on computed::Context, if it's a non-inherited one.

https://reviewboard.mozilla.org/r/179148/#review184324

I wasn't sure why you needed the property before, so that's why I reviewed the following patches before. Looks good.

::: servo/components/style/properties/helpers.mako.rs:317
(Diff revision 1)
>                      panic!("variables should already have been substituted")
>                  }
>                  _ => panic!("entered the wrong cascade_property() implementation"),
>              };
>  
> +            context.for_non_inherited_property =

not sure if I love the `for_`. But I guess it's fine.
Attachment #8907461 - Flags: review?(emilio) → review+
Comment on attachment 8907462 [details]
Bug 1367635 - Part 4: Make structs uncacheable if ex/ch units are used.

https://reviewboard.mozilla.org/r/179150/#review184326
Attachment #8907462 - Flags: review?(emilio) → review+
(In reply to Emilio Cobos Álvarez [:emilio] from comment #80)
> ::: servo/components/style/properties/properties.mako.rs:3224
> (Diff revision 2)
> >              custom_properties, &inherited_custom_properties);
> >  
> > +    // A pseudo-element with property restrictions can result in different
> > +    // computed values if it's also used for a non-pseudo. Don't cache any
> > +    // structs in this case, or use existing cached structs.
> > +    if pseudo.and_then(|p| p.property_restriction()).is_some() {
> 
> Should this be done instead in `RuleCache::find` (where we do the :visited
> handling)?

Sounds good.

> What prevents us to insert them in the cache? That's done outside of
> `cascade`. So probably this also needs the handling in `insert_if_possible`.

Oops, yes.  Anyway, makes sense to put these checks in the RuleCache functions themselves.
Comment on attachment 8907468 [details]
Bug 1367635 - Part 11: Don't re-cache structs if we just got them from the cache.

https://reviewboard.mozilla.org/r/179162/#review184328

I don't understand why is this needed. This is not a big allocation. Is it just to save the `malloc`d of the `HashMap`? I believe that may not always be a win necessarily.

If this was a huge stack type that we boxed, like the style sharing cache, I'd get it (thought I still don't like it).
Attachment #8907468 - Flags: review?(emilio)
Comment on attachment 8907469 [details]
Bug 1367635 - Part 10: Don't cache structs with custom property references.

https://reviewboard.mozilla.org/r/179164/#review184330
Attachment #8907469 - Flags: review?(emilio) → review+
(In reply to Emilio Cobos Álvarez [:emilio] from comment #84)
> I don't understand why is this needed. This is not a big allocation. Is it
> just to save the `malloc`d of the `HashMap`? I believe that may not always
> be a win necessarily.
> 
> If this was a huge stack type that we boxed, like the style sharing cache,
> I'd get it (thought I still don't like it).

Bobby (since you requested this)?
Flags: needinfo?(bobbyholley)
(In reply to Cameron McCormack (:heycam) from comment #86)
> (In reply to Emilio Cobos Álvarez [:emilio] from comment #84)
> > I don't understand why is this needed. This is not a big allocation. Is it
> > just to save the `malloc`d of the `HashMap`? I believe that may not always
> > be a win necessarily.
> > 
> > If this was a huge stack type that we boxed, like the style sharing cache,
> > I'd get it (thought I still don't like it).
> 
> Bobby (since you requested this)?

HashMap buffers aren't small. The initial allocation size is 32 entries, which includes both key and value, so especially if we have some SmallVecs in there (which we totally should), the buffer would be > 1k.

That said, this is probably less of an issue now that we have the adaptive traversal. When I originally suggested this I imagined that we could just stick it in our already-persisted storage, but Cameron pointed out that it would mess up the abstractions a bit.

So just measure the impact on tiny-traversals-singleton. If the number doesn't move, I'm fine to skip this. If it does, we should do this or something similar.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #87)
> So just measure the impact on tiny-traversals-singleton. If the number
> doesn't move, I'm fine to skip this. If it does, we should do this or
> something similar.

It didn't make much difference.  On my machine, 1935 without that patch, 1924 with it.  I'll drop it.
Comment on attachment 8907468 [details]
Bug 1367635 - Part 11: Don't re-cache structs if we just got them from the cache.

https://reviewboard.mozilla.org/r/179162/#review184838
Attachment #8907468 - Flags: review?(emilio) → review+
Comment on attachment 8907467 [details]
Bug 1367635 - Part 9: Don't use rule cache for property-restricted pseudo-elements.

https://reviewboard.mozilla.org/r/179160/#review184842
Attachment #8907467 - Flags: review?(emilio) → review+
This landed: https://github.com/servo/servo/commit/874cb0d9df44e62a78d427f22f234a13227d07f8
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Renaming this bug, since we're not actually caching the structs in the rule tree anymore.
Summary: stylo: consider caching style structs in the rule tree → stylo: share reset structs during the style traversal
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: