Closed Bug 1367862 Opened 7 years ago Closed 7 years ago

stylo: Not much ComputedValues sharing going on on the Obama wikipedia page

Categories

(Core :: CSS Parsing and Computation, enhancement, P5)

53 Branch
enhancement

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: bzbarsky, Assigned: bzbarsky)

References

()

Details

(Keywords: meta)

Attachments

(2 files, 2 obsolete files)

Attached patch Logging patch (obsolete) — Splinter Review
I'm using rev a040a09aa7a1 with the attached patch applied to do logging.

Summary: The page, as of today, has about 32k nodes, of which there are about 15k elements; the rest are textnodes, comments, etc.  Gecko allocates about 2500 nsStyleContexts.  Stylo allocates about 22.5k nsStyleContexts and about 21700 ComputedValues.

The larger number of nsStyleContexts is expected, since stylo turns off Gecko style context sharing.  The large number of ComputedValues, and in particular the fact that it's not much smaller than the number of style contexts, indicates that we're not getting much sharing via the style sharing cache.

So as a start we should figure out what's going on with style sharing here.
This patch logs how many extant style contexts for text we have.  This matters because we do no servo-side sharing for textnode styles.  While I was here I also logged style contexts for pseudos in general.

Looks like out of 22.5k nsStyleContexts, about 7k are for nsCSSAnonBoxes::mozText.  About 2200 are for other pseudos (yikes; that's way more than I expected).  If we assume that none of those can get any sharing, we're still looking at 13k style contexts for elements of which only 800 are managing to share afaict.
I looked into those "other pseudos" style contexts.  Sorting them by number left at the end, we have:

796 ::-moz-list-number (7960 created, 7164 destroyed)
628 ::-moz-scrolled-content (4830 created, 4202 destroyed)
214 ::-moz-list-bullet (2140 created, 1926 destroyed)
142 ::-moz-cell-content (1393 created, 1251 destroyed)
112 ::after (1159 created, 1047 destroyed)
 82 ::-moz-table-column (all created, no destructions)
 44 ::-moz-table-wrapper (170 created, 126 destroyed)
 44 ::-moz-table-column-group (all created, no destructions)
 32 ::-moz-table-row-group (all created, no destructions)
 31 ::-moz-table-row (all created, no destructions)
 24 ::before (231 created, 207 destroyed)
 
and some long-tail bits where we allocated 20 or fewer of a few things (:-moz-viewport, :placeholder, :-moz-range-*, etc).
How about hanging a linked list off of Element style contexts to cache the style context for text and various pseudos? That should at least get us similar sharing characteristics for text and pseudos.

It would also be good to understand how much of the memory impact is from the duplicated style contexts per se, how much of it is for the duplicated ComputedValues, and how much of it is from duplicated style structs. I can think of different solutions for each issue.

Accurately measuring the memory footprint of style structs in Servo's copy-on-write problem is going to be an interesting problem for njn, btw.
> That should at least get us similar sharing characteristics for text and pseudos.

That wouldn't help us much with the number above.  A given element has at most one ::-moz-list-number or ::-moz-scrolled-content or ::-moz-list-bullet or ::-moz-cell-content or ::after.  So sharing them on that one element doesn't help.  You need to enable cousin sharing to be effective here.

> It would also be good to understand how much of the memory impact is from the duplicated
> style contexts per se, how much of it is for the duplicated ComputedValues, and how much
> of it is from duplicated style structs.

In my tree, stylo is using about 20MB more memory, as I said in bug 1367854.  That breaks down as:

  3.5MB - extra style contexts (number from about:memory)
    7MB - ComputedValues (21700 * 320, where 320 is the number from bug 1367635)

I would guess the remaining 10MB is style structs, but I'm not sure.  https://pastebin.mozilla.org/9022740 (from IRC yesterday) says about 700KB of just GeckoBox structs for stylo, and I _think_ was on the same testcase; for comparison about:memory claims 600KB for all structs taken together in gecko.  We have 23 struct types, so if they're all in the hundreds of KB range then 10MB is not impossible.

All these numbers are 64-bit.

Note that ComputedValues got shrunk somewhat, from 320 bytes to 216 bytes, earlier today, so that would save us about 2.2MB.  My build predates that change.

> Accurately measuring the memory footprint of style structs in Servo's copy-on-write problem

Fwiw, Gecko has a shared-struct setup too.  We just arena-allocate them, which makes counting them simple: examine the arena.

Given how much thrashing we have for these structs (see above created/destroyed numbers as we go through and do restyles), arena allocation, if possible, might be really nice to avoid malloc traffic...
Oh, and if the above pastebin is in fact the same testcase, it's showing 1692 display structs, which is a good bit less than our nearly 22k ComputedValues.  So we're getting some struct sharing sharing there, for sure.
> Oh, and if the above pastebin is in fact the same testcase

It is. All my measurements yesterday were from the content process with two tabs open: the Obama Wikipedia page and about:memory.
I tried poking at this a bit more.

One thing going on with this page is we have 1189 elements with an id.  Specifically, there are are two elements with ids for every footnote (the footnote itself, and the link to it), and almost 500 footnotes.  These ids are used for linking, not styling, but we ban all these elements from sharing styles anyway.

Each footnote has 12 elements under it, so that right there is about 6000 elements that can't do any style sharing at all.

Another thing that is going on is that any time we have a "parent" miss we clear the cache.  I don't really see how we can end up with any cousin sharing as a result...

If I do the following:

1)  Bump the cache size to 50
2)  Disable the "clear on parent failure" bit, and also disable the "stop trying on parent mismatch" bit.
3)  Manually disable the "don't share if the ids are different" bit.

then I end up with 14k ComputedValues instead of 22k.  Given that 9200 or so are for pseudos, that's a pretty decent drop.  Note that I haven't measured the performance impact of doing so many more revalidations.

If I leave the cache size at 8 but make the other two changes (note that this is not going to let us effective share across the many identical-ish 12-element subtrees in the footnotes), then we get 19k ComputedValues instead of 14k....
We should also look into sharing styles for elements with ::before/::after pseudos. There's nothing special preventing us to do that AFAICT.
This is really excellent analysis - thank you for looking into it!

(In reply to Boris Zbarsky [:bz] (work week until 5/26) (if a patch has no decent message, automatic r-) from comment #7)
> I tried poking at this a bit more.
> 
> One thing going on with this page is we have 1189 elements with an id. 
> Specifically, there are are two elements with ids for every footnote (the
> footnote itself, and the link to it), and almost 500 footnotes.  These ids
> are used for linking, not styling, but we ban all these elements from
> sharing styles anyway.

Hm, interesting. Here's an idea - how about we allow style sharing for elements _if_ the id rulehash has no entries for that id? That should be quick to check, and would likely make a difference here.

> Each footnote has 12 elements under it, so that right there is about 6000
> elements that can't do any style sharing at all.

You mean cousin sharing, right? Presumably the direct siblings can still share styles with each other.

> Another thing that is going on is that any time we have a "parent" miss we
> clear the cache.  I don't really see how we can end up with any cousin
> sharing as a result...

Because the 'parent' check will also succeed if the parents shared styles, right? Or is the problem that this only allows sharing between directly adjacent cousins? I guess that makes sense, and we should try to get wider sharing. In that case, we should probably change to tagging the cache at given DOM depth, and clearing it when the depth changes (we know this already in the PerLevelTraversalData).

> 
> If I do the following:
> 
> 1)  Bump the cache size to 50
> 2)  Disable the "clear on parent failure" bit, and also disable the "stop
> trying on parent mismatch" bit.
> 3)  Manually disable the "don't share if the ids are different" bit.
> 
> then I end up with 14k ComputedValues instead of 22k.  Given that 9200 or so
> are for pseudos, that's a pretty decent drop.  Note that I haven't measured
> the performance impact of doing so many more revalidations.
> 
> If I leave the cache size at 8 but make the other two changes (note that
> this is not going to let us effective share across the many identical-ish
> 12-element subtrees in the footnotes), then we get 19k ComputedValues
> instead of 14k....

Would be good to know the spectrum of the cache size tradeoff. What do 16 and 32 look like?

Note that it would also be interesting to measure parallel vs sequential. With parallel the style sharing is going to be less exact, which could hurt us with cousin sharing, or maybe it wouldn't be that bad, especially with work stealing locality. Worth measuring.

Finally, I'd be very interested in the impact on performance in addition to memory. The revalidation selectors mean it isn't a guaranteed win, but my intuition is that it's probably worth it to have a bigger cache and share more.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #9)
> This is really excellent analysis - thank you for looking into it!
> 
> (In reply to Boris Zbarsky [:bz] (work week until 5/26) (if a patch has no
> decent message, automatic r-) from comment #7)
> > I tried poking at this a bit more.
> > 
> > One thing going on with this page is we have 1189 elements with an id. 
> > Specifically, there are are two elements with ids for every footnote (the
> > footnote itself, and the link to it), and almost 500 footnotes.  These ids
> > are used for linking, not styling, but we ban all these elements from
> > sharing styles anyway.
> 
> Hm, interesting. Here's an idea - how about we allow style sharing for
> elements _if_ the id rulehash has no entries for that id? That should be
> quick to check, and would likely make a difference here.

Note that in particular, this preserves the property described at [1]. If we stopped preserving that property, we would need more complexity to deal with that.

[1] http://searchfox.org/mozilla-central/rev/a14524a72de6f4ff738a5e784970f0730cea03d8/servo/components/style/stylist.rs#1006
> Here's an idea - how about we allow style sharing for elements _if_ the id rulehash has no entries for that id?

That seems plausible.  I'll try it out.

> You mean cousin sharing, right? Presumably the direct siblings can still share styles with each other.

In theory, sure.  In practice, on this testcase, each footnote looks like this in the markup:

  <li id="cite_note-467">
    <span class="mw-cite-backlink">
      <b>
        <a href="#cite_ref-467">
          <span class="cite-accessibility-label">Jump up </span>^
        </a>
      </b>
    </span>
    <span class="reference-text">
      <cite class="citation web">author-and-date 
        <a rel="nofollow" class="external text" href="whatever">article-title</a>. 
        <i>source</i>.
      </cite>
      <span title="stuff" class="Z3988">
        <span style="display:none;">&nbsp;</span>
      </span>
    </span>
  </li>

and if you look at that you will see that there are only two siblings that have the same localname (the two toplevel spans inside the li) and they have different classes.  So there is no sibling sharing happening here; it all needs to be cousin sharing.

> Because the 'parent' check will also succeed if the parents shared styles, right?

The point is that if there is anything in the cache that has a non-matching parent before a thing that has a matching parent, we will completely clear the cache.  Ok, so "any cousin sharing" was a bit strong, but...

For example, say we have a parent.  It has a list of kids with two different styles for odd and even (e.g. alternating table rows).  We put those two styles in the cache, great.  Now we start processing the cells.  Say each row has one cell.  We process the first row's cell, cache it.  We process the second row's cell, parent does not match, clear the cache, cache it.  We process the third row's cell, parent does not match, clear the cache, cache it.  And so on.  Unless I'm missing something about how this works?

In practical terms, if I put back the "clear the cache on parent mismatch" thing in my testcase above where we ended up with 14K ComputedValues (cache size 50), then the count jumps to 20K.

I'm not sure what you mean by "directly adjacent cousins".  Are the cells in my example above directly adjacent?

> and clearing it when the depth changes

This would make sense to me, yes.  A lot more than clearing on first detected parent mismatch.

> Would be good to know the spectrum of the cache size tradeoff. What do 16 and 32 look like?

For this particular testcase, in the "14K" ComputedValues condition (bailing out on parent mismatch disabled, id check disabled), I end up with closer to 15.5K at cache size 16 and 14.3K at cache size 32.  This last is pretty close to what happens at size 50; that one is at 14.0K fairly reliably.

> Note that it would also be interesting to measure parallel vs sequential.

With STYLO_THREADS=1 I do get more sharing.  For example, with the parent and id bits disabled and cache size 32 I get 12.7K ComputedValues instead of 14.2K.  Again, our lower bound right now due to text and anon stuff is 9.2K, and Gecko is at 2.5K, so we're certainly in that same ballpark for elements by this point.

> Finally, I'd be very interested in the impact on performance in addition to memory.

Yes.  I need to do some measurements here.  The edit-test cycle is much longer for a build in which I can measure performance (I use --enable-rust-debug in my normal stylo build), so I haven't gotten to it yet...
> Here's an idea - how about we allow style sharing for elements _if_ the id rulehash has no entries for that id?

This gives us equivalent sharing in this case, not surprisingly.  I'll try to get some perf numbers.
OK, some performance numbers.  Steps to generate these:

1)  Load https://en.wikipedia.org/wiki/Barack_Obama
2)  Evaluate this bookmarklet:
  javascript:var%20e%20=%20document.documentElement;%20var%20s%20=%20e.style;%20s.display%20=%20"none";%20e.offsetWidth;%20var%20d%20=%20new%20Date;%20s.display%20=%20"";%20window.getComputedStyle(e,"").color;%20var%20d2%20=%20new%20Date;%20alert(d2%20-%20d);

This basically measures how long a top-down restyle and frame construction takes.  For comparison, on my machine Gecko is at about 88ms.  I tested this in the following scenarios:

* Vanilla stylo:                     81, 78, 79, 85, 76, 78
* With id style sharing allowed:     83, 78, 76, 80, 79, 85, 83
* With stop/clear-on-parent removed: 77, 73, 74, 77, 73, 76, 74
* With the cache size set to 32:     65, 71, 74, 80, 72, 71, 69

where each row includes all the changes from the previous ones.  That is, that last row has style sharing allowed for ids that have no rules for them, has the break and "should_clear_cache = true" bit on parent miss removed, and has the cache size set to 32.

Given that, for now I've created https://github.com/servo/servo/pull/17055 to do the id thing.  Still thinking about the right way to do the parent thing.  And we should probably bump the cache size; what I'm not sure about is what, if anything, this means for our work unit size.
(In reply to Boris Zbarsky [:bz] (work week until 5/26) (if a patch has no decent message, automatic r-) from comment #11)
> I'm not sure what you mean by "directly adjacent cousins".  Are the cells in
> my example above directly adjacent?

I mean that the parents are adjacent siblings. My point is that styles would be shared without the even-odd business. Anyway, I think we understand each other here.

> > and clearing it when the depth changes
> 
> This would make sense to me, yes.  A lot more than clearing on first
> detected parent mismatch.

Yep, agreed. FWIW the current code predated cousin sharing, so it made sense back then.

> 
> > Would be good to know the spectrum of the cache size tradeoff. What do 16 and 32 look like?
> 
> For this particular testcase, in the "14K" ComputedValues condition (bailing
> out on parent mismatch disabled, id check disabled), I end up with closer to
> 15.5K at cache size 16 and 14.3K at cache size 32.  This last is pretty
> close to what happens at size 50; that one is at 14.0K fairly reliably.

Ok, that (combined with your subsequent performance measurements) seems like a reasonable argument for a cache size of 32. Might be worth measuring on a few other pages.

> 
> > Note that it would also be interesting to measure parallel vs sequential.
> 
> With STYLO_THREADS=1 I do get more sharing.  For example, with the parent
> and id bits disabled and cache size 32 I get 12.7K ComputedValues instead of
> 14.2K.

Ok, that's probably a pretty fundamental tradeoff then. The more we restrict things to one core, the more we have the possibility of reusing work from the same core.

> Again, our lower bound right now due to text and anon stuff is 9.2K,
> and Gecko is at 2.5K, so we're certainly in that same ballpark for elements
> by this point.

Ok. We _are_ fixing the text and anon stuff too, right?


(In reply to Boris Zbarsky [:bz] (work week until 5/26) (if a patch has no decent message, automatic r-) from comment #12)
> > Here's an idea - how about we allow style sharing for elements _if_ the id rulehash has no entries for that id?
> 
> This gives us equivalent sharing in this case, not surprisingly.

Define "equivalent sharing"?


(In reply to Boris Zbarsky [:bz] (work week until 5/26) (if a patch has no decent message, automatic r-) from comment #13)
> Given that, for now I've created https://github.com/servo/servo/pull/17055
> to do the id thing.  Still thinking about the right way to do the parent
> thing.

Just cache the DOM depth alongside the LRUCache, and clear the cache if the DOM depth ever changes, no?

> And we should probably bump the cache size; what I'm not sure about
> is what, if anything, this means for our work unit size.

I'm guessing that increasing the work unit size to increase style sharing is probably false economy here (though I could be wrong). With work stealing, we should get a fair amount of locality even across work units, so long as the other threads are not idle (in which case we probably do want to sacrifice sharing to boost parallelism). I'm guessing the the issue with wikipedia is that there's an extremely wide section of the DOM which gets split across the threads, and then there are lots of isomorphic things between those sections which can be shared on sequential but not on parallel. I'm not sure I see any way around that fundamental tradeoff.
> Might be worth measuring on a few other pages.

Yep.  I need to do that, but not until Tuesday.

> Ok, that's probably a pretty fundamental tradeoff then.

Yes, agreed.

> Ok. We _are_ fixing the text and anon stuff too, right?

I'd hope so.  I just filed bug 1368290 and bug 1368291 on those bits.

> Define "equivalent sharing"?

"The same amount of sharing as just sharing no matter what's going on with ids".  As in, it does what we want, on this testcase.  The hard part is making it sound; I'm still working on that.

> Just cache the DOM depth alongside the LRUCache, and clear the cache if the DOM depth ever changes, no?

OK.  I just hadn't gotten far enough into figuring out how the pieces fit together to say something that specific.  ;)

> I'm guessing that increasing the work unit size to increase style sharing is probably false economy here 

I suspect you're right.
Depends on: 1369584
Depends on: 1369620
Depends on: 1369621
Attachment #8871381 - Attachment is obsolete: true
Attachment #8871402 - Attachment is obsolete: true
Boris, is there anything else we need to track in this bug, or should we resolve it?
Flags: needinfo?(bzbarsky)
Priority: -- → P1
We should fix bug 1368290 and bug 1368291 and remeasure.
Depends on: 1368291, 1368290
Flags: needinfo?(bzbarsky)
Blocks: 1373362
Keywords: meta
Priority: P1 → P5
Assignee: nobody → bzbarsky
Status: NEW → ASSIGNED
bz, can you make sure your tree has both bug 1387116 and bug 1368290 applied and remeasure this?
Flags: needinfo?(bzbarsky)
Took the measurements requested in comment 20. I get output like [1] on obama-noscript. bz, does that look to you like sharing is working?

[1]
OTHER PSEUDO COUNT DEC: 1532
COUNT INC: 4461
COUNT DEC: 4460
COUNT INC: 4461
OTHER PSEUDO COUNT INC: 1533
COUNT INC: 4462
OTHER PSEUDO COUNT INC: 1534
COUNT INC: 4463
COUNT INC: 4464
OTHER PSEUDO COUNT INC: 1535
COUNT INC: 4465
COUNT INC: 4466
STYLO DEC: 4423
COUNT DEC: 4465
COUNT INC: 4466
COUNT INC: 4467
COUNT INC: 4468
TEXT COUNT INC: 923
COUNT DEC: 4467
TEXT COUNT DEC: 922
COUNT INC: 4468
TEXT COUNT INC: 923
COUNT DEC: 4467
TEXT COUNT DEC: 922
OK, some numbers with a build that should have all the bits applied, on a 4-core machine:

obama-noscript:
  Stylo: 4938 elements, 1760 non-text pseudos, 2552 text
  Stylo-sequential: 2021 elements, 1535 non-text pseudos, 923 text
  Gecko: 1335 elements, 349 non-text pseudos, 951 text

Obama live (on Wikipedia as of today):
  Stylo: 5020 elements, 1876 non-text pseudos, 2398 text
  Stylo-sequential: 2088 elements, 1660 non-text pseudos, 901 text
  Gecko: 1389 elements, 256 non-text pseudos, 890 text

Looks to me like sharing is working ok (about as we expected) for elements and text.  The non-text pseudos is a lot higher with stylo; I expect that's in large part due to the :-moz-list-bullet pseudo-elements on the page.  Those numbers are _without_ the patch in bug 1368291.

With that patch applied, I get numbers like so:

obama-noscript:
  Stylo: 4491 elements, 1077 non-text pseudos, 2316 text
  Stylo-sequential: 1967 elements, 841 non-text pseudos, 922 text

Obama live:
  Stylo: 5127 elements, 1142 non-text pseudos, 2412 text
  Stylo-sequential: 2089 elements, 909 non-text pseudos, 900 text

so the patch in bug 1368291 helps a good bit with the non-text pseudos.  Still nowhere close to the Gecko numbers, but that's not unexpected given the worse element sharing; I was surprised the text numbers are as good as they are.
Flags: needinfo?(bzbarsky)
Ok, sounds like we can resolve this bug.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: