Closed Bug 1358806 Opened 7 years ago Closed 7 years ago

stylo: Stylo triggers a lot more frame construction and reflow than Gecko does

Categories

(Core :: CSS Parsing and Computation, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: bholley, Assigned: bzbarsky)

References

(Blocks 1 open bug)

Details

Attachments

(4 obsolete files)

This seems to be the main driver behind the slow Talos runs in bug 1331578. We're presumably just doing something dump here.

Gecko profile: https://perfht.ml/2oA9YZ4
Stylo profile: https://perfht.ml/2ozMinU

STR:
(1) Grab the myspace-orig testcase in https://www.dropbox.com/s/h51fspacftf1lcq/myspace.tar.bz2?dl=0 , and find albumart.html
(2) Load the page in a Gecko Nightly build with the Gecko profiler.
(3) Snip the timeline to the section where the content process is doing work.
(4) Load a stylo build with STYLO_THREADS=1 DISABLE_STYLE_SHARING_CACHE=1 , and do the same.

On my machine, the Gecko window is 570ms, and the Stylo window is 650ms. There are a few small things to optimize:
(1) Stylo spends 17ms rebuilding the styleset (FlushStyleSheet), while Gecko (IIUC) spends 4ms doing the same (RefreshRuleCascade). I'll dig into this later.
(2) Stylo spends 40ms Parsing stylesheets, Gecko spends 29. This is bug 1331843.

But the big difference is that Stylo somehow spends a lot more time in Frame construction and Reflow:
(1) Stylo spends 91ms under ProcessReflowCommands, Gecko spends 57ms.
(2) Stylo spends 66ms in nssCSSFrameConstructor. Gecko spends 47ms, but 14 of that is spent resolving styles (which happens elsewhere in Stylo), so the real cost is 33ms.

So the frame constructor and reflow stuff accounts for ((91 - 57) + (66 - 33)) = 67ms of the difference, which is the lion's share of the delta here.
Boris, do you have any cycles to dig into this this week? I still have a few days of work on my plate to finish optimizing the style system proper, and I'm certainly less familiar with the frame constructor and reflow code than you are.

It'd be great to start telling people about our performance numbers soon, but it's hard to do that when Talos looks bad.
Flags: needinfo?(bzbarsky)
One possible point for investigation would be the lazy frame construction stuff, since in bug 1338921 we switched that to use change hints, rather than the CreateNeededFrames stuff that Gecko does.
Also the extra reframes we do for pseudos right now, I guess. Would be cool to profile with https://github.com/servo/servo/pull/16207 and see if there's any difference.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #3)
> Also the extra reframes we do for pseudos right now, I guess. Would be cool
> to profile with https://github.com/servo/servo/pull/16207 and see if there's
> any difference.

That only affects restyle, right? All these profiles are with initial styling only.
I can take a look into this, yes.
Thanks Boris!
Assignee: nobody → bzbarsky
OK, so what's going on here is that ServoStyleSet::ProbePseudoElementStyle is broken in the lazy pseudo-element case.  It calls Servo_ResolvePseudoStyle with isProbe = true, which calls get_pseudo_style() and returns null if isProbe and None comes back.

But get_pseudo_style() calls lazily_compute_pseudo_element_style() in this case, which only returns None if         "self.pseudos_map.get(pseudo).is_none()" tests true.

So as long as there is _any_ rule for the relevant pseudo-element we will get back a style from ProbePseudoElementStyle.

This particular page has this rule in its stylesheets:

  .imTray .moreOptions div.status:first-letter ,.onlineTray .moreOptions div.status:first-letter {text-transform:uppercase}

This matches exactly one element.  But in stylo we end up creating first-letter frames for every single block on the page.  And then of course reflowing them, etc.  If I remove that one rule, then relayout (exclusive of restyling) of the page matches Gecko; with the rule present relayout is 2-2.5x slower than gecko.

If I leave the rule but fix lazily_compute_pseudo_element_style to return None when there are no declarations, relayout ends up about the same as Gecko as far as I can tell.

I guess this would get automagically fixed for this testcase once we moved first-letter away from being lazy (bug 1324618 or whatever is tracking it).  But worth fixing anyway, in case someone probes for some other lazy pseudo for some reason.  I put up https://github.com/servo/servo/pull/16637
Flags: needinfo?(bzbarsky)
Nice find bz! Looking forward to seeing what this does to our Talos numbers. :-)
OK, landed on m-c: https://hg.mozilla.org/mozilla-central/rev/e82ca7f0565b

How do I remeasure the Talos bits?
Flags: needinfo?(bobbyholley)
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-central,80984697abf1f1ff2b058e2d9f0b351fd9d12ad9,1,1%5D&series=%5Bmozilla-central,ae68c64ef8bfa104fded89971f1c2c6c90926dca,1,1%5D

Looks like it didn't fix it, or at least not entirely. Stylo is still 50ms slower on the myspace testcase.
Flags: needinfo?(bobbyholley)
Just to make sure, that 50ms is not comparable to the times from the profiler in comment 0, right?  What was that 50ms number before?

Anyway, I just modified my measurement bookmarklet like so: javascript:var%20r%20%20=%20document.documentElement;%20r.style.display%20=%20"none";%20r.offsetWidth;%20r.style.display%20=%20"";var%20start%20=%20performance.now();%20%20getComputedStyle(r).color;%20var%20mid%20=%20performance.now();%20r.offsetWidth;%20alert((mid-start)%20+%20"%20"%20+%20(performance.now()%20-%20mid))

This kills off the frame tree via display:none, then sets display to "" again, flushes style+frames, flushes layout.  It alerts two numbers the time the style and layout flushes took, in that order.

What that shows is that while layout takes about the same amount of time now the "style+frame flush" bit takes over 2x longer with stylo compared to Gecko.  And that no longer seems to depend on the :first-letter style, so there's something else going on here.  Will dig into it.
OK, so I just redid the actual pageload measurement on tip (well, on yesterday's tip).  I get numbers like so under ProcessReflowCommands:

Gecko:  216ms, 34ms, 24ms, 32ms, 34ms
Stylo: 82ms, 41ms, 34ms, 32ms, 34ms

I expect that first number is bigger because images aren't in the imagelib cache yet or something.  Doesn't obviously look to me like stylo is slower than Gecko here...

Looking at callstacks that contain nsCSSFrameConstructor in them, I see:

Gecko: 106ms, 29ms, 40ms, 44ms, 140ms
Stylo: 66ms, 43ms, 38ms, 43ms, 52ms

Again, not obvious to me that stylo is slower (and _boy_ is this noisy data).  For both stylo and gecko, the vast majority of the time (95% or more) is under the nsCSSFrameConstructor::ContentRangeInserted call that PresShell::Initialize makes.  This obviously includes the ServoStyleSet::StyleDocument() call that nsCSSFrameConstructor::ConstructDocElementFrame makes, so this is a pretty fair apples to apples comparison against Gecko's frame construction, since both include all the styling work.

I'm not quite sure why the data is so noisy.  Comparing the Gecko 140ms and 44ms times, the 140ms time shows 17ms in RuleHash::EnumerateAllRules and 11ms in AncestorFilter::MightHaveMatchingAncestor while the 44ms time shows 4ms and 4ms respectively...

I tried also just timing the full pageload (from the point when we start parsing the HTML to onload firing).  Those number are, in ms:

Gecko: 263, 182, 240, 229, 197
Stylo: 260, 254, 236, 309, 209

which is definitely looking a bit slower than Gecko, but again _really_ noisy for both.

So I tried comparing two of those profiles head to head for the whole pageload: https://perf-html.io/public/044181aaa877d66032e934aa57fa00a7553b2622/calltree/?invertCallstack&thread=3 for Gecko and https://perf-html.io/public/1a62b6632b2331ebc4ad3d0dcbfaba4e8118ee01/calltree/?invertCallstack&thread=3 for stylo.  Some observations, from an inverted profile:

1)  The times in _ARGBSetRow_X86 are just about the same, which is a useful calibration.
2)  stylo spends 1.5x as long (an extra 7ms) under _platform_bzero as called from _arena_dalloc
    and _chunk_alloc.  So just more memory traffic?
3)  Both spend about the same amount of time under _je_malloc_usable_size.
4)  stylo has an extra 7ms of memmove compared to Gecko.  Some of this is from
    vec reallocs (a bit over 1ms), some is from
    style::properties::declaration_block::parse_property_declaration_list (2ms), some
    from SmallVec's into_vec() as called from
    selectors::parser::parse_complex_selector_and_pseudo_element (1.2ms), some
    from various other longer-tail stuff.
5)  Stylo has 15ms in style::stylesheets::effective_rules.  But some of this is actually
    teardown of the previous page (!).  It looks like any sheet addition/removal immediately calls
    Stylist::update.  We spend a total of 30ms under Stylist::update.  I think this is the moral
    equivalent of Gecko's nsCSSRuleProcessor::RefreshRuleCascade which takes about 3ms in the
    above-linked profile.  I filed bug 1361843 on this.

I'll poke more at the profile later, I guess, unless someone gets to it before me.
(In reply to Boris Zbarsky [:bz] (still a bit busy) (if a patch has no decent message, automatic r-) from comment #12)
> OK, so I just redid the actual pageload measurement on tip (well, on
> yesterday's tip).  I get numbers like so under ProcessReflowCommands:
> 
> Gecko:  216ms, 34ms, 24ms, 32ms, 34ms
> Stylo: 82ms, 41ms, 34ms, 32ms, 34ms
> 
> I expect that first number is bigger because images aren't in the imagelib
> cache yet or something.  Doesn't obviously look to me like stylo is slower
> than Gecko here...
> 
> Looking at callstacks that contain nsCSSFrameConstructor in them, I see:
> 
> Gecko: 106ms, 29ms, 40ms, 44ms, 140ms
> Stylo: 66ms, 43ms, 38ms, 43ms, 52ms
> 
> Again, not obvious to me that stylo is slower (and _boy_ is this noisy
> data).  For both stylo and gecko, the vast majority of the time (95% or
> more) is under the nsCSSFrameConstructor::ContentRangeInserted call that
> PresShell::Initialize makes.  This obviously includes the
> ServoStyleSet::StyleDocument() call that
> nsCSSFrameConstructor::ConstructDocElementFrame makes, so this is a pretty
> fair apples to apples comparison against Gecko's frame construction, since
> both include all the styling work.
> 
> I'm not quite sure why the data is so noisy.  Comparing the Gecko 140ms and
> 44ms times, the 140ms time shows 17ms in RuleHash::EnumerateAllRules and
> 11ms in AncestorFilter::MightHaveMatchingAncestor while the 44ms time shows
> 4ms and 4ms respectively...
> 
> I tried also just timing the full pageload (from the point when we start
> parsing the HTML to onload firing).  Those number are, in ms:
> 
> Gecko: 263, 182, 240, 229, 197
> Stylo: 260, 254, 236, 309, 209
> 
> which is definitely looking a bit slower than Gecko, but again _really_
> noisy for both.
> 
> So I tried comparing two of those profiles head to head for the whole
> pageload:
> https://perf-html.io/public/044181aaa877d66032e934aa57fa00a7553b2622/
> calltree/?invertCallstack&thread=3 for Gecko and
> https://perf-html.io/public/1a62b6632b2331ebc4ad3d0dcbfaba4e8118ee01/
> calltree/?invertCallstack&thread=3 for stylo.  Some observations, from an
> inverted profile:
> 
> 1)  The times in _ARGBSetRow_X86 are just about the same, which is a useful
> calibration.
> 2)  stylo spends 1.5x as long (an extra 7ms) under _platform_bzero as called
> from _arena_dalloc
>     and _chunk_alloc.  So just more memory traffic?

Bug 1360772 is about reducing the unnecessary bzero-ing, and bug 1347852 tracks arena allocation.

> 3)  Both spend about the same amount of time under _je_malloc_usable_size.
> 4)  stylo has an extra 7ms of memmove compared to Gecko.  Some of this is
> from
>     vec reallocs (a bit over 1ms), some is from
>     style::properties::declaration_block::parse_property_declaration_list
> (2ms), some
>     from SmallVec's into_vec() as called from
>     selectors::parser::parse_complex_selector_and_pseudo_element (1.2ms),
> some
>     from various other longer-tail stuff.
> 5)  Stylo has 15ms in style::stylesheets::effective_rules.  But some of this
> is actually
>     teardown of the previous page (!).  It looks like any sheet
> addition/removal immediately calls
>     Stylist::update.  We spend a total of 30ms under Stylist::update.  I
> think this is the moral
>     equivalent of Gecko's nsCSSRuleProcessor::RefreshRuleCascade which takes
> about 3ms in the
>     above-linked profile.  I filed bug 1361843 on this.
> 
> I'll poke more at the profile later, I guess, unless someone gets to it
> before me.
I just reprofiled this testcase, and realized it's actually triggering restyles, which I didn't realize before. We're spending almost all of that restyle time computing restyle hints, which is presumably due to the lack of rule hashing. Bug 1360088 should fix this.
So I suspect that as originally filed this is fixed by the commit measured in comment 9, fwiw.
(In reply to Boris Zbarsky [:bz] (still a bit busy) (if a patch has no decent message, automatic r-) from comment #15)
> So I suspect that as originally filed this is fixed by the commit measured
> in comment 9, fwiw.

Agreed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Attachment #8866396 - Attachment is obsolete: true
Attachment #8866396 - Flags: review?(emilio+bugs)
Attachment #8866397 - Attachment is obsolete: true
Attachment #8866397 - Flags: review?(emilio+bugs)
Attachment #8866398 - Attachment is obsolete: true
Attachment #8866398 - Flags: review?(emilio+bugs)
Attachment #8866399 - Attachment is obsolete: true
Attachment #8866399 - Flags: review?(emilio+bugs)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: