Closed Bug 1418161 Opened 7 years ago Closed 6 years ago

stylo: Huge perf hit on painting and MotionMark benchmarks w/ Stylo and uBlock Origin

Categories

(Core :: CSS Parsing and Computation, defect, P3)

57 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla60
Performance Impact high
Tracking Status
firefox-esr52 --- unaffected
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- fixed

People

(Reporter: dvander, Assigned: bholley)

References

Details

(Keywords: perf)

Attachments

(4 files, 1 obsolete file)

Attached file testcase
I was trying to come up with a painting benchmark and hit some kind of corner case in Stylo. The attached test case gets 50fps with Stylo disabled and 18fps with Stylo enabled. Stylo appears to be spending about 25ms per frame restyling.

Chrome, Edge, Fx56 have no issues.

(I uh, recommend not looking directly at the test case for too long, the patterns are kind of nauseating.)

Profile, look at Content process #3: https://perfht.ml/2jyWwrJ
Priority: -- → P1
Summary: Huge perf hit on test case w/ Stylo → stylo: Huge perf hit on painting benchmark test case w/ Stylo
Version: 40 Branch → 57 Branch
This happens to spend a lot of time on style invalidation, which is so weird because all that testcase is doing is tweaking the style attribute.

David, I've seen similar stuff when people have uBlock installed, which happens to add tons of rules with [style] selectors, which happen to make us take the slow path. Any chance you have any addon suspicious of inserting stylesheets?

This is my profile for example, which looks quite different: https://perfht.ml/2A66buc
Flags: needinfo?(dvander)
This is the relevant #layout log when investigating https://github.com/servo/webrender/issues/2045:

  https://mozilla.logbot.info/layout/20171116#c13860644

If style is still a more significant part of this than with Stylo disabled or than other browsers I can still look into it.
At least in my microbenchmark, Emilio points out that Gecko ends up caching the serialization of the style attr at https://searchfox.org/mozilla-central/rev/33c90c196bc405e628bc868a4f4ba29b992478c0/dom/base/nsAttrValue.cpp#657 but we don't do that in stylo because this cache is not threadsafe.

If the original situation also involves lots of [style] selectors, that may be all that's going on here....
Yup, I do have uBlock installed. This bug should still be fixed though - especially if per comment #1 it's a recurring problem.
Flags: needinfo?(dvander)
The old style system caches the serialized [style] selectors. Can we cache in Stylo?

Can we just fix the uBlock extension?
Priority: P1 → P3
Summary: stylo: Huge perf hit on painting benchmark test case w/ Stylo → stylo: Huge perf hit on painting and MotionMark benchmarks w/ Stylo and uBlock Origin
Whiteboard: [qf]
ni? uBlock developer for comment #6
Flags: needinfo?(rhill)
What uBO does, all top blockers[1] using EasyList do it: cosmetic filtering, which is really CSS rules injection.

The issue here affects a core feature of all blockers. "Fixing" uBO would mean to remove a core feature, and this won't solve the issue for other blockers -- which in aggregate affects even more users. The profiling in comment #0 barely shows any javascript code from uBO (1ms for whole profile), meaning there is nothing to fix code-wise on uBO's side, aside removing CSS rules injection, aka removing a core feature -- which would not need to be removed for same version of uBO on other browsers.

Other blockers inject *all* generic cosmetic filters from EasyList. uBO injects only the *highly* generic ones -- so theoretically it is even less affected by the issue here than the other blockers.


[1] Adblock Plus, AdBlock, Adguard, Adblocker Ultimate (rip off of Adguard).
Flags: needinfo?(rhill)
Last time I checked with :mstange (who had the same problem with bug 1423369), he couldn't repro after disabling and re-enabling uBlock, which looks weird... I can try to investigate it a bit today.
Flags: needinfo?(emilio)
At the Austin All Hands, I think someone proposed caching the original CSS rule strings instead of the serialized rules (because, IIUC, some of the original rule strings had extra spaces that were not round-tripped through serialization/deserialization and thus those rules were not found in the cache of serialized rules).
Bobby, 
The QF team triage this and consider this an important bug to fix on Stylo that would impact users especially with the increasing use of add blocking. Would you be able to address this in the F60 timeframe?
Flags: needinfo?(bobbyholley)
Whiteboard: [qf] → [qf:f60][qf:p1]
(In reply to Jean Gong :jgong from comment #11)
> Bobby, 
> The QF team triage this and consider this an important bug to fix on Stylo
> that would impact users especially with the increasing use of add blocking.
> Would you be able to address this in the F60 timeframe?

Just curious, do we have any evidence of this actually impacting something else other than benchmarks? The attribute serialization will usually already be cached.
Flags: needinfo?(jgong)
One of the ideas thrown around when discussing this bug was just invalidating the style for the element if the stylesheets had [style] attribute selectors.

That works and should be easy to fix, though doesn't handle descendant combinators of course...
I do think we should prioritize performance with uBlock/ABP installed, since those are increasingly common among our target demographic. I'm also not totally convinced this is QF:P1, since it primarily manifests in synthetic benchmarks.

I think the first step is to decide on how we want to fix this, which is something generally worth doing. We discussed this in Austin, but my memory of that conversation is fuzzy, so hopefully others can help refresh it.

To summarize the issue here: the canonical storage of style attributes is in parsed form, because that's how it's normally used. We serialize lazily when needed, and cache the result - unless we're off-main-thread, which is the case when styling in the presence of attribute selectors on the |style| attribute. uBlock and other AdBlockers inject such selectors, largely because the style attribute is often the only uniquely-identifying feature of ad-related content.

One thing we discussed was to introduce a special internal selector syntax that would allow adblockers to match properly against style attributes. But that would be a lot of complexity and coordination that I'd rather avoid.

Emilio, was there any reason we can't just atomically CAS the string bits in during the parallel traversal, which would give us roughly the same performance characteristics of the old style system? I vaguely recall discuss this but can't remember a reason why it wouldn't work.
Flags: needinfo?(bobbyholley)
Or Xidorn?
Flags: needinfo?(xidorn+moz)
Err, sorry... The issue was that we couldn't CAS the string and the declaration block atomically because they're two different fields. But it's not clear to me now why we couldn't swap out the whole MiscContainer...
(In reply to Emilio Cobos Álvarez [:emilio] from comment #16)
> Err, sorry... The issue was that we couldn't CAS the string and the
> declaration block atomically because they're two different fields. But it's
> not clear to me now why we couldn't swap out the whole MiscContainer...

Why would we even need to do that? It seems like nsAttrValue::SetMiscAtomOrString asserts the correct mType and then only modifies mStringBits.

I'll write up a patch.
Assignee: nobody → bobbyholley
MozReview-Commit-ID: 1izMRY2bmzE
Attachment #8946728 - Flags: review?(emilio)
Flags: needinfo?(xidorn+moz)
Flags: needinfo?(jgong)
Flags: needinfo?(emilio)
This patch gives us parity with gecko on the microbenchmark. Atomics and stuff, so please review carefully. :-)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=40e5f57f372040788a44c00c3f69fc5f16ba7eb6
Comment on attachment 8946728 [details] [diff] [review]
Cache style attribute serialization during the parallel traversal. v1

Review of attachment 8946728 [details] [diff] [review]:
-----------------------------------------------------------------

r- mainly for the Atomize thing. Probably smaug should stamp the change to NS_Atomize, I don't have a good sense of how hot is this function.

::: dom/base/nsAttrValue.cpp
@@ +633,5 @@
>        // don't cache the string. The TLS overhead should't hurt us here, since
>        // main thread consumers will subsequently use the cache, and
>        // off-main-thread consumers only reach this in the rare case of selector
>        // matching on the "style" attribute.
> +      const_cast<nsAttrValue*>(this)->SetMiscAtomOrString(&aResult);

Let's point out that as of this writer this is the only caller that requires the machinery?

@@ +1780,5 @@
>      //   string corresponds to a particular enumerated value, especially
>      //   for enumerated values that are not limited enumerated.
>      // Add other types as needed.
>      NS_ASSERTION(len || Type() == eCSSDeclaration || Type() == eEnum,
>                   "Empty string?");

The assertions at the top of this function are racy now I believe, in particular the "Trying to re-set atom or string", so that probably needs to change, doesn't it?

@@ +1787,2 @@
>      if (len <= NS_ATTRVALUE_MAX_STRINGLENGTH_ATOM) {
> +      nsAtom* atom = NS_AtomizeMainThread(*aValue).take();

There's no way AtomizeMainThread can be called off-main-thread.

@@ +1805,5 @@
> +      // In the common case we're not in the servo traversal, and we can just
> +      // set the bits normally. The parallel case requires more care.
> +      if (MOZ_LIKELY(!ServoStyleSet::IsInServoTraversal())) {
> +        cont->SetStringBitsMainThread(bits);
> +      } else if (!cont->mStringBits.compareExchange(0, bits)) {

Is the branch + store really cheaper than the compareExchange? I guess it can be, but worth pointing it out / measuring, because it's not trivial. Alternatively, worth pointing out the lack of measurements just in case somebody wants to measure.
Attachment #8946728 - Flags: review?(emilio) → review-
(In reply to Emilio Cobos Álvarez [:emilio] from comment #20)
> Comment on attachment 8946728 [details] [diff] [review]
> Cache style attribute serialization during the parallel traversal. v1
> 
> Review of attachment 8946728 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r- mainly for the Atomize thing. Probably smaug should stamp the change to
> NS_Atomize, I don't have a good sense of how hot is this function.

Hah, good catch (assertions would have caught it, but I only did an opt build - also had to fix some include screwiness).

I think it's better to just conditionally call NS_Atomize on the servo path (see my comments conditional branches below).

> 
> ::: dom/base/nsAttrValue.cpp
> @@ +633,5 @@
> >        // don't cache the string. The TLS overhead should't hurt us here, since
> >        // main thread consumers will subsequently use the cache, and
> >        // off-main-thread consumers only reach this in the rare case of selector
> >        // matching on the "style" attribute.
> > +      const_cast<nsAttrValue*>(this)->SetMiscAtomOrString(&aResult);
> 
> Let's point out that as of this writer this is the only caller that requires
> the machinery?

Done (comment was also out of date).

> 
> @@ +1780,5 @@
> >      //   string corresponds to a particular enumerated value, especially
> >      //   for enumerated values that are not limited enumerated.
> >      // Add other types as needed.
> >      NS_ASSERTION(len || Type() == eCSSDeclaration || Type() == eEnum,
> >                   "Empty string?");
> 
> The assertions at the top of this function are racy now I believe, in
> particular the "Trying to re-set atom or string", so that probably needs to
> change, doesn't it?

Good point.

> 
> @@ +1787,2 @@
> >      if (len <= NS_ATTRVALUE_MAX_STRINGLENGTH_ATOM) {
> > +      nsAtom* atom = NS_AtomizeMainThread(*aValue).take();
> 
> There's no way AtomizeMainThread can be called off-main-thread.

Yeah, I'll fix it with a branch, which the compiler will hopefully roll into the other one.

> 
> @@ +1805,5 @@
> > +      // In the common case we're not in the servo traversal, and we can just
> > +      // set the bits normally. The parallel case requires more care.
> > +      if (MOZ_LIKELY(!ServoStyleSet::IsInServoTraversal())) {
> > +        cont->SetStringBitsMainThread(bits);
> > +      } else if (!cont->mStringBits.compareExchange(0, bits)) {
> 
> Is the branch + store really cheaper than the compareExchange? I guess it
> can be, but worth pointing it out / measuring, because it's not trivial.
> Alternatively, worth pointing out the lack of measurements just in case
> somebody wants to measure.

I am quite sure that predictable branches are way cheaper than RMU operations like CAS (see the benchmarks at [1]). A predictable branch is basically free.

[1] https://github.com/bholley/simple_benchmarks
This will allow us to invoke it from nsAttrValueInlines.h, which can't
include ServoStyleSet.h due to circular dependencies.

MozReview-Commit-ID: BgC7ExyWRn7
Attachment #8946792 - Flags: review?(emilio)
Attachment #8946728 - Attachment is obsolete: true
MozReview-Commit-ID: 1izMRY2bmzE
Attachment #8946793 - Flags: review?(emilio)
Attachment #8946792 - Flags: review?(emilio) → review+
Comment on attachment 8946793 [details] [diff] [review]
Part 2 - Cache style attribute serialization during the parallel traversal. v2

Review of attachment 8946793 [details] [diff] [review]:
-----------------------------------------------------------------

r=me. We should probably sprinkle some NS_IsMainThread assertions around nsAttrValue, specially on Reset, mind doing that one at least (and other at your will)?
Attachment #8946793 - Flags: review?(emilio) → review+
(In reply to Emilio Cobos Álvarez [:emilio] from comment #25)
> r=me. We should probably sprinkle some NS_IsMainThread assertions around
> nsAttrValue, specially on Reset, mind doing that one at least (and other at
> your will)?

Per IRC discussion this isn't needed because SetStringBitsMainThread() already asserts.

Thanks for the reviews!
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5dec0813e8a3
Hoist IsInServoTraversal into ServoUtils. r=emilio
https://hg.mozilla.org/integration/autoland/rev/a4cb461ced3b
Cache style attribute serialization during the parallel traversal. r=emilio
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7fb6f3ce4dea
Followup fix for fuzz build bustage. r=me
Depends on: 1434681
I'm assuming this can ride the trains to 60.
Performance Impact: --- → P1
Whiteboard: [qf:f60][qf:p1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: