4.74 - 10.73% dromaeo_css (linux64, osx-10-10, windows10-64, windows7-32) regression on push ab10dee6754602dc43038618032c2a6b06396661 (Sat Nov 4 2017)

RESOLVED INCOMPLETE

Status

()

defect
P3
normal
RESOLVED INCOMPLETE
2 years ago
Last year

People

(Reporter: igoldan, Unassigned)

Tracking

({perf, regression, talos-regression})

unspecified
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox57 unaffected, firefox58 disabled, firefox59 disabled, firefox60 affected)

Details

Attachments

(3 attachments)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=be1d69e97c5a7d389348ae92ab7361109f23cd8c&tochange=ab10dee6754602dc43038618032c2a6b06396661

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 11%  dromaeo_css summary osx-10-10 opt e10s     9,493.44 -> 8,474.75
  9%  dromaeo_css summary linux64 opt e10s       8,628.04 -> 7,869.46
  9%  dromaeo_css summary linux64 pgo e10s       9,256.02 -> 8,466.51
  8%  dromaeo_css summary windows7-32 pgo e10s   8,370.83 -> 7,679.77
  8%  dromaeo_css summary windows10-64 pgo e10s  8,508.53 -> 7,830.43
  6%  dromaeo_css summary windows10-64 opt e10s  7,736.21 -> 7,309.80
  5%  dromaeo_css summary windows7-32 opt e10s   7,385.05 -> 7,034.82


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=10351

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → CSS Parsing and Computation
Product: Firefox → Core
:emilio Please look over these performance regressions. Can we fix them or should we accept them?
Flags: needinfo?(emilio)
Yeah, so this brings regressions in Dromaeo, but huge wins in other benchmarks, like bug 988740... I talked with Boris and we'll look at this.
In the following benchmark, I get consistently:

 * Stylo: ~900ms, ~200ms
 * Gecko: ~12000ms, ~120ms
Can we conclude this bug?
Unfortunately this still needs investigation, it's on my queue of things to look at, and I think Boris wanted to take a look too.
I plan to look for sure.  I took a few days to get an opt build working due to bug 1415686...
Priority: -- → P3
So some numbers, for a testcase like emilio's but doing "querySelector('nosuchtag:only-child')" 100 times and with all the querySelector fast-paths disabled.

If I make matches_complex_selector_internal always return NotMatchedGlobally (or rather, copy all the matching methods with _querySelector appended to their names, and make the _querySelector version do that), then I get:

  186 192 206 203 225 179

If I just replace the part of matches_complex_selector_internal_querySelector that does selector.iter.all(matches_simple_selector_querySelector) with literal false:

  273 252 265 320 282 305

If I leave all that in place but just make matches_simple_selector_querySelector return false, I get:

  255 262 267 281 286 295

If I make matches_simple_selector_querySelector look like this:

    match *selector {
        Component::LocalName(LocalName { ref name, ref lower_name }) => {
            let is_html = element.is_html_element_in_html_document();
            element.get_local_name() == select_name(is_html, name, lower_name).borrow()
        },
        _ => false
        
    }

I get:

 316 325 325 319 385 393

so there's overhead at all the steps here.  :(  For comparison, the Gecko numbers on the same testcase are in the ~200 range.
Should we probably hold back on qS and qSA with stylo on 58, and we would have more time to fix the regression on 59? We are not removing the old style system in 58 anyway, so it feels bad that we have performance regression there.
So Boris poked at this yesterday and noticed that a bunch of methods weren't getting inlined at all because they're defined in the style crate, but instantiated from the Gecko crate.

I sprinkled some of them based on a patch of him, and it seemed to be a slight improvement:

  https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=83ece9d96d570bb81cf22d72e9c51959dfccd9b9&newProject=try&newRevision=c9c565f284b970467608bb2c85eef48a6bf1494a&framework=1

So I'm landing those annotations in https://github.com/servo/servo/pull/19226.
(In reply to Xidorn Quan [:xidorn] UTC-8 (less responsive Nov 5 ~ Dec 16) from comment #8)
> Should we probably hold back on qS and qSA with stylo on 58, and we would
> have more time to fix the regression on 59? We are not removing the old
> style system in 58 anyway, so it feels bad that we have performance
> regression there.

Looks like we need some more time to keep improving, avoid regression in 58 seems sensible to me.
Emilio, thoughts?
(In reply to Emilio Cobos Álvarez [:emilio] from comment #9)
> So I'm landing those annotations in
> https://github.com/servo/servo/pull/19226.

That patch landed, is it worth re-testing in 59 to consider uplift? Or is that only a partial fix.
(In reply to Mike Taylor [:miketaylr] (58 Regression Engineering Owner) from comment #11)
> That patch landed, is it worth re-testing in 59 to consider uplift? Or is
> that only a partial fix.

Jet, WDYT?
Flags: needinfo?(bugs)
(In reply to Mike Taylor [:miketaylr] (58 Regression Engineering Owner) from comment #11)
> (In reply to Emilio Cobos Álvarez [:emilio] from comment #9)
> > So I'm landing those annotations in
> > https://github.com/servo/servo/pull/19226.
> 
> That patch landed, is it worth re-testing in 59 to consider uplift? Or is
> that only a partial fix.

It's probably only a partial fix for this regression.

(In reply to Astley Chen [:astley] (UTC+8) from comment #10)
> Looks like we need some more time to keep improving, avoid regression in 58
> seems sensible to me.
> Emilio, thoughts?

So, this is a regression, but in a very specific microbenchmark. While I still want to look at it and figure out how to avoid it, it's not clear to me that it's worth the hassle of disabling / re-enabling, or uplifting the inlining fix. As you can see from the test-case, in other benchmarks we're much faster than Gecko, these are just not run on automation.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #13)
> So, this is a regression, but in a very specific microbenchmark.

Sorry, for jumping into discussion here, but I have to disagree. I see performance regression all over the board. More importantly new code struggle with simple, very often used selectors. Those "very specific" microbenchmark yo refer to, are actually quite good performance measure. 

> As you can see from the test-case, in other benchmarks we're
> much faster than Gecko, these are just not run on automation.

In complex, very specific case that Gecko struggle with, we are indeed multiple times faster, but in more everyday usage we are slower. I'm not only talking about benchmarks, uBlock's procedural filters that rely heavily on querySelectorAll are also slower and this affects all users of the add-on. And this is only one example.

I understand you want to push Stylo forward. But if this regression is allowed to ride the train, I feel that everyone should see the whole picture. I personally don't see any advantage of the new code atm, except those specific benchmarks where it is faster, but those doesn't translate to real word usage. I stand to be corrected, but I wanted to share my concerns about this bug.

Thanks.
Approval Request Comment
[Feature/Bug causing the regression]: bug 1410624
[User impact if declined]: performance regression
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: no, but it's a backout fix
[Needs manual test from QE? If yes, steps to reproduce]: bug 1422522 is a good test-case I guess.
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: it just unconditionally takes the Gecko code-path for CSSOM APIs, it's a one-line change which makes us take a known, well-tested path.
[String changes made/needed]: none
Attachment #8933937 - Flags: review?(xidorn+moz)
Attachment #8933937 - Flags: approval-mozilla-beta?
Attachment #8933937 - Flags: review?(xidorn+moz) → review+
(In reply to Astley Chen [:astley] (UTC+8) from comment #12)
> (In reply to Mike Taylor [:miketaylr] (58 Regression Engineering Owner) from
> comment #11)
> > That patch landed, is it worth re-testing in 59 to consider uplift? Or is
> > that only a partial fix.
> 
> Jet, WDYT?

The backout in 58, fix in 59 plan SGTM.
Flags: needinfo?(bugs)
Comment on attachment 8933937 [details] [diff] [review]
Back out Stylo DOM APIs from beta.

Backout to fix the perf regression. Beta58+.
Attachment #8933937 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
The regression looks better than what it used to, but it's still of ~6%... Still work to do here.
I captured a profile for the testcase described in comment 7, and it seems to be weird that we also spend non-trivial time in each level.

In my profile, I load the testcase several times, and sum the result together, and filter the call tree with "nsINode::QuerySelector". In the part called from optimized JS, the time spent under nsINode::QuerySelector is 1415ms, and there are three levels of time consumption:
* query_selector_slow [self time: 800ms]
* matches_complex_selector_internal [self time: 282ms]
* matches_simple_selector [self time: 332ms]

On comparison, for a profile with Gecko doing the same thing, the time spent under nsINode::QuerySelector is 1238ms, and there are four levels of time consumption:
* <lambda>::operator() [self time: 700ms]
* SelectorListMatches [self time: 218ms]
* SelectorMatches [self time: 286ms]
* _security_check_cookie [self time: 33ms] (I don't know what is this...)
So, if I understand correctly, the part of work done by SelectorListMatches in Gecko's profile is actually part of query_selector_slow in stylo, so stylo is actually faster in this part.

And the SelectorMatches part of Gecko then includes matches_complex_selector_internal and matches_simple_selector in Stylo, so Stylo is way slower in this part (over 2x time there).
I thought about whether it could be because of our approach for storing selectors which makes cache less efficient. However, it seems to me that Gecko's nsCSSSelector takes 9 words, which is 72 bytes, while Stylo's Component<Impl> may take 5 words, so a two items array should take 80 bytes. That doesn't sound too bad (although it is still surprisingly large).
Depends on: 1430502
(In reply to Xidorn Quan [:xidorn] UTC+10 (PTO Jan 19 ~ 29) from comment #22)
> I thought about whether it could be because of our approach for storing
> selectors which makes cache less efficient. However, it seems to me that
> Gecko's nsCSSSelector takes 9 words, which is 72 bytes, while Stylo's
> Component<Impl> may take 5 words, so a two items array should take 80 bytes.
> That doesn't sound too bad (although it is still surprisingly large).

Oh, hmmm, actually it's worse than that. The given selector would produce a three-item list due to the implicit default namespace one. So it's 120 bytes.
(In reply to Xidorn Quan [:xidorn] UTC+10 (PTO Jan 19 ~ 29) from comment #23)
> (In reply to Xidorn Quan [:xidorn] UTC+10 (PTO Jan 19 ~ 29) from comment #22)
> > I thought about whether it could be because of our approach for storing
> > selectors which makes cache less efficient. However, it seems to me that
> > Gecko's nsCSSSelector takes 9 words, which is 72 bytes, while Stylo's
> > Component<Impl> may take 5 words, so a two items array should take 80 bytes.
> > That doesn't sound too bad (although it is still surprisingly large).
> 
> Oh, hmmm, actually it's worse than that. The given selector would produce a
> three-item list due to the implicit default namespace one. So it's 120 bytes.

Are you sure? We only push a DefaultNamespace component when it's implicit but there's a default namespace, which is not the case I'd think.

Anyway, the thing regarding it's due to the storage is interesting... The other thing I have some patches for measuring (though haven't got around to profile it yet because I was away-ish this week doing university stuff) is the order we branch. In particular, Gecko is helped by PGO, probably ordering the frequent branches together and early, while Stylo doesn't have this.

I don't know in which order does rustc emit the branches in our case, but it could be very different to the order we care about.
Flags: needinfo?(emilio)
Err...
Flags: needinfo?(emilio)
(In reply to Emilio Cobos Álvarez [:emilio] from comment #24)
> (In reply to Xidorn Quan [:xidorn] UTC+10 (PTO Jan 19 ~ 29) from comment #23)
> > (In reply to Xidorn Quan [:xidorn] UTC+10 (PTO Jan 19 ~ 29) from comment #22)
> > > I thought about whether it could be because of our approach for storing
> > > selectors which makes cache less efficient. However, it seems to me that
> > > Gecko's nsCSSSelector takes 9 words, which is 72 bytes, while Stylo's
> > > Component<Impl> may take 5 words, so a two items array should take 80 bytes.
> > > That doesn't sound too bad (although it is still surprisingly large).
> > 
> > Oh, hmmm, actually it's worse than that. The given selector would produce a
> > three-item list due to the implicit default namespace one. So it's 120 bytes.
> 
> Are you sure? We only push a DefaultNamespace component when it's implicit
> but there's a default namespace, which is not the case I'd think.

I'm... not sure. I just read the code, and didn't really inspect the value there.

> Anyway, the thing regarding it's due to the storage is interesting...

Well it's just a guess work. We may want to use some finer grained profiler to do instruction-level measurement to understand...

> The other thing I have some patches for measuring (though haven't got around to
> profile it yet because I was away-ish this week doing university stuff) is
> the order we branch. In particular, Gecko is helped by PGO, probably
> ordering the frequent branches together and early, while Stylo doesn't have
> this.
> 
> I don't know in which order does rustc emit the branches in our case, but it
> could be very different to the order we care about.

Gecko has its own order that compare namespace, then tag, then id then class, etc. so it's just a binary conditional jump for most of the cases. Stylo uses enum so its a dispatch. I'm not sure how much that would make a difference.
(In reply to Xidorn Quan [:xidorn] UTC+10 (PTO Jan 19 ~ 29) from comment #26)
> (In reply to Emilio Cobos Álvarez [:emilio] from comment #24)
> > (In reply to Xidorn Quan [:xidorn] UTC+10 (PTO Jan 19 ~ 29) from comment #23)
> > > (In reply to Xidorn Quan [:xidorn] UTC+10 (PTO Jan 19 ~ 29) from comment #22)
> > > > I thought about whether it could be because of our approach for storing
> > > > selectors which makes cache less efficient. However, it seems to me that
> > > > Gecko's nsCSSSelector takes 9 words, which is 72 bytes, while Stylo's
> > > > Component<Impl> may take 5 words, so a two items array should take 80 bytes.
> > > > That doesn't sound too bad (although it is still surprisingly large).
> > > 
> > > Oh, hmmm, actually it's worse than that. The given selector would produce a
> > > three-item list due to the implicit default namespace one. So it's 120 bytes.
> > 
> > Are you sure? We only push a DefaultNamespace component when it's implicit
> > but there's a default namespace, which is not the case I'd think.
> 
> I'm... not sure. I just read the code, and didn't really inspect the value
> there.
> 
> > Anyway, the thing regarding it's due to the storage is interesting...
> 
> Well it's just a guess work. We may want to use some finer grained profiler
> to do instruction-level measurement to understand...

Yeah, I may try to talk with jseward about this bug tomorrow in the office... This is the kind of thing he knows about :)
As a record, I also tried to pull tag comparison from the big match and put it onto the very beginning of matches_simple_selector, but that doesn't seem to help at all, so it's probably not the issue of the order.
Looking at the assembly, I suspect the main overhead is from function call, or more specifically, the register backup of the function call.

matches_complex_selector_internal and matches_simple_selector each saves and restores seven general purpose registers. SelectorMatches also saves and restores seven registers, but it's only one level.

Probably we should try to inlinize matches_simple_selector somehow.
OK, so I confirmed that the reason is really that matches_simple_selector is not inlined.

I pulled out the LocalName branch from matches_simple_selector and put it to matches_complex_selector_internal just before calling into matches_simple_selector, the performance is significantly improved, and becomes even faster than Gecko.

The time consumption becomes just two level:
* query_selector_slow [self time: 748ms]
* matches_complex_selector_internal [self time: 410ms]

matches_complex_selector_internal is still not as fast as SelectorMatches (which from comment 20, takes 319ms in total), but because query_selector_slow is significantly faster than <lambda>::operator() + SelectorListMatches (takes 918ms), the overall speed becomes faster.

So the problem here is then that we need to inlinize matches_simple_selector.

This function is marked #[inline], but probably because it recursively calls itself for :not(), the compiler doesn't really inlinize it. We should probably make :not() to call something else (match_complex_selector?) to allow inlining.
Depends on: 1430521
This is a testcase based on comment 7 that I use for checking performance improvement.
I think the gap here has almost been closed. There may still be some small regression, though...

However, given bug 1422522, we probably want to land the back out again on beta 59 to disable stylo on qS / qSA.
Somehow I cannot see data points further before, but according to my previous testing, the optimization landed in several dependencies of this bug should have (gradually) brought us to roughly the same level as Gecko. There may still be remaining gap like some extra FFI calls, but they should be minor, and they are tricky to fix in general.

I'd like to call this bug resolved for now.
Status: NEW → RESOLVED
Closed: Last year
Flags: needinfo?(emilio)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.