1420423 - [meta] stylo: Large regression on some talos tests when enabling stylo-chrome

Reporter

Description

•

8 years ago

See the comparison in https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=46f657fa7299c1f09323bf8b9e489d2f21267a1a&newProject=try&newRevision=783915f36ea74b64daf5794b4b2d6f1f326aed96&framework=1 There are some wins, but also some very large regressions (especially sessionrestore, tpaint, tresize, and ts_paint). They could have the same root cause. It probably needs to block enabling stylo-chrome.

Emilio Cobos Álvarez (:emilio)

Comment 1

•

8 years ago

I'm looking into this, I think it's extra-dumb stuff going on with XBL bindings... building patches now.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 2

•

8 years ago

Assigning to emilio given comment 1. If you don't have enough bandwidth, feel free to tell me where to look at, and I can probably try taking care of it.

Assignee: nobody → emilio

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 3

•

8 years ago

Here are profiles of a (modified) tpaint on Nightly 20171123100420 macOS opt: * stylo-chrome enabled: https://perfht.ml/2hQrgAy * stylo-chrome disabled: https://perfht.ml/2hOxiSb From this profile, I suspect XBL binding is probably not the only reason (if it is one of the reasons). It seems that we are spending lots of time under cascading. I'm looking at a 29ms style phase, whose counterpart in the non-stylo-chrome seems to be only about 15ms. In that style phase, 32% of time is under cascade_primary_style, and under that, 21% of time is spent in substitute_variables.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 4

•

8 years ago

Attached file modified tpaint file for local profiling (obsolete) — Details

You would need the following prefs: * "dom.send_after_paint_to_content": true, * "browser.link.open_newwindow": 2, * "security.data_uri.block_toplevel_data_uri_navigations": false And also allow the file to open new window without user input (need to click the permission popup in the first time). This file automatically runs the test 10 times and record the time in the page. In the profiles above, I pick the shortest one in the 10 to investigate (because that usually indicates least noisy).

Emilio Cobos Álvarez (:emilio)

Comment 5

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC-6 (less responsive Nov 5 ~ Dec 16) from comment #3) > Here are profiles of a (modified) tpaint on Nightly 20171123100420 macOS opt: > * stylo-chrome enabled: https://perfht.ml/2hQrgAy > * stylo-chrome disabled: https://perfht.ml/2hOxiSb > > From this profile, I suspect XBL binding is probably not the only reason (if > it is one of the reasons). It seems that we are spending lots of time under > cascading. > > I'm looking at a 29ms style phase, whose counterpart in the non-stylo-chrome > seems to be only about 15ms. In that style phase, 32% of time is under > cascade_primary_style, and under that, 21% of time is spent in > substitute_variables. Right, my point is that as of right now, when we create frames for an element without an XBL binding, we style the subtree more than twice, scaling with the level of nestedness. That could also explain those numbers, since we're effectively doing triple the work. From looking at where the Servo_TraverseSubtree calls come from, it looks like this is the case (30% of the time is the initial styling, 60% come from nsXBLService::LoadBindings). I have a patch, but I need to debug some stuff, and I haven't been able so far because of bug 1420301.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 6

•

8 years ago

(In reply to Emilio Cobos Álvarez [:emilio] from comment #5) > From looking at where the Servo_TraverseSubtree calls come from, it looks > like this is the case (30% of the time is the initial styling, 60% come from > nsXBLService::LoadBindings). Are we looking at the same profile? In my profile in comment 4, for the last style phase in the time slice, Servo_TraverseSubtree is only called from ServoStyleSet::StyleDocument directly invoked from DoFlushPendingNotifications, which should all be the initial styling, right? In most of the style phases, nsXBLService::LoadBindings doesn't even appear. So I'm confused how that can lead to your conclusion.

Emilio Cobos Álvarez (:emilio)

Comment 7

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC-6 (less responsive Nov 5 ~ Dec 16) from comment #6) > (In reply to Emilio Cobos Álvarez [:emilio] from comment #5) > > From looking at where the Servo_TraverseSubtree calls come from, it looks > > like this is the case (30% of the time is the initial styling, 60% come from > > nsXBLService::LoadBindings). > > Are we looking at the same profile? > > In my profile in comment 4, for the last style phase in the time slice, > Servo_TraverseSubtree is only called from ServoStyleSet::StyleDocument > directly invoked from DoFlushPendingNotifications, which should all be the > initial styling, right? In your profile, if I focus in nsXBLService::LoadBindings, on the first styling, I get: https://screenshots.firefox.com/54O2j9so7Wudf5fx/perfhtml.io If I look at the TraverseSubtree that comes from ConstructDocElementContainingBlock, which is effectively the initial style, I get: https://screenshots.firefox.com/IcZkZSk9pUqyazES/perfhtml.io (only the dark blue is relevant) So we're spending triple the time resolving bindings, that's what I based my previous comment on.

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1420496

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Assignee: emilio → nobody

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 8

•

8 years ago

So in the profiles, there is an 81ms total difference on processing delay between stylo and gecko (on the displayed measurement, the difference is 70ms, maybe window closing and focus switching contribute the other 11ms). Bug 1420496 can explain 38ms in it. There is still 43ms to be investigated. I suspect it is related to the outstanding time in cascade.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 9

•

8 years ago

So another part from the profiles: excluding the time spent in LoadBindings, we spend 30.6ms inside substitute_variables in stylo, while the gecko counterpart ResolveVariableReferences takes only 6.4ms in total, so it sounds like we do very poor on variable resolving. Maybe it's parsing performance? Anyway, it should explain another 24ms, so we have 19ms left to explain.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 10

•

8 years ago

In substitute_variables, there are lots of parsing functions show up in the profile, while in Gecko's profile, there doesn't seem to have many individual parsing functions show up. This could be either due to slow parsing, or we call it more times than necessary.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 11

•

8 years ago

parse_value in stylo vs. ParseProperty in gecko is 15.2ms vs. 5.8ms, so it feels like the parsing may be a problem. Not sure if Simon has any insight on if there's anything optimizable in that area.

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1420509

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 12

•

8 years ago

Matching seems to be a bit slower as well. 32.4ms in "::match_" for Stylo, while 26.6ms in "EnumRulesMatching" for Gecko. I guess we should see the comparison after bug 1420496 and bug 1420509 get merged.

Emilio Cobos Álvarez (:emilio)

Comment 13

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC-6 (less responsive Nov 5 ~ Dec 16) from comment #12) > Matching seems to be a bit slower as well. 32.4ms in "::match_" for Stylo, > while 26.6ms in "EnumRulesMatching" for Gecko. > > I guess we should see the comparison after bug 1420496 and bug 1420509 get > merged. Right, as of right now at least, we're not doing the same amount of selector-matching at all :)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 14

•

8 years ago

(In reply to Emilio Cobos Álvarez [:emilio] from comment #13) > Right, as of right now at least, we're not doing the same amount of > selector-matching at all :) That measurement has already excluded those from time slice under LoadBindings, so it's independent extra time.

Emilio Cobos Álvarez (:emilio)

Comment 15

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC-6 (less responsive Nov 5 ~ Dec 16) from comment #14) > (In reply to Emilio Cobos Álvarez [:emilio] from comment #13) > > Right, as of right now at least, we're not doing the same amount of > > selector-matching at all :) > > That measurement has already excluded those from time slice under > LoadBindings, so it's independent extra time. Not really, note that the point of the LoadBindings bug is that we also do extra time _before_ entering in LoadBindings (styling the whole subtree beforehand).

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 16

•

8 years ago

(In reply to Emilio Cobos Álvarez [:emilio] from comment #15) > Not really, note that the point of the LoadBindings bug is that we also do > extra time _before_ entering in LoadBindings (styling the whole subtree > beforehand). So... I meant, this measurement actually only includes the part after LoadBindings. I didn't include the first sixth under the red bar, so it should be independent. Are you aware of anything we may be doing slower on matching?

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 17

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC-6 (less responsive Nov 5 ~ Dec 16) from comment #12) > Matching seems to be a bit slower as well. 32.4ms in "::match_" for Stylo, > while 26.6ms in "EnumRulesMatching" for Gecko. There is 9.6ms in SelectorMatches with Gecko, and 19.8ms in selector::matching with Stylo, both numbers are only counting part after binding loading. Actually, SelectorMatches doesn't shown up at all in binding loading part, while Stylo has 5.4ms under selector::matching in that part. It's not clear to me whether it is a fair comparison, though.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 18

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC-6 (less responsive Nov 5 ~ Dec 16) from comment #17) > There is 9.6ms in SelectorMatches with Gecko, and 19.8ms in > selector::matching with Stylo, both numbers are only counting part after > binding loading. Actually, SelectorMatches doesn't shown up at all in > binding loading part, while Stylo has 5.4ms under selector::matching in that > part. > > It's not clear to me whether it is a fair comparison, though. So, I guess it's not fair. There are some logic in ContentEnumFunc. Also SelectorMatches and SelectorMatchesTree may be running for too short to be caught. For ContentEnumFunc, then, Gecko takes 17.8ms, so it's not such large difference then.

Jet Villegas (inactive)

Updated

•

8 years ago

Priority: -- → P3

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 19

•

8 years ago

Two known regressions have been fixed. I'll do some try push and see how things look like now.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 20

•

8 years ago

Talos tests haven't finished, but number from some finished tasks showed that there are improvements from what we had before. The regression has been cut about half, but there is still some regression. Here is a new profile for tpaint on stylo-chrome with the current m-c tip: https://perfht.ml/2BkKtlD

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 21

•

8 years ago

And a profile without style-chrome on the same revision: https://perfht.ml/2BkLPgd

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1421305

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1421310

Chris Peterson [:cpeterson]

Comment 22

•

8 years ago

P1 because we can't enable stylo-chrome if it regresses Talos performance too much.

status-firefox57: --- → unaffected

status-firefox58: --- → unaffected

status-firefox-esr52: --- → unaffected

Priority: P3 → P1

Summary: stylo-chrome: Large regression on some talos tests when enabling stylo-chrome → stylo: Large regression on some talos tests when enabling stylo-chrome

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 23

•

8 years ago

Here's the new comparison: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=af1d0a3e3a6a&newProject=try&newRevision=11876e34b9c4&framework=1 So we are still regressing tpaint by ~30% on Windows pgo builds, which is still the largest one at the moment. (But has been much better than before, where we regressed by 50%.) I'm going to do profiling on windows pgo build to see if I can find anything helpful.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 24

•

8 years ago

A profile with the stylo-chrome win64 pgo build: https://perfht.ml/2iYJbpu

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 25

•

8 years ago

For comparison, a profile without stylo-chrome: https://perfht.ml/2j0miSs

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1421408

Emilio Cobos Álvarez (:emilio)

Comment 26

•

8 years ago

So, I did a profile with a smaller sampling interval, and there's tons of XBL overhead and similar stuff too... https://perfht.ml/2Brsn19

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 27

•

8 years ago

Jet suggested me to do a try push with thread pool enabled for stylo-chrome, and see what the talos result would look like. I just did so, and it seems that stylo-chrome with parallel is significantly better than single-threaded stylo-chrome. There is a 4%~12% improvement among various tests we see regressions on. The regression percent of the worst regression, tpaint win32 pgo, is down from 33.5% to 16.6%. Result here: gecko vs. single-thread stylo-chrome: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fb09869fa688&newProject=try&newRevision=2a56c4aea5e0&framework=1 gecko vs. multi-thread stylo-chrome: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fb09869fa688&newProject=try&newRevision=2b6888c17efc&framework=1 So it seems we should enable thread pool for stylo-chrome eventually. However, we are still regressing a lot even with thread pool enabled, so we still need to try optimizing single-thread stylo-chrome further. (Also single-thread stylo is easier to profile, so probably we should stick on single-thread stylo-chrome for now until the regression is down to a reasonable range.)

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1423964

Emilio Cobos Álvarez (:emilio)

Updated

•

8 years ago

Depends on: 1424022

Liz Henry (:lizzard) (relman/hg->git project)

Comment 28

•

8 years ago

Tracking for 59, since this is a blocker for enabling stylo-chrome on nightly by default. I realize that might mean only some of the dependent bugs are blocking, so we can move the tracking flag if that makes more sense.

tracking-firefox59: --- → +

Cameron McCormack (:heycam)

Comment 29

•

8 years ago

I added some instrumentation and found that for Xidorn's modified tpaint test, on average the old style system ends up resolving 1567 style contexts in browser.xul, and stylo ends up resolving 2238.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 30

•

8 years ago

(In reply to Cameron McCormack (:heycam) from comment #29) > I added some instrumentation and found that for Xidorn's modified tpaint > test, on average the old style system ends up resolving 1567 style contexts > in browser.xul, and stylo ends up resolving 2238. Is that an accumulated number or the number when settled? For the latter, I think I had a look at the memory report and didn't notice anything weird from there...

Cameron McCormack (:heycam)

Comment 31

•

8 years ago

No, it's total number of resolved style contexts. So it may be an indication that we're restyling more with stylo.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 32

•

8 years ago

(In reply to Cameron McCormack (:heycam) from comment #31) > No, it's total number of resolved style contexts. So it may be an > indication that we're restyling more with stylo. We do restyle one pass more than gecko (and that pass takes ~0.5% of the total time), which is bug 1421310 which we need the cause stack to analyze the reason... So I guess you mean that is the accumulated number of style context we ever allocate during that?

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 33

•

8 years ago

So I did some counting on number of style structs allocated during restyling (via increasing atomic counter from their constructors), and it seems for the restyle when handling load event (which is usually the second cyan bar in a stylo-chrome profile), Stylo allocates much more style structs than Gecko. The numbers are listed below: Style struct | Stylo | Gecko --------------+-------+------ Font | 43 | 0 Color | 98 | 0 List | 48 | 19 Text | 57 | 6 Visibility | 30 | 17 UserInterface | 215 | 170 TableBorder | 0 | 0 SVG | 28 | 0 Variables | 0 | 4 Background | 26 | 6 Position | 162 | 7 TextReset | 16 | 10 Display | 236 | 16 Content | 4 | 0 UIReset | 154 | 0 Table | 0 | 0 Margin | 66 | 1 Padding | 67 | 0 Border | 47 | 14 Outline | 0 | 0 XUL | 81 | 6 SVGReset | 0 | 0 Column | 9 | 3 Effects | 24 | 6 To be clear, this is the number from single thread Stylo, so multi-thread would be worse on this. It's probably not very surprising that we allocate more reset style structs since the rule cache is coarse-grained comparing to Gecko, although such big difference may still be somehow surprising. It is more interesting that we even do bad on inherited ones like Font, Color, etc.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 34

•

8 years ago

And... for the restyle during handling of "activate" event, stylo allocates roughly the same number of style structs again (which is much higher than Gecko)... That makes me suspect that we are probably doing a whole page restyle every time (rather than some incremental restyle). Maybe related to :-moz-window-inactive which emilio mentioned before which could trigger some large invalidation?

Cameron McCormack (:heycam)

Comment 35

•

8 years ago

I thought that the old style system also restyled everything when NS_DOCUMENT_STATE_WINDOW_INACTIVE changes. Still, maybe you can make some estimates on how much time we'd save by doing better when that changes?

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 36

•

8 years ago

Discussed with heycam in IRC, and it seems we are currently posting a RESTYLE_SUBTREE for WINDOW_INACTIVE change, and that would make stylo to recompute style for every element in the document, which is very unfortunate. Gecko is currently doing the same thing, however, Gecko may get help from the persisted cache on the rule node, and thus it doesn't need to do computation for elements which are not affected by the state change (since their rule node will keep the same). I guess we are not going to go Gecko's approach to persist the rule cache anyway, so maybe we should try to use invalidation for window state change.

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1428978

Emilio Cobos Álvarez (:emilio)

Updated

•

8 years ago

Depends on: 1429125

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1409672

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1429959

Emilio Cobos Álvarez (:emilio)

Updated

•

8 years ago

Depends on: 1430844

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 37

•

8 years ago

After several patches landed over the last weeks, the performance of stylo-chrome seems to be improved for quite a bit. In my local test, it seems that stylo-chrome is still ~10% slower. I also did some profiling, and here is a comparison: * Stylo: https://perfht.ml/2ElGx6G * Gecko: https://perfht.ml/2EkZy9g It seems that we are now much faster on some of the restyles (e.g. the one for "activate" event and the one after "MozAfterPaint" event), but we still take longer in some of the restyles, specifically the one under "load" event. Also, bug 1421310 still seems to be a problem that we have an additional large restyle between "load" event and "activate" event which Gecko doesn't have. The restyle immediately after "DOMContentLoaded" also takes much longer in Stylo.

Emilio Cobos Álvarez (:emilio)

Comment 38

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #37) > After several patches landed over the last weeks, the performance of > stylo-chrome seems to be improved for quite a bit. > > In my local test, it seems that stylo-chrome is still ~10% slower. I also > did some profiling, and here is a comparison: > * Stylo: https://perfht.ml/2ElGx6G > * Gecko: https://perfht.ml/2EkZy9g > > It seems that we are now much faster on some of the restyles (e.g. the one > for "activate" event and the one after "MozAfterPaint" event), but we still > take longer in some of the restyles, specifically the one under "load" > event. Also, bug > 1421310 still seems to be a problem that we have an additional large restyle > between "load" event and "activate" event which Gecko doesn't have. The > restyle immediately after "DOMContentLoaded" also takes much longer in Stylo. Yeah, I have some theory about how to make it better changing how the XBL stuff works, but may need to check with bgrins and such before. In particular, if we restrict ::-moz-binding to document sheets, we can avoid all this servo -> gecko -> servo ping-ponging for XBL bindings.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 39

•

8 years ago

So the first restyle (immediately after DOMContentLoaded) is triggered by a style attribute change on the root element, so we have to restyle everything. The long one under load event handler is triggered for CSS rule changes. The one between "load" and "activate" event has "RESTYLE_SELF | RESTYLE_DESCENDANTS" on the root element, and thus we again restyle everything. It is not currently clear to me why's that, though. So basically it seems they are long because we restyle the whole chrome document. Gecko takes less time on the whole document restyle probably because it takes advantage of its persisted rule tree cache, but I'm not sure.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 40

•

8 years ago

So, for the question (from emilio last night asked me on IRC) what the style attribute change it is. The answer is, it is custom property "--tab-min-width" set by this: https://searchfox.org/mozilla-central/rev/eeb7190f9ad6f1a846cd6df09986325b3f2c3117/browser/base/content/tabbrowser.xml#6022-6026 As I mentioned before somewhere, it may be a common pattern to set custom property on root element nowadays, and we may want to somehow optimize for it before the environment variable actually comes.

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1435122

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1417970

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1435139

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 41

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #39) > The one between "load" and "activate" event has "RESTYLE_SELF | > RESTYLE_DESCENDANTS" on the root element, and thus we again restyle > everything. It is not currently clear to me why's that, though. So, this one is because we post a media feature values change from window resizing. Gecko does so as well, but it does so in a synchronous restyle, which doesn't show up in the profile timeline. So this is basically another unavoidable full document restyle, which we are slower than Gecko.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 42

•

8 years ago

That... makes me think... why do we have three full document restyle for a single window open after the initial style... which sounds ridiculous... Photon and Quantum Flow should probably put some energy on collapsing them together... Anyway, so the problem left is that, we are slow on full document restyle. This is a known issue, actually, likely because Gecko has a persisted cache on rule tree while we don't. Gecko can do reasonably fast full document restyle based on that cache, while we cannot. We may need to either implement that cache as well, or make every effort to make use of invalidation and avoid full document restyle. (BTW, given that the main slowdown is still from full document restyle, I suspect the display:none may still help to some extent since we do have lots of display:none root.)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 43

•

8 years ago

So, the second full document restyle is caused by Pocket loading its stylesheet here: https://searchfox.org/mozilla-central/rev/eeb7190f9ad6f1a846cd6df09986325b3f2c3117/browser/extensions/pocket/bootstrap.js#504

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 44

•

8 years ago

So, the first full document restyle can probably be optimized either with some special cascading mechanism mentioned in bug 1417970, or by fixing both bug 1435122 and bug 1435139. Per discussion with emilio on IRC, the stylesheet mentioned in comment 43 doesn't meet the criteria for invalidation added in bug 1357583 because it contains @keyframes rules. This is probably optimizable, and web content in general can potentially also benefit from having this optimization. The third full document restyle from viewport change... we can probably also use invalidation to only restyle element affected by media rules changed their condition result... Not sure how helpful that would be. But I feel frustrated to see we have three full document restyle... Couldn't we somehow collapse them altogether?

Emilio Cobos Álvarez (:emilio)

Updated

•

8 years ago

Depends on: 1435214

:Gijs (he/him)

Updated

•

8 years ago

Depends on: 1435217

Comment hidden (mozreview-request)

Emilio Cobos Álvarez (:emilio)

Comment 47

•

8 years ago

Comment on attachment 8947765 [details] style: Make insertion of @font-face rules not necessarily restyle the whole document. Wrong bug... Should've been bug 1435214.

Attachment #8947765 - Attachment is obsolete: true

Attachment #8947765 - Flags: review?(xidorn+moz)

Emilio Cobos Álvarez (:emilio)

Updated

•

8 years ago

Attachment #8947766 - Attachment is obsolete: true

Attachment #8947766 - Flags: review?(xidorn+moz)

Attachment #8947766 - Flags: review?(hikezoe)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 48

•

8 years ago

So after the fix in bug 1435217, it seems we are roughly on par with gecko now. I cannot distinguish between a Gecko result and a Stylo-chrome result from a brief look. (Actually, when I saw the number initially, I thought, oh this fast result must be from Gecko. But after I saw the profile as well as about:support, I realized that we have made stylo-chrome fast enough.)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 49

•

8 years ago

Hmmm, although it seems to be on par on my machine, it is still not on infra. It seems we are still regressing 8~12% on tpaint. And tpaint is not the most regressed one now. The sessionrestore ones have 8~16% regression, and on tresize we have 19% regression on some platforms. tsvg_static seems to also regress a lot on linux specifically. I expect the fix of bug 1435122 and bug 1435139 can remove some more regression on tpaint, but probably we should shift our focus to tresize now. Also there are lots of serious regression on pgo platforms on stylo_disabled tasks, which we probably don't care. It could be because pgo consider the Gecko path to be cold. I'm not sure whether any non-stylo-disabled test can be affected by this, though.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 50

•

8 years ago

The comparison can be found in: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=805b50280b3d&newProject=try&newRevision=e37e29fcca1e&framework=1

Hiroyuki Ikezoe (:hiro)

Comment 51

•

8 years ago

One thing I noticed about tpaint is that the test supposes that window.resizeTo() is processed synchronously. So in stylo case it also measures the time that we are waiting for the next vsync tick, IIUC.

Hiroyuki Ikezoe (:hiro)

Comment 52

•

8 years ago

(In reply to Hiroyuki Ikezoe (:hiro) from comment #51) > One thing I noticed about tpaint is that the test supposes that I meant 'tresize'.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 53

•

8 years ago

So, for tpaint, the time collected also seems to be bimodal which has a difference of ~16ms, which may indicate that we are slipping around a vsync boundary. Probably if we can optimize a bit more, we can consistently fit into the previous vsync and get a bump on this benchmark. I'm not quite sure how to run tresize. Anyone has experience about how to run that? It seems to require installing a leagcy unsigned addon, which I have no idea how to...

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 54

•

8 years ago

Oh, it just reminds me... Wouldn't tresize basically be triggering resize reflow several times? IIRC that means repeatedly whole page restyle? Sounds terrible :/ Maybe we should try to optimize for viewport change, then... It seems we do have lots of @media (max-width) in browser.css, so we would need restyle. But maybe it's possible to use invalidation... Not sure how much it would help. I'll try to profile and see whether that's the real problem.

Hiroyuki Ikezoe (:hiro)

Comment 55

•

8 years ago

FWIW, I did a try run with eliminating waiting for the vsync tick; https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=71e243d83e769d992f4ed38630edeb6aaf14b05a&framework=1&filter=tresize%20pgo%20windows10&selectedTimeRange=172800 It didn't help at all. :/

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 56

•

8 years ago

So here is profiles for tresize. Gecko: https://perfht.ml/2C08X3s Stylo: https://perfht.ml/2BY4bmY With this profiles, Gecko reports 10.027ms, and Stylo reports 11.456ms for the talos result, which translates to 14% regression. It seems to mostly match what we see on infra considering that profiler can reduce regression on overall number given its constant overhead. The tresize test is basically resizing the window 300 times, each time 2px on both width and height, from 425x425 to 1025x1025. Given this, the above numbers can be translated to 3008ms vs. 3437ms total time taken, which 429ms difference in total. In the profiles, the time captured is 2950ms and 3348ms, which is 398ms difference. Majority of the time in both runs is taken in the painting rather than style, but style does make some difference. I tried to focus on function "PresShell::DoFlushPendingNotifications Style" in the stack, and the difference becomes 84ms vs. 386ms, so 300ms of the 400ms difference can be explained here. I'll focus on analyzing this 300ms first then try to figure out where are the additional ~100ms is from.

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1435939

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1435940

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 57

•

8 years ago

Oh, and it seems pretty easy to profile tresize, actually. For a profiling build, after mach buildsymbols, you can run the following command: > ./mach talos-test --cycles 1 -a tresize --geckoProfile --symbolsPath objdir/dist/firefox-60.0a1.en-US.win64.crashreporter-symbols.zip --setpref=layout.css.servo.chrome.enabled=true Note that, it seems to me the symbolsPath needs to be the full path, and it needs to be a zip rather than a directory (which I complained in bug 1435913). And you can later find the result profiles in srcdir/testing/mozharness/build/blobber_upload_dir/profile_tresize.zip.

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1435944

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

8 years ago

Depends on: 1435949

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 58

•

8 years ago

So, in the 302ms regressed time under flush pending style notifications, bug 1435939 explains 212ms, bug 1435940 explains 56ms, bug 1435944 explains 8ms. Bug 1435949 explains at least 18ms, maybe more. So there remains ~8ms to be explained, which is probably small enough that we can ignore for now (especially given that bug 1435949 may actually count more than 18ms). I probably should instead trying to look at where is the remaining ~100ms outside flush pending style notifications from.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 59

•

8 years ago

The majority of the rest time difference are on CondVar::Wait under PCompositorBridgeChild::SendFlushRendering, which in this profile has ~70ms difference. I have no idea what's happening there... The majority of time in the compositor thread is also WaitForMessage...

Ryan VanderMeulen [:RyanVM]

Comment 60

•

8 years ago

This isn't shipping in 59, setting the status accordingly.

status-firefox59: affected → disabled

status-firefox60: --- → affected

tracking-firefox59: + → -

tracking-firefox60: --- → +

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 61

•

8 years ago

It looks like after fixing bug 1435939, regression on tresize disappears on talos, although my local test still shows the regression (~212ms less than the previous profiling). In that case, maybe bug 1435940 and bug 1435949 can become less important, although it would still be good it have since restyle on resize also affect tpaint. tpaint on Linux still regresses 12%, while on Win64 it has almost gone (maybe because I profiled based on Win64). It may be worth looking at what's happening on Linux. I can probably still profile the Win32 build for this, which is also regression 7% at the moment. Regressions on sessionrestore are reduced a lot as well, maybe also thanks to bug 1435939. And I cannot really reproduce this regression locally. tsvg_static is regressed a lot (18%) on Linux specifically. Other platforms seem to be even faster than gecko (no enough confidence, though). I have no idea why as well. This may be related to bug 1430706.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 62

•

8 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #61) > tpaint on Linux still regresses 12%, while on Win64 it has almost gone > (maybe because I profiled based on Win64). It may be worth looking at what's > happening on Linux. I can probably still profile the Win32 build for this, > which is also regression 7% at the moment. So on Linux, there is still a pretty long restyle under load event handler which is probably triggered by something else. Also the restyle after DOMContentLoaded is also much longer in Stylo than in Gecko for unknown reason. Maybe setting custom properties on root is not the only reason for triggering that. This one is reproducible on Windows as well. Stylo is now much faster for the second half of tpaint (deactivate / activate on the windows) where Gecko has several non-trivial restyles, while with stylo-chrome those don't show up at all. We are still slower on the resize reflow one, but we would probably fix it in bug 1435940. > tsvg_static is regressed a lot (18%) on Linux specifically. Other platforms > seem to be even faster than gecko (no enough confidence, though). I have no > idea why as well. This may be related to bug 1430706. This seems to be pretty reliable on talos push, but doesn't seem to reproduce well in a talos with profiler push... Actually, it seems tsvg_static is a page load test in the content process, how can it be affected by stylo-chrome in a negative way?

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 63

•

7 years ago

This is the regressions we would see for enabling stylo-chrome: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=d26229a79b7f&newProject=try&newRevision=606fd27ed9c4&framework=1&showOnlyConfident=1

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 64

•

7 years ago

Attached file modified tpaint file for local profiling, v2 — Details

Add performance mark to understand the actual range tpaint measures.

Attachment #8931735 - Attachment is obsolete: true

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 65

•

7 years ago

It seems on a win64 pgo build, stylo is about 5~10ms slower on the part before the DOMContentLoaded event comes. It indicates that we are slower on the initial style. (Actually, we didn't regress on win64 pgo that much in my earlier try push... Is there something landed recently slowed down the initial style? :( Anyway, that's probably something worth looking at. Gecko: https://perfht.ml/2sBnJ14 Stylo: https://perfht.ml/2syrJ2d

Emilio Cobos Álvarez (:emilio)

Updated

•

7 years ago

Depends on: 1438911

Emilio Cobos Álvarez (:emilio)

Updated

•

7 years ago

No longer depends on: 1438911

Emilio Cobos Álvarez (:emilio)

Comment 66

•

7 years ago

So apart from bug 1438911, which doesn't seem to be relevant anymore, the other full-doc restyle in the parent process that happens is a window resize from 30x30px to the actual size after the chrome document has loaded. This causes a full-doc restyle because of these media queries: https://searchfox.org/mozilla-central/rev/74b7ffee403c7ffd05b8b476c411cbf11d134eb9/browser/base/content/browser.css#729 I guess we need to optimize that out somehow, either avoiding the resize, or using invalidations. I'm looking into that now. FTR, the resize comes from nsXULWindow::OnChromeLoaded -> nsXULWindow::SetSize -> ResizeReflowIgnoringOverride.

Boris Zbarsky [:bzbarsky]

Comment 67

•

7 years ago

I think we should move the size-setting to run under XULDocument::DoneWalking (in case overlays mess with the window attrs), before we call StartLayout. At that point the size sets should be pretty no-op in terms of the style system, I would think.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 68

•

7 years ago

bholley, for the part I mentioned that stylo is slower than gecko but I have no idea why, you can see the profiles in comment 65. The black lines in content 2 are the measurement ranges. You can pick a random one from each profile, and check the range from the start of the range, to the start of the first black line in Main Thread (which should be for DOMContentLoaded event). That's the part I mentioned in today's meeting. We are consistently 5~10ms slower on that part.

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

7 years ago

Flags: needinfo?(bobbyholley)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 69

•

7 years ago

(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #67) > I think we should move the size-setting to run under > XULDocument::DoneWalking (in case overlays mess with the window attrs), > before we call StartLayout. At that point the size sets should be pretty > no-op in terms of the style system, I would think. It looks like almost the whole thing under nsXULWindow::OnChromeLoaded can be moved to XULDocument::DoneWalking? I suppose all the attributes should have been available at that point?

Emilio Cobos Álvarez (:emilio)

Updated

•

7 years ago

Depends on: 1439875

Bobby Holley (:bholley)

Comment 70

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #68) > The black lines in content 2 are the measurement ranges. You can pick a > random one from each profile, and check the range from the start of the > range, to the start of the first black line in Main Thread (which should be > for DOMContentLoaded event). That's the part I mentioned in today's meeting. > We are consistently 5~10ms slower on that part. It's very hard to tell what's going on with 1ms profiler resolution. Can you crank it up to 0.1ms or 0.2ms and reprofile?

Flags: needinfo?(bobbyholley)

Boris Zbarsky [:bzbarsky]

Comment 71

•

7 years ago

> I suppose all the attributes should have been available at that point? Depends on whether async-loading scripts set them, I'd think. If we are willing to assume they do not, then the attributes are set by the time DoneWalking() is called. In fact, they're set by the time the ApplyPersistentAttributes() call near the end of ResumeWalk() returns.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 72

•

7 years ago

Fluent L10n is done in a microtask (Promise) triggered by https://searchfox.org/mozilla-central/rev/9a8d3f8191b15cdca89a7ce044c7bea2dd0462dc/dom/xul/XULDocument.cpp#2692 and it'll definitely affect content and attributes.

Boris Zbarsky [:bzbarsky]

Comment 73

•

7 years ago

Is it going to affect the attributes that nsXULWindow::OnChromeLoaded cares about?

Boris Zbarsky [:bzbarsky]

Comment 74

•

7 years ago

(Note that we _could_ run the current contents of OnChromeLoaded between that code you point to and StartLayout, which would also run it after that microtask once we fix promises to actually run off microtasks. But it would be pretty odd from my point of view if l10n messes with the "width", "height", "screenX", "screenY", "hidechrome", "chromemargin", "windowtype", "id", "drawtitle", "toggletoolbar", "fullscreenbutton", "macanimationtype", "sizemode", or "zlevel" attributes on the <xul:window> element, which are the relevant attributes here.)

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 75

•

7 years ago

Ah, apologies, didn't know that we're talking about that. In theory l10n could affect dialog window width/height but it wouldn't be on the main window, but rather some preferences window outside of the startup path.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 76

•

7 years ago

New profiles with 0.1ms interval: Gecko: https://perfht.ml/2F0eE76 Stylo: https://perfht.ml/2F0HrbC

Flags: needinfo?(bobbyholley)

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

7 years ago

Depends on: 1441102

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

7 years ago

Depends on: 1441389

Bobby Holley (:bholley)

Comment 77

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #76) > New profiles with 0.1ms interval: > Gecko: https://perfht.ml/2F0eE76 > Stylo: https://perfht.ml/2F0HrbC Thanks for posting the profiles Xidorn. I spent some time digging into them, and came up with various interesting bits. The time interval you mentioned (between window open and DOMContentLoaded) is dominated by PresShell::Initialize, so I focused the profiler on on those samples [1], [2] (for Gecko, Servo respectively). We can divide the total time by 10 to get the average time per run, with less noise than looking at individual runs. Sure enough, we see 53.7 ms vs 66.0 ms per run - a difference of 12.3ms, which is significant. A large part of this difference is in cascading. I see 4.0 ms in WalkRuleTree vs 8.4 ms in properties::cascade. I believe this is the result of the eager vs lazy cascade architectures. Specifically, the profiles show a big restyle happening between frame construction and the first call to DoReflow, which means that much/all of the style data was thrown away before layout actually used it. So the lazy computation performs much better in this case, because it avoids a lot of the work that would be wasted anyway. The right fix here, of course, is to not do the wasted work at all. IIUC, bug 1439875 should fix this. Accounting for that, we still have an 7.9 ms discrepancy. Filtering the cascade functions out, we get [3], [4]. Next we look at selector matching. On average per run, I see 4.1 ms in nsStyleSet::FileRules, vs 5.4 ms in match_{primary,pseudo}. This is a bit strange to me, because I generally expect to see less selector matching overhead in Stylo. One obvious thing to check would be the number of elements matched in each case - is selector matching slower, or are we matching more elements? In fact, we really should be matching _fewer_ elements given the style sharing cache. So another thing would be to verify whether the style sharing cache is working correctly. Perhaps something about XUL elements/documents is tripping it up? That leaves us with 6.6 ms discrepancy. Filtering the match functions out, we get [5], [6]. The next interesting thing I see is time spent in LoadBindings, 16.4 ms vs 20.5 ms. 1.6 ms of this is traversal overhead (Servo_TraverseSubtree minus the actual styling work we've already filtered out) from all the extra work we do to style XBL subtrees, which may be unavoidable until the frontend moves to Shadow DOM. But the rest of it is roughly evenly distributed across the callees of LoadBindings, most of which are not style-related (InstallImplementation, InstallEventHandlers, etc). This makes my strongly suspect that we're invoking LoadBindings more times in total under Stylo compared to Gecko. We should measure this and figure out why it's happening. That leaves us with 2.5 ms discrepancy. Filtering out LoadBindings, we get [7], [8]. In that, the obvious difference is handling of URLValueData, where stylo spends 2.3 ms and Gecko doesn't appear to do any equivalent work. We should brainstorm ways to fix this. NI xidorn for comments and further investigation on the above. [1] https://perf-html.io/public/e6c4161edc3bfe0e44f6df4c5745e8df459d13bb/calltree/?hiddenThreads=1&range=2.9904_16.0063&thread=0&threadOrder=0-1-2&transforms=ff-6958&v=2 [2] https://perf-html.io/public/1a564c903fd4ffdb20be117a3490eb2f13a8663b/calltree/?hiddenThreads=1&range=2.9831_16.9969&search=presshell%3A%3AInit&thread=0&threadOrder=0-1-2&transforms=ff-6773&v=2 [3] https://perf-html.io/public/e6c4161edc3bfe0e44f6df4c5745e8df459d13bb/calltree/?hiddenThreads=1&range=2.9904_16.0063&thread=0&threadOrder=0-1-2&transforms=ff-6958~df-19878&v=2 [4] https://perf-html.io/public/1a564c903fd4ffdb20be117a3490eb2f13a8663b/calltree/?hiddenThreads=1&range=2.9831_16.9969&thread=0&threadOrder=0-1-2&transforms=ff-6773~df-21200&v=2 [5] https://perf-html.io/public/e6c4161edc3bfe0e44f6df4c5745e8df459d13bb/calltree/?hiddenThreads=1&range=2.9904_16.0063&thread=0&threadOrder=0-1-2&transforms=ff-6958~df-19878~df-271&v=2 [6] https://perf-html.io/public/1a564c903fd4ffdb20be117a3490eb2f13a8663b/calltree/?hiddenThreads=1&range=2.9831_16.9969&thread=0&threadOrder=0-1-2&transforms=ff-6773~df-21200~df-25969~df-38739&v=2 [7] https://perf-html.io/public/e6c4161edc3bfe0e44f6df4c5745e8df459d13bb/calltree/?hiddenThreads=1&range=2.9904_16.0063&thread=0&threadOrder=0-1-2&transforms=ff-6958~df-19878~df-271~df-32246&v=2 [8] https://perf-html.io/public/1a564c903fd4ffdb20be117a3490eb2f13a8663b/calltree/?hiddenThreads=1&range=2.9831_16.9969&thread=0&threadOrder=0-1-2&transforms=ff-6773~df-21200~df-25969~df-38739~df-38129&v=2

Flags: needinfo?(bobbyholley) → needinfo?(xidorn+moz)

Emilio Cobos Álvarez (:emilio)

Comment 78

•

7 years ago

FWIW, I haven't checked the profiles, but I put a patch in bug 1442246 to handle urls more cheaply today :).

Bobby Holley (:bholley)

Comment 79

•

7 years ago

(In reply to Emilio Cobos Álvarez [:emilio] from comment #78) > FWIW, I haven't checked the profiles, but I put a patch in bug 1442246 to > handle urls more cheaply today :). Nice! So to be clear, the overhead here comes from the NS_NewURI calls in mozilla::css::URLValueData::GetURI. Will your patches cause us to do that less? [1] https://perf-html.io/public/1a564c903fd4ffdb20be117a3490eb2f13a8663b/calltree/?hiddenThreads=1&range=2.9831_16.9969&thread=0&threadOrder=0-1-2&transforms=ff-6773~df-21200~df-25969~df-38739~df-38129~ff-7197&v=2

Emilio Cobos Álvarez (:emilio)

Comment 80

•

7 years ago

(In reply to Bobby Holley (:bholley) from comment #79) > (In reply to Emilio Cobos Álvarez [:emilio] from comment #78) > > FWIW, I haven't checked the profiles, but I put a patch in bug 1442246 to > > handle urls more cheaply today :). > > Nice! > > So to be clear, the overhead here comes from the NS_NewURI calls in > mozilla::css::URLValueData::GetURI. Will your patches cause us to do that > less? > > [1] > https://perf-html.io/public/1a564c903fd4ffdb20be117a3490eb2f13a8663b/ > calltree/?hiddenThreads=1&range=2.9831_16.9969&thread=0&threadOrder=0-1- > 2&transforms=ff-6773~df-21200~df-25969~df-38739~df-38129~ff-7197&v=2 Nope, that just makes us to refcount URLs more instead of copying them + converting them to UTF16.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 81

•

7 years ago

(In reply to Bobby Holley (:bholley) from comment #77) > (In reply to Xidorn Quan [:xidorn] UTC+10 from comment #76) > > New profiles with 0.1ms interval: > > Gecko: https://perfht.ml/2F0eE76 > > Stylo: https://perfht.ml/2F0HrbC > > Thanks for posting the profiles Xidorn. I spent some time digging into them, > and came up with various interesting bits. Thanks for help analyzing. > A large part of this difference is in cascading. I see 4.0 ms in > WalkRuleTree vs 8.4 ms in properties::cascade. I believe this is the result > of the eager vs lazy cascade architectures. Specifically, the profiles show > a big restyle happening between frame construction and the first call to > DoReflow, which means that much/all of the style data was thrown away before > layout actually used it. So the lazy computation performs much better in > this case, because it avoids a lot of the work that would be wasted anyway. > The right fix here, of course, is to not do the wasted work at all. IIUC, > bug 1439875 should fix this. I suppose by the reflow, and big restyle, you mean the one under DOMContentLoaded? That probably wouldn't be fixed by bug 1439875, because that restyle is caused by updating the value of a custom property. That custom property was moved down from root to #tabbrowser-tabs in bug 1435122, so the restyle has been much shorter than before, but it's still outstanding. Unless we can move the value's setting to somewhere before the frame construction, we cannot fix the waste here. IIRC that code is in some binding initialization code, so it's probably not fixable at all. > That leaves us with 2.5 ms discrepancy. Filtering out LoadBindings, we get > [7], [8]. In that, the obvious difference is handling of URLValueData, where > stylo spends 2.3 ms and Gecko doesn't appear to do any equivalent work. We > should brainstorm ways to fix this. So, Gecko does the same thing. It seems the biggest difference is that, Gecko shares URLValueData between specified value and computed value, which means for each rule, the url only needs to be resolved once, while for Stylo, we create a new URLValueData each time, [1] so we have to resolve them for each element rather than each used rule. This is probably the most important reason why this is happening. Across the whole range in profile (including those outside tpaint measurement), URLValueData::GetURI() takes 3.7ms in Stylo vs. 0.3ms in Gecko, and the ctor takes 0.4ms in Stylo vs. 0.0ms in Gecko, dtor takes 0.5ms in Stylo vs. 0.0ms in Gecko. It seems this is a pretty big difference we should fix. Keeping ni? myself for investigating the LoadBindings and selector matching issues. [1] https://github.com/servo/servo/blob/e3f69668aefaaac36e59509bd978dfda05018486/components/style/properties/gecko.mako.rs#L942-L949

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

7 years ago

Depends on: 1443046

Bobby Holley (:bholley)

Comment 82

•

7 years ago

Do you have instructions for generating the profiles from comment 76?

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 83

•

7 years ago

The instructions are in comment 4, and the new version of the file in comment 64 added the measurement range to the profile.

:Gijs (he/him)

Comment 84

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #81) > (In reply to Bobby Holley (:bholley) from comment #77) > > (In reply to Xidorn Quan [:xidorn] UTC+10 from comment #76) > > > New profiles with 0.1ms interval: > > > Gecko: https://perfht.ml/2F0eE76 > > > Stylo: https://perfht.ml/2F0HrbC > > > > Thanks for posting the profiles Xidorn. I spent some time digging into them, > > and came up with various interesting bits. > > Thanks for help analyzing. > > > A large part of this difference is in cascading. I see 4.0 ms in > > WalkRuleTree vs 8.4 ms in properties::cascade. I believe this is the result > > of the eager vs lazy cascade architectures. Specifically, the profiles show > > a big restyle happening between frame construction and the first call to > > DoReflow, which means that much/all of the style data was thrown away before > > layout actually used it. So the lazy computation performs much better in > > this case, because it avoids a lot of the work that would be wasted anyway. > > The right fix here, of course, is to not do the wasted work at all. IIUC, > > bug 1439875 should fix this. > > I suppose by the reflow, and big restyle, you mean the one under > DOMContentLoaded? > > That probably wouldn't be fixed by bug 1439875, because that restyle is > caused by updating the value of a custom property. That custom property was > moved down from root to #tabbrowser-tabs in bug 1435122, so the restyle has > been much shorter than before, but it's still outstanding. > > Unless we can move the value's setting to somewhere before the frame > construction, we cannot fix the waste here. IIRC that code is in some > binding initialization code, so it's probably not fixable at all. Tabbrowser code is now mostly outside of a XBL binding. It's going to be entirely outside of a XBL binding after bug 1442651. The value just comes from a pref, so I think we could set it sooner fairly easily, if that had significant performance benefits.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 85

•

7 years ago

(In reply to :Gijs (under the weather; responses will be slow) from comment #84) > Tabbrowser code is now mostly outside of a XBL binding. It's going to be > entirely outside of a XBL binding after bug 1442651. The value just comes > from a pref, so I think we could set it sooner fairly easily, if that had > significant performance benefits. That doesn't seem to be the only thing here... Probably need more investigation into how can we avoid restyle between the initial styling and reflow. That generally means we want to commit as many changes as possible before we start styling / frame construction.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 86

•

7 years ago

(In reply to Bobby Holley (:bholley) from comment #77) > That leaves us with 6.6 ms discrepancy. Filtering the match functions out, > we get [5], [6]. The next interesting thing I see is time spent in > LoadBindings, 16.4 ms vs 20.5 ms. 1.6 ms of this is traversal overhead > (Servo_TraverseSubtree minus the actual styling work we've already filtered > out) from all the extra work we do to style XBL subtrees, which may be > unavoidable until the frontend moves to Shadow DOM. But the rest of it is > roughly evenly distributed across the callees of LoadBindings, most of which > are not style-related (InstallImplementation, InstallEventHandlers, etc). > This makes my strongly suspect that we're invoking LoadBindings more times > in total under Stylo compared to Gecko. We should measure this and figure > out why it's happening. It doesn't seem that we invoke LoadBindings more times. I added some logging code to track all calls into LoadBindings. There is some slightly different order of bindings loaded in the first window (but not the rest window opened by the test), but everything else seems to work exactly the same in Stylo and Gecko.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 87

•

7 years ago

(In reply to Bobby Holley (:bholley) from comment #77) > Accounting for that, we still have an 7.9 ms discrepancy. Filtering the > cascade functions out, we get [3], [4]. Next we look at selector matching. > On average per run, I see 4.1 ms in nsStyleSet::FileRules, vs 5.4 ms in > match_{primary,pseudo}. This is a bit strange to me, because I generally > expect to see less selector matching overhead in Stylo. One obvious thing to > check would be the number of elements matched in each case - is selector > matching slower, or are we matching more elements? In fact, we really should > be matching _fewer_ elements given the style sharing cache. So another thing > would be to verify whether the style sharing cache is working correctly. > Perhaps something about XUL elements/documents is tripping it up? It seems we do have a decent style sharing on the top level browser.xul document, but we cannot share style across subtree of different bound element because we explicit treat them as having different style scope, [1] but the comment in that code seems to be wrong, in the sense I do see many "Miss: Different style scopes" in the log. Maybe we can remove that check and allow element under different bound element share style as far as they pass the parent check? [1] https://searchfox.org/mozilla-central/rev/bffd3e0225b65943364be721881470590b9377c1/servo/components/style/sharing/mod.rs#659-666

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 88

•

7 years ago

So that's responses for all potential issues bholley found from the profiles. Clearing ni? now.

Flags: needinfo?(xidorn+moz)

Bobby Holley (:bholley)

Comment 89

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #87) > Maybe we can remove that check and allow element under different bound > element share style as far as they pass the parent check? > > [1] > https://searchfox.org/mozilla-central/rev/ > bffd3e0225b65943364be721881470590b9377c1/servo/components/style/sharing/mod. > rs#659-666 NI emilio, who has a much better sense of how style sharing and XBL interact. Also note that, in profiling with bug 1439875 applied, it appears that a large amount of the remaining difference comes from Servo_InvalidateStyleForDocStateChanges. Emilio was wondering if we could get better performance by matching selectors right to left, and was going to send me a patch tomorrow to measure.

Flags: needinfo?(emilio)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 90

•

7 years ago

IIUC, we currently match complex selector from right to left, but each compound selector for left to right? Is he proposing to change the latter to right to left as well? I think we intentionally changed that order to this direction with the assumption that authors may put cheaper selectors to the left. If it is our chrome stylesheets which have a different assumption in mind, we probably should try to rewrite them to the order we like. Maybe we can invent a tool to rewrite selector so that it works the best with the style engine?

Emilio Cobos Álvarez (:emilio)

Comment 91

•

7 years ago

(In reply to Bobby Holley (:bholley) from comment #89) > (In reply to Xidorn Quan [:xidorn] UTC+10 from comment #87) > > Maybe we can remove that check and allow element under different bound > > element share style as far as they pass the parent check? > > > > [1] > > https://searchfox.org/mozilla-central/rev/ > > bffd3e0225b65943364be721881470590b9377c1/servo/components/style/sharing/mod. > > rs#659-666 > > NI emilio, who has a much better sense of how style sharing and XBL interact. > > Also note that, in profiling with bug 1439875 applied, it appears that a > large amount of the remaining difference comes from > Servo_InvalidateStyleForDocStateChanges. Emilio was wondering if we could > get better performance by matching selectors right to left, and was going to > send me a patch tomorrow to measure. We can indeed _almost_ remove that check. That check is needed because of Shadow DOM, where the style of the parent doesn't change but the rules applying to children do. We didn't use to differentiate Shadow DOM and XBL, so we needed to use the binding parent. This is all effectively bug 1422651. Fortunately Shadow DOM and XBL now are independent (bug 1425759), so we should be able to fix this checking containing_shadow instead. I'll write a patch.

Flags: needinfo?(emilio)

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

7 years ago

Depends on: 1422651

Emilio Cobos Álvarez (:emilio)

Comment 92

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #90) > IIUC, we currently match complex selector from right to left, but each > compound selector for left to right? > > Is he proposing to change the latter to right to left as well? I think we > intentionally changed that order to this direction with the assumption that > authors may put cheaper selectors to the left. The issue is that we reorder selectors during parsing to match this, so when you match in the usual order, right to left, you're usually matching in the "fast" order, that is, if you wrote "div:first-child", you first match div, then :first-child, because we store it as [Component(:first-child), Component(div)]. However, for invalidation, we match them on the opposite order (because we don't know where the next combinator is, so we look for it). It may be actually more efficient just lookahead to the next combinator, then go back, that is, if we're invalidating a selector like: [:-moz-window-inactive, [foo="bar"], :root] As of right now we'd match them in that order for invalidation, though for normal selector-matching we'd match :root first. The hypothesis is that it may be faster actually look ahead until :root, then go back. Does that make sense?

No longer depends on: 1422651

Flags: needinfo?(xidorn+moz)

Emilio Cobos Álvarez (:emilio)

Comment 93

•

7 years ago

Err, sorry, mid-air, didn't intend to remove the dependency.

Depends on: 1422651

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 94

•

7 years ago

Per discussion with emilio on IRC, so the problem in invalidation here is that we match the selectors in the reverse order of how we store it. We store the selectors in a mixed order, that means we are effectively using a different order of simple selectors inside one compound selector for invalidation than what we do in normal selector matching (rtl vs. ltr). His proposal is to make invalidation use the same order of simple selectors in compound selector. That makes sense. It could be tricky, though.

Flags: needinfo?(xidorn+moz)

Jet Villegas (inactive)

Comment 95

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #94) > Per discussion with emilio on IRC, so the problem in invalidation here is > that we match the selectors in the reverse order of how we store it. I can see how that might be slower. > We store the selectors in a mixed order, that means we are effectively using a > different order of simple selectors inside one compound selector for > invalidation than what we do in normal selector matching (rtl vs. ltr). > > His proposal is to make invalidation use the same order of simple selectors > in compound selector. That makes sense. It could be tricky, though. Trickier than fixing storage serialization? Looking for the quick and safe fix here :-)

Jet Villegas (inactive)

Updated

•

7 years ago

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Comment 96

•

7 years ago

Yup, I'm looking into it, no worries. I just got side-tracked this morning with an annoying style sharing bug while trying to get https://github.com/servo/servo/pull/20223 landable (which should be now).

Emilio Cobos Álvarez (:emilio)

Updated

•

7 years ago

Depends on: 1443814

Emilio Cobos Álvarez (:emilio)

Comment 97

•

7 years ago

I filed bug 1443814 for comment 94, with a patch as soon as mozreview reacts.

Flags: needinfo?(emilio)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 98

•

7 years ago

So in tabpaint, we do have some restyles much longer than gecko. These are profiles for (modified) tabpaint: * Gecko: https://perfht.ml/2Di15ej * Stylo: https://perfht.ml/2Dk7CoJ (both captured with STYLO_THREADS=1) Like before, the short black bars for UserTiming is roughly what is measured. (To be exact, they are different. tabpaint uses timestamp from MozAfterPaint, but the performance mark is set in handler of MozAfterPaint, and they do show difference. But it's probably enough for analysis.)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 99

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #98) > Like before, the short black bars for UserTiming is roughly what is > measured. (To be exact, they are different. tabpaint uses timestamp from > MozAfterPaint, but the performance mark is set in handler of MozAfterPaint, > and they do show difference. But it's probably enough for analysis.) I mean, the short black bars in Content (2 of 2), but the issue we want to investigate is mostly in the Main Thread.

Bobby Holley (:bholley)

Comment 100

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #98) > So in tabpaint, we do have some restyles much longer than gecko. These are > profiles for (modified) tabpaint: > * Gecko: https://perfht.ml/2Di15ej > * Stylo: https://perfht.ml/2Dk7CoJ > (both captured with STYLO_THREADS=1) Yeah, the parallelism may be responsible for the tabpaint wins we see in bug 1438775 comment 14. Definitely worth having a look to see if there are any obvious ways we can speed things up more. Taking a quick look, I see more time cascading during the timed interval, but it's closer to parity when you consider the entire timeline (because Gecko's lazy cascading ends up costing more later). I also see some time in ProcessPostTraversal, which is understandable given that we're restyling here.

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

7 years ago

Depends on: 1445566

Julien Cristau [:jcristau] (back August 18)

Comment 101

•

7 years ago

The situation in 60 has been deemed good enough (bug 1438775 comment 17 and 18), and work is still ongoing for further improvements in 61.

status-firefox60: affected → wontfix

status-firefox61: --- → affected

Maire Reavy [:mreavy]

Updated

•

7 years ago

Keywords: meta

Maire Reavy [:mreavy]

Updated

•

7 years ago

Summary: stylo: Large regression on some talos tests when enabling stylo-chrome → [meta] stylo: Large regression on some talos tests when enabling stylo-chrome

Sylvestre Ledru [:Sylvestre]

Comment 102

•

7 years ago

Moving to p3 because no activity for at least 24 weeks.

Priority: P1 → P3

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

modified tpaint file for local profiling 8 years ago Xidorn Quan [:xidorn] UTC+10 1.32 KB, text/html		Details
style: Make insertion of @font-face rules not necessarily restyle the whole document. 8 years ago Emilio Cobos Álvarez (:emilio) 59 bytes, text/x-review-board-request		Details
Bug 1420423: Optimize @keyframes rule insertions. 8 years ago Emilio Cobos Álvarez (:emilio) 59 bytes, text/x-review-board-request		Details
modified tpaint file for local profiling, v2 7 years ago Xidorn Quan [:xidorn] UTC+10 1.43 KB, text/html		Details