Closed Bug 1441037 Opened 2 years ago Closed 2 years ago
stylo-chrome significantly regresses Fluent
This is a spin-off from bug 1441035. I strongly suspect that the recent landing of stylo-chrome (bug 1417138) significantly regressed Fluent performance on the startup path. Quoting from bug 1441035 comment 1: > The status as of January 2018 was  (Windows 10 PGO): > > * sessionrestore - ~2.95% hit (11ms) > * tpaint - ~2.69% hit (6ms) > * ts_paint - ~0.38% win (1.9ms) > > As you can see the improvement was pretty drastic and that's without the C++ `document.localize` patch. > > Unfortunately, since then we regressed again. February 23rd: > > * sessionrestore - ~8.62% hit (34ms) > * tpaint - ~15.93% hit (35.8ms) > * ts_paint - ~6.23% hit (31.7ms) > > What's worse, the C++ DOM patch doesn't seem to help at all anymore. > I suspect the landing of stylo-chrome (bug 1417138) to be responsible, so I'll file a separate bug to investigate that. High level overview of what Fluent-Gecko does during browser.xul startup is: 1) It loads l10n.js synchronously and immediately synchronously querySelectorAll the document for "link[rel=localization]"  2) It asynchronously triggers the fetch of I/O resource needed for the l10n context (in this case, a single file from resource:// protocol)  3) It reacts to `MozBeforeInitialXULLayout` which is fired right before `StartLayout`  4) It querySelectorAll("[data-l10n-id]"), formats all messages required and applies translation onto their textContent and attributes  The naive logic would expect that (4) is the most expensive step, but the profiles show that it actually doesn't take much time and what's more, the experiment with the C++ DOM patch indicates that moving that code to between JS and C++ doesn't affect the talos tests at all anymore (plus the code is heavily cached so tpaint should not be affected) In our testing more than a year ago we were surprised to discover that the biggest performance hit was actually caused by (1) - because the querySelectorAll happened so early, we were the first code that required those elements to become available in JS thus creating their reflections, and that was very costly. That seemed to get optimized in the old style engine for Firefox Quantum reducing the penalty. Now with stylo, maybe we're back to that issue? The other idea is that bug 1422653 may be related since we know that attaching XBL bindings blocks frame creation and it was one of the issues back in 2016. Once again, there's a chance that stylo brings that back (or even makes it worse?).  https://searchfox.org/mozilla-central/rev/b469db5c90d618f4b202d5ef280e1a78224eb43b/intl/l10n/l10n.js#54  https://searchfox.org/mozilla-central/rev/b469db5c90d618f4b202d5ef280e1a78224eb43b/intl/l10n/l10n.js#59  https://searchfox.org/mozilla-central/rev/b469db5c90d618f4b202d5ef280e1a78224eb43b/dom/xul/XULDocument.cpp#2690  https://searchfox.org/mozilla-central/rev/b469db5c90d618f4b202d5ef280e1a78224eb43b/intl/l10n/l10n.js#67
Xidorn - can you profile the startup path with patch from bug 1441035 and see if there's anything new that may be causing this regression or any low hanging fruits that would make stylo work better with fluent?
One thing what I am bit concerned is that documentReady() uses Promises inside it. So translateRoot() might delay. bug 1193394 will fix such delays, FWIW.
> One thing what I am bit concerned is that documentReady() uses Promises inside it. So translateRoot() might delay. bug 1193394 will fix such delays, FWIW. Yeah, we're aware of bug 1193394. In fact, the current code has been landed very recently (bug 1437427) to workaround the limitation. I profiled the result and currently we do trigger the `translateRoots` synchronously after `BeforeXULInitialLayout` which you can see in the profile. I'll remove the workaround as soon as bug 1193394 lands.
I didn't notice that at all. That's good to hear. :) translateElements()s inside Promise.all() in translateRoot() are also synchronously processed? I am not convinced that they are synchronously processed.
> translateElements()s inside Promise.all() in translateRoot() are also synchronously processed? I am not convinced that they are synchronously processed. Interesting observation. I can't reason about why it is, but I can confirm that in Gecko Profiler I see all of the translateRoots code within the MozBeforeInitialXULLayout event start/end boundaries. Maybe it's because its just one root at the moment? If you want to test that you can either apply the browser-menubar patch or even without that just profile the Preferences tab opening. It contains fluent l10n already and you can find the `MozBeforeInitialXULLayout` there and all the code called via `l10n.js` or `DOMLocalization.jsm`.
I measured the 274ms slice on the first profile and 332ms one on the second profile (which are the shortest ones in their profiles respectively). This has a total of 58ms difference. Before DOMContentLoaded, the only difference seems to be the time taken in MozBeforeInitialXULLayout, which is 25ms. In that, Element::WrapObject alone takes 22.6ms. Inside Element::WrapObject, Element::GetBindingURL takes 17.5ms, and nsXBLService::LoadBindings takes 3.9ms. So Element::GetBindingURL does seem to be a problem here.
Here are my profiles as per emilio's request: Revisions: 1) m-c: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f43e170bb075 2) fluent: https://treeherder.mozilla.org/#/jobs?repo=try&revision=856f6e64e1ea 3) fluent + C++ patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5bcb19bea5b5 Talos: 1 vs 2: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f43e170bb075&newProject=try&newRevision=856f6e64e1ea450b5e11e23a91248df20a9549bd&framework=1 1 vs 3: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f43e170bb075&newProject=try&newRevision=5bcb19bea5b5a7f994087c8e52c08233d3b75589&framework=1 2 vs 3: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=856f6e64e1ea&newProject=try&newRevision=5bcb19bea5b5a7f994087c8e52c08233d3b75589&framework=1 Profiles (Windows 10, same m-c base revision): Command: `MOZ_PROFILER_STARTUP=1 MOZ_PROFILER_STARTUP_ENTRIES=40000000 MOZ_PROFILER_STARTUP_INTERVAL=0.3 ./mach run` 1) Testing (1) https://perfht.ml/2CjQCCZ 2) Testing (2) https://perfht.ml/2or9fMT 3) Testing (3) https://perfht.ml/2ossaXz Seems like the C++ patch doesn't change the fact that we're still doing the Element::GetBindingURL. (protip: any easy way to filter out to just see what Fluent is doing is to switch to JS stack on the main thread in the profile and filter for `l10n`)
So... I think the C++ patch in bug 1363862 should still work. From the profile, it seems the reason it doesn't help is because the patched function is not used at all. The patch there changes DOMLocalization.translateFragment to use the new API, but the code actually calls to DOMLocalization.translateElements via DOMLocalization.translateRoots directly.
Oh! You're right! I carried over the code from last year, but the Fluent logic slightly changed in the mean time! I updated the patch in bug 1363862 and my initial profiling indicates that the time spent in MozBeforeInitialXULLayout dropped from 17ms to 1.1ms! I pushed it talos as https://treeherder.mozilla.org/#/jobs?repo=try&revision=a74e16e7ce6ce9afee26397d26283b375fed9001 Expect the m-c to fluent+DOM patch in https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f43e170bb075&newProject=try&newRevision=a74e16e7ce6c&framework=1 Thank you emilio and xidorn for help! Once the talos numbers come back and if they look good, we'll be able to close this bug or at least identify new targets to improve in stylo/fluent.
And the first talos tests coming back indicate that with patch from bug 1363862 the stylo-chrome regression is gone! To summarize: stylo-chrome did regress Fluent performance on the startup path, most likely because stylo computes all style structs, while gecko only lazily do so. Most of the regression was located in `Element::GetBindingURL` which is where Stylo attempts to resolve XBL attached via CSS. The good news is that all of the regression is in the phase of Fluent localization that is covered by bug 1363862 API! This means that with the patch to add `document.localize` we are not affected by the stylo-chrome regression anymore, and what's even more impressive - we're not regressing startup at all! That's a great news because that means that instead of overoptimizing Stylo for the XUL+XBL scenario to prevent Fluent regressions, we can use the C++ API, and also it means that this particular regression should not happen without XBL and in the HTML (and Web Components) model. The only downside here, is that non-privileged XUL is going to be still affected, but I believe that all non-privileged XUL is not on any performance critical path! We'll have to separately profile HTML performance and decide if we still want the C++ API there, or is the Stylo+JS in the HTML scenario good enough. Xidorn mentioned several potential improvements based on his profiling, but the only one I was able to extract from the IRC conversation is that we may want to optimize the early reflection issue at some point. :xidorn - is there a bug for that? Are there any other open stylo bugs that may be worth keeping track on in the context of fluent?
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #11) > stylo-chrome did regress Fluent performance on the startup path, most likely > because stylo computes all style structs, while gecko only lazily do so. > Most of the regression was located in `Element::GetBindingURL` which is > where Stylo attempts to resolve XBL attached via CSS. It is where the DOM queries the style system to resolve XBL. > Xidorn mentioned several potential improvements based on his profiling, but > the only one I was able to extract from the IRC conversation is that we may > want to optimize the early reflection issue at some point. > > :xidorn - is there a bug for that? Are there any other open stylo bugs that > may be worth keeping track on in the context of fluent? So... what I mentioned is that we may want to skip querying the style system about binding url attached on the element when we know there definitely cannot have one for HTML, so that we don't waste time on computing anything. But I just had a look at the code, and it seems we already skip that for all HTML element except <object> and <embed>, so I guess there isn't much for stylo to optimize further here.
Thank you! Marking as fixed!
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.