Closed Bug 1441035 Opened 2 years ago Closed 11 months ago

[tracking] Improve performance of Fluent on the startup path

Categories

(Core :: Internationalization, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: zbraniecki, Assigned: zbraniecki)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Keywords: meta)

Attachments

(3 obsolete files)

Fluent is a new localization system introduced to Gecko to replace DTD/StringBundles.

It brings a lot of features, provides better quality and security and ties better into internationalization layer, but at the moment it bring regression in several talos tests.

Let's investigate those.
Blocks: 1365426
Depends on: 1363862, 1384236, 1437717
Priority: -- → P3
Summary: Improve performance of Fluent on the startup path → [tracking] Improve performance of Fluent on the startup path
We did a pretty complete investigating of the performance on tpaint and ts_paint a year ago, and documented it at https://wiki.mozilla.org/L20n/Firefox/Performance (that's late 2016).

Since then we spent almost a year focused on extracting Fluent out of L20n. When we started looking into performance again it was much better, with our hypothesis being that a combination of improvements in the DOM, JS engine and style systems around Firefox Quantum helped us reduce the performance hit.

The main patch we're using to test the startup performance is called "browser-menubar" and moves the main menubar which is part of browser.xul [0] to Fluent.
The patch is around 100 strings which is a vast majority of strings currently on the startup path (whether the browser menubar should be on the startup path is a separate conversation, but it helps us with testing).

The status as of December 2016 was:

* sessionrestore - ~5.1% hit (44ms)
* sessionrestore(e10s) - ~6.3% hit (47ms)
* tpaint - ~14% hit (40ms)
* tpaint(e10s) - ~13% hit (36ms)
* ts_paint - ~1.5% hit (14ms)
* ts_paint(e10s) - ~1.7% hit (15ms)

At the time, with the help of :smaug we did some profiling and Olli wrote a POC patch that finds all nodes that have `data-l10n-id` attribute, and translates them into C++ based on the assumption that creating JS reflection of those nodes is what makes us slower.
(the patch can be found in bug 1363862)

At that time the patch helped a lot reducing hits on tpaint from ~13% to 3% and on ts_paint to zero.

The status as of January 2018 was [1] (Windows 10 PGO):

* sessionrestore - ~2.95% hit (11ms)
* tpaint - ~2.69% hit (6ms)
* ts_paint - ~0.38% win (1.9ms)

As you can see the improvement was pretty drastic and that's without the C++ `document.localize` patch.

Unfortunately, since then we regressed again. February 23rd[2]:

* sessionrestore - ~8.62% hit (34ms)
* tpaint - ~15.93% hit (35.8ms)
* ts_paint - ~6.23% hit (31.7ms)

What's worse, the C++ DOM patch doesn't seem to help at all anymore[3].
I suspect the landing of stylo-chrome (bug 1417138) to be responsible, so I'll file a separate bug to investigate that.

Once that's fixed, we'll have to look at what else is delaying us and see which of the dependencies of this bug can help us get to a green (or at least white) talos.


[0] https://searchfox.org/mozilla-central/source/browser/base/content/browser-menubar.inc
[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=5e6ef7eae125&newProject=try&newRevision=afa7f521620fecb85e6d428ad607a9f6c9bc2b5d&framework=1
[2] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1a95c22f842b&newProject=try&newRevision=d40dbf52dd092c134a87b936b76ab9ce5c221ec7&framework=1
[3] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1a95c22f842b&newProject=try&newRevision=ada74456f5af6104abf44ffed7e5f059431b72a0&framework=1
Attached patch browser-menubar.patch (obsolete) — Splinter Review
This is the patch you can apply onto m-c to test impact of Fluent on the startup path (mostly for the DTD->Fluent migration)
Depends on: 1441037
Based on bug 1441037 comment 11: with the (updated) patch from bug 1363862 the talos looks good: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f43e170bb075&newProject=try&newRevision=a74e16e7ce6c&framework=1

All other dependencies are now optional, but we will keep this bug open and monitor until we land in the startup path.
Here's a comparative view between m-c, browser-menubar(without Node.localize) and browser-menubar(with Node-localize) - https://pike.github.io/talos-compare/?revision=efb326bbd665&revision=8b5ebddb2069&revision=e2c48508d0aa
In bug 1437921 bz added a ChromeOnly Promise on `document` that resolves when DOMContentLoaded or Layout is done. We could use a similar technique instead of MozBeforeInitialXULLayout to unify the trigger between chrome-only HTML and XUL.
Attached patch browser-menubar.patch (obsolete) — Splinter Review
Attachment #8953881 - Attachment is obsolete: true
Here's a startup profile from today's m-c with the attached patch: https://perfht.ml/2I8miLj

The things I can identify in the profile:

* I/O takes ~10ms
* nsBrowserGlue.js registering L10nRegistry source takes ~13ms
* parsing takes ~2.4ms
* translateRoots takes ~1.5ms

The last one is a bit surprising because the whole `MozBeforeInitialXULLayout` takes ~5.9ms and it should be just translation, so my guess is that the actual that the translation takes is the 6ms.

Things that look good:

* The total cost doesn't seem very high
* We seem to be doing the right things at the right time
* 

Things worth investigating:

* Why is L10nRegistation so expensive?
* Why is loading DOMContentLoaded taking 9ms?
* Why is l10n.js running time 30ms but self time is just 10ms?
* Why is there DOMContentLoaded before MozBeforeXULInitialLayout?

There may be other things in the profile that are worth investigating. I'll redo it after 60 branches off with symbols to look more into C++ side of things.
Depends on: 1317481
Depends on: 1456388
This becomes a P1 as of today since it's a requirement for the Capstone project and, overall, migration to Fluent. We've seen some minor wins with bug 1455649 and SpiderMonkey has been optimizing a lot of codepaths that we care about (generators, strings, concatenations, etc.), so I'd like to remeasure the impact.

This time, instead of migrating the whole menubar, I'll just inject a single string from FTL and monitor the impact.

The hypothesis is that we will be slightly slower in tests because DTD goes into the XUL FastLoad cache which makes that system unobservable in all runs after the first. ts_paint and tpaint take 20/40 runs, so for 39 runs the document is not getting localized.

Fluent caches the FTL resource parsing, but on each run has to apply the localization onto the DOM. We may have to decide if we want to add Fluent to the cache (and who would be qualified to touch that), swallow the perf impact for the greater good, or see if we can minimize the penalty.
Assignee: nobody → gandalf
Status: NEW → ASSIGNED
Priority: P3 → P1
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #10)
> Talos run between central from today and a single string migrated to Fluent:
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=13affc067c58&newProject=try&newR
> evision=5a5aa2504bcf3e79144563c353c874a982d03935&framework=1
> 
> The patch:
> https://hg.mozilla.org/try/rev/905ede35bdf415c8109081b0acc1e96dc90cc946

Yeah, this doesn't look great:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=13affc067c58&newProject=try&newRevision=5a5aa2504bcf3e79144563c353c874a982d03935&framework=1&showOnlyImportant=1&showOnlyConfident=1

2-6% regressions across ts_paint, tpaint, sessionrestore and tabpaint.
As of today, the talos shows:

 - 4ms on ts_paint across the platforms (pgo)
 - 1ms on tpaint across the platforms (pgo)
 - 4-6ms on sessionrestore accross the platforms (pgo)

It is not as bad as I expected, and certainly better than in the past.

I'll now try to disable XUL fastload cache to see how much of that difference can be explained with it.
> 2-6% regressions across ts_paint, tpaint, sessionrestore and tabpaint.

I would expect tabpaint to be noise tbh.
I collected the profiles. Let me know if that's a good start.
Flags: needinfo?(bugs)
Ok, so TriggerInitialDocumentTranslation() is 2ms or so, and it is mostly about wrapping XUL elements. That is about DOMLocalization.jsm. So, rewriting DOMLocalization.jsm in C++ would be good from performance perspective.
addResourceIds is 1ms. That is rather deep JS stack. All that would be better in C++ too.

So, as far as I see, most of the overhead comes from backend implementation being JS.
Flags: needinfo?(bugs)
Thanks Olli!

We had a longer conversation on #content on IRC yesterday about strategies to approach this. I'll try sum them up today, but first, Brian asked for a talos run with initialization but TriggerInitialDocumentTranslation returning early.

Here's the talos: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=317ddd31da97&newProject=try&newRevision=5f9023f430c5&framework=1

Ignore the about_preferences result and focus on sessionrestore, tpaint, ts_paint - all within 1% of the central.


Here's talos-compare: https://pike.github.io/talos-compare/?revision=317ddd31da97&revision=5f9023f430c5&revision=ba16bad555a1&revision=9d2d72e91bb4

And the zoom-in on ts_paint: https://screenshots.firefox.com/yqHTu3WmJJg4SRJL/pike.github.io
1) central
2) early return from TriggerInitialDocumentTranslation
3) early return from translateRoots
4) translate elements in browser.xul

Reading the results - I don't think there's a big difference between (3) and (4), which is unexpected. There is some cost between (1) and (2), but it stays within 1% and doesn't trigger "significant" regression on talos. There's also some regression between (2) and (3). Together they constitute a significant regression (1)->(3).
I updated the three patches from the last bug to today's m-c: (1), (2) and (4).


1) mozilla-central
https://treeherder.mozilla.org/#/jobs?repo=try&revision=72eed19d131e8994d9abdbae2cfefe962f6e63b7

2) disable translations, added fluent init
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf48b65831b99d326bdd1752b87e8510edab449f

3) single string translated
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3879decb5bed954832f0c8e3257c802adb7b59e8

The performance is actually quite good! Here's a compare between (1) and (3):
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=72eed19d131e&newProject=try&newRevision=3879decb5bed&framework=1

I still claim there is a regression, I'd estimate 3-4ms on ts_paint, 1-5ms on tpaint and ~5ms on sessionrestore, but it's not marked as significant and they translate to ~1% of the number which makes it tricky to try to hunt down due to small band in which we're looking.

I'm still interested in moving most of the DOMLocalization to C++ (DocumentL10n), since smaug says that we're likely mostly regressing because of when we have to touch DOM during DOMLocalization, but it may be that we're close to be ready to enable Fluent on the startup path!
I pushed a patch that migrated 150 strings on the startup path to Fluent. Let's see what's the talos perf of that.
Attachment #8956709 - Attachment is obsolete: true
I'm now analyzing 4 scenarios:

1) mozilla-central
2) Fluent initialized on the startup path, but TriggerInitialDocumentTranslation disabled
3) Fluent initialized and a single string localized
4) Fluent initialized and 150 strings (browser-menubar) localized


Here's a talos-compare for those 4 runs: https://pike.github.io/talos-compare/?revision=72eed19d131e&revision=cf48b65831b9&revision=3879decb5bed&revision=c6f68b69aaf2

The only tests that are important to us here are: tpaint, ts_paint and sessionrestore.

Looking at them I see two jumps:

a) between (1) and (2)
b) between (3) and (4)

That indicates to me that there is some static cost associated with the initialization of Fluent, and then there's a per-string cost.

The total regressions as of today:
 - ts_paint 1-2%
 - tpaint 1-5%
 - sessionrestore 1-4%
I collected profiles for the 150 strings scenario by disabling everything in DocumentL10n.cpp and re-enabling one by one.

The following scenarios were tested now:

1) m-c
2) browser-menubar with everything in DocumentL10n.cpp disabled except of do_CreateInstance of mozIDOMLocalization
3) browser-menubar with everything disabled and AddResources reenabled (triggers I/O, parsing etc.)
4) browser-menubar with everything disabled and AddResources + RegisterObservers reenabled
5) browser-menubar with everything disabled and AddResources + RegisterObservers + ConnectRoot reenabled
6) browser-menubar with everything enabled

Here is a talos-compare for them: https://pike.github.io/talos-compare/?revision=72eed19d131e&revision=658da3c3bbfd&revision=3f6f64513a42&revision=578c680b5785&revision=e07bdbfcd483&revision=c6f68b69aaf2

What I think is interesting is that there seem to be a difference in when the "jump" happens

- in sessionrestore the perf jump happens between (2) and (3), and for ts_paint it happens
- in ts_paint there are two, one at (2)->(3) and the other at (5)->(6)
- in tpaint it only happens in (5)->(6)

(you may have to remove the other tests to see the jumps in the graphs)

Here are profiles:

(1):
https://perfht.ml/2yaDZWp
https://perfht.ml/2yeIorh
https://perfht.ml/2ybCaIT
https://perfht.ml/2y9eCnQ
https://perfht.ml/2ycUlh3

(6):
https://perfht.ml/2y9eaWG
https://perfht.ml/2yeI5N9
https://perfht.ml/2ygjtDG
https://perfht.ml/2ycxceN
https://perfht.ml/2yh5Ow6


Olli - my next step is to try to move some pieces of DOMLocalization to C++ and see if there's any impact, but I hope the profiles are ready for profiling!
Flags: needinfo?(bugs)
L10nRegistry.jsm handling the fetch shows up, the parsing part. I think this is actually the biggest single thing here.
Could we load and parse in some other thread and/or use native code here.
And put ftl to startup cache?  (This stuff isn't too trivial, but I think needs to be done eventually)
Actually, just using startup cache should be enough in practice. No need to load and parse stuff all the time.

There is something I don't understand. We call translateRoots way before L10nRegistry has loaded and parsed some file. Is that expected?



DOMLocalization stuff is mostly just it being JS. Many small things taking totally a bit too much time.

Would it be possible to not handle mutations if MozBeforeInitialXULLayout listener triggers mutations (or something before it). Could we deal with mutations - just guessing here - in an rAF callback?
Handling mutations forces creating wrappers at that point.

Just loading DOMLocalization takes a tiny bit time.



So, nothing new here. Having too much JS in very hot code paths shows always up in the profiles.
Flags: needinfo?(bugs)
Thanks Olli!

I'm going to investigate other things you noticed, but initially, I was concerned about your report about mutations - since the `browser-menubar` patch translates 150 strings in browser.xul without any injections from JS I'd expect MutationObserver to never kick in on the startup path (it's used in Preferences).

I looked myself and found it, so I modified the code we're testing against to only trigger `TranslateFragment(doc)` from DocumentL10n::TriggerInitialDocumentTranslation rather than whole `ConnectRoot(doc`).

This means that we're not connecting any MutationObserver and we're not settings directionality.

Here's the patch compared to "full" (6) - https://hg.mozilla.org/try/rev/3aa5da028d668d2393be4d05ccb159e68d760b42

I tested the perf impact of this change comparing it to (6): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c6f68b69aaf2&newProject=try&newRevision=e13cf0203a85&framework=1

There doesn't seem to be any impact on sessionrestore, tpaint or ts_paint. The win on about_preferences is to be expected (since they use dynamically injected localizable elements at startup).

So, my conclusion is that this modification still contains all the code that causes the perf cost, but the profile is cleaner, so maybe easier to evaluate.

Olli, can you look at:
https://perfht.ml/2OEePcX
https://perfht.ml/2OE84rq
https://perfht.ml/2OAZoC9
https://perfht.ml/2OEyPfs
https://perfht.ml/2OyA1RL

and check if it still seems the same as your conclusion from comment 24?
Flags: needinfo?(bugs)
Looks the same, except that small mutation observer bit being not there (so less element wrapping).
Flags: needinfo?(bugs)
Mike,

Joe suggested I ping you about this effort. Olli did the profiling of the paths between the patch and m-c and it seems like based on that the next step is to get Fluent or DOM localized by Fluent into the startup cache.

Do you see anything else here that we could do to unblock us?
Flags: needinfo?(mconley)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #23)
> Here are profiles:
> 
> (1):
> https://perfht.ml/2yaDZWp
> https://perfht.ml/2yeIorh
> https://perfht.ml/2ybCaIT
> https://perfht.ml/2y9eCnQ
> https://perfht.ml/2ycUlh3
> 
> (6):
> https://perfht.ml/2y9eaWG
> https://perfht.ml/2yeI5N9
> https://perfht.ml/2ygjtDG
> https://perfht.ml/2ycxceN
> https://perfht.ml/2yh5Ow6
> 
> 

Taking advantage of the start-up cache makes perfect sense. Let's definitely do that.

I'm a little confused by these profiles. As I understand from comment 23, the first set (1) should be the "good" case, and the second set (2) should be the "bad" case.

I opened the last of each set, and then I zoomed in on the parent process main thread from process start to the firstPaint marker for each.

1: "good" - https://perfht.ml/2Ewk9eG
2: "bad" - https://perfht.ml/2EzaljZ

What I find confusing is that in these profiles, the "bad" case is actually reaching firstPaint faster (224ms vs 294ms).

I find that pretty confusing.

From a glance, it doesn't look like these profiles are coming from a Talos test. Can we get some Talos test profiles from try posted? Pick the worst-performing one, and then put

mozharness: --geckoProfile

into the try syntax to make the try push generate profiles.
Flags: needinfo?(mconley) → needinfo?(gandalf)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #27)
> Joe suggested I ping you about this effort. Olli did the profiling of the
> paths between the patch and m-c and it seems like based on that the next
> step is to get Fluent or DOM localized by Fluent into the startup cache.

I think I've mentioned this elsewhere as well, but do we have a sense for the relative win between putting Fluent in startup cache vs putting DOM localized by Fluent in the startup cache? The former seems pretty low-cost and maintenance (I assume that means making something like nsXULPrototypeCache::PutStyleSheet and nsXULPrototypeCache::GetStyleSheet but for FTL). For the latter, I'm generally concerned that if we tie better perf to XUL documents only through the document prototype cache that it'll mean chrome HTML docs (which we are starting to use more of) don't get the speedup, and it'll make shipping the browser window as HTML harder.
With the landing of bug 1488973, I decided to redo the talos profiles so that we can see if the pattern holds.

I did four different 20 cycle runs:

(a) mozilla-central
(b) single string translated using FTL on the startup path
(c) whole menubar translated using FTL
(d) same as (d) but with the FTL AST JSON parsed and inlined to skip I/O and parsing

== (a) central:

talos: fffffd7c5a52
1: https://perfht.ml/2CD35B4
2: https://perfht.ml/2CEloWD
3: https://perfht.ml/2qbUKgl
4: https://perfht.ml/2qbnmq0
5: https://perfht.ml/2q9VVNp

== (b) one-string:

talos: ca436ca69ae4
1: https://perfht.ml/2CDeeBQ
2: https://perfht.ml/2CELHvV
3: https://perfht.ml/2CEM1L9
4: https://perfht.ml/2CD9uvW
5: https://perfht.ml/2CD62Sa

== (c) browser-menubar:

talos: 09fdaabb36dd
1: https://perfht.ml/2qbRLVa
2: https://perfht.ml/2CDZ7rS
3: https://perfht.ml/2qaMLQM
4: https://perfht.ml/2CDGtQW
5: https://perfht.ml/2CFXCt5

== (d) browser-menubar with hardcoded FTL AST:

talos: 790819444a8a
1: http://bit.ly/2qbGAM1
2: http://bit.ly/2q8Qy0N
3: http://bit.ly/2qcHIz0
4: http://bit.ly/2CDLfOo
5: http://bit.ly/2qaS7vd

Comparisons:

(a) vs (b) - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fffffd7c5a52&newProject=try&newRevision=ca436ca69ae4

tpaint warnings and one red ~2.49%

(a) vs (c) - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fffffd7c5a52&newProject=try&newRevision=09fdaabb36dd&framework=1

some sessionrestore (2.5%), tpaint (3-6%) and tspaint (2-3%).

(a) vs (d) - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fffffd7c5a52&newProject=try&newRevision=790819444a8a&framework=1

seems very similar to (c)

talos-compare - https://pike.github.io/talos-compare/?revision=fffffd7c5a52&revision=ca436ca69ae4&revision=09fdaabb36dd&revision=790819444a8a

My read of the results indicate that there's some regression coming from even a single string being localized on the startup path - which is likely related to us loading the Fluent jsm, and initializing the system, searching for translatable strings etc. and then there's a cost of translating a high number of strings.
The former (a vs b) places us on the verge of visible talos regressions with 1-2% impact. The latter pushes us beyond that into 3-6% territory.

=== Caching FTL ===

In particular, we wanted to test the hypothesis that placing parsed FTL into the startup cache would help us get talos wins.

(c) vs (d) - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=09fdaabb36dd&newProject=try&newRevision=790819444a8a&framework=1

I believe there are no significant differences (although some tests are still running). I suspect that the reason for lack of visible win is that we already cache FTL during runtime, so the most sensitive tests like tpaint will only see a win in (d) scenario on the first of twenty runs.

:mossop, :mconley, :smaug - if I read the talos results correctly (and please, take a spin with the profiles I attached), caching FTL in the startup path will not, on its own help us.

We either need to cache localized DOM, or we need to remove the JS code from the startup path (so, migrate DOMLocalization to C++, and switch Fluent.jsm to fluent-rs?), or maybe even both.

How does it look to you?
Flags: needinfo?(mconley)
Flags: needinfo?(gandalf)
Flags: needinfo?(dtownsend)
Flags: needinfo?(bugs)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #30)
> We either need to cache localized DOM,
This, or at least cache the fluent files in binary format

> or we need to remove the JS code from
> the startup path (so, migrate DOMLocalization to C++, and switch Fluent.jsm
> to fluent-rs?), or maybe even both.
And this definitely, as far as I see.

I was looking at b/3. translateRoots isn't super slow, but whatever it is doing, is spending time mostly just running JS and crossing JS->C++ and C++ boundaries and compiling JS. Having that all in C++/Rust should be significantly more light weight.
Same with the other parts in the profile doing l10n stuff.


So, nothing new here. We shouldn't really run any extra JS in critical paths.
(this is a broader question in general - which parts of the browser UI can be implemented in JS and which can't be.)
Flags: needinfo?(bugs)
Also, the regressions here aren't horribly bad, but over the time such not-horribly-bad regressions accumulate and affect to the user experience.
One of the hard parts of doing analysis like this is that the profiles are different even within the same run. For example, I compared the (a)2 and (c)1, and one difference that stood out to me is that (c)1 seems to have a harder time setting the tab min width value here:

https://searchfox.org/mozilla-central/rev/d5fc6f3d4746279a7c6fd4f7a98b1bc5d05d6b01/browser/base/content/tabbrowser.xml#129

which is a setter defined here:

https://searchfox.org/mozilla-central/rev/d5fc6f3d4746279a7c6fd4f7a98b1bc5d05d6b01/browser/base/content/tabbrowser.xml#179

in (a)2, it takes 4ms: http://bit.ly/2zfvgTx
in (c)1, it takes 22ms: http://bit.ly/2z2oxMq

But in (c)3, it takes 4ms again: http://bit.ly/2z7ayFd

so it's pretty easy to fall down blind alley's that don't actually amount to much when doing the analysis.

Certainly I trust smaug's assessment about getting JS out of the critical path - crossing the native / JS boundaries is not at all free, and we get to avoid a bunch of overhead if we avoid crossing into JS.
Flags: needinfo?(mconley)
Depends on: 1503657
Attachment #9013979 - Attachment is obsolete: true
Depends on: 1507008
Depends on: 1512674
Depends on: 1517880
Flags: needinfo?(dtownsend)
Type: defect → enhancement

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #35)

With bug 1552714 fixed, I'm testing the perf impact of landing a single string on the startup path: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c1a9623c2c56f38cfcef106d2a41b1c8fa058f3d&newProject=try&newRevision=2b5aaba72a5856b507b5741faf4c4ac2ce28ec98

This still shows regressions in ts_paint and twinopen. There's some noise but the overall picture still looks like this would be a regression.

Are we still intending to fix bug 1517880?

This still shows regressions in ts_paint and twinopen. There's some noise but the overall picture still looks like this would be a regression.

Yeah, before windows results showed up it looked better :/

Are we still intending to fix bug 1517880?

Yes. I believe so. Or switch to fluent-rs, which is likely a larger task, so my focus will be on bug 1517880 first.

Depends on: 1558602

Update on this. With all of the Fluent DOM now in C++, and with changes in bug 1558602, we're getting to the point where we may want to consider starting to move away from DTD in browser.xhtml.

As an experiment, I migrated the whole menubar and some bits and pieces from browser.xhtml and panelUI - total of 250 strings. I didn't remove them from DTD and I didn't remove the DTD yet, just added Fluent on top.

Here are talos results: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6fa61d1c9e669fd5f1f156201dff5ec0d001ceae&newProject=try&newRevision=068bfbc7c468903dd6b9318e2228ad49f7ca4db5

I'm still interested in bug 1517880 and will now work on that, but I think we're at the point where we can start working on bug 1501886 in parallel.

I'd like to verify that claim with someone from the perf team - mconley. Can you take a look at the talos and tell me what you think? Is there anything else you'd like me to test?

In my initial landing, I'd like to migrate just the menubar, which is ~100 strings.

Flags: needinfo?(mconley)

Thanks for tagging me in, gandalf.

So the comparison still shows a regression - but as you say, we've effectively doubled the work in your treatment build; we're loading both the Fluent strings and the DTD strings. Does the regression go away if you remove the DTDs? If so, that's pretty compelling.

Flags: needinfo?(mconley)

Hah, actually, let me provide more context first!

So, starting here, I'd already want to ask a question - what's an acceptable performance penalty kick for the value we're getting. What's the process, who are the stakeholders, and how can we evaluate that hit against the things the change unblocks us to do.

To reiterate on why we want to unblock Fluent on startup:

  • Removal of Yellow Screen of Death https://docs.google.com/document/d/1A2wQI8tdEe0wIFK2KQZniBlQOjSrGoZglQUO_Xfvx2U/edit#
  • Introducing of Fluent L10n Context in browser.xhtml enables plethora of cleanups in all browser.xhtml's JS
  • And unification of that JS l10n with XUL/XHTML l10n into a single context
  • Final deprecation of DTD in Firefox
  • Runtime locale switching without restart
  • Increased security in result of DTD removal
  • Improved l10n experience with better error fallbacking
  • Step on the path to remove all of the old StringBundle, PluralRules etc. codebase

All of those things, (and a couple others) are basically blocked on this for the last 6 months. I of course understand that we want to prevent any perf regressions, but I believe this is not the first case, where reality requires us to make compromises and we have two conflicting needs.

On top of that, some of those changes, such as JS cleanups around browser.xhtml are likely to bring small perf improvements over time. They won't show up as alerts in talos, and it's impossible to evaluate the claim except of anegdotal experience of porting Preferences to Fluent leading to code cleanups and easier startup JS code.

I don't know how much longer will I need to shave off the final milliseconds, but I'd like to understand what are the criteria on evaluating accepting some performance hit against those sought after goals.

Maybe the answer is "zero" - as in, if 6 months from now we're still 3ms away from green talos on windows7-32, then none of that can be unblocked. But I hope that the answer is more nuanced.

Now, let's take a dive into the numbers as of today:

So the comparison still shows a regression

We have 4 "important" changes in this talos result:

sessionrestore opt e10s stylo / windows10-64-shippable - 2.37%
ts_paint opt e10s stylo / windows10-64-shippable - 2.03%
twinopen ext+twinopen:twinopen.html opt e10s stylo / linux64-shippable - 2.30%
twinopen ext+twinopen:twinopen.html opt e10s stylo / macosx1014-64-shippable - 3.72%

In isolation, that may look like "just a regression", but if we look at a couple similar tests, the conclusion may be slightly different:

sessionrestore opt e10s stylo / windows10-64-shippable-qr - -0.47%
ts_paint opt e10s stylo / windows10-64-shippable-qr - -0.20%

In multiple other cases the results for windows*-qr variant show better performance than non-qr. Since we are turning on webrender everywhere. Is there a reason to argue that this is either noise or not very impactful for us?

The two others are one yellow on linux64-shippable, but not significant on -qr variant, and the macos1064 twinopen.
The macos doesn't have the -qr variant yet, but my guess is that once we add it, it'll show no impact.

My best guess is that there's something about the event loop that quantifies the events into some frames - you can notice that if you hover over twinopen/ts_paint results on windows - sometimes they have quasi-normal distribution, sometimes bi-modal.

Does the regression go away if you remove the DTDs?

I can't do that yet. Unfortunately the state of DTD in browser.xhtml is a patchwork over last 15 years. Strings are used all over the place and a lot of strings from menubar are used elsewhere. I'd have to migrate all of browser.xhtml all at once which would be daunting to work on and would take much longer.

My question is how far those results are from something acceptable.

Irrelevant of the answer, I'm going to work on the next steps:

  • fluent-minicache (bug 1517880)
  • migration to fluent-rs

The mini-cache should be close to being landable, but the fluent-rs will take longer.

I'd prefer to work on the cache and fluent-rs in parallel with unblocking the front-end and l10n teams to migrate browser.xhtml away from DTD, so I'm trying to understand who and how can make such a decision.

Flags: needinfo?(mconley)

for the record, I got the mini-cache to work (bug 1517880). Talos numbers here: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6fa61d1c9e669fd5f1f156201dff5ec0d001ceae&newProject=try&newRevision=d8de254ce2afb8540f9c603797b8925b68849432

I still think it would be nice to understand what are the criteria and who are the stakeholders for enabling Fluent on the startup path

Depends on: 1560038

I really apologize for not having gotten back to this sooner - Fission crunch for Whistler and then Whistler kinda took their toll.

what's an acceptable performance penalty kick for the value we're getting. What's the process, who are the stakeholders, and how can we evaluate that hit against the things the change unblocks us to do.

Great question. I think we have to look at what we're paying for with the cost of the performance regression - you've done a good job of enumerating the performance benefits. Naturally, the ideal scenario is that we wouldn't regress at all, so the aspirational target is 0 or negative performance regression.

That having been said, I believe it's generally understood that if you've picked most of the low-hanging fruit, that the effort and time cost in order to shave off the last few milliseconds to gain parity will be high.

So the equation roughly is: is the value that we're getting (which you've enumerated) greater than the cost of the regression? And which costs more - the regression, or the effort to address the regression?

I seem to recall a meeting during the All Hands in Orlando where we discussed the Fluent minicache as the thing that will likely "solve" the performance regression. Have you performed any experiments or built any prototypes to demonstrate that this is indeed so?

Flags: needinfo?(mconley) → needinfo?(gandalf)

I seem to recall a meeting during the All Hands in Orlando where we discussed the Fluent minicache as the thing that will likely "solve" the performance regression. Have you performed any experiments or built any prototypes to demonstrate that this is indeed so?

I'm still working on bug 1558602 and bug 1517880. I'm just getting to the point where things become tradeoffs. The cache, for example, increases code complexity and depending on how we implement it may be hard to maintain. I'm still optimistic about getting to "no regression" but I would like to understand what's my room to at some point say "ok, adding this amount of convulsed code is not worth shaving those X ms" and who can I consult on that decision.

Flags: needinfo?(gandalf)

Olli, I profiled three release builds: m-c (with patch from bug 1558602), menubar on startup path (bug 1501886) async, and sync.

== central ==

https://perfht.ml/2NiFhcR
https://perfht.ml/320ofn8
https://perfht.ml/2NgpYBe
https://perfht.ml/2NjlcTG
https://perfht.ml/2NlJFaM

https://treeherder.mozilla.org/#/jobs?repo=try&revision=708b775e66b3e27a98e13a554766f851c29f02d9

== menubar FTL async ==

https://perfht.ml/2NhhA4w
https://perfht.ml/31XhrGQ
https://perfht.ml/2NhOYIh
https://perfht.ml/2NhUwCL
https://perfht.ml/2NiZ0c4

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b35b97e3b5e96f84463629897eb2e7a80e391b7e

perfherder (vs m-c): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=708b775e66b3e27a98e13a554766f851c29f02d9&newProject=try&newRevision=b35b97e3b5e96f84463629897eb2e7a80e391b7e

== menubar FTL sync ==

https://perfht.ml/2NsP2Fs
https://perfht.ml/2NfZaRv
https://perfht.ml/2Njpx9q
https://perfht.ml/2NhiuOs
https://perfht.ml/2Ndhc6T

https://treeherder.mozilla.org/#/jobs?repo=try&revision=64737f98524938c9fcaa4ec353a20f37d2208bde

perfherder (vs m-c) - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=708b775e66b3e27a98e13a554766f851c29f02d9&newProject=try&newRevision=64737f98524938c9fcaa4ec353a20f37d2208bde

Can you tell me what you see in the profiles? Is it still mostly the JS engine initialization?

:mconley - you can also notice from the perfherder results that some swings are wild. +16% here, -33% there. I'm worried that at some point we'll have to try to read the profiles without statistical significance and then the interpretation will be prone to "confirmation bias" fallacy - if you want to find regression in the results, you'll notice the +2.0% here, if you want to believe it's faster, you'll notice -2.0% there. I'd like to understand who'se go-ahead am I looking for.

Flags: needinfo?(mconley)
Flags: needinfo?(bugs)

(looking https://perfht.ml/2Njpx9q)
There is definitely couple of milliseconds (~5) coming from js component loading.
mozilla::intl::Localization::Init overall takes ~10ms, half of it is js module loading, another half is getLocalization call.
Does getLocalization use generators heavily - in such way which doesn't get jit'ed?

translateElements is called elsewhere (DocumentL10n::TriggerInitialDocumentTranslation), and takes 3ms.

So yes, JS usage does show up.

Flags: needinfo?(bugs)

Does getLocalization use generators heavily - in such way which doesn't get jit'ed?

Idk, is it possible to see it from the profile?

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #44)

:mconley - you can also notice from the perfherder results that some swings are wild. +16% here, -33% there. I'm worried that at some point we'll have to try to read the profiles without statistical significance and then the interpretation will be prone to "confirmation bias" fallacy - if you want to find regression in the results, you'll notice the +2.0% here, if you want to believe it's faster, you'll notice -2.0% there. I'd like to understand who'se go-ahead am I looking for.

Hey gandalf,

At that point, where you've (admirably!) boiled down to the noise, you'll want to probably have esmyth make the decision on whether or not the "regression" (real or imagined) is acceptable, based on what we're buying with it.

Flags: needinfo?(mconley)

Since the cache looks to require more wrangling between prototypes and documents and fluent, I decided to try to push bug 1560038 a bit.

I now have a patch that replaces Fluent.jsm with FluentBundle and FluentResource backed by fluent-rs and here are talos numbers: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf343dbb979f2429785f51be2c44e61e1254286&newProject=try&newRevision=49b4a81baecee9ea35d75dec902963fc331ba196

I think it's pretty promising and I'll continue evaluating it.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #49)

With the latest set of performance improvements to the Arc branch of fluent-rs - https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=2812d08947cbcfb966be0d5d148cd8c5cc492ef2

This still regresses ts_paint on windows pgo builds by nearly 2%. :-(

This still regresses ts_paint on windows pgo builds by nearly 2%. :-(

This is statisticially insignificant. I'm happy to run more reps to try to get better p=95, but I think we're in the "trying to read into random data" point. If you look for regressions, you'll find them, if you look for wins, you'll find a 25% win in twinopen on the same platform.

I believe both to be flukes.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #51)

This still regresses ts_paint on windows pgo builds by nearly 2%. :-(

This is statisticially insignificant.

I don't think it's necessarily insignificant (we're doing our darndest to shave off fractions of a percent over here), but like I wrote in comment 47, you should double-check with esmyth to see if the possible cost is worth what we're buying.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #51)

This still regresses ts_paint on windows pgo builds by nearly 2%. :-(

This is statisticially insignificant.

Like mconley, I disagree with this assessment. The spread of numbers for ts_paint on win64 (that you can see when hovering over the data points on both sides of the graph) is almost completely disjoint (ie no overlap), and perfherder claims high confidence in the regression based on a t-test.

I'm happy to run more reps to try to get better p=95, but I think we're in the "trying to read into random data" point. If you look for regressions, you'll find them, if you look for wins, you'll find a 25% win in twinopen on the same platform.

I see a 13.26% regression on win10-64. On win10-64-qr, there's an "improvement", but it is low-confidence, and if you look at the spread of the numbers that are being used, you can see that on the baseline (m-c) dataset, there are 2 outliers at about 380ms each, and the rest of the numbers actually come in below the 123.82 average on your trypush, ie there, too, this actually looks likely to be a regression. The outliers look like an irregular occurence (but not unprecedented) based on the graph.

So no, I don't think this just depends on what you're looking for. I think the data points out that fluent is slower.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #48)

Since the cache looks to require more wrangling between prototypes and documents and fluent, I decided to try to push bug 1560038 a bit.

Is this a temporary or fundamental hold-up to the caching work?

The spread of numbers for ts_paint on win64 (that you can see when hovering over the data points on both sides of the graph) is almost completely disjoint (ie no overlap), and perfherder claims high confidence in the regression based on a t-test.

This is a bimodal distribution which unfortunately means that the t-test used by talos to calculate significance is not the proper test to calculate it.

It seems that for ts_paint and twinopen all windows results are bimodal (maybe collapsing on frames?) which makes it more counter-intuitive to try to read their significance.

I see a 13.26% regression on win10-64. On win10-64-qr, there's an "improvement", but it is low-confidence, and if you look at the spread of the numbers that are being used, you can see that on the baseline (m-c) dataset, there are 2 outliers at about 380ms each, and the rest of the numbers actually come in below the 123.82 average on your trypush, ie there, too, this actually looks likely to be a regression.

I think that the way you analyze the data is not in line with how I was instructed to in statistics classes.

As I stated before, t-test is (quoting wikipedia) "most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. ".

In particular, it is a common mistake to apply t-test to a non-normal distributions, like bi-modal or multimodal.

There are tests that can verify if the distribution is normal (it is visibly not, but if you prefer I can also calculate it), and in such a case, Mann–Whitney U test would be more appropriate [0], as it is (quoting Wikipedia): "Unlike the t-test it does not require the assumption of normal distributions."

What you do past that point, dropping outliers, is a risky operation that has far fetching consequences on validity of the interpretation. I believe if you apply the same manual removal of outliers on twinopen "windows7-32" you'll get a perf win, but I'd be very cautious about its validity.

In other words, if we don't trust talos results significance, I believe we should not try to "read between the lines" and try to deduct what we believe to be true, because data will naturally give us ability to select bits that fit our preconception.

Instead, we should revisit the tools we use (t-test?) and if we want to remove outliers, we should do this using statistical tools, not our human intuition. One such tool is IQR score [2].

Finally, as you see from the graphs, we're comparing multimodal and normal distributions which are bound to have different characteristics.
That means that we have a pretty high chance of chasing ghosts and trying to normalize the numbers to behave in line with baseline, even if it doesn't match actual impact on user experience.

As I stated in comment 44, at some point, assuming my patches change the behavior and distribution, we'll have to face a balance of gains/costs that are not proven to be statistically significant and try to deduct what it means.
Gijs in the comment above, removes outliers in a selected result when the outcome doesn't match the hypothesis, until it does. At the same time, he doesn't remove the outlier in "ts_paint/windows10-64-shippable-qr" which might drop the significance from mid to low.
I'm afraid that such strategy will always be available to prove a point.

We can also dwell into outliers. If in twinopen/windows10-64-shippable-qr m-c has outliers, and those outliers also happen in real life on user machines, and my patch removes those outliers, but regresses the median by 0.2%, is it a win or loss?

My question is - if we will never get to the point where the data in talos shows exactly what we want to see, and instead it'll remain chaotic and fluctuating - whos call it is to make a call that it is good enough?
Gijs? Mconley? esmyth?

And if we evaluate that there indeed is a hidden regression, say 0.3% on ts_paint. Whos call it is to evaluate it against YSOD or overall removal of DTD?

Because I'm concerned that we use the wrong tools to read the data and keep delaying migration of the startup path to conform to a wrong test.

[0] https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test
[1] https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/
[2] https://en.wikipedia.org/wiki/Interquartile_range#Outliers

Is this a temporary or fundamental hold-up to the caching work?

I'm not sure. I'm trying to verify it with smaug and bdahl this week.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #54)

My question is - if we will never get to the point where the data in talos shows exactly what we want to see, and instead it'll remain chaotic and fluctuating - whos call it is to make a call that it is good enough?
Gijs? Mconley? esmyth?

esmyth.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #54)

Gijs in the comment above, removes outliers in a selected result when the outcome doesn't match the hypothesis, until it does. At the same time, he doesn't remove the outlier in "ts_paint/windows10-64-shippable-qr" which might drop the significance from mid to low.
I'm afraid that such strategy will always be available to prove a point.

FWIW, I didn't look at the win10 qr result in detail as it had only medium confidence, only the non-qr win10-64 shippable, which afaict doesn't have significant outliers and has a "high" confidence t-test, not medium. Runs before are 314-318 (11 samples), runs after are 317-324 (19), with a 1.98% regression marked. I don't see a bimodal distribution in the graphs for ts_paint win10-shippable (non-qr) either (neither before nor after), but perhaps I'm not looking at the same thing as you?

To get more reliable distributions it's probably worth doing a trypush off the base revision instead of just taking the data from the last 2 days of m-c runs, to reduce the potential for additional noise caused by other patches. For the general point of whether specific tests are bimodal or not, it seems worth having a conversation with the perf infra team folks about stabilizing those tests and/or having perfherder employ better tests when describing the "results" of comparing two revisions -- in a separate bug, of course.

With the landing of bug 1517880 and now bug 1501886 we now have Fluent on the startup path with no talos regressions.

Hooray!

Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED

For the general point of whether specific tests are bimodal or not, it seems worth having a conversation with the perf infra team folks about stabilizing those tests and/or having perfherder employ better tests when describing the "results" of comparing two revisions -- in a separate bug, of course.

I still intend to file a bug about it, just failed to prioritize so far. Setting NI

Flags: needinfo?(gandalf)

This is now handled in bug 1581533 and bug 1601952! Hooray!

Flags: needinfo?(gandalf)
You need to log in before you can comment on or make changes to this bug.