[tracking] Improve performance of Fluent on the startup path

Assigned to



a year ago
2 months ago


(Reporter: gandalf, Assigned: gandalf)


(Depends on 4 bugs, Blocks 3 bugs, {meta})

Firefox Tracking Flags

(Not tracked)



(3 obsolete attachments)

Fluent is a new localization system introduced to Gecko to replace DTD/StringBundles.

It brings a lot of features, provides better quality and security and ties better into internationalization layer, but at the moment it bring regression in several talos tests.

Let's investigate those.


a year ago
Blocks: 1365426
Depends on: 1363862, 1384236, 1437717
Priority: -- → P3
Summary: Improve performance of Fluent on the startup path → [tracking] Improve performance of Fluent on the startup path

Comment 1

a year ago
We did a pretty complete investigating of the performance on tpaint and ts_paint a year ago, and documented it at https://wiki.mozilla.org/L20n/Firefox/Performance (that's late 2016).

Since then we spent almost a year focused on extracting Fluent out of L20n. When we started looking into performance again it was much better, with our hypothesis being that a combination of improvements in the DOM, JS engine and style systems around Firefox Quantum helped us reduce the performance hit.

The main patch we're using to test the startup performance is called "browser-menubar" and moves the main menubar which is part of browser.xul [0] to Fluent.
The patch is around 100 strings which is a vast majority of strings currently on the startup path (whether the browser menubar should be on the startup path is a separate conversation, but it helps us with testing).

The status as of December 2016 was:

* sessionrestore - ~5.1% hit (44ms)
* sessionrestore(e10s) - ~6.3% hit (47ms)
* tpaint - ~14% hit (40ms)
* tpaint(e10s) - ~13% hit (36ms)
* ts_paint - ~1.5% hit (14ms)
* ts_paint(e10s) - ~1.7% hit (15ms)

At the time, with the help of :smaug we did some profiling and Olli wrote a POC patch that finds all nodes that have `data-l10n-id` attribute, and translates them into C++ based on the assumption that creating JS reflection of those nodes is what makes us slower.
(the patch can be found in bug 1363862)

At that time the patch helped a lot reducing hits on tpaint from ~13% to 3% and on ts_paint to zero.

The status as of January 2018 was [1] (Windows 10 PGO):

* sessionrestore - ~2.95% hit (11ms)
* tpaint - ~2.69% hit (6ms)
* ts_paint - ~0.38% win (1.9ms)

As you can see the improvement was pretty drastic and that's without the C++ `document.localize` patch.

Unfortunately, since then we regressed again. February 23rd[2]:

* sessionrestore - ~8.62% hit (34ms)
* tpaint - ~15.93% hit (35.8ms)
* ts_paint - ~6.23% hit (31.7ms)

What's worse, the C++ DOM patch doesn't seem to help at all anymore[3].
I suspect the landing of stylo-chrome (bug 1417138) to be responsible, so I'll file a separate bug to investigate that.

Once that's fixed, we'll have to look at what else is delaying us and see which of the dependencies of this bug can help us get to a green (or at least white) talos.

[0] https://searchfox.org/mozilla-central/source/browser/base/content/browser-menubar.inc
[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=5e6ef7eae125&newProject=try&newRevision=afa7f521620fecb85e6d428ad607a9f6c9bc2b5d&framework=1
[2] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1a95c22f842b&newProject=try&newRevision=d40dbf52dd092c134a87b936b76ab9ce5c221ec7&framework=1
[3] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1a95c22f842b&newProject=try&newRevision=ada74456f5af6104abf44ffed7e5f059431b72a0&framework=1

Comment 2

a year ago
Posted patch browser-menubar.patch (obsolete) — Splinter Review
This is the patch you can apply onto m-c to test impact of Fluent on the startup path (mostly for the DTD->Fluent migration)


a year ago
Depends on: 1441037

Comment 3

a year ago
Based on bug 1441037 comment 11: with the (updated) patch from bug 1363862 the talos looks good: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f43e170bb075&newProject=try&newRevision=a74e16e7ce6c&framework=1

All other dependencies are now optional, but we will keep this bug open and monitor until we land in the startup path.

Comment 5

a year ago
Here's a comparative view between m-c, browser-menubar(without Node.localize) and browser-menubar(with Node-localize) - https://pike.github.io/talos-compare/?revision=efb326bbd665&revision=8b5ebddb2069&revision=e2c48508d0aa

Comment 6

a year ago
In bug 1437921 bz added a ChromeOnly Promise on `document` that resolves when DOMContentLoaded or Layout is done. We could use a similar technique instead of MozBeforeInitialXULLayout to unify the trigger between chrome-only HTML and XUL.

Comment 7

a year ago
Posted patch browser-menubar.patch (obsolete) — Splinter Review
Attachment #8953881 - Attachment is obsolete: true

Comment 8

a year ago
Here's a startup profile from today's m-c with the attached patch: https://perfht.ml/2I8miLj

The things I can identify in the profile:

* I/O takes ~10ms
* nsBrowserGlue.js registering L10nRegistry source takes ~13ms
* parsing takes ~2.4ms
* translateRoots takes ~1.5ms

The last one is a bit surprising because the whole `MozBeforeInitialXULLayout` takes ~5.9ms and it should be just translation, so my guess is that the actual that the translation takes is the 6ms.

Things that look good:

* The total cost doesn't seem very high
* We seem to be doing the right things at the right time

Things worth investigating:

* Why is L10nRegistation so expensive?
* Why is loading DOMContentLoaded taking 9ms?
* Why is l10n.js running time 30ms but self time is just 10ms?
* Why is there DOMContentLoaded before MozBeforeXULInitialLayout?

There may be other things in the profile that are worth investigating. I'll redo it after 60 branches off with symbols to look more into C++ side of things.


a year ago
Depends on: 1317481


a year ago
Depends on: 1456388

Comment 9

7 months ago
This becomes a P1 as of today since it's a requirement for the Capstone project and, overall, migration to Fluent. We've seen some minor wins with bug 1455649 and SpiderMonkey has been optimizing a lot of codepaths that we care about (generators, strings, concatenations, etc.), so I'd like to remeasure the impact.

This time, instead of migrating the whole menubar, I'll just inject a single string from FTL and monitor the impact.

The hypothesis is that we will be slightly slower in tests because DTD goes into the XUL FastLoad cache which makes that system unobservable in all runs after the first. ts_paint and tpaint take 20/40 runs, so for 39 runs the document is not getting localized.

Fluent caches the FTL resource parsing, but on each run has to apply the localization onto the DOM. We may have to decide if we want to add Fluent to the cache (and who would be qualified to touch that), swallow the perf impact for the greater good, or see if we can minimize the penalty.
Assignee: nobody → gandalf
Priority: P3 → P1

Comment 11

7 months ago
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #10)
> Talos run between central from today and a single string migrated to Fluent:
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=13affc067c58&newProject=try&newR
> evision=5a5aa2504bcf3e79144563c353c874a982d03935&framework=1
> The patch:
> https://hg.mozilla.org/try/rev/905ede35bdf415c8109081b0acc1e96dc90cc946

Yeah, this doesn't look great:


2-6% regressions across ts_paint, tpaint, sessionrestore and tabpaint.

Comment 12

7 months ago
As of today, the talos shows:

 - 4ms on ts_paint across the platforms (pgo)
 - 1ms on tpaint across the platforms (pgo)
 - 4-6ms on sessionrestore accross the platforms (pgo)

It is not as bad as I expected, and certainly better than in the past.

I'll now try to disable XUL fastload cache to see how much of that difference can be explained with it.

Comment 13

7 months ago
> 2-6% regressions across ts_paint, tpaint, sessionrestore and tabpaint.

I would expect tabpaint to be noise tbh.

Comment 15

7 months ago
I collected the profiles. Let me know if that's a good start.
Flags: needinfo?(bugs)
Ok, so TriggerInitialDocumentTranslation() is 2ms or so, and it is mostly about wrapping XUL elements. That is about DOMLocalization.jsm. So, rewriting DOMLocalization.jsm in C++ would be good from performance perspective.
addResourceIds is 1ms. That is rather deep JS stack. All that would be better in C++ too.

So, as far as I see, most of the overhead comes from backend implementation being JS.
Flags: needinfo?(bugs)

Comment 17

7 months ago
Thanks Olli!

We had a longer conversation on #content on IRC yesterday about strategies to approach this. I'll try sum them up today, but first, Brian asked for a talos run with initialization but TriggerInitialDocumentTranslation returning early.

Here's the talos: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=317ddd31da97&newProject=try&newRevision=5f9023f430c5&framework=1

Ignore the about_preferences result and focus on sessionrestore, tpaint, ts_paint - all within 1% of the central.

Here's talos-compare: https://pike.github.io/talos-compare/?revision=317ddd31da97&revision=5f9023f430c5&revision=ba16bad555a1&revision=9d2d72e91bb4

And the zoom-in on ts_paint: https://screenshots.firefox.com/yqHTu3WmJJg4SRJL/pike.github.io
1) central
2) early return from TriggerInitialDocumentTranslation
3) early return from translateRoots
4) translate elements in browser.xul

Reading the results - I don't think there's a big difference between (3) and (4), which is unexpected. There is some cost between (1) and (2), but it stays within 1% and doesn't trigger "significant" regression on talos. There's also some regression between (2) and (3). Together they constitute a significant regression (1)->(3).

Comment 18

7 months ago
I updated the three patches from the last bug to today's m-c: (1), (2) and (4).

1) mozilla-central

2) disable translations, added fluent init

3) single string translated

The performance is actually quite good! Here's a compare between (1) and (3):

I still claim there is a regression, I'd estimate 3-4ms on ts_paint, 1-5ms on tpaint and ~5ms on sessionrestore, but it's not marked as significant and they translate to ~1% of the number which makes it tricky to try to hunt down due to small band in which we're looking.

I'm still interested in moving most of the DOMLocalization to C++ (DocumentL10n), since smaug says that we're likely mostly regressing because of when we have to touch DOM during DOMLocalization, but it may be that we're close to be ready to enable Fluent on the startup path!

Comment 20

7 months ago
I pushed a patch that migrated 150 strings on the startup path to Fluent. Let's see what's the talos perf of that.


7 months ago
Attachment #8956709 - Attachment is obsolete: true

Comment 22

7 months ago
I'm now analyzing 4 scenarios:

1) mozilla-central
2) Fluent initialized on the startup path, but TriggerInitialDocumentTranslation disabled
3) Fluent initialized and a single string localized
4) Fluent initialized and 150 strings (browser-menubar) localized

Here's a talos-compare for those 4 runs: https://pike.github.io/talos-compare/?revision=72eed19d131e&revision=cf48b65831b9&revision=3879decb5bed&revision=c6f68b69aaf2

The only tests that are important to us here are: tpaint, ts_paint and sessionrestore.

Looking at them I see two jumps:

a) between (1) and (2)
b) between (3) and (4)

That indicates to me that there is some static cost associated with the initialization of Fluent, and then there's a per-string cost.

The total regressions as of today:
 - ts_paint 1-2%
 - tpaint 1-5%
 - sessionrestore 1-4%

Comment 23

7 months ago
I collected profiles for the 150 strings scenario by disabling everything in DocumentL10n.cpp and re-enabling one by one.

The following scenarios were tested now:

1) m-c
2) browser-menubar with everything in DocumentL10n.cpp disabled except of do_CreateInstance of mozIDOMLocalization
3) browser-menubar with everything disabled and AddResources reenabled (triggers I/O, parsing etc.)
4) browser-menubar with everything disabled and AddResources + RegisterObservers reenabled
5) browser-menubar with everything disabled and AddResources + RegisterObservers + ConnectRoot reenabled
6) browser-menubar with everything enabled

Here is a talos-compare for them: https://pike.github.io/talos-compare/?revision=72eed19d131e&revision=658da3c3bbfd&revision=3f6f64513a42&revision=578c680b5785&revision=e07bdbfcd483&revision=c6f68b69aaf2

What I think is interesting is that there seem to be a difference in when the "jump" happens

- in sessionrestore the perf jump happens between (2) and (3), and for ts_paint it happens
- in ts_paint there are two, one at (2)->(3) and the other at (5)->(6)
- in tpaint it only happens in (5)->(6)

(you may have to remove the other tests to see the jumps in the graphs)

Here are profiles:



Olli - my next step is to try to move some pieces of DOMLocalization to C++ and see if there's any impact, but I hope the profiles are ready for profiling!
Flags: needinfo?(bugs)
L10nRegistry.jsm handling the fetch shows up, the parsing part. I think this is actually the biggest single thing here.
Could we load and parse in some other thread and/or use native code here.
And put ftl to startup cache?  (This stuff isn't too trivial, but I think needs to be done eventually)
Actually, just using startup cache should be enough in practice. No need to load and parse stuff all the time.

There is something I don't understand. We call translateRoots way before L10nRegistry has loaded and parsed some file. Is that expected?

DOMLocalization stuff is mostly just it being JS. Many small things taking totally a bit too much time.

Would it be possible to not handle mutations if MozBeforeInitialXULLayout listener triggers mutations (or something before it). Could we deal with mutations - just guessing here - in an rAF callback?
Handling mutations forces creating wrappers at that point.

Just loading DOMLocalization takes a tiny bit time.

So, nothing new here. Having too much JS in very hot code paths shows always up in the profiles.
Flags: needinfo?(bugs)

Comment 25

7 months ago
Thanks Olli!

I'm going to investigate other things you noticed, but initially, I was concerned about your report about mutations - since the `browser-menubar` patch translates 150 strings in browser.xul without any injections from JS I'd expect MutationObserver to never kick in on the startup path (it's used in Preferences).

I looked myself and found it, so I modified the code we're testing against to only trigger `TranslateFragment(doc)` from DocumentL10n::TriggerInitialDocumentTranslation rather than whole `ConnectRoot(doc`).

This means that we're not connecting any MutationObserver and we're not settings directionality.

Here's the patch compared to "full" (6) - https://hg.mozilla.org/try/rev/3aa5da028d668d2393be4d05ccb159e68d760b42

I tested the perf impact of this change comparing it to (6): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c6f68b69aaf2&newProject=try&newRevision=e13cf0203a85&framework=1

There doesn't seem to be any impact on sessionrestore, tpaint or ts_paint. The win on about_preferences is to be expected (since they use dynamically injected localizable elements at startup).

So, my conclusion is that this modification still contains all the code that causes the perf cost, but the profile is cleaner, so maybe easier to evaluate.

Olli, can you look at:

and check if it still seems the same as your conclusion from comment 24?
Flags: needinfo?(bugs)
Looks the same, except that small mutation observer bit being not there (so less element wrapping).
Flags: needinfo?(bugs)

Comment 27

6 months ago

Joe suggested I ping you about this effort. Olli did the profiling of the paths between the patch and m-c and it seems like based on that the next step is to get Fluent or DOM localized by Fluent into the startup cache.

Do you see anything else here that we could do to unblock us?
Flags: needinfo?(mconley)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #23)
> Here are profiles:
> (1):
> https://perfht.ml/2yaDZWp
> https://perfht.ml/2yeIorh
> https://perfht.ml/2ybCaIT
> https://perfht.ml/2y9eCnQ
> https://perfht.ml/2ycUlh3
> (6):
> https://perfht.ml/2y9eaWG
> https://perfht.ml/2yeI5N9
> https://perfht.ml/2ygjtDG
> https://perfht.ml/2ycxceN
> https://perfht.ml/2yh5Ow6

Taking advantage of the start-up cache makes perfect sense. Let's definitely do that.

I'm a little confused by these profiles. As I understand from comment 23, the first set (1) should be the "good" case, and the second set (2) should be the "bad" case.

I opened the last of each set, and then I zoomed in on the parent process main thread from process start to the firstPaint marker for each.

1: "good" - https://perfht.ml/2Ewk9eG
2: "bad" - https://perfht.ml/2EzaljZ

What I find confusing is that in these profiles, the "bad" case is actually reaching firstPaint faster (224ms vs 294ms).

I find that pretty confusing.

From a glance, it doesn't look like these profiles are coming from a Talos test. Can we get some Talos test profiles from try posted? Pick the worst-performing one, and then put

mozharness: --geckoProfile

into the try syntax to make the try push generate profiles.
Flags: needinfo?(mconley) → needinfo?(gandalf)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #27)
> Joe suggested I ping you about this effort. Olli did the profiling of the
> paths between the patch and m-c and it seems like based on that the next
> step is to get Fluent or DOM localized by Fluent into the startup cache.

I think I've mentioned this elsewhere as well, but do we have a sense for the relative win between putting Fluent in startup cache vs putting DOM localized by Fluent in the startup cache? The former seems pretty low-cost and maintenance (I assume that means making something like nsXULPrototypeCache::PutStyleSheet and nsXULPrototypeCache::GetStyleSheet but for FTL). For the latter, I'm generally concerned that if we tie better perf to XUL documents only through the document prototype cache that it'll mean chrome HTML docs (which we are starting to use more of) don't get the speedup, and it'll make shipping the browser window as HTML harder.


6 months ago
Blocks: 1501881


6 months ago
Blocks: 1501886

Comment 30

6 months ago
With the landing of bug 1488973, I decided to redo the talos profiles so that we can see if the pattern holds.

I did four different 20 cycle runs:

(a) mozilla-central
(b) single string translated using FTL on the startup path
(c) whole menubar translated using FTL
(d) same as (d) but with the FTL AST JSON parsed and inlined to skip I/O and parsing

== (a) central:

talos: fffffd7c5a52
1: https://perfht.ml/2CD35B4
2: https://perfht.ml/2CEloWD
3: https://perfht.ml/2qbUKgl
4: https://perfht.ml/2qbnmq0
5: https://perfht.ml/2q9VVNp

== (b) one-string:

talos: ca436ca69ae4
1: https://perfht.ml/2CDeeBQ
2: https://perfht.ml/2CELHvV
3: https://perfht.ml/2CEM1L9
4: https://perfht.ml/2CD9uvW
5: https://perfht.ml/2CD62Sa

== (c) browser-menubar:

talos: 09fdaabb36dd
1: https://perfht.ml/2qbRLVa
2: https://perfht.ml/2CDZ7rS
3: https://perfht.ml/2qaMLQM
4: https://perfht.ml/2CDGtQW
5: https://perfht.ml/2CFXCt5

== (d) browser-menubar with hardcoded FTL AST:

talos: 790819444a8a
1: http://bit.ly/2qbGAM1
2: http://bit.ly/2q8Qy0N
3: http://bit.ly/2qcHIz0
4: http://bit.ly/2CDLfOo
5: http://bit.ly/2qaS7vd


(a) vs (b) - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fffffd7c5a52&newProject=try&newRevision=ca436ca69ae4

tpaint warnings and one red ~2.49%

(a) vs (c) - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fffffd7c5a52&newProject=try&newRevision=09fdaabb36dd&framework=1

some sessionrestore (2.5%), tpaint (3-6%) and tspaint (2-3%).

(a) vs (d) - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fffffd7c5a52&newProject=try&newRevision=790819444a8a&framework=1

seems very similar to (c)

talos-compare - https://pike.github.io/talos-compare/?revision=fffffd7c5a52&revision=ca436ca69ae4&revision=09fdaabb36dd&revision=790819444a8a

My read of the results indicate that there's some regression coming from even a single string being localized on the startup path - which is likely related to us loading the Fluent jsm, and initializing the system, searching for translatable strings etc. and then there's a cost of translating a high number of strings.
The former (a vs b) places us on the verge of visible talos regressions with 1-2% impact. The latter pushes us beyond that into 3-6% territory.

=== Caching FTL ===

In particular, we wanted to test the hypothesis that placing parsed FTL into the startup cache would help us get talos wins.

(c) vs (d) - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=09fdaabb36dd&newProject=try&newRevision=790819444a8a&framework=1

I believe there are no significant differences (although some tests are still running). I suspect that the reason for lack of visible win is that we already cache FTL during runtime, so the most sensitive tests like tpaint will only see a win in (d) scenario on the first of twenty runs.

:mossop, :mconley, :smaug - if I read the talos results correctly (and please, take a spin with the profiles I attached), caching FTL in the startup path will not, on its own help us.

We either need to cache localized DOM, or we need to remove the JS code from the startup path (so, migrate DOMLocalization to C++, and switch Fluent.jsm to fluent-rs?), or maybe even both.

How does it look to you?
Flags: needinfo?(mconley)
Flags: needinfo?(gandalf)
Flags: needinfo?(dtownsend)
Flags: needinfo?(bugs)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #30)
> We either need to cache localized DOM,
This, or at least cache the fluent files in binary format

> or we need to remove the JS code from
> the startup path (so, migrate DOMLocalization to C++, and switch Fluent.jsm
> to fluent-rs?), or maybe even both.
And this definitely, as far as I see.

I was looking at b/3. translateRoots isn't super slow, but whatever it is doing, is spending time mostly just running JS and crossing JS->C++ and C++ boundaries and compiling JS. Having that all in C++/Rust should be significantly more light weight.
Same with the other parts in the profile doing l10n stuff.

So, nothing new here. We shouldn't really run any extra JS in critical paths.
(this is a broader question in general - which parts of the browser UI can be implemented in JS and which can't be.)
Flags: needinfo?(bugs)
Also, the regressions here aren't horribly bad, but over the time such not-horribly-bad regressions accumulate and affect to the user experience.
One of the hard parts of doing analysis like this is that the profiles are different even within the same run. For example, I compared the (a)2 and (c)1, and one difference that stood out to me is that (c)1 seems to have a harder time setting the tab min width value here:


which is a setter defined here:


in (a)2, it takes 4ms: http://bit.ly/2zfvgTx
in (c)1, it takes 22ms: http://bit.ly/2z2oxMq

But in (c)3, it takes 4ms again: http://bit.ly/2z7ayFd

so it's pretty easy to fall down blind alley's that don't actually amount to much when doing the analysis.

Certainly I trust smaug's assessment about getting JS out of the critical path - crossing the native / JS boundaries is not at all free, and we get to avoid a bunch of overhead if we avoid crossing into JS.
Flags: needinfo?(mconley)


6 months ago
Depends on: 1503657
Attachment #9013979 - Attachment is obsolete: true


5 months ago
Depends on: 1507008
Depends on: 1512674


4 months ago
Depends on: 1517880
Flags: needinfo?(dtownsend)
You need to log in before you can comment on or make changes to this bug.