Closed Bug 1240407 Opened 4 years ago Closed 4 years ago

2-5% ts_paint/tps osx talos regression on mozilla-inbound (v.46) on Jan 14th from push: 309e36d57832

Categories

(Core :: Graphics: Text, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla47
Tracking Status
firefox46 --- fixed
firefox47 --- fixed

People

(Reporter: jmaher, Assigned: jtd)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(2 files)

Talos has detected a Firefox performance regression from your commit 309e36d57832210a041d1dafe161e47f300e187d in bug 1212731.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=309e36d57832210a041d1dafe161e47f300e187d&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#ts_paint
To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#tps

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p macosx64 -u none -t other,g2  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a ts_paint:tps

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Thursday, or the offending patch will be backed out! ***

Our wiki page outlines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
here is a link to a compareview:

ts_paint (opt + e10s):
https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=cf0e0332832e&newProject=mozilla-inbound&newRevision=309e36d57832&framework=1&filter=ts_

tps (opt);
https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=cf0e0332832e&newProject=mozilla-inbound&newRevision=309e36d57832&framework=1&filter=tps

These links show the difference between a series of retriggers on the offending push and the previous push:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=34a6488a7b20&tochange=32a8c6a3be18&filter-searchStr=mozilla-inbound%20talos&selectedJob=19961301

:jtd, can you help us figure out why this is happening?  It is only on osx, so that is a good clue!  Also if you have ideas on how to fix this or need more info- speak up :)
Flags: needinfo?(jdaggett)
Assignee: nobody → jdaggett
Flags: needinfo?(jdaggett)
Taking a look at this today. Either the lookups aren't cached correctly or the simple fact of acting another n font families to the set of fonts used for a font group increases the font matching overhead.
I think the attribution isn't correct here, I pushed the current contents of mozcentral vs. mozcentral with this feature disabled and the tp times are almost identical. It looks like some other feature regressed, as the tp times have already returned to the pre-14th state (at least looking at mozilla-inbound numbers).

With feature explicitly disabled:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2be52269e19c&selectedJob=15585666

With feature reenabled:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee7ac62a2ec1&selectedJob=15585791

tps: 67.4 vs. 68.9 (disabled vs. enabled)
ts_paint: 1397 vs. 1381
tp5o: 266 vs. 262

The graph server results show a big blip on Jan 14 with a fix on Jan 16:

http://graphs.mozilla.org/graph.html#tests=[[255,63,59]]&sel=1447802278508.4846,1453184299033,217.74193548387098,295.16129032258067&displayrange=90&datatype=geo

Let me know if you still think there's a regression issue here.
Flags: needinfo?(jmaher)
(In reply to John Daggett (:jtd) from comment #3)
> The graph server results show a big blip on Jan 14 with a fix on Jan 16:
> 
> http://graphs.mozilla.org/graph.html#tests=[[255,63,59]]&sel=1447802278508.
> 4846,1453184299033,217.74193548387098,295.
> 16129032258067&displayrange=90&datatype=geo

It does, but that's distinct from what's being reported here. Note that bug 1212731 didn't hit inbound until Jan 15. Comparing the graphs, it's clear that the Ts regression here occurs almost 24 hours after that tp5o blip you're pointing at; and it doesn't go away when the tp5o blip drops back on the 16th.

http://graphs.mozilla.org/graph.html#tests=[[255,63,59],[83,63,61],[83,63,59]]&sel=none&displayrange=7&datatype=geo
(In reply to John Daggett (:jtd) from comment #3)
> With feature explicitly disabled:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=2be52269e19c&selectedJob=15585666
> 
> With feature reenabled:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=ee7ac62a2ec1&selectedJob=15585791

I retriggered some jobs on these pushes, to see beyond the noise, and it's clear that there is a significant regression on the tps test (close to 6%) when this is enabled:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=2be52269e19c&newProject=try&newRevision=ee7ac62a2ec1&framework=1
the tp5o regression and partial fix is from bug 1236104, we are tracking the perf regressions related to that in bug 1239834.  

I agree with jkew that the try pushes for tps show a clear difference- and I think the try pushes show the ts_paint is not really related- let me hunt that down, it could be related to the previous push or something like that.

Here is a graph showing try and inbound together:
https://treeherder.allizom.org/perf.html#/graphs?series=[mozilla-inbound,f6fb7dcb89c26e9c7e18722fa9846f60583d2ef2,1]&series=[try,f6fb7dcb89c26e9c7e18722fa9846f60583d2ef2,1]&highlightedRevisions=cf0e0332832e&highlightedRevisions=ee7ac62a2ec1&timerange=604800
Flags: needinfo?(jmaher)
I looked back over the ts_paint regression and you can see it clearly:
https://treeherder.mozilla.org/perf.html#/graphs?series=[mozilla-inbound,fd481b27e806dff6494d6a3007b5b4973cd28869,1]&highlightedRevisions=cf0e0332832e&highlightedRevisions=309e36d57832&zoom=1452727334938.0166,1452968684400.8264,659.0148642160635,890.9851244391118

if you remove one or the other highlighted revision, you can see the clear distinction.  In addition, I did additional retriggers on the target revision and the previous revision and the number still come up showing a measureable difference.  This matches up well with the zoomed out view for historical and future data showing a sustained regression.

For the bug, how about we focus on tps as that is easily viewed with the try push- we will see if that affects ts_paint as well!
(In reply to Joel Maher (:jmaher) from comment #7)
> For the bug, how about we focus on tps as that is easily viewed with the try
> push- we will see if that affects ts_paint as well!

Note that the try job from comment 3 only disabled the cascade feature at font-matching time; there's still live code from bug 1212731 that was added to gfxMacPlatformFontList::InitSystemFonts(). That will run during startup.

So if the LookupFontCascadeForLang() method it calls ends up being a bit expensive, that could explain the ts_paint regression (and the "disabled" try job wouldn't have fixed it, as it didn't disable that part of the code).
thanks for clarifying.  It sounds very much like this is a bit more understandable.  Now for the challenge- to get a fix if possible without spending too many days on it!
(In reply to Jonathan Kew (:jfkthame) from comment #8)

> I agree with jkew that the try pushes for tps show a clear difference- and I
> think the try pushes show the ts_paint is not really related- let me hunt
> that down, it could be related to the previous push or something like that.

Ok, I'll investigate this some more tomorrow.
John, looking at the code in gfxMacPlatformFontList::InitSystemFonts(), I think it's a mistake to (unconditionally) call LookupFontCascadeForLang for lang="en" there. AFAICT, what actually gets used during rendering will likely be "en-us" or "en-gb" or something else (according to system locale? browser localization?), so the "en" cascade being eagerly cached here is probably redundant.
I ran through the tp5o testpages today with a local perf build. I think the problem is that appending the cascade to the fontlist for system fonts adds to the running time of gfxFontGroup::BuildFontList.

BuildFontList time for tp5o pages, 1 iteration:
60.5 ms vs. 189 ms (without and with cascade appended)

The top 8 BuildFontList times with the cascade take 120 ms and these all are including the cascade.

So maybe the way to tackle this is to shift the cascade to the fallback code, such that for simple UI text these cascades won't be added.
(In reply to Jonathan Kew (:jfkthame) from comment #11)
> John, looking at the code in gfxMacPlatformFontList::InitSystemFonts(), I
> think it's a mistake to (unconditionally) call LookupFontCascadeForLang for
> lang="en" there. AFAICT, what actually gets used during rendering will
> likely be "en-us" or "en-gb" or something else (according to system locale?
> browser localization?), so the "en" cascade being eagerly cached here is
> probably redundant.

The app locale will give the right flavor of en-ness here. But that's not so important with regards to the slowdown. The slowdown is caused by checking non-CJK cascade lists against the default. The cascade list is only ever different for CJK locales, so the time spent looking up cascades for those languages is what's adding a lot of overhead. With a new patch that assumes non-CJK cascade lists are the same, I can eliminate most of that added overhead.
Instead of verifying that the system font cascade for most locales is the same as the default one, only look up the system font cascade for CJK locales. This is the only place where Apple code jiggles the relative ordering of CJK fonts in the list. This eliminates most of the overhead of looking up system font cascades.
Attachment #8711612 - Flags: review?(m_kato)
Attachment #8711612 - Flags: review?(m_kato) → review+
https://hg.mozilla.org/mozilla-central/rev/92770b987b87
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
I am not seeing improvements on the graphs on inbound.  This landed roughly a day ago, we have had a lot of data in the last 24 hours.

Do we have other ideas on why this is a problem?
Flags: needinfo?(jd.bugzilla)
(In reply to Joel Maher (:jmaher) from comment #17)
> I am not seeing improvements on the graphs on inbound.  This landed roughly
> a day ago, we have had a lot of data in the last 24 hours.
> 
> Do we have other ideas on why this is a problem?

Just to confirm, you're talking about OSX 10.10 tps numbers right? Those look like they are now clustering slightly lower than before the fix here landed but still have not completely returned to levels before the patches on bug 1212731 were landed. So I think the culprit must be the cost of looking up the system font cascade the first time in gfxMacPlatformFontList::InitSystemFonts. I'll run some tests to confirm this.

The effect of not initializing the system font cascade at startup would be that UI text (menus, controls) would use the system font but normal system fallback for non-Latin script strings. This has the potential for UI that doesn't match the platform conventions.

One thing we could do is push the system font cascade initialization into a separate thread, such that for a brief number of milliseconds at startup the system font cascade would be empty. That adds the cost of creating and destroying a thread but I imagine it would eliminate the regression for this particular test.
Flags: needinfo?(jd.bugzilla)
I dug into the timing of gfxMacPlatformFontList::LookupFontCascadeForLang a little more today. The initial call on 10.10 takes 43ms (!!) which is basically not acceptable at startup time. Most of this is not calling the system to fetch the cascade itself but the time needed to dig the real family name out of the items in the list. Iterating over the list to get the family names takes 86% of the time within the loop.
ok, I am reopening this and we can continue investigations.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
this is seen on aurora
I see session restore no auto restore on osx 10.10.5 opt and e10s regressing with this as well.
Because of the regression in bug 1244017, the system font cascade code was backed out. That should resolve this regression for now. Any future solution will need to resolve/abate the large startup time regression.
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Is this still affecting 46? It is late to uplift, but I just noticed this was fixed in 47 but not 46.
Flags: needinfo?(jmaher)
Flags: needinfo?(jd.bugzilla)
good catch lizzard!

I see on beta what appears to be this change:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=31536000&series=%5Bmozilla-aurora,846e351e4a7d1d2694240ca420004fff6ab8ae1d,1%5D&series=%5Bmozilla-beta,846e351e4a7d1d2694240ca420004fff6ab8ae1d,1%5D&series=%5Bmozilla-central,846e351e4a7d1d2694240ca420004fff6ab8ae1d,1%5D&zoom=1449335823712.3745,1460222057216.5552,500,1599.4423791821564

I couldn't find other instances of osx or ts_paint that matched this, as we saw this on aurora.  Looking at the code to disable this, I saw it land on inbound Feb 7th, which was Firefox 47.

I am not sure of the risk of doing this for beta.
Flags: needinfo?(jmaher)
The talos regression here came from bug 1212731, which landed on mozilla-central on 2016-01-17 for mozilla46.

It was then backed out due to a behavioral regression in bug 1244017, which hit m-c on 2016-02-08, but by that time m-c was at mozilla47.

So the problem is that bug 1244017 wasn't noted as affecting 46, and nobody thought to uplift it to aurora (which it probably should have been).

Since then, the trains have all moved on, so the regression (both behavioral and talos) is now on beta, and the backout is on central+aurora.

Given that it's a pretty straightforward backout, and has been on central since early February (and on aurora for the whole of the aurora47 cycle), I think uplifting the backout (i.e. the patches in bug 1244017) to beta46, despite how late it is, should still be OK. This will regress the font-selection issues that bug 1212731 was originally aiming to solve, but we've accepted that for 47 and later anyway, until such time as someone revisits that bug and comes up with a new fix. So I see little benefit in shipping that "fix" (with its associated regressions) in 46, only to revert it in 47 anyway.

To sum up, I think we should uplift bug 1244017 to beta. (That will also involve uplifting the patch here (comment 15), as bug 1244017 landed on top of it.) It also needs (trivial) manual merging because of other changes (e.g. nsAutoTArray -> AutoTArray) that have landed in the same time period. I'll post a combined backout patch here after tryserver confirms that it builds OK.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9abc97746510
Tryserver mozilla-beta builds look fine; still waiting for it to get around to running reftests, but not expecting any issues there as the related tests are simply disabled by this patch. So this is what I propose we land for 46, as discussed above.
Comment on attachment 8741356 [details] [diff] [review]
[patch for mozilla-beta only] Back out bug 1212731 on mozilla46 by uplifting patches from bugs 1244017 and 1240407 to beta. Carrying over r=m_kato for the changesets as landed on trunk; rebased to current tip of mozilla-beta

Approval Request Comment
[Feature/regressing bug #]: Bug 1212731

[User impact if declined]: Small perf regression (this bug) and behavior regression (font selection problems for certain characters, bug 1244017)

[Describe test coverage new/current, TreeHerder]: This restores the pre-FF46 code (tested by existing reftests etc), and removes the new tests that were added for bug 1212731

[Risks and why]: Should be low risk, this just reverts to long-standing code that we've already been shipping, and has already been done on trunk/aurora.

[String/UUID change made/needed]: None
Attachment #8741356 - Attachment description: [patch for mozilla-beta only] Back out bug 1212731 on mozilla46 by uplifting patches from bugs 1244017 and 1240407 to beta. Carrying over for the changesets as landed on trunk; rebased to current tip of mozilla-beta → [patch for mozilla-beta only] Back out bug 1212731 on mozilla46 by uplifting patches from bugs 1244017 and 1240407 to beta. Carrying over r=m_kato for the changesets as landed on trunk; rebased to current tip of mozilla-beta
Attachment #8741356 - Flags: approval-mozilla-beta?
Comment on attachment 8741356 [details] [diff] [review]
[patch for mozilla-beta only] Back out bug 1212731 on mozilla46 by uplifting patches from bugs 1244017 and 1240407 to beta. Carrying over r=m_kato for the changesets as landed on trunk; rebased to current tip of mozilla-beta

Backout for perf regression. Please uplift to beta only.
Attachment #8741356 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I see a bunch of improvements from talos on mozilla-beta:
https://treeherder.mozilla.org/perf.html#/alerts?id=921
Flags: needinfo?(jd.bugzilla)
You need to log in before you can comment on or make changes to this bug.