Closed
Bug 1240407
Opened 8 years ago
Closed 8 years ago
2-5% ts_paint/tps osx talos regression on mozilla-inbound (v.46) on Jan 14th from push: 309e36d57832
Categories
(Core :: Graphics: Text, defect)
Core
Graphics: Text
Tracking
()
RESOLVED
FIXED
mozilla47
People
(Reporter: jmaher, Assigned: jtd)
References
(Blocks 1 open bug)
Details
(Keywords: perf, regression, Whiteboard: [talos_regression])
Attachments
(2 files)
4.92 KB,
patch
|
m_kato
:
review+
|
Details | Diff | Splinter Review |
19.80 KB,
patch
|
lizzard
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
Talos has detected a Firefox performance regression from your commit 309e36d57832210a041d1dafe161e47f300e187d in bug 1212731. We need you to address this regression. This is a list of all known regressions and improvements related to your bug: http://alertmanager.allizom.org:8080/alerts.html?rev=309e36d57832210a041d1dafe161e47f300e187d&showAll=1 On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#ts_paint To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#tps Reproducing and debugging the regression: If you would like to re-run this Talos test on a potential fix, use try with the following syntax: try: -b o -p macosx64 -u none -t other,g2 # add "mozharness: --spsProfile" to generate profile data To run the test locally and do a more in-depth investigation, first set up a local Talos environment: https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code Then run the following command from the directory where you set up Talos: talos --develop -e <path>/firefox -a ts_paint:tps Making a decision: As the patch author we need your feedback to help us handle this regression. *** Please let us know your plans by Thursday, or the offending patch will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Reporter | ||
Comment 1•8 years ago
|
||
here is a link to a compareview: ts_paint (opt + e10s): https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=cf0e0332832e&newProject=mozilla-inbound&newRevision=309e36d57832&framework=1&filter=ts_ tps (opt); https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=cf0e0332832e&newProject=mozilla-inbound&newRevision=309e36d57832&framework=1&filter=tps These links show the difference between a series of retriggers on the offending push and the previous push: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=34a6488a7b20&tochange=32a8c6a3be18&filter-searchStr=mozilla-inbound%20talos&selectedJob=19961301 :jtd, can you help us figure out why this is happening? It is only on osx, so that is a good clue! Also if you have ideas on how to fix this or need more info- speak up :)
Flags: needinfo?(jdaggett)
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → jdaggett
Flags: needinfo?(jdaggett)
Assignee | ||
Comment 2•8 years ago
|
||
Taking a look at this today. Either the lookups aren't cached correctly or the simple fact of acting another n font families to the set of fonts used for a font group increases the font matching overhead.
Assignee | ||
Comment 3•8 years ago
|
||
I think the attribution isn't correct here, I pushed the current contents of mozcentral vs. mozcentral with this feature disabled and the tp times are almost identical. It looks like some other feature regressed, as the tp times have already returned to the pre-14th state (at least looking at mozilla-inbound numbers). With feature explicitly disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2be52269e19c&selectedJob=15585666 With feature reenabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee7ac62a2ec1&selectedJob=15585791 tps: 67.4 vs. 68.9 (disabled vs. enabled) ts_paint: 1397 vs. 1381 tp5o: 266 vs. 262 The graph server results show a big blip on Jan 14 with a fix on Jan 16: http://graphs.mozilla.org/graph.html#tests=[[255,63,59]]&sel=1447802278508.4846,1453184299033,217.74193548387098,295.16129032258067&displayrange=90&datatype=geo Let me know if you still think there's a regression issue here.
Flags: needinfo?(jmaher)
Comment 4•8 years ago
|
||
(In reply to John Daggett (:jtd) from comment #3) > The graph server results show a big blip on Jan 14 with a fix on Jan 16: > > http://graphs.mozilla.org/graph.html#tests=[[255,63,59]]&sel=1447802278508. > 4846,1453184299033,217.74193548387098,295. > 16129032258067&displayrange=90&datatype=geo It does, but that's distinct from what's being reported here. Note that bug 1212731 didn't hit inbound until Jan 15. Comparing the graphs, it's clear that the Ts regression here occurs almost 24 hours after that tp5o blip you're pointing at; and it doesn't go away when the tp5o blip drops back on the 16th. http://graphs.mozilla.org/graph.html#tests=[[255,63,59],[83,63,61],[83,63,59]]&sel=none&displayrange=7&datatype=geo
Comment 5•8 years ago
|
||
(In reply to John Daggett (:jtd) from comment #3) > With feature explicitly disabled: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=2be52269e19c&selectedJob=15585666 > > With feature reenabled: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=ee7ac62a2ec1&selectedJob=15585791 I retriggered some jobs on these pushes, to see beyond the noise, and it's clear that there is a significant regression on the tps test (close to 6%) when this is enabled: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=2be52269e19c&newProject=try&newRevision=ee7ac62a2ec1&framework=1
Reporter | ||
Comment 6•8 years ago
|
||
the tp5o regression and partial fix is from bug 1236104, we are tracking the perf regressions related to that in bug 1239834. I agree with jkew that the try pushes for tps show a clear difference- and I think the try pushes show the ts_paint is not really related- let me hunt that down, it could be related to the previous push or something like that. Here is a graph showing try and inbound together: https://treeherder.allizom.org/perf.html#/graphs?series=[mozilla-inbound,f6fb7dcb89c26e9c7e18722fa9846f60583d2ef2,1]&series=[try,f6fb7dcb89c26e9c7e18722fa9846f60583d2ef2,1]&highlightedRevisions=cf0e0332832e&highlightedRevisions=ee7ac62a2ec1&timerange=604800
Flags: needinfo?(jmaher)
Reporter | ||
Comment 7•8 years ago
|
||
I looked back over the ts_paint regression and you can see it clearly: https://treeherder.mozilla.org/perf.html#/graphs?series=[mozilla-inbound,fd481b27e806dff6494d6a3007b5b4973cd28869,1]&highlightedRevisions=cf0e0332832e&highlightedRevisions=309e36d57832&zoom=1452727334938.0166,1452968684400.8264,659.0148642160635,890.9851244391118 if you remove one or the other highlighted revision, you can see the clear distinction. In addition, I did additional retriggers on the target revision and the previous revision and the number still come up showing a measureable difference. This matches up well with the zoomed out view for historical and future data showing a sustained regression. For the bug, how about we focus on tps as that is easily viewed with the try push- we will see if that affects ts_paint as well!
Comment 8•8 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #7) > For the bug, how about we focus on tps as that is easily viewed with the try > push- we will see if that affects ts_paint as well! Note that the try job from comment 3 only disabled the cascade feature at font-matching time; there's still live code from bug 1212731 that was added to gfxMacPlatformFontList::InitSystemFonts(). That will run during startup. So if the LookupFontCascadeForLang() method it calls ends up being a bit expensive, that could explain the ts_paint regression (and the "disabled" try job wouldn't have fixed it, as it didn't disable that part of the code).
Reporter | ||
Comment 9•8 years ago
|
||
thanks for clarifying. It sounds very much like this is a bit more understandable. Now for the challenge- to get a fix if possible without spending too many days on it!
Assignee | ||
Comment 10•8 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #8) > I agree with jkew that the try pushes for tps show a clear difference- and I > think the try pushes show the ts_paint is not really related- let me hunt > that down, it could be related to the previous push or something like that. Ok, I'll investigate this some more tomorrow.
Comment 11•8 years ago
|
||
John, looking at the code in gfxMacPlatformFontList::InitSystemFonts(), I think it's a mistake to (unconditionally) call LookupFontCascadeForLang for lang="en" there. AFAICT, what actually gets used during rendering will likely be "en-us" or "en-gb" or something else (according to system locale? browser localization?), so the "en" cascade being eagerly cached here is probably redundant.
Assignee | ||
Comment 12•8 years ago
|
||
I ran through the tp5o testpages today with a local perf build. I think the problem is that appending the cascade to the fontlist for system fonts adds to the running time of gfxFontGroup::BuildFontList. BuildFontList time for tp5o pages, 1 iteration: 60.5 ms vs. 189 ms (without and with cascade appended) The top 8 BuildFontList times with the cascade take 120 ms and these all are including the cascade. So maybe the way to tackle this is to shift the cascade to the fallback code, such that for simple UI text these cascades won't be added.
Assignee | ||
Comment 13•8 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #11) > John, looking at the code in gfxMacPlatformFontList::InitSystemFonts(), I > think it's a mistake to (unconditionally) call LookupFontCascadeForLang for > lang="en" there. AFAICT, what actually gets used during rendering will > likely be "en-us" or "en-gb" or something else (according to system locale? > browser localization?), so the "en" cascade being eagerly cached here is > probably redundant. The app locale will give the right flavor of en-ness here. But that's not so important with regards to the slowdown. The slowdown is caused by checking non-CJK cascade lists against the default. The cascade list is only ever different for CJK locales, so the time spent looking up cascades for those languages is what's adding a lot of overhead. With a new patch that assumes non-CJK cascade lists are the same, I can eliminate most of that added overhead.
Assignee | ||
Comment 14•8 years ago
|
||
Instead of verifying that the system font cascade for most locales is the same as the default one, only look up the system font cascade for CJK locales. This is the only place where Apple code jiggles the relative ordering of CJK fonts in the list. This eliminates most of the overhead of looking up system font cascades.
Attachment #8711612 -
Flags: review?(m_kato)
Updated•8 years ago
|
Attachment #8711612 -
Flags: review?(m_kato) → review+
Comment 16•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/92770b987b87
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox47:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Reporter | ||
Comment 17•8 years ago
|
||
I am not seeing improvements on the graphs on inbound. This landed roughly a day ago, we have had a lot of data in the last 24 hours. Do we have other ideas on why this is a problem?
Flags: needinfo?(jd.bugzilla)
Assignee | ||
Comment 18•8 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #17) > I am not seeing improvements on the graphs on inbound. This landed roughly > a day ago, we have had a lot of data in the last 24 hours. > > Do we have other ideas on why this is a problem? Just to confirm, you're talking about OSX 10.10 tps numbers right? Those look like they are now clustering slightly lower than before the fix here landed but still have not completely returned to levels before the patches on bug 1212731 were landed. So I think the culprit must be the cost of looking up the system font cascade the first time in gfxMacPlatformFontList::InitSystemFonts. I'll run some tests to confirm this. The effect of not initializing the system font cascade at startup would be that UI text (menus, controls) would use the system font but normal system fallback for non-Latin script strings. This has the potential for UI that doesn't match the platform conventions. One thing we could do is push the system font cascade initialization into a separate thread, such that for a brief number of milliseconds at startup the system font cascade would be empty. That adds the cost of creating and destroying a thread but I imagine it would eliminate the regression for this particular test.
Flags: needinfo?(jd.bugzilla)
Assignee | ||
Comment 19•8 years ago
|
||
I dug into the timing of gfxMacPlatformFontList::LookupFontCascadeForLang a little more today. The initial call on 10.10 takes 43ms (!!) which is basically not acceptable at startup time. Most of this is not calling the system to fetch the cascade itself but the time needed to dig the real family name out of the items in the list. Iterating over the list to get the family names takes 86% of the time within the loop.
Reporter | ||
Comment 20•8 years ago
|
||
ok, I am reopening this and we can continue investigations.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 21•8 years ago
|
||
this is seen on aurora
Reporter | ||
Comment 22•8 years ago
|
||
I see session restore no auto restore on osx 10.10.5 opt and e10s regressing with this as well.
Assignee | ||
Comment 23•8 years ago
|
||
Because of the regression in bug 1244017, the system font cascade code was backed out. That should resolve this regression for now. Any future solution will need to resolve/abate the large startup time regression.
Reporter | ||
Updated•8 years ago
|
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Comment 24•8 years ago
|
||
Is this still affecting 46? It is late to uplift, but I just noticed this was fixed in 47 but not 46.
Flags: needinfo?(jmaher)
Flags: needinfo?(jd.bugzilla)
Reporter | ||
Comment 25•8 years ago
|
||
good catch lizzard! I see on beta what appears to be this change: https://treeherder.mozilla.org/perf.html#/graphs?timerange=31536000&series=%5Bmozilla-aurora,846e351e4a7d1d2694240ca420004fff6ab8ae1d,1%5D&series=%5Bmozilla-beta,846e351e4a7d1d2694240ca420004fff6ab8ae1d,1%5D&series=%5Bmozilla-central,846e351e4a7d1d2694240ca420004fff6ab8ae1d,1%5D&zoom=1449335823712.3745,1460222057216.5552,500,1599.4423791821564 I couldn't find other instances of osx or ts_paint that matched this, as we saw this on aurora. Looking at the code to disable this, I saw it land on inbound Feb 7th, which was Firefox 47. I am not sure of the risk of doing this for beta.
Flags: needinfo?(jmaher)
Comment 26•8 years ago
|
||
The talos regression here came from bug 1212731, which landed on mozilla-central on 2016-01-17 for mozilla46. It was then backed out due to a behavioral regression in bug 1244017, which hit m-c on 2016-02-08, but by that time m-c was at mozilla47. So the problem is that bug 1244017 wasn't noted as affecting 46, and nobody thought to uplift it to aurora (which it probably should have been). Since then, the trains have all moved on, so the regression (both behavioral and talos) is now on beta, and the backout is on central+aurora. Given that it's a pretty straightforward backout, and has been on central since early February (and on aurora for the whole of the aurora47 cycle), I think uplifting the backout (i.e. the patches in bug 1244017) to beta46, despite how late it is, should still be OK. This will regress the font-selection issues that bug 1212731 was originally aiming to solve, but we've accepted that for 47 and later anyway, until such time as someone revisits that bug and comes up with a new fix. So I see little benefit in shipping that "fix" (with its associated regressions) in 46, only to revert it in 47 anyway. To sum up, I think we should uplift bug 1244017 to beta. (That will also involve uplifting the patch here (comment 15), as bug 1244017 landed on top of it.) It also needs (trivial) manual merging because of other changes (e.g. nsAutoTArray -> AutoTArray) that have landed in the same time period. I'll post a combined backout patch here after tryserver confirms that it builds OK. https://treeherder.mozilla.org/#/jobs?repo=try&revision=9abc97746510
Comment 27•8 years ago
|
||
Tryserver mozilla-beta builds look fine; still waiting for it to get around to running reftests, but not expecting any issues there as the related tests are simply disabled by this patch. So this is what I propose we land for 46, as discussed above.
Comment 28•8 years ago
|
||
Comment on attachment 8741356 [details] [diff] [review] [patch for mozilla-beta only] Back out bug 1212731 on mozilla46 by uplifting patches from bugs 1244017 and 1240407 to beta. Carrying over r=m_kato for the changesets as landed on trunk; rebased to current tip of mozilla-beta Approval Request Comment [Feature/regressing bug #]: Bug 1212731 [User impact if declined]: Small perf regression (this bug) and behavior regression (font selection problems for certain characters, bug 1244017) [Describe test coverage new/current, TreeHerder]: This restores the pre-FF46 code (tested by existing reftests etc), and removes the new tests that were added for bug 1212731 [Risks and why]: Should be low risk, this just reverts to long-standing code that we've already been shipping, and has already been done on trunk/aurora. [String/UUID change made/needed]: None
Attachment #8741356 -
Attachment description: [patch for mozilla-beta only] Back out bug 1212731 on mozilla46 by uplifting patches from bugs 1244017 and 1240407 to beta. Carrying over for the changesets as landed on trunk; rebased to current tip of mozilla-beta → [patch for mozilla-beta only] Back out bug 1212731 on mozilla46 by uplifting patches from bugs 1244017 and 1240407 to beta. Carrying over r=m_kato for the changesets as landed on trunk; rebased to current tip of mozilla-beta
Attachment #8741356 -
Flags: approval-mozilla-beta?
Comment 29•8 years ago
|
||
Comment on attachment 8741356 [details] [diff] [review] [patch for mozilla-beta only] Back out bug 1212731 on mozilla46 by uplifting patches from bugs 1244017 and 1240407 to beta. Carrying over r=m_kato for the changesets as landed on trunk; rebased to current tip of mozilla-beta Backout for perf regression. Please uplift to beta only.
Attachment #8741356 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Reporter | ||
Comment 31•8 years ago
|
||
I see a bunch of improvements from talos on mozilla-beta: https://treeherder.mozilla.org/perf.html#/alerts?id=921
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(jd.bugzilla)
You need to log in
before you can comment on or make changes to this bug.
Description
•