Closed Bug 1603404 Opened 6 years ago Closed 4 years ago

Performance change between Firefox 70 and Firefox 71 Wikipedia

Categories

(Core :: Graphics: Text, defect, P3)

71 Branch
Unspecified
Linux
defect

Tracking

()

RESOLVED WONTFIX
Performance Impact high
Tracking Status
firefox-esr68 --- unaffected
firefox71 --- wontfix
firefox72 --- wontfix
firefox73 --- fix-optional

People

(Reporter: peter, Unassigned)

References

(Regression)

Details

(Keywords: perf:pageload, regression)

Attachments

(4 files)

Attached image 71-facebook.jpg

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Safari/605.1.15

Steps to reproduce:

Hi, I work in the Wikimedia Foundations performance team and I've seen some changes in performance between 70 and 71. Let me explain: We focus on first visual change since we know that is one of our most stable metrics. We run to two different tests: one using Browsertime with WebPageReplay to measure frontend performance and one set of tests using WebPageTest.

I got alerts from our Browsertime/WebPageReplay tests last days for first visual change (we test 12 wikis and three pages each) but I wasn't 100% sure because I also changed Firefox preferences during that time.

But when WebPageTest also switched to 71 I could see changes in the performance there too. However we got one page that got a lot faster in our testing with 71:
https://en.wikipedia.org/wiki/Barack_Obama

But we also got pages that got slower:
https://en.wikipedia.org/wiki/Facebook (I've attached a screenshot what the change look like for us).

Our bug for this is: https://phabricator.wikimedia.org/T240557

We also have RUM metrics but Firefox is usually a small part so we cannot see changes + miss first paint there.

Is there any change you all can backtrack to these changes? I want to know if there's something we can do different to avoid the pages getting slower.

Attached image facebook.png

Hi Peter,

Thanks for being clear about the bug. I used this add-on https://addons.mozilla.org/en-US/firefox/addon/load-time/#&gid=1&pid=1 to measure the loading time of the page https://en.wikipedia.org/wiki/Facebook. I'm a bit confused since some of the parameters have improved and some have worsened. Which one do you take as a reference?

Regards, Flor.

Flags: needinfo?(peter)

Hi Florencia, we focus on First Visual Change (when something first is painted on the screen) that we get from a video we record, but we also get your timeToNonBlankPaint that also have the same behavior in our tests.

However on the English Wikipedia we could see one page getting faster (and that I also think was tied to the Firefox upgrade). A couple of more pages where we could see a regression:
https://zh.wikipedia.org/wiki/Facebook
https://zh.wikipedia.org/wiki/北京市

But there are also a lot pages that we test where I don't see any change at all. So it's hard :) I'm curios if you know any change that could affect the first paint in the changes in 71? It's hard for me from the outside checking changes.

Status: UNCONFIRMED → NEW
Ever confirmed: true

I've repro'd this locally on Linux using browsertime (aka sitespeed). I see a ~20% degredation in FP/FVC and SpeedIndex/loadtime/etc.

$ ./mach browsertime --browser firefox --visualMetrics=true --firefox.binaryPath=obj-opt/dist/bin/firefox https://en.wikipedia.org/wiki/Facebook --iterations 10 --video=true

I also tried it with profiles on:
./mach browsertime --browser firefox --visualMetrics=true --firefox.binaryPath=obj-opt/dist/bin/firefox https://en.wikipedia.org/wiki/Facebook --iterations 5 --video=true --firefox.geckoProfiler --firefox.geckoProfilerParams.features=js,stackwalk,leaf,responsiveness,screenshots --firefox.geckoProfilerParams.threads="GeckoMain,Compositor,Socket Thread,SSL,Cert"

Note that to build FF70 locally if you have an up-to-date rust, you need to do "rustup override set 1.36.0" in the ff70 repo. To build 71, you need to apply a patch from bug 1602358 (which need minor conflict resolution to apply to 71)

Results from ff70:
[2019-12-16 16:22:44] INFO: https://en.wikipedia.org/wiki/Facebook 0, backEndTime: 91ms (±6.45ms), firstPaint: 1.75s (±61.45ms), firstVisualChange: 1.79s (±54.73ms), DOMContentLoaded: 1.81s (±55.54ms), Load: 1.84s (±56.58ms), speedIndex: 1984 (±152.65), perceptualSpeedIndex: 2027 (±161.28), contentfulSpeedIndex: 1837 (±73.54), visualComplete85: 2.11s (±225.25ms), lastVisualChange: 2.50s (±99.02ms), rumSpeedIndex: 668 (±20.33) (5 runs)

ff71:
[2019-12-16 12:42:34] INFO: https://en.wikipedia.org/wiki/Facebook 0, backEndTime: 92ms (±6.16ms), firstPaint: 1.13s (±9.65ms), firstVisualChange: 1.18s (±8.39ms), DOMContentLoaded: 1.18s (±9.34ms), Load: 1.22s (±10.15ms), speedIndex: 1189 (±8.42), perceptualSpeedIndex: 1212 (±8.51), contentfulSpeedIndex: 1185 (±8.34), visualComplete85: 1.18s (±8.39ms), lastVisualChange: 1.71s (±16.78ms), rumSpeedIndex: 208 (±9.73) (10 runs)

(those are with screenshots and profiling on; I see similar though faster results with profiling and screenshots off)

The primary difference I saw looking at the profiles of the median run from 70 and 71 is that Reflow takes about 25% longer (1024ms vs 745ms)

I'll attach the profiles, but note most symbols are missing since browsertime doesn't correctly set the directories for geckoprofiler to get symbols from. You should see similar results with local profiling runs. I have the raw numbers from 10 runs without profiling, and 5 with, for each of 70 and 71, as well as videos of each one (which can be combined with a script to do side-by-sides).

Profiles of Median FirstVisualChange run for each:
ff70: https://perfht.ml/2EupHDG ff71: https://perfht.ml/2Puygoi
I have 4 more of each profile if you want.

This page and the Barack Obama page, which did not regress apparently, are very similar. Also, we monitor a (recorded copy) of the Barack Obama page in perfherder/raptor-tp6, and saw no regressions.

Vicky indicated layout perf issues should start with Matt and Maire; NI-ing

Feel free to reach out to me with any questions or for assistance.

Component: Untriaged → Layout
Flags: needinfo?(matt.woodrow)
Product: Firefox → Core
Whiteboard: [qf:p1:pageload]

If this is reproducible, it should be feasible to run mozregression on this, right?

I asked Jesup, and, quoting:

jesup: you don't need profiler, --video or --videometrics to repro - you can do it off FP/FVC or loadtime I believe
jesup: not doing metrics makes it much faster
emilio: jesup so ideally I can get a central checkout and do ./mach browsertime with the right path to the mozregression-downloaded binary path?
emilio: So with --firefox.binaryPath=/path/to/wherever/mozregression/unpacked/stuff
jesup: Yes, I think that'll work

So is someone volunteering? Or did I just volunteer? ;)

(I can try to do this tomorrow or sometime this week if nobody does it before me)

Note to self: I just tried the above, and it almost works. I had to use MOZ_DISABLE_CONTENT_SANDBOX=1, because of bug 1597792 (which I forgot about). d'oh.

Mozregression pointed me to bug 1583707: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=28fcea56b3a3ee186aa736c80e1f95b62c7bcf74&tochange=566dce282e2c06afa4318ba35d471f5c2b8c85e2

If I understand correctly that change made our layout on Linux be similar to other OSes. But it'd be good to confirm.

Attached file MozRegression log

(for reference)

So the way I tested above (also for reference) is running mozregression normally, but then grabbing the binary name (like /tmp/yadayada/firefox/firefox) and using: ./mach browsertime --browser=firefox --iterations 5 https://en.wikipedia.org/wiki/Facebook --firefox.binaryPath=/tmp/tmph1u87yxm/firefox/firefox.

The page load / time to first paint times would be different enough for it to be noticeable. Se above.

So, can somebody reproduce this with mozregression --repo autoland --launch 28fcea56b3a3ee186aa736c80e1f95b62c7bcf74 (good) and mozregression --repo autoland --launch ebeb78d6ad375a58cab872a7d49ab712c6653b48 (bad)?

Also, would be good to confirm that the above does not reproduce on Mac / Windows.

Tentatively assuming I got the regression range right.

Regressed by: 1583707

So I guess the next thing to figure out is whether this is a Linux-only issue (as in, cairo being slow), or this affects other platforms more generally (as in, subpixel glyphs are slow) and we just noticed with the above patch. But I'm not familiar with that code at all so... :)

Flags: needinfo?(peter)
Component: Layout → Graphics: Text

Emilio - thanks. There might be more than one problem here - or it might be that the testing they're doing in their automation is on linux. Peter?
We should try to re-verify the regression on Windows or Mac

Lee, should we expect this type of perf regression on this page? Why do we see it here, and not on the Barack Obama page (which got faster)? Is it just luck-of-positioning?

Flags: needinfo?(peter)
Flags: needinfo?(lsalzman)

I had a look at our RUM and I couldn't see anything there (the main switch started the 6->7 of December) but since we don't have any paint timings there its hard to say. But we could see this on timeToNonBlankPaint on synthetic, so if in this case it would have helped us to have that turned on by default = we would know on other platforms too, however I understand if there are other considerations. Or are you able to see any difference in Navigations Timing metrics too, I didn't fully get that?

We test only on Linux but we do it on two different systems. Both with Browsertime/WebPageReplay and WebPageTest (without replaying traffic) and i could see the regression on those pages on both systems. We did the switch to 71 on different times so it's easy to see. We have no automatic setup for Windows/Mac, sorry.

Ok, so that would support this being a linux gfx issue (or if you prefer, a linux gfx bugfix that as a side-effect of slowing Reflow). Lee: any chances for improvements in the perf for this feature, or is this a hit we take on all platforms when sub-pixel rendering, and Linux just gets to take the hit now along with the others?

OS: Unspecified → Linux

So, with regards to subpixel positioning, the worst case scenario is we may be potentially adding up to 4x as many copies of a given glyph to the glyph cache, if there are instances of the same glyph but at 4 different subpixel locations. On average, though, I'd expect more in the range of 2-3x more FreeType rasterization overhead of initializing a glyph and putting into the glyph cache.

Now, another scenario is the glyph cache itself. Once things are all entered into the glyph cache for a given font, we are no longer actually rasterizing the glyphs in FreeType, and instead are just blitting them straight from the stored copy in the glyph cache. But this glyph cache is itself fixed in size, so if there are too many glyphs to actually fit within the cache limits, we may end up thrashing.

But if it's more first load that's the problem, I wouldn't necessarily be implicating thrashing of the glyph cache to be the issue, but more just that subpixel positioning requires us to punt to FreeType more often to initialize those different locations, and that overhead is an entirely expected trade-off for the improved quality in text layout.

Other issue, those profiles don't seem to have any symbols, so it's hard to really analyze them too heavily. Can we possibly get some better annotated ones?

Flags: needinfo?(lsalzman)

Lee: Manual profiles, with full symbols, local opt builds on linux:
FF70: https://perfht.ml/2rZ9oMp
FF71: https://perfht.ml/2S6box6

Flags: needinfo?(lsalzman)

Peter: Have there been any significant changes to the /Facebook page (the fonts/css/etc) in the last day or so, especially in the last several hours? I was working to get better profiles, and the regression seems to have disappeared in-between two sets of runs (one showed it, subsequent ones don't). I'm re-running test sequences to try to verify. I'll also check the page history: changes are minimal; someone added a sentence and a reference with a link.

Argh yes. Looking at our WebPageTest tests something changed for that pages the 17/12 -12 13 CET or around that time. The first visual change changed and become much more stable. Checked what changed, its "only" content: https://en.wikipedia.org/w/index.php?title=Facebook&type=revision&diff=931178682&oldid=931146866

There's no visual change between those changes:: https://s3.amazonaws.com/synthetic-tests-result-wikimedia/en.wikipedia.org/2019-12-17-13-01-38/pages/en.wikipedia.org/wiki/Facebook/data/screenshots/wpt-1-firstView.png
and https://s3.amazonaws.com/synthetic-tests-result-wikimedia/en.wikipedia.org/2019-12-17-12-16-06/pages/en.wikipedia.org/wiki/Facebook/data/screenshots/wpt-1-firstView.png

I've seen this before that First Visual Change has been really sensitive to small content changes in Firefox.

https://zh.wikipedia.org/wiki/北京市 seems to still have the issue.

Priority: -- → P3

Removing regression-window wanted based on Comment 11.

Flags: needinfo?(matt.woodrow)

Marking this as fix-optional for 73 to remove it from weekly regression triage. We can still take a patch for 73 or 74.

Flags: needinfo?(lsalzman)

Hey Peter, we're looking at this again. I see you closed the issue on your side. Do you see similar issues nowadays or has it vanished sometime between 71 and 72?
Thanks!

Hi Julien, I closed our issue since it was so long time ago and many releases. I usually see regressions/wins on some URLs per release, but its hard report/or fins the exact issue. I think you can close this issue, thanks!

Flags: needinfo?(peter)

Thanks for the feedback!

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
Has Regression Range: --- → yes
Performance Impact: --- → P1
Keywords: perf:pageload
Whiteboard: [qf:p1:pageload]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: