Open Bug 1678935 Opened 10 months ago Updated 6 months ago

Extremely low fps with translateZ since 83.0

Categories

(Core :: Graphics: WebRender, defect, P3)

Firefox 83
Desktop
Windows
defect

Tracking

()

Tracking Status
firefox-esr78 --- unaffected
firefox83 --- wontfix
firefox84 --- wontfix
firefox85 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix

People

(Reporter: erik.faulhaber, Assigned: gw)

References

(Depends on 1 open bug)

Details

(Keywords: regression)

Attachments

(2 files)

5.38 MB, application/x-zip-compressed
Details
87.71 KB, application/x-zip-compressed
Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0

Steps to reproduce:

I created a minimal example to reproduce:
https://codepen.io/erik-f/pen/PozrrPb

I could reproduce the bug on multiple devices running Windows 10 with Firefox 83.0.
In Firefox 83.0 on Archlinux everything works fine.
In Firefox 82.0.3 on Windows everything works as expected too.

Actual results:

While developing a website with parallax scrolling (using translateZ), I experienced horrible lags a few days ago. While reducing the source code to a minimal example, I accidentally broke the parallax effect, but the performance issues still persist (that's what you can see in the codepen above).

Both the working parallax website and the codepen above, which doesn't have any visible parallax effect now, lag horribly while scrolling and resizing the window.

I did a performance analysis while slowly resizing the window in both 83.0 and 82.0.3 on Windows 10. I attached both results below.
In 82.0.3 the frame rate is constantly at 60 fps. In 83.0 the frame rate is mostly around 5 fps. GPU usage (GTX 1080) is around 10% on 82.0.3 and around 60% on 83.0.

The problem seems to scale with the screen resolution. In a maximized window on a 1440p monitor it's horrible, while it's not very noticeable in a small window.

Note: Even when removing the translateZ lines from the CSS file, resizing is still not 100% smooth as in 82.0.3. However, it doesn't seem to happen exclusively with SVG files. We tried PNGs instead and while the performance was a lot better, there was still a noticeable lag with very large PNGs (again, everything works fine in 82.0.3).

Expected results:

I expect smooth scrolling and resizing at 60 fps like in 82.0.3.

I have to mention that, in my experience, this issue only occurs when in Maximized window mode, does not occur when the window is smaller than the whole screen.
Secondly, the lag while zooming in and out is still seen in versions as old as Nighty v78.0a1, so this issue will address the scrolling.
Thirdly, the same issue is seen in Firefox Release v82.0.1, but I went further back and I observed that Nightly v78.0a1 does have a smooth scrolling action, so I performed a regression and these are my results:

2020-11-26T14:01:20: DEBUG : Found commit message:
Bug 1623715 - [8.2] Move media fullscreen event to JS and extend its metadata. r=geckoview-reviewers,snorp,alwu
Differential Revision: https://phabricator.services.mozilla.com/D86350
2020-11-26T14:01:20: DEBUG : Did not find a branch, checking all integration branches
2020-11-26T14:01:20: INFO : The bisection is done.
2020-11-26T14:01:20: INFO : Stopped

This issue is not observed on Mac OS 10.15.6 or Ubuntu 20.04.

I have chosen the (Core) Web Painting component for this issue. Please set a more appropriate one if incorrect.

Status: UNCONFIRMED → NEW
Component: Untriaged → Web Painting
Ever confirmed: true
Keywords: regression
OS: Unspecified → Windows
Product: Firefox → Core
Regressed by: 1623715
Hardware: Unspecified → Desktop

That is impossible, the bug 1623715 is for Android only and it's nothing to do with painting.

Can you please look at (or copy) the graphics section of about:support for the good and bad cases here?

Flags: needinfo?(erik.faulhaber)
Flags: needinfo?(daniel.bodea)

My profile on Windows for this: https://share.firefox.dev/37clRNh

It looks like translateZ is pushing each svg into a separate blob image, and WebRender is really struggling with giant uploads while scrolling.

Component: Web Painting → Graphics: WebRender

I also crashed trying to test this on MacOS, in tex_sub_image_2d_pbo.

No longer regressed by: 1623715

I spent the last days investigating this bug.
After a lot of manual bisecting (unfortunately I didn't know of the mozregression tool until today) I found that the extremely laggy resizing actually originated in 2019 (0e4d7f204a27).
However, the scrolling lag seems to be independent of this. The scrolling definitely works fine in 2020-01-01-09-29-38 and is definitely broken in 2020-09-01-09-45-42. I tried the regression tool multiple times, but it seems to me that it's not one particular revision that is causing the lag. It rather seems to me that there are several versions that are "a bit worse" and they build up to the very broken version 2020-09-01-09-45-42.

Both the scrolling and the resizing are definitely broken in 2020-09-01-09-45-42. We made another codepen where the parallax effect actually works and the scrolling seems to be even worse than in the other one: https://codepen.io/lucamarcelpeters/full/OJRJgBR

I wondered why it worked in 82.0.3 (and 82.0 and 82.0b1), but not in 83.0.
It must have been fixed somewhere after 2020-09-01-09-45-42 (which is 82.0a1), but it seems to be that this fix didn't make it into 83.0 for some reason. Shouldn't the release branch containing the fix have been merged back to central and beta?
I tried following the bug fix back from NIGHTLY_82_END (cecca8e30949) where it didn't work to 82.0b1 where everything works fine. Right after the merge revision (acc3d41c2c93) everything works fine. For me, it seems like the bug fix was already in the beta branch before 82 beta and that it didn't get merged back to central. I still don't know why it's bugged again in 83.0 though.

I hope this is useful somehow, as I spent way too much time on it. I'll leave this to someone who actually knows what they're doing now. Please let me know if I'm somehow right with my "the bug fix is in beta but didn't get merged back to nightly" theory or if that is all nonsense.

Flags: needinfo?(erik.faulhaber)

(In reply to erik.faulhaber from comment #6)

I hope this is useful somehow, as I spent way too much time on it. I'll leave this to someone who actually knows what they're doing now. Please let me know if I'm somehow right with my "the bug fix is in beta but didn't get merged back to nightly" theory or if that is all nonsense.

Not all features that are enabled on Nightly remain enabled when a release goes to Beta. So what you're seeing is very possibly due to Nightly vs. Beta configuration issues (in particular differences in WebRender being enabled or not). In general, all code changes land on Nightly before being merged into Beta, so that's not likely to be the explanation here.

FWIW, on my Win10 system, I see a noticeable drop in scrolling performance on release between Fx80 and Fx81. I was able to bisect with mozregression:

 3:09.34 INFO: Last good revision: 4b8de762e09740f9d140a0a097922fbccc4d1406
 3:09.34 INFO: First bad revision: c8ca1d1866e7e3591d2df84c2a4f0204d43386ed
 3:09.34 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=4b8de762e09740f9d140a0a097922fbccc4d1406&tochange=c8ca1d1866e7e3591d2df84c2a4f0204d43386ed

Which fits the regression range found in comment 1 with the notable difference of having some WebRender changes prior to the Android ones noted in that comment. I don't know whether bug 1623792 or bug 1658182 is more likely to the culprit here, but those at least seem plausible.

Flags: needinfo?(daniel.bodea) → needinfo?(gwatson)

Thank you, I was really stupid. I can confirm, that's where it breaks.

I tried the regression tool multiple times, but it seems to me that it's not one particular revision that is causing the lag. It rather seems to me that there are several versions that are "a bit worse" and they build up to the very broken version 2020-09-01-09-45-42.

It turned out that I didn't realize I'm launching the wrong version with mozregression --launch without using --repo autoland. I did several regressions and always ended up with the same output as you. Unfortunately, I tried testing these two builds again without --repo autoland and couldn't find a difference (duh!).
However, I wasn't completely wrong. It is getting "a bit worse" before it's completely breaking. I did another regression and that's what I came up with (using the minimal example from my last comment, https://codepen.io/lucamarcelpeters/full/OJRJgBR):

  1. As stated in my first comment, the laggy resizing starts in 0e4d7f204a27. Scrolling is still 100% smooth here though.
  2. The first (subtle) drop in scrolling performance led me to these regression results:
4:15.15 INFO: Last good revision: 2d55f2c0fc33eda6c995ea77bb7fe59b86bba6f0
4:15.15 INFO: First bad revision: 8f4b47079a44eeea87caa560b3b072148551aa3c

In 2d55f2c0fc33 I get solid 60 fps minimum when scrolling. In 8f4b47079a44 it drops to 35 fps.
3. The second regression led me to the same results as [:RyanVM], in 4b8de762e097 I get 28 fps minimum, in c8ca1d1866e7 scrolling only works with 10 fps.

Flags: needinfo?(gwatson)

Jamie, I think you recently changed our behaviour around managing large textures in the texture cache. Did that fix this, and if not, should it?

Flags: needinfo?(jnicol)

Doesn't scroll any different for me on a current Nightly build.

I just tested 0ee685602a7f (as suggested by [:gw]) and I get 28 fps again. Same in today's Nightly build, @[:RyanVM].

So the issues introduced in c8ca1d1866e7 seem to be fixed now.
I still don't get near 60 fps like before 8f4b47079a44 though.

Jamie, I think you recently changed our behaviour around managing large textures in the texture cache. Did that fix this, and if not, should it?

That's the patch Glenn mentions in comment 10, which seems to have helped. Presumably that part was due to bug 1658182 as identified in comment 7.

As for the earlier regression caused by bug 1616901. We no longer use texture arrays and have changed to 2048x2048 2d textures. But the effect is the same: the cache is now split in to multiple fixed size textures instead of fewer massive ones. On my computer I see some frames with really high draw call counts, so I suspect the remaining slowness is due to this. We definitely don't want to go back to larger textures, but maybe batching could be improved on this page somehow.

Flags: needinfo?(jnicol)
Severity: -- → S3
Priority: -- → P3
Flags: needinfo?(matt.woodrow)

Here's an updated profile from MacOS: https://share.firefox.dev/2JSOTtM

Flags: needinfo?(matt.woodrow)

The profile just matches Jamie's explanation from comment 14. We're spending a lot of time issuing draw calls, so improving batching would be the main thing we could do to fix this.

Flags: needinfo?(gwatson)

I tried to look into this - for some reason I can't explain the SVG files on that domain won't load for me. I tried two different internet connections, both of them fail. tracepath also times out on that domain, somewhere in the US.

It's probably some temporary routing issue? But if someone is able to attach the test case directly to the bug, that would be great.

Assignee: nobody → gwatson
Flags: needinfo?(gwatson) → needinfo?(erik.faulhaber)
Attached file parallax_svgs.zip
Flags: needinfo?(erik.faulhaber)

I attached the SVG files. Hopefully, that's just a temporary routing issue, as we will eventually deploy our application to this domain and server.

Yes, it seems like it was a temporary routing issue, the page is loading correctly for me here now. Thanks!

OK, there's multiple issues involved here - some of them are related:

  1. The texture cache eviction bug referenced above (which is now fixed, and has improved things).
  2. The SVGs are rasterized at a very large size (12k x 4k) on my screen. I think this is due to the scale transform. There is some planned work to improve this by working out a better scale to rasterize SVG files at.
  3. We currently treat all rasterized SVG files as being possibly translucent - if we can detect that the background layer(s) are opaque, WR will use that to reduce the blending cost for those layers.
  4. Since the rasterized images are very large, they get split into tiles and spread across multiple texture cache pages. WR currently does a bad job of batching tiled images which span multiple texture pages.

I will look into a fix for 4 today - that should be a reasonably simple fix, and then we can hand this off to people who will be looking at 3 and 4 for further improvements.

Another issue causing high draw calls is that when we have a tiled image, we only calculate the visible tiles once, and then replay that visible image tile list across all picture cache tiles.

In this case, each SVG file has ~135 visible tiles, which replayed across 5 - 6 picture cache tiles, multiplied by the number of parallax layers.

Even with the fix for (4) above, this is still a high draw call count. To fix this, we need to calculate a visible set of image tiles per picture cache tile (I'm planning to implement this in the new year, for other reasons).

Depends on: 1683962
No longer blocks: gfx-triage
You need to log in before you can comment on or make changes to this bug.