Closed Bug 1528865 Opened 8 months ago Closed 8 months ago

Blitting picture cache tiles seems slower on Windows/ANGLE for the same hardware.

Categories

(Core :: Graphics: WebRender, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla67
Tracking Status
firefox-esr60 --- unaffected
firefox65 --- unaffected
firefox66 --- disabled
firefox67 --- fixed

People

(Reporter: gw, Assigned: sotaro)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

When a picture cache tile is copied from the framebuffer to the texture cache, it shows up in the WR GPU profiler as a bright green bar near the end of the frame (tag "Blit").

When using the same hardware (Intel HD4600) these blits seem to take much longer (I'd guess 5-6x) on Windows/ANGLE than they do on Linux.

There's a few possibilities here:

  • The GPU profiler is incorrect on Windows and/or Linux, reporting incorrect values.
  • The blits are much slower due to a slow driver path on Windows.
  • The blits are much slower due to a slow ANGLE code path being hit.

We could investigate this by:

  • Testing if the profile results are reproducible in other GPU profilers (e.g. GPA)
  • Testing if the results differ when running on Windows + native GL.

It'd also be interesting to know if this is GPU / vendor specific.

Does this blit use the same path as when resizing the texture cache? If so, Linux may be using glCopyImageSubData, which isn't supported on angle.

Oh, no it can't do, since it's from the framebuffer and glCopyImageSubData needs a texture

If I disable ANGLE, I see reported GPU times for caching picture tiles that are much faster (the GPU profile looks around the same as on Linux with native GL).

So I currently suspect we are triggering a slow path in ANGLE (or that the GPU timers being reported when using ANGLE are inaccurate).

Although not as drastic, it does seem to be the same on a machine with an nVidia GTX 1050 - the reported tile blit times are much longer when running with ANGLE compared to native GL.

I will try some vendor profiling tools to see if they report the same differences.

FWIW, I have noticed that the gpu timer queries on Windows tend to have a large performance impact. Do you have a way to check if what you're seeing is just an artifact of enabling the timer queries or a real performance difference?

It's still possible it's a measurement error - but it does seem reproducible between (non) ANGLES on different machines and GPUs. I'm planning to do some captures in wrench and then some profiling with the nVidia / Intel vendor profiling tools to see what they show.

A few notes from investigating today:

The code path we are hitting in ANGLE for glBlitFrameBuffer results in a D3D draw call with vertex + pixel shader. It's unclear to me so far whether all blits are implemented as draws, or whether we are hitting a slow / uncommon path.

There is an ANGLE extension called ANGLE_framebuffer_blit. I haven't looked into it in detail, but I wonder if this is to work around some performance issues with implementing glBlitFrameBuffer.

When running with the nvidia tools, it's not clear that there is a reported difference between ANGLE/D3D or native GL. However, this may just be because the GPU times are so fast on these test cases with a GTX 1050 - the GPU times are small enough that the noise in the profile timings is a significant percentage of the total time, let alone the blit times.

I'm going to try and run with the Intel GPU tools on a HD4600, where the difference seemed much more significant, and see if I can get more reliable numbers on that.

For the record, ANGLE_framebuffer_blit's glBlitFramebufferANGLE is just the ES2 version of the ES3-only glBlitFramebuffer.

Attached image Capture.PNG

I'm not making much progress on this - writing up some notes here to see if anyone else has ideas.

Problem:

  • When running on a mobile HD4600 GPU, sometimes the time to blit cache tiles from framebuffer into the texture cache are very slow. See the attached capture.png - the bright green blits are 90+% of the GPU time, when they are typically expected to be ~10% of the GPU time.
  • It does seem to be a real slow down (rather than a measurement error). Even with GPU timers disabled, it's very noticeably laggy scrolling compared to with ANGLE disabled.

Things I've found:

  • Only occurs when ANGLE is enabled in Firefox. If I disable ANGLE and run native GL, the GPU times look as expected.
  • Only occurs inside Firefox. If I take a wrench capture and replay it, the GPU times are as expected, with/without --angle enabled.

Random things I've tried in Gecko without any improvements:

  • Disabling the blocking present query.
  • Disabling triple buffering.
  • Tried to disable the DirectComposition path, but I just get a white screen with nothing rendered.

I initially thought it was because the glBlitFrameBuffer impl in ANGLE was going through the slow path (skipping CopySubResource and doing a draw call). However, on the Intel GPU, as best I can tell it's going through the fast path. I also tried manually hacking ANGLE to go through the slow path and it didn't seem to help. It's possible I made a mistake here - might be worth verifying these claims.

Any ideas?

Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(dmalyshau)

In terms of other devices, I think it's also occurring on nVidia GTX 1050, but it's much harder to say for sure since it's fast enough that it's hard to notice the difference conclusively.

I haven't tried on other Intel GPUs, but I suspect it will occur on those too, not just the HD4600. Running the same hardware on Linux with native GL, the tile blit times are as expected, so the hardware itself is not the problem.

If you want to try and reproduce, it can be more easily seen by making the picture cache code always cache tiles, by:

Commenting out https://searchfox.org/mozilla-central/rev/4587d146681b16ff9878a6fdcba53b04f76abe1d/gfx/wr/webrender/src/picture.rs#1569, so that tiles never become valid.

Changing https://searchfox.org/mozilla-central/rev/4587d146681b16ff9878a6fdcba53b04f76abe1d/gfx/wr/webrender/src/picture.rs#1538 to if true { so that tiles are cached every frame.

Thanks Sotaro, I will do some testing with this tomorrow!

Oh, attachment 9045543 [details] was wrong, I updated the patch.

Looks like Sotaro saved the day \o/

Flags: needinfo?(dmalyshau)

I just tried this patch out on my HD4600 mobile device and it makes a massive difference - the GPU time when scrolling around on most pages drops from ~11ms to ~5ms.

The sooner we can get this merged the better, I think. I'm not sure the best way to address Dzmitry's concerns about doing it on other GPUs. Maybe we merge this and revisit if we encounter perf issues on any other GPUs later?

Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(dmalyshau)

Discussing this a bit more with Jeff, my thoughts are:

I think we should land it as-is, because (1) the performance penalty is so severe without it, and (2) we don't have any real insight into whether ANGLE will choose a fast / slow path at any time, so we have no guarantees that it won't need this on all platforms at some time anyway.

How about we land it as-is, and keep a close eye on the telemetry graphs for each GPU in the metrics dashboards over the next few days?

Sure, let's proceed

Flags: needinfo?(dmalyshau)

Marking this as affected for the 66 webrender experiments. I can still take an uplift here.

Sotaro, I think this will also fix the main cause of major motionmark and other slowdowns you previously reported with picture caching too.

I'd definitely like to get this uplifted to 66, if possible. Ideally we could let it sit in nightly for a day or two to make sure it doesn't regress elsewhere, but it certainly has the potential to be a big performance win on Intel GPUs at least.

(In reply to Glenn Watson [:gw] from comment #19)

The sooner we can get this merged the better, I think. I'm not sure the best way to address Dzmitry's concerns about doing it on other GPUs. Maybe we merge this and revisit if we encounter perf issues on any other GPUs later?

Yea, I agree. It might not affect to NVIDIA, since the patch did not affect to talos result. And motionmark score improvemed on my win10 intel laptop.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=7e95eefcad6b0e1e6c49bf6ff0b8e27c31bc337d&newProject=try&newRevision=1d96d0e4c99a8a2126077decbd5cd8b5ed55b0a3&framework=1

Flags: needinfo?(sotaro.ikeda.g)
Attachment #9045543 - Attachment description: Bug 1528865 - Change SwapChain's BufferUsage as same to Angle → Bug 1528865 - Change SwapChain's BufferUsage as to add DXGI_USAGE_SHADER_INPUT
Pushed by sikeda@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/20b3e9001dbc
Change SwapChain's BufferUsage as to add DXGI_USAGE_SHADER_INPUT r=kvark
Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67
Assignee: nobody → sotaro.ikeda.g
Blocks: 1487564
Blocks: 1453991

Would you like to uplift this now? It's had some time on nightly. Are you able to measure a performance gain?

Flags: needinfo?(sotaro.ikeda.g)

I think this is worth uplifting. Thoughts Jeff, Sotaro?

Flags: needinfo?(jmuizelaar)

Yea, it seems to worth uplifting :)

Flags: needinfo?(sotaro.ikeda.g)

(In reply to Liz Henry (:lizzard) (use needinfo) from comment #27)

Are you able to measure a performance gain?

Performance improvement did not appear on talos on try. Then I tested locally with https://browserbench.org/MotionMark/ on my Intel PC(Lenovo P50). I used the following win64-pgo builds for MotionMark testing.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc6f06fbd9d4ef8fca2782a266f2faca5b65aa16
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7694cc64ee51e21cbaaa0ed810408bbde6b4091f

Score of https://browserbench.org/MotionMark/ were the followings. Without DXGI_USAGE_SHADER_INPUT flang, the scores were unstable.

  • With DXGI_USAGE_SHADER_INPUT: 260-320
  • Without DXGI_USAGE_SHADER_INPUT: 180-260

OK, please request uplift and this can likely make it into beta 13 next week. Thanks!

Ah, it is not necessary to uplift the patch, since we do not enable WebRender on beta on intel GPU yet.

I tested out this change on a HD Graphics 530 in a Desktop machine at 1080p and did not see a noticeable difference in motionmark scores: 747 +-5.23% vs 483 +-4.69%.

You need to log in before you can comment on or make changes to this bug.