Closed Bug 1595437 Opened 5 years ago Closed 5 years ago

Performance of os compositor is bad on Windows

Categories

(Core :: Graphics: WebRender, defect, P3)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: sotaro, Assigned: gw)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Markus was finding that there is a lot of overhead inside the OS calls to create and destroy surfaces - I believe the plan there is to have a pool of CA surfaces that can be recycled.

On Mac, we were seeing times of > 6ms inside the create/destroy surface calls.

Is it possible something like that is occurring? Does anything show up as a major bottleneck in a Firefox profile recording?

OS: Unspecified → Windows
Priority: -- → P3
Blocks: 1592509

(In reply to Glenn Watson [:gw] from comment #1)

On Mac, we were seeing times of > 6ms inside the create/destroy surface calls.
Is it possible something like that is occurring? Does anything show up as a major bottleneck in a Firefox profile recording?

create/destroy surface calls seemed not spend time. But eglCreatePbufferFromClientBuffer() call in DCLayer::CreateEGLSurfaceForCompositionSurface() spent more time. SwapChain11::resetOffscreenDepthBuffer() was slow. Sometime the duration was 1-5 ms on my Intel laptop. EGLSurface is re-created every DCLayer rendering.

And in general, when OS compositor was enabled, gpu usage in windows task manager seemed to be increased than default WebRender.

I found a couple things today:


  1. We need to not rebuild the visual tree each frame, otherwise a full redraw occurs (not super surprising, just implemented this way initially for simplicity).

I attached a hacky patch that improves this. It checks if the visual tree is the same as the previous frame, and if so, it skips clearing and re-adding the visuals to the root. This handles the common case, but we can do better.

This simple patch is not ideal when scrolling occurs and a small number of tiles get added / removed. I think we can build on the attached patch, and do a simple array diff to work out which layers to add / remove from the existing visual tree - then we won't have to clear it when scrolling changes the tile configuration.


  1. DC seems to be doing extra GPU work due to some of the background layer tiles not being occluded. Once the scrollbar primitives are tagged as opaque, WR will occlusion cull those background tiles, which should reduce the GPU work DC is doing. This is being tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=1593574.

The patch for (1) reduced the amount of redraw in manual testing, but didn't have any noticeable effect on talos.

I'm going to keep investigating tomorrow, looking at the CPU performance of binding/unbinding surfaces and see if we can improve that.

Assignee: nobody → gwatson

I confirmed Sotaro's findings above.

With the two patches above, scrolling on a page that is painting just the scrollbars (no content / UI paints), I can see 26% of the total time in the renderer thread is inside Compositor::Bind.

Around half of that time is inside BeginDraw and half inside eglCreatePbufferFromClientBuffer.

It seems likely that this could explain most, if not all, of the remaining talos regressions.

I'll do some more research on how we can fix this.

I created a patch that uses the EGLImage extensions instead of the pbuffer APIs.

This allows us to cache and retain depth buffers between tiles, which fixes most of the regressions from comment 1.

The WIP patch is [1] with current talos results [2].

This looks much better than the initial try run - there are high confidence regressions in 3 test suites (glterrain, tart and tscrollx), and a high confidence 20% win in tresize.

Tomorrow, I'll profile the remaining regressions. Right now, I am recreating the FBO for every tile each time it's updated, and rebuilding the visual tree every time a scroll with new tiles occurs, so those might be simple wins to improve performance too.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=1e08271af244ef969b3f8e02e570941cd16ea805
[2] https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=1e08271af244ef969b3f8e02e570941cd16ea805&framework=1&selectedTimeRange=172800

Depends on: 1597887
Attached image profile.png

A profile of one of the remaining talos regressions.

What we commonly see is a large portion of the time in the renderer thread in:

  • WaitForPreviousPresentQuery / GetLastCompletedFrameId
  • Commit() [DirectComposition]

I suspect we might have a synchronization / ordering issue causing this.

Sotaro, some questions:

  • What is the previous present query used for in DC mode? Is it still required?
  • Do you have any ideas on what may cause large time seen in these functions?

I will do some profiling and experiments tomorrow with these parts of the code.

Flags: needinfo?(sotaro.ikeda.g)

(In reply to Glenn Watson [:gw] from comment #6)

Created attachment 9110181 [details]
profile.png

A profile of one of the remaining talos regressions.

What we commonly see is a large portion of the time in the renderer thread in:

  • WaitForPreviousPresentQuery / GetLastCompletedFrameId
  • Commit() [DirectComposition]

I suspect we might have a synchronization / ordering issue causing this.

Sotaro, some questions:

  • What is the previous present query used for in DC mode? Is it still required?

It is necessary to prevent too much task is submitted to gpu. And it works as a safe mechanism of video texture recycling.

  • Do you have any ideas on what may cause large time seen in these functions?

The wait happens when 2 frames before gpu tasks were not completed yet. It means a lot of gpu tasks.

I will do some profiling and experiments tomorrow with these parts of the code.

Flags: needinfo?(sotaro.ikeda.g)
Attached image fps.png

Performance of the example-compositor application running at different resolutions with a large invalidation area.

Attached image fps2.png

Performance of the example-compositor application running at different resolutions with a small invalidation area.

The graphs above show a couple of things:

  • There is some level of (almost fixed) overhead of using DirectComposition.
    • On very small resolutions the memory bandwidth savings are outweighed by this.
    • On discrete GPUs, the cross over point is typically a higher resolution.
    • When running at "normal" frame rates (e.g. 60 - 144 Hz) this overhead is not generally noticeable, and the power / performance / bandwidth savings are worth it anyway.
  • The DirectComposition graphs show almost ideal performance. That is, the overall frame rate is mostly independent of the window size and scales with the invalidated area.
  • On integrated GPUs, the cross over point tends to occur at a low enough frame rate that it's probably worth enabling everywhere (sans driver blacklist issues).
  • The talos tests are running at a screen resolution of 1024 x 680. This probably explains why we don't see a huge number of reported wins on those tests when running with DC enabled.

Given that, I think we can close this bug as FIXED. Once we have the remaining reftest functionality sorted, I think it'd be reasonable to enable DC by default on Nightly, and see what telemetry says (it should give us a much better idea of frame rate / compositor time impacts).

Thoughts?

Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(mstange)
Flags: needinfo?(jmuizelaar)
Depends on: 1599965, 1599656

The following is a latest talos compare result with bug 1599656 and bug 1599965. glterrain regression was 127%. I worry about the amount regression. Is there a plan for addressing it?

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=79779b0841e3cfc7a74a96f57bc56a863237ce35&newProject=try&newRevision=281f216f639241830abc1e7406028a3f75db9211&framework=1

Flags: needinfo?(sotaro.ikeda.g)

For enabling DC compositor on nightly, I am not sure how to handle Bug 1597559. I enable os compositor on 3 Win10 PCs and I saw the problem on 2 PCs.

(In reply to Sotaro Ikeda [:sotaro] from comment #11)

The following is a latest talos compare result with bug 1599656 and bug 1599965. glterrain regression was 127%. I worry about the amount regression. Is there a plan for addressing it?

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=79779b0841e3cfc7a74a96f57bc56a863237ce35&newProject=try&newRevision=281f216f639241830abc1e7406028a3f75db9211&framework=1

I think that what's happening here is that DC has a higher overhead, which is noticeable on pages that are running in ASAP mode, where the frame time is already very low (often due to the low resolution of talos tests, which are 1024 x 600). It seems that when running with vsync and/or a higher resolution, DC mode is faster (I did some testing of glterrain locally). I talked to jgilbert about this, and he was OK with landing with this regression. What we could do is enable it, and keep a close look at telemetry on nightly, and see if we notice any regressions there, as an experiment?

(In reply to Sotaro Ikeda [:sotaro] from comment #12)

For enabling DC compositor on nightly, I am not sure how to handle Bug 1597559. I enable os compositor on 3 Win10 PCs and I saw the problem on 2 PCs.

Yes, that is slightly concerning. I haven't seen this issue on any machines I have tested on. Do we have any ideas on how to repro it or what hardware / driver it is affecting?

(In reply to Glenn Watson [:gw] from comment #14)

(In reply to Sotaro Ikeda [:sotaro] from comment #12)

For enabling DC compositor on nightly, I am not sure how to handle Bug 1597559. I enable os compositor on 3 Win10 PCs and I saw the problem on 2 PCs.

Yes, that is slightly concerning. I haven't seen this issue on any machines I have tested on. Do we have any ideas on how to repro it or what hardware / driver it is affecting?

I do not have a specific STR yet. The hardware is the following as in Bug 1597559 . I wonder if Bug 1600501 might be related.

  • Intel(R) HD Graphics P530
  • Driver Version: 26.20.100.7106
  • Driver Date: 8-7-2019

(In reply to Glenn Watson [:gw] from comment #13)

What we could do is enable it, and keep a close look at telemetry on nightly, and see if we notice any regressions there, as an experiment?

Since Bug 1597887 fix, I did not see a perf regression during actual use. Then the talo regression is OK for me.

Let's consider this fixed then, since the noticeable regression has been solved. We can re-open this or create new bugs as we encounter specific problems.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(mstange)
Flags: needinfo?(jmuizelaar)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: