Open Bug 1617498 (wr-linux-wayland-compositing) Opened 5 years ago Updated 2 months ago

[meta] WR Wayland Compositing

Categories

(Core :: Graphics: WebRender, enhancement, P3)

Desktop
Linux
enhancement

Tracking

()

ASSIGNED

People

(Reporter: gw, Assigned: rmader)

References

(Depends on 6 open bugs, Blocks 4 open bugs)

Details

(Keywords: meta)

WebRender has a trait that can be implemented by Gecko which allows all rendering to occur in native compositor surfaces [1].

On Windows, we render directly into DirectComposition surfaces, while on Mac we render directly into CoreAnimation surfaces. It would be great if we could also do this on Linux, when supported by the underlying windowing system.

The advantage is that WebRender no longer composites the set of picture cache slices into a single buffer before handing to the OS. Instead, the OS compositor is able to composite the picture cache slices directly. This can result in significant performance and battery improvements. We're also able to support compositing video directly to a native compositor surface, which can provide further performance and power savings (this work is being tracked in [2]).

I don't believe this is feasible on X11, since there's no way that I'm aware of to draw into surface tiles with the GPU, and composite them with a single atomic transaction (if there is a way, please let me know!).

However, I believe that Wayland supports everything we need, so long as the wp_viewporter [3] or similar extension is supported. WebRender needs this in able to support clipping of the wayland subsurfaces that the picture cache tiles would be rasterized into. It appears that this extension is available in GNOME [4] and also KWin / Plasma [5].

[1] https://searchfox.org/mozilla-central/rev/a37fc61f172b432e7ae0b6b4c4a12cac2a787a0f/gfx/wr/webrender/src/composite.rs#451

[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1579235

[3] https://cgit.freedesktop.org/wayland/wayland-protocols/tree/stable/viewporter/viewporter.xml

[4] https://gitlab.gnome.org/GNOME/mutter/issues/132

[5] https://phabricator.kde.org/D26171

CCing a few people that might be interested in this work.

That can be done on Wayland by rendering to dmabuf as it's implemented for WebGL (Bug 1586696). Also cross-process fence synchronization is available (Bug 1614568).

It appears that this extension is available in GNOME [4] and also KWin / Plasma [5].

Weston also does support it well

Author of the Gnome Viewport implementation here. I wouldn't be surprised if you run into bugs in Mutter when using subsurfaces so advanced (we don't have any clients doing that yet). So great to see this and I'll be following this bug closely. Feel free to always ping me.

Great, thanks Robert! We shouldn't need any cross-process synchronization for this case, I think - all surface allocation and rasterization occurs inside the GPU process.

Do we have a ticket for the GPU process on Wayland?

I believe GPU process is enabled on Linux now by default on nightly? I'm not sure if that's different when using Wayland?

Even if not using a dedicated GPU process, WR still exists in a single process as far as all allocation and rasterization is involved.

(In reply to Glenn Watson [:gw] from comment #6)

I believe GPU process is enabled on Linux now by default on nightly? I'm not sure if that's different when using Wayland?

Wayland does not use GPU process. It's disabled because Wayland can't share plain surfaces/windows across processes. Wayland can only share the underlying GPU memory (by dmabuf) which can be mapped to EGLImage/framebuffer in different processes.

Priority: -- → P3
OS: Unspecified → Linux
Hardware: Unspecified → Desktop

Side note: the upcoming Sway version will have viewport support, too.

Sway 1.5 with viewporter support is out.

Using wl-viewports would apparently allow us to scale videos more efficiently. YUV conversion in the compositor is not mandatory in Wayland - the Mutter tracking bug for that is here: https://gitlab.gnome.org/GNOME/mutter/-/issues/1366 (hopefully available around 3.40 if everything works out).

See Also: → 1623530, 1653166

Yes - there are patches in progress for WR to make use of native OS compositor transforms where available to scale videos efficiently in the compositor / hardware (see https://phabricator.services.mozilla.com/D84328). We can make use of the viewport scaling functionality in wayland to achieve the same efficiency savings here as with DirectComposition and CoreAnimation.

Depends on: 1668805
Assignee: nobody → robert.mader
Status: NEW → ASSIGNED
Alias: WR-linux-wayland-compositing
Summary: Implement WebRender native compositor trait for Wayland → [meta] WR Windows Compositing
Summary: [meta] WR Windows Compositing → [meta] WR Wayland Compositing
Depends on: 1695500
Depends on: 1697673
Depends on: 1699754
Depends on: 1699985

Status update: the example compositor now works quite well and can be tested (see bug 1695500). So far Weston is the only compositor able to run it properly - compositor bugs are tracked in bug 1699754.

The main takeaway from implementing the example compositor Wayland backend for me is that:
1: Wayland seems to offer everything needed to map the features used on other platforms
2: We may want to use Wayland APIs directly instead of using the EGL-Wayland platform in order to have more control over buffers etc.

The second point is something for later when the basic functionality stands. However it may make sense to create a little library for that so it can be reused by other projects that want to do similar compositor integration.

Depends on: 1700151
Depends on: 1700684
Depends on: 1707202
See Also: → 1707209
Depends on: 1711214
Depends on: 1711224
Depends on: 1711244
Depends on: 1711461
See Also: 1623530
Depends on: 1712472
Depends on: 1713202
Depends on: 1714326
Depends on: 1714771
Depends on: 1716006

Little status update here: after the latest round of patches things seem to run quite stable for me. So I think this is now dogfoodable and if you run recent Gnome (40.1/3.38.5) or KDE (5.22), you're invited to give this a try. Simply switch on gfx.webrender.compositor.force-enabled on latest nightly (of course you also need to run with MOZ_ENABLE_WAYLAND=1).

Depends on: 1716044
Depends on: 1716108

I did some (not very scientific) performance profiling now on my Thinkpad T460p (skylake). What immediately jumps to attention is that that we have heavily reduced GPU utilization when e.g. scrolling a static page. I tested this with intel_gpu_top and both reported utilization as well and frequencies drop by about 30% while RC6 time increased by about 10%. This is on a FullHD screen - on 4K I'd expect even bigger differences. Reducing GPU overhead is the central idea behind this effort, so it's nice to see that it works out.

CPU wise we seem to also consume about the same in FF, however at least Gnome-Shell consumes about twice as much CPU time as normally (still way less than FF). It is somewhat expected that we trade GPU vs CPU time to some extend. However, I think there's quite a bit of optimization potential, both by how FF uses the Wayland protocol and by the implementation in Gnome-Shell.

Power consumption wise I didn't spot a significant difference on my mashine yet. Apparently the lower GPU frequency gets compensated by the extra CPU time or there are other things at play so that the package (I have an integrated Intel GPU) does not power down. This finding is a bit sad as saving energy is the eventual main goal of the whole effort.

Note that I only looked for very obvious and easy to spot differences - nothing below a save 10% change. Also, other hardware may be affected differently. Also, this was only for HW-WR, not SW-WR.

Robert I have a 4K display running off Intel UHD 620 graphics (Whiskey lake). Do you know of a good (scientific) profiling utility for GNOME/Fedora so I could do some testing? Perhaps there's a way of logging intel_gpu_top output to a file.

I see in this blog macOS has a tool to show the area being repainted. Are you aware of such a tool on Linux/Wayland?

Depends on: 1717902

Hi Vincent. Created bug 1717902 for discussions and findings around performance and profiling, lets continue there.

Depends on: 1718569
Depends on: 1718570
Depends on: 1720375
Depends on: 1718688

After bug 1718570 landed I now consider the compositor backend to be on feature parity with the default one. To my knowledge, there's no broken feature (I previously worried about e.g. screenshots, but they work) - and in many situations the compositor backend is already much faster. So while there is outstanding performance work and potentially some bugs will get discovered, we are getting closer to the point where we can enable compositor integration by default - at least for a subset of users using recent versions of their compositors.

@rmader sorry for asking in such a random place, but on my system (Arch Linux, GNOME Wayland, the 2021-07-11 Nightly, AMD GPU), with the compositor enabled I sometimes get rectangular parts of the window flickering with portions from another tab. I don't get along very well with the Bugzilla search, so if that's a known issue, can you please point me to it? Otherwise I'll try to update and file a bug.

(In reply to Laurențiu Nicola from comment #18)

@rmader sorry for asking in such a random place, but on my system (Arch Linux, GNOME Wayland, the 2021-07-11 Nightly, AMD GPU), with the compositor enabled I sometimes get rectangular parts of the window flickering with portions from another tab. I don't get along very well with the Bugzilla search, so if that's a known issue, can you please point me to it? Otherwise I'll try to update and file a bug.

No worries, this probably affected all users until bug 1718570 landed - so thanks for asking.
Despite its title about partial damage (thus better performance), its main achievement was actually to give much better guarantees about correctness. So if you update nightly to the latest version, my expectation would be that what you describe should not happen any more - buffer content should now always be correct (minus Webrender, system compositor or driver bugs of course). If you still see such issues please file a new bug blocking this one.

Depends on: 1720850
Depends on: 1720874
No longer depends on: 1720874
Depends on: 1721036
Depends on: 1721298
Depends on: 1723012
Depends on: 1723940

Hello Robert, what's status of this feature? Should it be enabled by default, do we need to test is somehow or so?
It may be possible to run testsuite on the compositor to compare result, for instance I use locally:

MOZ_ENABLE_WAYLAND=1 ./mach mochitest dom/base/test --setpref widget.wayland.test-workarounds.enabled=true --enable-webrender

or for long version

MOZ_ENABLE_WAYLAND=1 ./mach mochitest dom --setpref widget.wayland.test-workarounds.enabled=true --enable-webrender

you can use --setpref to enable the feature.

Flags: needinfo?(robert.mader)
Depends on: 1725371

(In reply to Martin Stránský [:stransky] (ni? me) from comment #20)

Hello Robert, what's status of this feature? Should it be enabled by default, do we need to test is somehow or so?

I think it's quite close to be ready from the FF side, but as it uncovered a lot of bugs in compositors (some of them listed in bug 1699754). It will still take some time until most/all of them are fixed and reached users - the good thing is that this will benefit other applications as well that try to do similar things. Opened bug 1725372 to track things.

Flags: needinfo?(robert.mader)
Depends on: 1726807
Depends on: 1726954
Depends on: 1727936
Depends on: 1729233
Depends on: 1729613
Depends on: 1731450
Depends on: 1732051
Depends on: 1735494
Depends on: 1735560
Depends on: 1736205
Depends on: 1737821
Depends on: 1741081
Depends on: 1742990
Depends on: 1743631

On a Gemini Lake (Linux 5.16 and latest mesa git-master) system with Plasma/KWin 5.23.90 and 5.23 Wayland, this seems to be counter-productive:
With gfx.webrender.compositor & gfx.webrender.compositor.force-enabled = false, SoC power consumption while watching YT 720p 60fps VP9 VAAPI is ~4.4W. With both options = true, it's ~5.2W (double checked & sufficiently long enough playback to rule out additional load by buffering etc.). Also, there is more stutter on light web sites while scrolling with it enabled.

Rather vital information I forgot to mention: Used Firefox version was 97.0b3.

Interesting, thanks for sharing! Note: I opened bug 1717902 for performance measurements as this is now a meta bug. For me it would be great to know where that energy is spent: on the CPU or GPU (this backend generally trades less GPU time for slightly more CPU time).

I'd expected video playback to be slightly better (usually one less copy - as long as scanout doesn't kick in, which is more likely when using the default EGL backend, see bug 1743631), however real differences should only show up once bug 1711461 is implemented. As for scrolling: this is something where I'd expect this backend to be much better. However, as it moves a lot of work into the Wayland compositor, performance also depends on the compositor to be optimized for this use-case. AFAIK this is the first and still only client to do this to such an extend so I don't expect Wayland compositor devs to care that much (apart from Gnome, where I'm a dev myself).

Depends on: 1750373

CPU load and CPU core power consumption seem to be unchanged. However, intel_gpu_top reports roughly twice as high GPU load with WR compositor enabled vs. disabled and higher GPU power consumption accordingly.

I can give Sway (latest git-master) a try. I could also give Gnome a try. Slightly OT: However, it slows down that particular low end device too much, there are also continuous frame drops during playback with mpv etc. I suspect there might be some latency reduction active that works too aggressively by default for such a slow GPU. Just a shot in the dark, but that's also the case with KWin's latency reduction (that can be configured via UI to a less aggressive value). Might be worth a bug report (can do that if you think this would help). Sway also has a latency reduction, but it's disabled by default. Yet I also found the values it suggests as safe as too aggressive also with a faster dedicated GPU (frame drops in games with high GPU load).

intel_gpu_top reports roughly twice as high GPU load with WR compositor enabled vs. disabled and higher GPU power consumption accordingly.

To me that sounds like missing optimizations regarding opaque regions and subsurfaces in Kwin. Things should look quite different on Gnome and, more importantly, in theory (on a perfect compositor).
Regarding low end devices: I also test this on an old Thinkpad T400 and get quite good results. It was also reported that this improves performance on e.g. the Pinephone. That was on Gnome (which has dynamic latency reduction based on measurements) and Weston (which like Gnome should have proper optimizations for subsurfaces in place) though. Kwin and Sway are the compositors I know least about.

Anyway, please let's continue any performance related conversation either in bug 1717902 or open a new bug for compositor specific issues (such as "Higher GPU utilization on Kwin" / "Performance on Kwin"). From your report the later sounds like a good idea.

Depends on: 1750443
Depends on: 1750457

I think that bug 1747481 should block this bug. For me, it occurs so often that firefox is unusable with the wayland compositor force enabled, but never occurs without it and therefore I thought it was clearly related. Sorry if this is not as clear as it seems to me.

Depends on: 1747481
Depends on: 1752469

For all interested parties: it may turn out that the approach here is a dead end with regard to the future development of Wayland. Most importantly, offloading composition to Wayland compositors may turn out to not be efficient in a HDR world. Doing composition within Firefox and rely on direct scanout by the Wayland compositor may be a better approach, so the work here stays experimental for the foreseeable future. See https://gitlab.freedesktop.org/pq/color-and-hdr/-/issues/6 for more information.

Depends on: 1752678
Depends on: 1761927
Depends on: 1767795
Depends on: 1770404
Depends on: 1775002
Depends on: 1786064
Depends on: 1791156
Severity: normal → S3
See Also: → 1798360
See Also: 1798360
Depends on: 1828323
Blocks: wr-projects

With GtkGraphicsOffload and Mutter changes and upcoming HDR support we may reconsider to use it somehow. AFAK Mutter support direct rendering of fullscreen windows only right now but that may change. It would be great to use a layer for video playback at least.

Robert, what do you think? I see your comment about the deprecation now (https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c28) but it looks to me that recent development is coming back to this concept, at least in some kind, right?

Flags: needinfo?(robert.mader)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #30)

Robert, what do you think? I see your comment about the deprecation now (https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c28) but it looks to me that recent development is coming back to this concept, at least in some kind, right?

The crucial part that's different in what GTK4 and Chromium[1] do - and what we IMO should do as well in FF - is that they limit subsurface offloading to very few cases. Essentially to only one video subsurface - which, however, can be layered behind controls, with a whole punched into the main surface (see https://blog.gtk.org/files/2023/11/bbb-below.png).

So the main problem with the current state of the implementation here is that unconditionally offloads everything. I think we could do something more similar to SW-WR-OGL (used on old Android?), which IIRC uses the "native" WR renderer with SW-tiles and then composites them into the window buffer. If we'd do the same - just with dmabuf tiles like already present here - then it should be relatively easy to offload special tiles like video or webgl ones (and again that's AFAIK pretty close to how Chromium works on Wayland).

  1. Note that LaCros (Wayland backend with ChromeOS-private protocols) tried to do something similar to what we have here - in fact way more radical, also trying to offload all kinds of CSS.
Flags: needinfo?(robert.mader)

Robert,

while implementing the video offload I hit interesting finding. With Firefox video offload enabled via wayland compositing (it uses your layers code + external image support and YUV direct offload to compositor) it looks like Mutter is the only compositor which has issues with such setup.

When running on Gnome shell, I see 100% CPU usage when layers are used (30% without it!). OTOH Sway has superior performance with layers, it takes only 8% of CPU even with YUV direct compositing.

Funny enough if I run Sway as nested compositor inside Gnome, Sway plays / composited Firefox fine and plays the video and still takes 8-10% CPU while gnome-shell also takes 10-15% CPU so nested Sway+Firefox is far better than Firefox on mutter only (Firefox on Mutter takes 30% on my box).

And surprisingly KDE works even better, I don't see any CPU utilization at all when subsurfaces/compositing mode is used on KDE!

That brings me a question what mutter/gnome does so wrongly with surfaces offload? I don't think it's worth to implement extra Firefox internal compositor to just workaround clear bug in Mutter, better to fix Mutter directly I guess.

What do you think?

Flags: needinfo?(robert.mader)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #32)>

That brings me a question what mutter/gnome does so wrongly with surfaces offload? I don't think it's worth to implement extra Firefox internal compositor to just workaround clear bug in Mutter, better to fix Mutter directly I guess.

Of course, if this is a mutter issue, we should fix it in mutter, not work around it in Firefox.

Can you take a CPU profile while reproducing the mutter 100% CPU usage, e.g. with sysprof?

Yes, I'll look at it. Looks like nested mutter has better performance (uses 10% CPU) but I hit new bugs like image corruption during playback in such mode. I'll fix that on Firefox side first and then do the testing.

Providing builds and instructions how to reproduce locally would be useful as well.

Thanks, will provide that when it's ready for testing.

Have done testing with fixed Firefox version and now and there isn't any difference between composited and non-composited CPU usage on Mutter side during YUV video offload, tested on Fedora 40. So looks like it was caused by my FF patches and perhaps also by logging. Sorry for the noise.

But that also means the Wayland Compositing is suitable for use and way to go which it great news.

Well I spoke too soon. There's visible compositing penalty if blending is used. For instance YT playback causes it as YT player has round corners over the video. I see 10% CPU if I play plain clip and 20% CPU on YT with the round corners. OTOH composition on Firefox side uses the same mutter CPU (10%).

(In reply to Martin Stránský [:stransky] (ni? me) from comment #37)

But that also means the Wayland Compositing is suitable for use and way to go which it great news.

Nice, great to hear!

(In reply to Martin Stránský [:stransky] (ni? me) from comment #38)

Well I spoke too soon. There's visible compositing penalty if blending is used. For instance YT playback causes it as YT player has round corners over the video. I see 10% CPU if I play plain clip and 20% CPU on YT with the round corners. OTOH composition on Firefox side uses the same mutter CPU (10%).

Yeah - blending is an issue both practically and conceptually - especially with HDR.

Somewhat related: here's a about how to reduce bandwidth overhead for typical videa player scenarios I still hope to get around pushing forward: https://gitlab.freedesktop.org/wayland/wayland/-/issues/423

Flags: needinfo?(robert.mader)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #38)

I see 10% CPU if I play plain clip and 20% CPU on YT with the round corners. OTOH composition on Firefox side uses the same mutter CPU (10%).

https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c33 / https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c35 still apply.

(In reply to Michel Dänzer from comment #40)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #38)

I see 10% CPU if I play plain clip and 20% CPU on YT with the round corners. OTOH composition on Firefox side uses the same mutter CPU (10%).

https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c33 / https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c35 still apply.

I hope to get patches committed to Firefox this/next week so it can be tested by stock upstream binaries.

(Fixing the alias to match others)

Alias: WR-linux-wayland-compositing → wr-linux-wayland-compositing
Depends on: 1967250
You need to log in before you can comment on or make changes to this bug.