[meta] WR Wayland Compositing
Categories
(Core :: Graphics: WebRender, enhancement, P3)
Tracking
()
People
(Reporter: gw, Assigned: rmader)
References
(Depends on 6 open bugs, Blocks 4 open bugs)
Details
(Keywords: meta)
WebRender has a trait that can be implemented by Gecko which allows all rendering to occur in native compositor surfaces [1].
On Windows, we render directly into DirectComposition surfaces, while on Mac we render directly into CoreAnimation surfaces. It would be great if we could also do this on Linux, when supported by the underlying windowing system.
The advantage is that WebRender no longer composites the set of picture cache slices into a single buffer before handing to the OS. Instead, the OS compositor is able to composite the picture cache slices directly. This can result in significant performance and battery improvements. We're also able to support compositing video directly to a native compositor surface, which can provide further performance and power savings (this work is being tracked in [2]).
I don't believe this is feasible on X11, since there's no way that I'm aware of to draw into surface tiles with the GPU, and composite them with a single atomic transaction (if there is a way, please let me know!).
However, I believe that Wayland supports everything we need, so long as the wp_viewporter
[3] or similar extension is supported. WebRender needs this in able to support clipping of the wayland subsurfaces that the picture cache tiles would be rasterized into. It appears that this extension is available in GNOME [4] and also KWin / Plasma [5].
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1579235
[3] https://cgit.freedesktop.org/wayland/wayland-protocols/tree/stable/viewporter/viewporter.xml
Reporter | ||
Comment 1•5 years ago
|
||
CCing a few people that might be interested in this work.
Comment 2•5 years ago
|
||
That can be done on Wayland by rendering to dmabuf as it's implemented for WebGL (Bug 1586696). Also cross-process fence synchronization is available (Bug 1614568).
Assignee | ||
Comment 3•5 years ago
|
||
It appears that this extension is available in GNOME [4] and also KWin / Plasma [5].
Weston also does support it well
Author of the Gnome Viewport implementation here. I wouldn't be surprised if you run into bugs in Mutter when using subsurfaces so advanced (we don't have any clients doing that yet). So great to see this and I'll be following this bug closely. Feel free to always ping me.
Reporter | ||
Comment 4•5 years ago
|
||
Great, thanks Robert! We shouldn't need any cross-process synchronization for this case, I think - all surface allocation and rasterization occurs inside the GPU process.
Comment 5•5 years ago
|
||
Do we have a ticket for the GPU process on Wayland?
Reporter | ||
Comment 6•5 years ago
|
||
I believe GPU process is enabled on Linux now by default on nightly? I'm not sure if that's different when using Wayland?
Even if not using a dedicated GPU process, WR still exists in a single process as far as all allocation and rasterization is involved.
Comment 7•5 years ago
|
||
(In reply to Glenn Watson [:gw] from comment #6)
I believe GPU process is enabled on Linux now by default on nightly? I'm not sure if that's different when using Wayland?
Wayland does not use GPU process. It's disabled because Wayland can't share plain surfaces/windows across processes. Wayland can only share the underlying GPU memory (by dmabuf) which can be mapped to EGLImage/framebuffer in different processes.
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 8•5 years ago
|
||
Side note: the upcoming Sway version will have viewport support, too.
Comment 9•5 years ago
|
||
Sway 1.5 with viewporter support is out.
Assignee | ||
Comment 10•5 years ago
|
||
Using wl-viewports would apparently allow us to scale videos more efficiently. YUV conversion in the compositor is not mandatory in Wayland - the Mutter tracking bug for that is here: https://gitlab.gnome.org/GNOME/mutter/-/issues/1366 (hopefully available around 3.40 if everything works out).
Reporter | ||
Comment 11•5 years ago
|
||
Yes - there are patches in progress for WR to make use of native OS compositor transforms where available to scale videos efficiently in the compositor / hardware (see https://phabricator.services.mozilla.com/D84328). We can make use of the viewport scaling functionality in wayland to achieve the same efficiency savings here as with DirectComposition and CoreAnimation.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 12•4 years ago
|
||
Status update: the example compositor now works quite well and can be tested (see bug 1695500). So far Weston is the only compositor able to run it properly - compositor bugs are tracked in bug 1699754.
The main takeaway from implementing the example compositor Wayland backend for me is that:
1: Wayland seems to offer everything needed to map the features used on other platforms
2: We may want to use Wayland APIs directly instead of using the EGL-Wayland platform in order to have more control over buffers etc.
The second point is something for later when the basic functionality stands. However it may make sense to create a little library for that so it can be reused by other projects that want to do similar compositor integration.
Assignee | ||
Comment 13•4 years ago
|
||
Little status update here: after the latest round of patches things seem to run quite stable for me. So I think this is now dogfoodable and if you run recent Gnome (40.1/3.38.5) or KDE (5.22), you're invited to give this a try. Simply switch on gfx.webrender.compositor.force-enabled
on latest nightly (of course you also need to run with MOZ_ENABLE_WAYLAND=1
).
Assignee | ||
Comment 14•4 years ago
|
||
I did some (not very scientific) performance profiling now on my Thinkpad T460p (skylake). What immediately jumps to attention is that that we have heavily reduced GPU utilization when e.g. scrolling a static page. I tested this with intel_gpu_top
and both reported utilization as well and frequencies drop by about 30% while RC6 time increased by about 10%. This is on a FullHD screen - on 4K I'd expect even bigger differences. Reducing GPU overhead is the central idea behind this effort, so it's nice to see that it works out.
CPU wise we seem to also consume about the same in FF, however at least Gnome-Shell consumes about twice as much CPU time as normally (still way less than FF). It is somewhat expected that we trade GPU vs CPU time to some extend. However, I think there's quite a bit of optimization potential, both by how FF uses the Wayland protocol and by the implementation in Gnome-Shell.
Power consumption wise I didn't spot a significant difference on my mashine yet. Apparently the lower GPU frequency gets compensated by the extra CPU time or there are other things at play so that the package (I have an integrated Intel GPU) does not power down. This finding is a bit sad as saving energy is the eventual main goal of the whole effort.
Note that I only looked for very obvious and easy to spot differences - nothing below a save 10% change. Also, other hardware may be affected differently. Also, this was only for HW-WR, not SW-WR.
Comment 15•4 years ago
|
||
Robert I have a 4K display running off Intel UHD 620 graphics (Whiskey lake). Do you know of a good (scientific) profiling utility for GNOME/Fedora so I could do some testing? Perhaps there's a way of logging intel_gpu_top output to a file.
I see in this blog macOS has a tool to show the area being repainted. Are you aware of such a tool on Linux/Wayland?
Assignee | ||
Comment 16•4 years ago
|
||
Hi Vincent. Created bug 1717902 for discussions and findings around performance and profiling, lets continue there.
Assignee | ||
Comment 17•4 years ago
|
||
After bug 1718570 landed I now consider the compositor backend to be on feature parity with the default one. To my knowledge, there's no broken feature (I previously worried about e.g. screenshots, but they work) - and in many situations the compositor backend is already much faster. So while there is outstanding performance work and potentially some bugs will get discovered, we are getting closer to the point where we can enable compositor integration by default - at least for a subset of users using recent versions of their compositors.
Comment 18•4 years ago
|
||
@rmader sorry for asking in such a random place, but on my system (Arch Linux, GNOME Wayland, the 2021-07-11 Nightly, AMD GPU), with the compositor enabled I sometimes get rectangular parts of the window flickering with portions from another tab. I don't get along very well with the Bugzilla search, so if that's a known issue, can you please point me to it? Otherwise I'll try to update and file a bug.
Assignee | ||
Comment 19•4 years ago
|
||
(In reply to Laurențiu Nicola from comment #18)
@rmader sorry for asking in such a random place, but on my system (Arch Linux, GNOME Wayland, the 2021-07-11 Nightly, AMD GPU), with the compositor enabled I sometimes get rectangular parts of the window flickering with portions from another tab. I don't get along very well with the Bugzilla search, so if that's a known issue, can you please point me to it? Otherwise I'll try to update and file a bug.
No worries, this probably affected all users until bug 1718570 landed - so thanks for asking.
Despite its title about partial damage (thus better performance), its main achievement was actually to give much better guarantees about correctness. So if you update nightly to the latest version, my expectation would be that what you describe should not happen any more - buffer content should now always be correct (minus Webrender, system compositor or driver bugs of course). If you still see such issues please file a new bug blocking this one.
Comment 20•4 years ago
|
||
Hello Robert, what's status of this feature? Should it be enabled by default, do we need to test is somehow or so?
It may be possible to run testsuite on the compositor to compare result, for instance I use locally:
MOZ_ENABLE_WAYLAND=1 ./mach mochitest dom/base/test --setpref widget.wayland.test-workarounds.enabled=true --enable-webrender
or for long version
MOZ_ENABLE_WAYLAND=1 ./mach mochitest dom --setpref widget.wayland.test-workarounds.enabled=true --enable-webrender
you can use --setpref to enable the feature.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 21•4 years ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #20)
Hello Robert, what's status of this feature? Should it be enabled by default, do we need to test is somehow or so?
I think it's quite close to be ready from the FF side, but as it uncovered a lot of bugs in compositors (some of them listed in bug 1699754). It will still take some time until most/all of them are fixed and reached users - the good thing is that this will benefit other applications as well that try to do similar things. Opened bug 1725372 to track things.
Comment 22•4 years ago
|
||
On a Gemini Lake (Linux 5.16 and latest mesa git-master) system with Plasma/KWin 5.23.90 and 5.23 Wayland, this seems to be counter-productive:
With gfx.webrender.compositor & gfx.webrender.compositor.force-enabled = false, SoC power consumption while watching YT 720p 60fps VP9 VAAPI is ~4.4W. With both options = true, it's ~5.2W (double checked & sufficiently long enough playback to rule out additional load by buffering etc.). Also, there is more stutter on light web sites while scrolling with it enabled.
Comment 23•4 years ago
|
||
Rather vital information I forgot to mention: Used Firefox version was 97.0b3.
Assignee | ||
Comment 24•4 years ago
•
|
||
Interesting, thanks for sharing! Note: I opened bug 1717902 for performance measurements as this is now a meta bug. For me it would be great to know where that energy is spent: on the CPU or GPU (this backend generally trades less GPU time for slightly more CPU time).
I'd expected video playback to be slightly better (usually one less copy - as long as scanout doesn't kick in, which is more likely when using the default EGL backend, see bug 1743631), however real differences should only show up once bug 1711461 is implemented. As for scrolling: this is something where I'd expect this backend to be much better. However, as it moves a lot of work into the Wayland compositor, performance also depends on the compositor to be optimized for this use-case. AFAIK this is the first and still only client to do this to such an extend so I don't expect Wayland compositor devs to care that much (apart from Gnome, where I'm a dev myself).
Comment 25•4 years ago
|
||
CPU load and CPU core power consumption seem to be unchanged. However, intel_gpu_top reports roughly twice as high GPU load with WR compositor enabled vs. disabled and higher GPU power consumption accordingly.
I can give Sway (latest git-master) a try. I could also give Gnome a try. Slightly OT: However, it slows down that particular low end device too much, there are also continuous frame drops during playback with mpv etc. I suspect there might be some latency reduction active that works too aggressively by default for such a slow GPU. Just a shot in the dark, but that's also the case with KWin's latency reduction (that can be configured via UI to a less aggressive value). Might be worth a bug report (can do that if you think this would help). Sway also has a latency reduction, but it's disabled by default. Yet I also found the values it suggests as safe as too aggressive also with a faster dedicated GPU (frame drops in games with high GPU load).
Assignee | ||
Comment 26•4 years ago
•
|
||
intel_gpu_top reports roughly twice as high GPU load with WR compositor enabled vs. disabled and higher GPU power consumption accordingly.
To me that sounds like missing optimizations regarding opaque regions and subsurfaces in Kwin. Things should look quite different on Gnome and, more importantly, in theory (on a perfect compositor).
Regarding low end devices: I also test this on an old Thinkpad T400 and get quite good results. It was also reported that this improves performance on e.g. the Pinephone. That was on Gnome (which has dynamic latency reduction based on measurements) and Weston (which like Gnome should have proper optimizations for subsurfaces in place) though. Kwin and Sway are the compositors I know least about.
Anyway, please let's continue any performance related conversation either in bug 1717902 or open a new bug for compositor specific issues (such as "Higher GPU utilization on Kwin" / "Performance on Kwin"). From your report the later sounds like a good idea.
Comment 27•3 years ago
|
||
I think that bug 1747481 should block this bug. For me, it occurs so often that firefox is unusable with the wayland compositor force enabled, but never occurs without it and therefore I thought it was clearly related. Sorry if this is not as clear as it seems to me.
Assignee | ||
Comment 28•3 years ago
|
||
For all interested parties: it may turn out that the approach here is a dead end with regard to the future development of Wayland. Most importantly, offloading composition to Wayland compositors may turn out to not be efficient in a HDR world. Doing composition within Firefox and rely on direct scanout by the Wayland compositor may be a better approach, so the work here stays experimental for the foreseeable future. See https://gitlab.freedesktop.org/pq/color-and-hdr/-/issues/6 for more information.
Updated•3 years ago
|
Reporter | ||
Updated•1 year ago
|
Comment 29•10 months ago
|
||
With GtkGraphicsOffload and Mutter changes and upcoming HDR support we may reconsider to use it somehow. AFAK Mutter support direct rendering of fullscreen windows only right now but that may change. It would be great to use a layer for video playback at least.
Comment 30•10 months ago
|
||
Robert, what do you think? I see your comment about the deprecation now (https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c28) but it looks to me that recent development is coming back to this concept, at least in some kind, right?
Assignee | ||
Comment 31•10 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #30)
Robert, what do you think? I see your comment about the deprecation now (https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c28) but it looks to me that recent development is coming back to this concept, at least in some kind, right?
The crucial part that's different in what GTK4 and Chromium[1] do - and what we IMO should do as well in FF - is that they limit subsurface offloading to very few cases. Essentially to only one video subsurface - which, however, can be layered behind controls, with a whole punched into the main surface (see https://blog.gtk.org/files/2023/11/bbb-below.png).
So the main problem with the current state of the implementation here is that unconditionally offloads everything. I think we could do something more similar to SW-WR-OGL (used on old Android?), which IIRC uses the "native" WR renderer with SW-tiles and then composites them into the window buffer. If we'd do the same - just with dmabuf tiles like already present here - then it should be relatively easy to offload special tiles like video or webgl ones (and again that's AFAIK pretty close to how Chromium works on Wayland).
- Note that LaCros (Wayland backend with ChromeOS-private protocols) tried to do something similar to what we have here - in fact way more radical, also trying to offload all kinds of CSS.
Comment 32•9 months ago
|
||
Robert,
while implementing the video offload I hit interesting finding. With Firefox video offload enabled via wayland compositing (it uses your layers code + external image support and YUV direct offload to compositor) it looks like Mutter is the only compositor which has issues with such setup.
When running on Gnome shell, I see 100% CPU usage when layers are used (30% without it!). OTOH Sway has superior performance with layers, it takes only 8% of CPU even with YUV direct compositing.
Funny enough if I run Sway as nested compositor inside Gnome, Sway plays / composited Firefox fine and plays the video and still takes 8-10% CPU while gnome-shell also takes 10-15% CPU so nested Sway+Firefox is far better than Firefox on mutter only (Firefox on Mutter takes 30% on my box).
And surprisingly KDE works even better, I don't see any CPU utilization at all when subsurfaces/compositing mode is used on KDE!
That brings me a question what mutter/gnome does so wrongly with surfaces offload? I don't think it's worth to implement extra Firefox internal compositor to just workaround clear bug in Mutter, better to fix Mutter directly I guess.
What do you think?
Comment 33•9 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #32)>
That brings me a question what mutter/gnome does so wrongly with surfaces offload? I don't think it's worth to implement extra Firefox internal compositor to just workaround clear bug in Mutter, better to fix Mutter directly I guess.
Of course, if this is a mutter issue, we should fix it in mutter, not work around it in Firefox.
Can you take a CPU profile while reproducing the mutter 100% CPU usage, e.g. with sysprof?
Comment 34•9 months ago
|
||
Yes, I'll look at it. Looks like nested mutter has better performance (uses 10% CPU) but I hit new bugs like image corruption during playback in such mode. I'll fix that on Firefox side first and then do the testing.
Comment 35•9 months ago
|
||
Providing builds and instructions how to reproduce locally would be useful as well.
Comment 36•9 months ago
|
||
Thanks, will provide that when it's ready for testing.
Updated•9 months ago
|
Comment 37•9 months ago
|
||
Have done testing with fixed Firefox version and now and there isn't any difference between composited and non-composited CPU usage on Mutter side during YUV video offload, tested on Fedora 40. So looks like it was caused by my FF patches and perhaps also by logging. Sorry for the noise.
But that also means the Wayland Compositing is suitable for use and way to go which it great news.
Comment 38•9 months ago
|
||
Well I spoke too soon. There's visible compositing penalty if blending is used. For instance YT playback causes it as YT player has round corners over the video. I see 10% CPU if I play plain clip and 20% CPU on YT with the round corners. OTOH composition on Firefox side uses the same mutter CPU (10%).
Assignee | ||
Comment 39•9 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #37)
But that also means the Wayland Compositing is suitable for use and way to go which it great news.
Nice, great to hear!
(In reply to Martin Stránský [:stransky] (ni? me) from comment #38)
Well I spoke too soon. There's visible compositing penalty if blending is used. For instance YT playback causes it as YT player has round corners over the video. I see 10% CPU if I play plain clip and 20% CPU on YT with the round corners. OTOH composition on Firefox side uses the same mutter CPU (10%).
Yeah - blending is an issue both practically and conceptually - especially with HDR.
Somewhat related: here's a about how to reduce bandwidth overhead for typical videa player scenarios I still hope to get around pushing forward: https://gitlab.freedesktop.org/wayland/wayland/-/issues/423
Comment 40•9 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #38)
I see 10% CPU if I play plain clip and 20% CPU on YT with the round corners. OTOH composition on Firefox side uses the same mutter CPU (10%).
https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c33 / https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c35 still apply.
Comment 41•9 months ago
|
||
(In reply to Michel Dänzer from comment #40)
(In reply to Martin Stránský [:stransky] (ni? me) from comment #38)
I see 10% CPU if I play plain clip and 20% CPU on YT with the round corners. OTOH composition on Firefox side uses the same mutter CPU (10%).
https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c33 / https://bugzilla.mozilla.org/show_bug.cgi?id=1617498#c35 still apply.
I hope to get patches committed to Firefox this/next week so it can be tested by stock upstream binaries.
Comment 42•8 months ago
|
||
(Fixing the alias to match others)
Description
•