[wayland] Implement n-buffering for the software backend (Basic/SWWR)
Categories
(Core :: Widget: Gtk, enhancement)
Tracking
()
Tracking | Status | |
---|---|---|
firefox91 | --- | fixed |
People
(Reporter: rmader, Assigned: rmader)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
Following up on bug 1693472 comment 39 and some matrix discussions: Traditionally software rendering in FF was single buffered - fast and who cares about tearing.
For Wayland, in order to fully comply with the spec and at the same time stay fast, we should optimally implement double buffering. Optimally we can reuse EGL infrastructure here, which already has partial damage support for this case.
Comment 1•3 years ago
|
||
I'd rather wait here until we really need it. It means significant performance loss while page scrolling and video playback and generally an small screen updates line menu rendering.
Comment 2•3 years ago
|
||
If you keep a journal with last N surface damage regions and assign a buffer age or something like that to every buffer, there won't be huge performance regressions.
Assignee | ||
Comment 3•3 years ago
|
||
(In reply to Vlad Zahorodnii [:zzag] from comment #2)
If you keep a journal with last N surface damage regions and assign a buffer age or something like that to every buffer, there won't be huge performance regressions.
Yes, agree with that. Theoretically there should only be overhead when the damage region over the last 2-3 frames changed, which is not the case for scrolling or fullscreen video. For that cases we should be able to have equal performance, if implemented right.
Assignee | ||
Comment 4•3 years ago
|
||
Martin: just FYU, I had a longer chat with Lee Salzman and Markus Stange at #gfx-firefox about this. Apparently CA has similar limitations, but it's not clear if we can reuse their solution. In any case we can make SWWR work for this case and likely reuse the EGL damage tracking logic. Basic is about to be abandoned soon anyway, so no reason to worry about that.
Assignee | ||
Comment 5•3 years ago
|
||
In order to fully comply with the spec and work on all compositors.
This is somewhat inspired by SurfacePoolCA
, moving buffers between
buckets - in use or available. The idea is to make it easy to expand
this further to share a common buffer pool between multiple surfaces.
It works similar to EGL buffer handling and reuses the partial damage
logic in WR by implementing buffer age.
Notable changes to the current implementation:
- always draw directly into shm-buffer memory - no caching.
- drawing is purely driven by the compositor / the VsyncSource.
No frame callbacks or threads apart from buffer release
callbacks. - Buffer objects are construct only, i.e. no resizing. This will
hopefully simplify buffer sharing between surfaces.
One of the goals here is to handle different compositor behaviour
efficiently - if a compositor releases buffers early, we want to
reuse a single buffer whenever possible. If a compositor holds buffers
longer, we want to minimize the area we need to redraw to older buffers.
This is archived by always using the last released buffer. The expected
usual buffer age is 1
in the fast release case and 2
-3
in the
holding case.
This changes buffer handling quite significantly, so propper testing
is due. However, as the overall architecture is somewhat simpler than
before, it may improve stability in the long term.
Updated•3 years ago
|
Assignee | ||
Comment 6•3 years ago
|
||
Here's a try build: https://treeherder.mozilla.org/jobs?repo=try&revision=bfde80fd01d54484f3890a52998c40744cc86c7b
So far I only tested on Mutter with buffer holding enabled or disabled. First test are encouraging, however I still experience certain crashes which might need some additional changes in other parts of the code.
Martin: This would replace a lot of code you worked on a lot with a quite different approach - so I'd love to your opinion/thoughs if this looks reasonable to you.
Assignee | ||
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 7•3 years ago
|
||
I did a more extensive round of testing with the latest revision. Crashes are gone now as far as I can see, because of better refcounting. I compared two optimized but non-release builds with and without the patch (1, 2). On my main mashine (Thinkpad T460p, skylake) I tested all major compositors (Mutter with and without buffer-holding, KWin, Sway and Weston), on an old Thinkpad T400 I only tested stock Mutter. Each time I tested both Basic and SWWR and checked sides like testufo.com and vsynctester.com.
So far my findings are:
- performance appears to be pretty much equal as far as reported FPS etc. are concerned
- on the T400, I had the subjective impression that sometimes scrolling felt smoother with the old/current implementation, both on SWWR and basic. I'm don't know yet why that could be and if it's really the case - maybe the old implementation skips frames sometimes in smarter way?
- unrelated to this patch, the combination of SWWR, Wayland and Kwin was heavily stuttering on testufo.com, both on the old and new backend. It doesn't happen with the Basic or HWWR compositor (it's doing great there), with the Xwayland backend or any other compositor. Also, it doesn't appear to be related to frame callbacks (
WaylandVsyncSource
). This needs investigation in another bug - maybe I'm just hitting very unfortunate timings or so.
1: https://treeherder.mozilla.org/jobs?repo=try&revision=4fc6cc5b6e9145b92f50d379039e6a1cf862a862&selectedTaskRun=GYm046RnQHGIPxSckEBXJw.0
2: https://treeherder.mozilla.org/jobs?repo=try&revision=56a78262fd441ead241ae4ef4deead4515bc71cd&selectedTaskRun=QBqdy_oqQpWvZK3ypzD6RQ.0
Comment 8•3 years ago
|
||
the combination of SWWR, Wayland and Kwin was heavily stuttering on testufo.com
There are a couple of reasons why testufo.com stutters when running kwin
(a) with SWWR, the entire Firefox window is damaged on every frame. With the Basic renderer, only some parts of the window are damaged, therefore there are less pixels to upload to the gpu
(b) by default, kwin doesn't start compositing right after a vblank. usually, it will perform compositing somewhere around the middle of the vblank interval. As far as I know, most compositors don't do it (at least, not by default). KWin has a mechanism to move compositing closer to vblank, but in this instance, it fails. Ideally, data from wl_shm buffers needs to be uploaded to opengl textures asynchronously, which is on our todo list. In meanwhile, you can change the latency policy in system settings to "Force smoothest animations."
Comment 9•3 years ago
|
||
(c) kwin uploads data from wl_shm buffers to opengl textures during compositing cycle :/, not when a buffer is attached to a surface
Assignee | ||
Comment 10•3 years ago
|
||
(In reply to Vlad Zahorodnii [:zzag] from comment #8)
(a) with SWWR, the entire Firefox window is damaged on every frame. With the Basic renderer, only some parts of the window are damaged, therefore there are less pixels to upload to the gpu
Oof, indeed, WRs partial damage logic really doesn't like that site :(
Assignee | ||
Comment 11•3 years ago
|
||
Assignee | ||
Comment 12•3 years ago
|
||
That didn't go well, here's a new one: https://treeherder.mozilla.org/jobs?repo=try&revision=10317c75ac8ccb8d15f8edbad30393b399f02900
Assignee | ||
Comment 13•3 years ago
|
||
And another one, fixing a trivial crash. Conveniently it confirmed that at least on SW-WR, a buffer release callback can arrive after the WindowSurface
was otherwise unreffed (or, before bug 1709606, explicitly destroyed), making us access freed memory under certain circumstances.
https://treeherder.mozilla.org/jobs?repo=try&revision=e6084a335971bf4a98417bd3939a9fa95ac325e0
Assignee | ||
Comment 14•3 years ago
|
||
Some more observations from testing this on my Thinkpad T400:
- frame drops seem to heavily depend on when compositor settings. The device is heavily bandwidth limited, so fullscreen updates apparently often don't make it. This is for example visible on Weston, which by default starts compositing at the 7ms mark while Mutter usually at 2ms. Using the default settings of 7ms, both backends are similar. When changing it to 2ms, to make it behave more similar to Mutter, things go in favor of the old backend. IIUC this is because the new backend is "dumper", i.e. drawing and submitting frames is both driven by frame callback
WaylandVsyncSource
. The old backend usesWaylandVsyncSource
for drawing, but has its own frame callbacks for actually attaching and commiting the surface, potentially caching things in between. From my understanding, the old backend, while sometimes working better on compositors like Mutter, does kinda work around a compositor issue. I.e. compositors should optimize callback times smarter. In conclusion I think the new backend behaves more predictable and how compositors would clients assume to behave, and compositors should be able to make it behave similar well. - On my main mashine, both backends appear to perform very similar.
- With the old backed, I occasional observed freezes on SW-WR when e.g. playing videos for a longer time. The new backend appears to very solid, from all I can see, which due to it's lower complexity wouldn't be too surprising.
Assignee | ||
Comment 15•3 years ago
|
||
This just became more important/delicate after bug 1711118 landed, which would make this the default.
Comment 16•3 years ago
|
||
Comment 17•3 years ago
|
||
Backed out for non-unified failure on RenderCompositorSWGL.cpp
backout: https://hg.mozilla.org/integration/autoland/rev/d7166b8e97a98a98e5d3a5df6f975051dcd47c38
failure log: https://treeherder.mozilla.org/logviewer?job_id=340022121&repo=autoland&lineNumber=7558
[task 2021-05-18T13:43:06.337Z] 60:42.08 /builds/worker/fetches/clang/bin/clang++ -isysroot /builds/worker/fetches/MacOSX10.12.sdk -std=gnu++17 --target=x86_64-apple-darwin -o /dev/null -c -fvisibility=hidden -fvisibility-inlines-hidden -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -ftrivial-auto-var-init=pattern -DDEBUG=1 -DOS_POSIX=1 -DOS_MACOSX=1 -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -DSTATIC_EXPORTABLE_JS_API -I/builds/worker/checkouts/gecko/gfx/webrender_bindings -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/gfx/webrender_bindings -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/ipc/ipdl/_ipdlheaders -I/builds/worker/checkouts/gecko/ipc/chromium/src -I/builds/worker/checkouts/gecko/ipc/glue -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/dist/include -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/dist/include/nspr -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/dist/include/nss -DMOZILLA_CLIENT -include /builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/mozilla-config.h -Qunused-arguments -Qunused-arguments -Wall -Wbitfield-enum-conversion -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wshadow-field-in-constructor-modified -Wsign-compare -Wtype-limits -Wunreachable-code -Wunreachable-code-return -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wempty-init-stmt -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wloop-analysis -Wno-range-loop-analysis -Wc++2a-compat -Wcomma -Wimplicit-fallthrough -Wunused-function -Wunused-variable -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-enum-zero-compare -Wtautological-unsigned-zero-compare -Wno-error=tautological-type-limit-compare -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=backend-plugin -Wno-error=free-nonheap-object -Wno-error=return-std-move -Wno-error=atomic-alignment -Wno-error=deprecated-copy -Wformat -Wformat-security -Wno-gnu-zero-variadic-macro-arguments -Wno-psabi -Wno-unknown-warning-option -fno-sized-deallocation -fno-aligned-new -fno-exceptions -fno-strict-aliasing -stdlib=libc++ -fPIC -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -O3 -fno-omit-frame-pointer -funwind-tables -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/dist/include/cairo -Werror=switch -ferror-limit=0 /builds/worker/checkouts/gecko/gfx/webrender_bindings/RenderEGLImageTextureHost.cpp -fsyntax-only
[task 2021-05-18T13:43:07.130Z] /builds/worker/checkouts/gecko/gfx/webrender_bindings/RenderCompositorSWGL.cpp:263:15: error: no member named 'gfxVars' in namespace 'mozilla::gfx'
[task 2021-05-18T13:43:07.130Z] return gfx::gfxVars::WebRenderMaxPartialPresentRects();
[task 2021-05-18T13:43:07.131Z] ~~~~~^
[task 2021-05-18T13:43:07.280Z] 1 error generated.
Comment 18•3 years ago
|
||
Also failing several webdriver tests: https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=aWZXALyCRXWmx1GWF0NIZQ.0&searchStr=linux%2C18.04%2Cx64%2Casan%2Copt%2Cweb%2Cplatform%2Ctests%2Ctest-linux1804-64-asan%2Fopt-web-platform-tests-wdspec-headless-e10s%2Cwdh1&revision=ee3c15c9dba52306591473b3261e3bfce266cd71
and
https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=eAnDeWgHQae1h59zfBZEQQ.0&revision=84eb62a713ed2d8405ba800a9c930bac26d6cc0f&searchStr=wdh&group_state=expanded
Assignee | ||
Comment 20•3 years ago
|
||
So the approach here is currently on hold because it turned out that SW-WR uses the "native" compositor paths and thus can't easily reuse the buffer-age based paths of the "draw" compositor. Fixing the case here is most likely equal to solving partial damage for compositor integration (bug 1617498).
So far I assumed that there's only one solution for this problem, which is to make WR repaint the union of the current and previous dirty rects. However, it turns out that the CoreAnimation backend instead just blits back the needed areas from the previous frame[1]. In case of Wayland, a submitted/attached buffer is considered "locked" and the spec is not super clear about whether we are allowed to blit from an attached buffer:
Committing a pending wl_buffer allows the compositor to read the pixels in the wl_buffer. The compositor may access the pixels at any time after the wl_surface.commit request. When the compositor will not access the pixels anymore, it will send the wl_buffer.release event.[2]
In a quick chat with emersion at #wayland
we concluded that the spec should be clarified - with a tendency to only allow reading from the compositor, and not possible stuff like in-place mutations which get reversed before buffer release. This would allow us to blit/memcopy the missing parts. I'll investigate if this is a feasible solution.
1: https://phabricator.services.mozilla.com/D51760
2: https://wayland.freedesktop.org/docs/html/apa.html
Assignee | ||
Comment 21•3 years ago
•
|
||
The approach described above, blitting from the previously submitted buffer, seems to work pretty well. That's great news as it:
- allows us handle everything purely in the surface backend
- thus also works for
Basic
- is pretty fast:
- doesn't have any overhead (apart from a few calculations) while scrolling or fullscreen updates
- doesn't have any overhead on compositor that quickly release buffers again
Will post a new patch soon. This also opens the door to have a single backend again, instead of maintaining two.
Assignee | ||
Comment 22•3 years ago
|
||
In order to fully comply with the spec and work on all compositors.
This is heavily inspired by NativeLayerCA
:
- Buffer are managed within a pool, similar to
SurfacePool
. This
makes sure there's always a buffer available, even if the compositor
holds buffers for longer than usually expected. One difference is
that buffers are not shared between windows though, reducing
complexity. - Partial damage is handled by blitting from the previous buffer,
likely currently held by the Wayland compositor. While the spec is
not completely clear whether this is legal, it strongly suggests it
is - and other Wayland developers suggested it should be.
This is almost identical toNativeLayerCA::HandlePartialUpdate
,
with a small optimization for the common case of double buffering,
in witch case use a "damage history", inspired by the EGL partial
damage support. As single or double buffering are by far the most
common cases for shm-buffers, we limit this to double buffering to
avoid complexity.
Changes compared to the existing WindowSurfaceWayland
include:
- reduced complexity:
- less code
- no extra frame callbacks, avoiding extra steps like D117911
- no compositor specific modes - this solution should always produce
fully correct output with minimal overhead.
Note: a previous version of this patch only worked for SW-WR but not
Basic and thus did not replace WindowSurfaceWayland
. Now this patch
supports both, making it possible to fully replace WindowSurfaceWayland
Assignee | ||
Comment 23•3 years ago
•
|
||
https://treeherder.mozilla.org/jobs?repo=try&revision=7acaa12a06e15b144cfb022ded26e3064979027c
https://treeherder.mozilla.org/jobs?repo=try&revision=622fc15c840f50d6d1b49bfb3d8f21e2a3e7aef8
Try job which always damages the whole buffer, making it easy to check if contents are really valid.
Assignee | ||
Comment 24•3 years ago
|
||
I gave the build from the comment above some testing on Gnome or KDE. It always does full buffer damage and thus bypasses the optional damage hint, forcing the compositor to always show the full buffer content. Comparing the old and new backend shows that the old one often produces glitches, thus violating the spec. This may or may not be visible on a given compositor - on Mutter it currently would never show.
The glitches are especially visible on KDE with Basic, because of its different architecture. With the new backend I wasn't able to produce any glitches, neither on SW-WR nor Basic, on neither Gnome or KDE. Performance wise I wasn't able to spot a difference, but I also didn't make any measurements yet.
I think that speaks in favor of replacing the old with the new backend - Martin, opinions here?
Updated•3 years ago
|
Assignee | ||
Comment 25•3 years ago
|
||
Another try build, this time without any debug stuff: https://treeherder.mozilla.org/jobs?repo=try&revision=87f00af7f6095b406d5211f7a317042913554ce6
The new backend is enabled by default, both on SW-WR and Basic. The old one can be activated by setting widget.wayland.software-backend:1
Comment 27•3 years ago
|
||
(In reply to Robert Mader [:rmader] from comment #24)
I think that speaks in favor of replacing the old with the new backend - Martin, opinions here?
Yes, we can replace it when it's tested enough in nightly and/or mozilla test suite with Wayland backend.
Comment 28•3 years ago
|
||
Comment 29•3 years ago
|
||
bugherder |
Assignee | ||
Comment 30•3 years ago
|
||
FTR: the new backend has now landed, but for is disabled by default - it can be enabled via the setting widget.wayland.multi-buffer-software-backend.enabled
.
In theory it should/could perform better on compositors like Kwin and maybe on Gnome as well, but I haven't been able to verify this - and there may be further optimization potential. In any case it should be more correct - testing welcome :)
Concerning issues on sites like testufo.com and vsynctester.com (Comment 8): this has been fixed in bug 1713686, unfortunately only for Wayland.
Description
•