Closed Bug 1708416 Opened 1 year ago Closed 1 year ago

[wayland] Implement n-buffering for the software backend (Basic/SWWR)

Categories

(Core :: Widget: Gtk, enhancement)

Other Branch
enhancement

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox91 --- fixed

People

(Reporter: rmader, Assigned: rmader)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Following up on bug 1693472 comment 39 and some matrix discussions: Traditionally software rendering in FF was single buffered - fast and who cares about tearing.

For Wayland, in order to fully comply with the spec and at the same time stay fast, we should optimally implement double buffering. Optimally we can reuse EGL infrastructure here, which already has partial damage support for this case.

I'd rather wait here until we really need it. It means significant performance loss while page scrolling and video playback and generally an small screen updates line menu rendering.

If you keep a journal with last N surface damage regions and assign a buffer age or something like that to every buffer, there won't be huge performance regressions.

(In reply to Vlad Zahorodnii [:zzag] from comment #2)

If you keep a journal with last N surface damage regions and assign a buffer age or something like that to every buffer, there won't be huge performance regressions.

Yes, agree with that. Theoretically there should only be overhead when the damage region over the last 2-3 frames changed, which is not the case for scrolling or fullscreen video. For that cases we should be able to have equal performance, if implemented right.

Martin: just FYU, I had a longer chat with Lee Salzman and Markus Stange at #gfx-firefox about this. Apparently CA has similar limitations, but it's not clear if we can reuse their solution. In any case we can make SWWR work for this case and likely reuse the EGL damage tracking logic. Basic is about to be abandoned soon anyway, so no reason to worry about that.

In order to fully comply with the spec and work on all compositors.
This is somewhat inspired by SurfacePoolCA, moving buffers between
buckets - in use or available. The idea is to make it easy to expand
this further to share a common buffer pool between multiple surfaces.

It works similar to EGL buffer handling and reuses the partial damage
logic in WR by implementing buffer age.

Notable changes to the current implementation:

  • always draw directly into shm-buffer memory - no caching.
  • drawing is purely driven by the compositor / the VsyncSource.
    No frame callbacks or threads apart from buffer release
    callbacks.
  • Buffer objects are construct only, i.e. no resizing. This will
    hopefully simplify buffer sharing between surfaces.

One of the goals here is to handle different compositor behaviour
efficiently - if a compositor releases buffers early, we want to
reuse a single buffer whenever possible. If a compositor holds buffers
longer, we want to minimize the area we need to redraw to older buffers.
This is archived by always using the last released buffer. The expected
usual buffer age is 1 in the fast release case and 2-3 in the
holding case.

This changes buffer handling quite significantly, so propper testing
is due. However, as the overall architecture is somewhat simpler than
before, it may improve stability in the long term.

Assignee: nobody → robert.mader
Status: NEW → ASSIGNED

Here's a try build: https://treeherder.mozilla.org/jobs?repo=try&revision=bfde80fd01d54484f3890a52998c40744cc86c7b

So far I only tested on Mutter with buffer holding enabled or disabled. First test are encouraging, however I still experience certain crashes which might need some additional changes in other parts of the code.

Martin: This would replace a lot of code you worked on a lot with a quite different approach - so I'd love to your opinion/thoughs if this looks reasonable to you.

See Also: → 1693472
Depends on: 1709606
Summary: [wayland] Implement double buffering for the software backend (Basic/SWWR) → [wayland] Implement n-buffering for the software backend (Basic/SWWR)
Attachment #9220347 - Attachment description: Bug 1708416 - RFC: Implement n-buffering for the Wayland software backend, r?stransky,lsalzman → Bug 1708416 - Implement n-buffering for the Wayland software backend, r?stransky,lsalzman

I did a more extensive round of testing with the latest revision. Crashes are gone now as far as I can see, because of better refcounting. I compared two optimized but non-release builds with and without the patch (1, 2). On my main mashine (Thinkpad T460p, skylake) I tested all major compositors (Mutter with and without buffer-holding, KWin, Sway and Weston), on an old Thinkpad T400 I only tested stock Mutter. Each time I tested both Basic and SWWR and checked sides like testufo.com and vsynctester.com.

So far my findings are:

  • performance appears to be pretty much equal as far as reported FPS etc. are concerned
  • on the T400, I had the subjective impression that sometimes scrolling felt smoother with the old/current implementation, both on SWWR and basic. I'm don't know yet why that could be and if it's really the case - maybe the old implementation skips frames sometimes in smarter way?
  • unrelated to this patch, the combination of SWWR, Wayland and Kwin was heavily stuttering on testufo.com, both on the old and new backend. It doesn't happen with the Basic or HWWR compositor (it's doing great there), with the Xwayland backend or any other compositor. Also, it doesn't appear to be related to frame callbacks (WaylandVsyncSource). This needs investigation in another bug - maybe I'm just hitting very unfortunate timings or so.

1: https://treeherder.mozilla.org/jobs?repo=try&revision=4fc6cc5b6e9145b92f50d379039e6a1cf862a862&selectedTaskRun=GYm046RnQHGIPxSckEBXJw.0
2: https://treeherder.mozilla.org/jobs?repo=try&revision=56a78262fd441ead241ae4ef4deead4515bc71cd&selectedTaskRun=QBqdy_oqQpWvZK3ypzD6RQ.0

the combination of SWWR, Wayland and Kwin was heavily stuttering on testufo.com

There are a couple of reasons why testufo.com stutters when running kwin

(a) with SWWR, the entire Firefox window is damaged on every frame. With the Basic renderer, only some parts of the window are damaged, therefore there are less pixels to upload to the gpu
(b) by default, kwin doesn't start compositing right after a vblank. usually, it will perform compositing somewhere around the middle of the vblank interval. As far as I know, most compositors don't do it (at least, not by default). KWin has a mechanism to move compositing closer to vblank, but in this instance, it fails. Ideally, data from wl_shm buffers needs to be uploaded to opengl textures asynchronously, which is on our todo list. In meanwhile, you can change the latency policy in system settings to "Force smoothest animations."

(c) kwin uploads data from wl_shm buffers to opengl textures during compositing cycle :/, not when a buffer is attached to a surface

(In reply to Vlad Zahorodnii [:zzag] from comment #8)

(a) with SWWR, the entire Firefox window is damaged on every frame. With the Basic renderer, only some parts of the window are damaged, therefore there are less pixels to upload to the gpu

Oof, indeed, WRs partial damage logic really doesn't like that site :(

See Also: → 1710120
Depends on: 1710532
Depends on: 1711094

And another one, fixing a trivial crash. Conveniently it confirmed that at least on SW-WR, a buffer release callback can arrive after the WindowSurface was otherwise unreffed (or, before bug 1709606, explicitly destroyed), making us access freed memory under certain circumstances.

https://treeherder.mozilla.org/jobs?repo=try&revision=e6084a335971bf4a98417bd3939a9fa95ac325e0

Some more observations from testing this on my Thinkpad T400:

  • frame drops seem to heavily depend on when compositor settings. The device is heavily bandwidth limited, so fullscreen updates apparently often don't make it. This is for example visible on Weston, which by default starts compositing at the 7ms mark while Mutter usually at 2ms. Using the default settings of 7ms, both backends are similar. When changing it to 2ms, to make it behave more similar to Mutter, things go in favor of the old backend. IIUC this is because the new backend is "dumper", i.e. drawing and submitting frames is both driven by frame callback WaylandVsyncSource. The old backend uses WaylandVsyncSource for drawing, but has its own frame callbacks for actually attaching and commiting the surface, potentially caching things in between. From my understanding, the old backend, while sometimes working better on compositors like Mutter, does kinda work around a compositor issue. I.e. compositors should optimize callback times smarter. In conclusion I think the new backend behaves more predictable and how compositors would clients assume to behave, and compositors should be able to make it behave similar well.
  • On my main mashine, both backends appear to perform very similar.
  • With the old backed, I occasional observed freezes on SW-WR when e.g. playing videos for a longer time. The new backend appears to very solid, from all I can see, which due to it's lower complexity wouldn't be too surprising.

This just became more important/delicate after bug 1711118 landed, which would make this the default.

See Also: → 1711118
See Also: → 1694038
Pushed by robert.mader@posteo.de:
https://hg.mozilla.org/integration/autoland/rev/ee3c15c9dba5
Implement n-buffering for the Wayland software backend, r=stransky

Backed out for non-unified failure on RenderCompositorSWGL.cpp

backout: https://hg.mozilla.org/integration/autoland/rev/d7166b8e97a98a98e5d3a5df6f975051dcd47c38

push: https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=aXabMa7CQdaGGRLBZkRa4Q.0&revision=ee3c15c9dba52306591473b3261e3bfce266cd71

failure log: https://treeherder.mozilla.org/logviewer?job_id=340022121&repo=autoland&lineNumber=7558

[task 2021-05-18T13:43:06.337Z] 60:42.08 /builds/worker/fetches/clang/bin/clang++ -isysroot /builds/worker/fetches/MacOSX10.12.sdk -std=gnu++17 --target=x86_64-apple-darwin -o /dev/null -c -fvisibility=hidden -fvisibility-inlines-hidden -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -ftrivial-auto-var-init=pattern -DDEBUG=1 -DOS_POSIX=1 -DOS_MACOSX=1 -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -DSTATIC_EXPORTABLE_JS_API -I/builds/worker/checkouts/gecko/gfx/webrender_bindings -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/gfx/webrender_bindings -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/ipc/ipdl/_ipdlheaders -I/builds/worker/checkouts/gecko/ipc/chromium/src -I/builds/worker/checkouts/gecko/ipc/glue -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/dist/include -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/dist/include/nspr -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/dist/include/nss -DMOZILLA_CLIENT -include /builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/mozilla-config.h -Qunused-arguments -Qunused-arguments -Wall -Wbitfield-enum-conversion -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wshadow-field-in-constructor-modified -Wsign-compare -Wtype-limits -Wunreachable-code -Wunreachable-code-return -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wempty-init-stmt -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wloop-analysis -Wno-range-loop-analysis -Wc++2a-compat -Wcomma -Wimplicit-fallthrough -Wunused-function -Wunused-variable -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-enum-zero-compare -Wtautological-unsigned-zero-compare -Wno-error=tautological-type-limit-compare -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=backend-plugin -Wno-error=free-nonheap-object -Wno-error=return-std-move -Wno-error=atomic-alignment -Wno-error=deprecated-copy -Wformat -Wformat-security -Wno-gnu-zero-variadic-macro-arguments -Wno-psabi -Wno-unknown-warning-option -fno-sized-deallocation -fno-aligned-new -fno-exceptions -fno-strict-aliasing -stdlib=libc++ -fPIC -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -O3 -fno-omit-frame-pointer -funwind-tables -I/builds/worker/checkouts/gecko/obj-x86_64-apple-darwin/dist/include/cairo -Werror=switch -ferror-limit=0 /builds/worker/checkouts/gecko/gfx/webrender_bindings/RenderEGLImageTextureHost.cpp -fsyntax-only
[task 2021-05-18T13:43:07.130Z] /builds/worker/checkouts/gecko/gfx/webrender_bindings/RenderCompositorSWGL.cpp:263:15: error: no member named 'gfxVars' in namespace 'mozilla::gfx'
[task 2021-05-18T13:43:07.130Z]   return gfx::gfxVars::WebRenderMaxPartialPresentRects();
[task 2021-05-18T13:43:07.131Z]          ~~~~~^
[task 2021-05-18T13:43:07.280Z] 1 error generated.
Flags: needinfo?(robert.mader)

Thanks, will look into it.

Flags: needinfo?(robert.mader)

So the approach here is currently on hold because it turned out that SW-WR uses the "native" compositor paths and thus can't easily reuse the buffer-age based paths of the "draw" compositor. Fixing the case here is most likely equal to solving partial damage for compositor integration (bug 1617498).

So far I assumed that there's only one solution for this problem, which is to make WR repaint the union of the current and previous dirty rects. However, it turns out that the CoreAnimation backend instead just blits back the needed areas from the previous frame[1]. In case of Wayland, a submitted/attached buffer is considered "locked" and the spec is not super clear about whether we are allowed to blit from an attached buffer:

Committing a pending wl_buffer allows the compositor to read the pixels in the wl_buffer. The compositor may access the pixels at any time after the wl_surface.commit request. When the compositor will not access the pixels anymore, it will send the wl_buffer.release event.[2]

In a quick chat with emersion at #wayland we concluded that the spec should be clarified - with a tendency to only allow reading from the compositor, and not possible stuff like in-place mutations which get reversed before buffer release. This would allow us to blit/memcopy the missing parts. I'll investigate if this is a feasible solution.

1: https://phabricator.services.mozilla.com/D51760
2: https://wayland.freedesktop.org/docs/html/apa.html

See Also: → 1713202

The approach described above, blitting from the previously submitted buffer, seems to work pretty well. That's great news as it:

  • allows us handle everything purely in the surface backend
  • thus also works for Basic
  • is pretty fast:
    • doesn't have any overhead (apart from a few calculations) while scrolling or fullscreen updates
    • doesn't have any overhead on compositor that quickly release buffers again

Will post a new patch soon. This also opens the door to have a single backend again, instead of maintaining two.

In order to fully comply with the spec and work on all compositors.
This is heavily inspired by NativeLayerCA:

  • Buffer are managed within a pool, similar to SurfacePool. This
    makes sure there's always a buffer available, even if the compositor
    holds buffers for longer than usually expected. One difference is
    that buffers are not shared between windows though, reducing
    complexity.
  • Partial damage is handled by blitting from the previous buffer,
    likely currently held by the Wayland compositor. While the spec is
    not completely clear whether this is legal, it strongly suggests it
    is - and other Wayland developers suggested it should be.
    This is almost identical to NativeLayerCA::HandlePartialUpdate,
    with a small optimization for the common case of double buffering,
    in witch case use a "damage history", inspired by the EGL partial
    damage support. As single or double buffering are by far the most
    common cases for shm-buffers, we limit this to double buffering to
    avoid complexity.

Changes compared to the existing WindowSurfaceWayland include:

  • reduced complexity:
    • less code
    • no extra frame callbacks, avoiding extra steps like D117911
  • no compositor specific modes - this solution should always produce
    fully correct output with minimal overhead.

Note: a previous version of this patch only worked for SW-WR but not
Basic and thus did not replace WindowSurfaceWayland. Now this patch
supports both, making it possible to fully replace WindowSurfaceWayland

I gave the build from the comment above some testing on Gnome or KDE. It always does full buffer damage and thus bypasses the optional damage hint, forcing the compositor to always show the full buffer content. Comparing the old and new backend shows that the old one often produces glitches, thus violating the spec. This may or may not be visible on a given compositor - on Mutter it currently would never show.
The glitches are especially visible on KDE with Basic, because of its different architecture. With the new backend I wasn't able to produce any glitches, neither on SW-WR nor Basic, on neither Gnome or KDE. Performance wise I wasn't able to spot a difference, but I also didn't make any measurements yet.

I think that speaks in favor of replacing the old with the new backend - Martin, opinions here?

Flags: needinfo?(stransky)
Attachment #9227434 - Attachment description: WIP: Bug 1708416 - Implement n-buffering for the Wayland software backend → Bug 1708416 - Implement n-buffering for the Wayland software backend, r=stransky

Another try build, this time without any debug stuff: https://treeherder.mozilla.org/jobs?repo=try&revision=87f00af7f6095b406d5211f7a317042913554ce6

The new backend is enabled by default, both on SW-WR and Basic. The old one can be activated by setting widget.wayland.software-backend:1

Sure, I'll look at it.

Flags: needinfo?(stransky)

(In reply to Robert Mader [:rmader] from comment #24)

I think that speaks in favor of replacing the old with the new backend - Martin, opinions here?

Yes, we can replace it when it's tested enough in nightly and/or mozilla test suite with Wayland backend.

Pushed by robert.mader@posteo.de:
https://hg.mozilla.org/integration/autoland/rev/f98734017be4
Implement n-buffering for the Wayland software backend, r=stransky
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch

FTR: the new backend has now landed, but for is disabled by default - it can be enabled via the setting widget.wayland.multi-buffer-software-backend.enabled.

In theory it should/could perform better on compositors like Kwin and maybe on Gnome as well, but I haven't been able to verify this - and there may be further optimization potential. In any case it should be more correct - testing welcome :)

Concerning issues on sites like testufo.com and vsynctester.com (Comment 8): this has been fixed in bug 1713686, unfortunately only for Wayland.

See Also: → 1713686
You need to log in before you can comment on or make changes to this bug.