[Wayland] Software decoding into dmabuf surface
Categories
(Core :: Audio/Video: Playback, enhancement)
Tracking
()
Tracking | Status | |
---|---|---|
firefox79 | --- | fixed |
People
(Reporter: jan, Assigned: stransky)
References
(Blocks 1 open bug)
Details
(Keywords: nightly-community, perf, power)
Attachments
(3 files)
(gashish4all from bug 1580169 comment 4)
with ffmpeg there is not very a big difference between software decoding and vaapi decoding to dmabuf surface.
Does bug 1610199 already cover software decoding into dmabuf surfaces? Should it be behind its own pref?
Assignee | ||
Comment 1•5 years ago
|
||
SW decoding to dmabuf surface is pointless. It combines disadvantages of HW decoding and SW rendering...that may be the most inefficient variant :)
Reporter | ||
Comment 2•5 years ago
•
|
||
IIRC there was the open question of a fast upload path for software decoding to avoid a painful copy and I just haven't found the bug about it.
(I think it was something like "Software decode directly into OpenGL surface")
Assignee | ||
Comment 3•5 years ago
|
||
(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #2)
IIRC there was the open question of a fast upload path for software decoding to avoid a painful copy and I just haven't found the bug about it.
That's the recent implementation of SW decoding/playback by shm memory.
Reporter | ||
Comment 4•5 years ago
|
||
Thanks for the hint! I assume you mean bug 1539043 (bug 1561181 comment 2, bug 1618429, bug 1595994). Great! :)
Reporter | ||
Comment 5•5 years ago
•
|
||
(gashish4all from bug 1580169 comment 4)
with ffmpeg there is not very a big difference between software decoding and vaapi decoding to dmabuf surface.
(In reply to Martin Stránský [:stransky] from comment #1)
SW decoding to dmabuf surface is pointless. It combines disadvantages of HW decoding and SW rendering...that may be the most inefficient variant :)
You mentioned unrelated "sw rendering", so I'm still unsure about closing this bug, but I'm also not a developer.
A while ago Chromium introduced GpuMemoryBuffer to eliminate CPU to GPU copies. On the implementation side, GpuMemoryBuffer is a Linux DMA buffer on Chrome OS, IOSurface on Mac OS X, and a gralloc buffer on Android. The software fallback uses POSIX shared memory.
POSIX shared memory seems to mean Shm.
I assume current situation is "one copy" instead of "zero copy": In content or RDD process, software decoding is done into Shm memory, then shared with WebRender in the GPU process (X11) or parent process (Wayland) which copies video frames onto the GPU.
But with dmabuf you would decode directly into GPU memory?
(Jean-Yves Avenard [:jya] from bug 1561181 comment #2)
We will instead directly upload into a GPU surface.
Is there already a bug about this or should this bug be reopened? (Should this depend on bug 1595994?)
Dmabuf is already used for zero-copy Wayland WebGL (bug 1572697), Wayland VAAPI (bug 1610199) and X11 support is pending (bug 1580166).
Assignee | ||
Comment 6•5 years ago
|
||
Yes I know that article...it it possible to have some gain here but frankly I doubt it's worth the work. Even zero-copy vaapi playback over dmabuf does not always create significant difference. The recent SW decode path is pretty well optimized with WebRender/GL compositor. And Chrome uses the same way by default.
Reporter | ||
Comment 7•5 years ago
•
|
||
Wouldn't (U)HD video on desktop be what it would improve, perf- and power-wise? Memory bandwidth is WebRender's top problem:
While most recent discreet GPUs can stomach about anything we throw at them, integrated GPUs operate on a much tighter budget and compete with the CPU for memory bandwidth. 4k screens are real little memory bandwidth eating monsters.
Windows seems to have it already. In bug 1479792 it seems they want to do this for WebRender/Mac to improve YouTube (bug 1536515). Links added by me:
This is probably the path that gets taken for things like software-decoded video. We can improve this by having Gecko do the upload (which will be using ClientStorage from an existing code path) and then expose the texture to WebRender as an external-image-native-texture rather than as a bag of bytes. I'll file a separate bug about this. (I thought we already had one on it but I cannot find it at the moment.)
Comment 8•5 years ago
|
||
I don't think we have a bug specifically about reducing the number of copies with ffmpeg on linux. There's one for mac (bug 1403618)
We have implemented it on Windows only.
On Windows we get a software buffer out of the decoder, and the buffer is directly uploaded into a GPU surface before being returned.
Having the same improvement on mac and linux would bring great wins.
In response to:
SW decoding to dmabuf surface is pointless. It combines disadvantages of HW decoding and SW rendering...that may be the most inefficient variant :)
No, I believe this is the opposite, it's sw decoding with HW rendering; and useful across the line not just ffmpeg but also other video decoders (non-ffmpeg such as theora)
That may be useful in particular once usage becomes more broad and we have to start blacklisting HW decoders (it's almost certain this will happen, lots of HW decoder are buggy, return green frames or incorrectly decode some resolutions); we could instead use SW decoding and directly upload into a GPU surface.
Comment 9•5 years ago
|
||
(In reply to Martin Stránský [:stransky] from comment #6)
Yes I know that article...it it possible to have some gain here but frankly I doubt it's worth the work. Even zero-copy vaapi playback over dmabuf does not always create significant difference. The recent SW decode path is pretty well optimized with WebRender/GL compositor. And Chrome uses the same way by default.
With 4K content on Windows, it caused a significant gain in performance when we moved to direct GPU copy.
Rendering of SW decode on linux isn't well optimised at all.
We'll have 7 copies / allocations of the software image before it gets to the GPU if a GPU process is enabled.
It works well now with ffmpeg because it runs in the content process and if wayland the GPU process is currently disabled. In this case we only have one copy into a shareable array.
Add decoder process (RDD) and GPU process and it will go bad.
Reporter | ||
Updated•5 years ago
|
Assignee | ||
Comment 10•5 years ago
•
|
||
(In reply to Jean-Yves Avenard [:jya] from comment #8)
No, I believe this is the opposite, it's sw decoding with HW rendering; and useful across the line not just ffmpeg but also other video decoders (non-ffmpeg such as theora)
That may be useful in particular once usage becomes more broad and we have to start blacklisting HW decoders (it's almost certain this will happen, lots of HW decoder are buggy, return green frames or incorrectly decode some resolutions); we could instead use SW decoding and directly upload into a GPU surface.
Hi Jean-Yves,
I think there's a bit misunderstanding here. I surely agree that sw decoding + hw rendering is the way to go. But IIUC Jan wants to use dmabuf textures instead of shm textures which may be suboptimal. From my understanding the situation is:
ffmpeg -> cpu memory -> SW YUV -> RGB conversion -> upload to texture -> render
I'm not sure where and how SHM memory is used here, I suspect the RGB conversion result is stored as shm surface and transfered across processes and it's uploaded immediately as a texture just before rendering.
I think the optimal way may looks like:
ffmpeg -> cpu memory -> upload Y and UV textures as NV12 (or different) textures -> share as EGLImages -> do the rendering as with vaapi
OTOH dmabuf surfaces are really GPU pieces of memory. It's expensive/difficult to write into it from CPU. Also when we (firefox) writes directly to dmabuf we can't keep optimal texture format so the texture may be slow to render and may use more GPU memory than necessary. It's definitely better to keep GL implementation (or vaapi) to create and manage the GPU textures and use GL code to upload data to textures.
So this bug may be changed to "upload SW decoded frames to NV12 GL texture and do compositing from it directly", right?
Assignee | ||
Comment 11•5 years ago
|
||
Also note that dmabuf works only with intel+amd but the general way with the Y and UV textures would work with all (even proprietary) drivers.
Assignee | ||
Comment 12•5 years ago
|
||
btw. If the video playback on Linux is so badly optimized maybe the SW decoding + HW rendering over EGLImages would be even faster for low-res videos than the vaapi/dmabuf.
Reporter | ||
Comment 13•5 years ago
•
|
||
It would be good to have a fast upload path from ffmpeg to WebRender like Windows has. Improvements for legacy X11 should only be a byproduct of the Wayland work, for better long-term maintenance. As my "knowledge" is restricted to Bugzilla I only assume(d) DMABUF is the desired solution for cross-process sharing.
bug 1129492 comment 24: "Remove X11 connections from content processes in all cases, for security"
It seems X11 Pixmaps must be avoided for sandbox hardening, even if that means Mesa users get the zero-copy dmabuf improvement while proprietary Nvidia sticks by its own wish (corporate policy) to status quo by not supporting it. Firefox as open-source project with much idealistic unpaid labor should have its primary focus on getting the best experience with open-source Mesa.
If I haven't misunderstood recent chat und bugzilla conversations, X11 connections are the top security problem on Linux as they allow further desktop access, so that only WebRender should be allowed to have one. Therefore, there is additional pressure to get WebGL remoting done (bug 1638466). That's why it seemed so sad Rinat didn't ask ealier because his X11 VAAPI prototype is based on X11 Pixmaps, but the DMABUF code path should be used on X11 as well. However, it might help you with bug 1619882 as it demonstrates TextureClientRecycleAllocator
usage, which is used by Windows' fast upload path. Windows actually has two upload paths to WebRender (RenderDXGITextureHostOGL + RenderDXGIYCbCrTextureHostOGL).
WebRender's Native ExternalImage seems to expect YUV, conversion to RGB before handing over to WebRender seems unneccessary.
bug 1493198 added HDR YUV image support to WebRender and Windows' fast upload path.
Btw, jbonisteel wrote in chat last friday: "some Intel folks are interested in making HDR happen in Firefox"
https://gstreamer.freedesktop.org/data/events/gstreamer-conference/2012/omap-dmabuf-gstcon2012.pdf
With wl_drm protocol, we can push YUV buffers directly to server
Pages 26 (X11) and 27 (Wayland) look nearly identical. I don't know if dri2video is a desired option for X11 or if it was only a prototype.
It's totally fine if only Wayland gets certain improvements.
Assignee | ||
Comment 14•5 years ago
|
||
Well, to use fast upload path for SW decode with EGL is more a refactoring work than anything else and may be pretty straightforward, I'm going to look at it when Bug 1619882 is done.
Assignee | ||
Comment 15•5 years ago
|
||
I managed to do the SW decoding to dmabuf NV12 surfaces. It will use texture recycling by default as well as vaapi but it needs some testing if it really brings any advantage over the recent playback via shm.
Assignee | ||
Comment 16•5 years ago
|
||
Updated•5 years ago
|
Assignee | ||
Comment 17•5 years ago
|
||
- Use WaylandDMABufSurface for SW decoded frames when we can convert to NV12 format.
- Implement a cache of WaylandDMABufSurfaces and recycle them.
Depends on D78291
Assignee | ||
Comment 18•5 years ago
|
||
- widget.wayland-dmabuf-video-textures.enabled controls dmabuf texture backend for decoded video frames.
Depends on D78292
Assignee | ||
Comment 19•4 years ago
|
||
Jean-Yves, does gecko support compositing directly from YV12/YUV420p texture format? (that's the default ffmpeg SW decoder output). If so we don't need to compose it to NV12 format then.
Thanks.
Assignee | ||
Comment 20•4 years ago
|
||
PushYCbCrPlanarImage() is exactly what we need here. It allows to push YV12/YUV420p textures directly without any conversion.
Comment 21•4 years ago
|
||
Comment 22•4 years ago
|
||
Backed out 3 changesets (bug 1629788) for WaylandDMABufSurface.cpp related bustages.
Backout link: https://hg.mozilla.org/integration/autoland/rev/829b0316223ece283168b68d22c8ccde701faf8f
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=306208116&repo=autoland&lineNumber=43471
[task 2020-06-13T10:50:33.393Z] 10:50:33 INFO - make[4]: Entering directory '/builds/worker/workspace/obj-build/widget/gtk'
[task 2020-06-13T10:50:33.397Z] 10:50:33 INFO - /builds/worker/fetches/sccache/sccache /builds/worker/fetches/clang/bin/clang++ -std=gnu++17 -o Unified_cpp_widget_gtk0.o -c -I/builds/worker/workspace/obj-build/dist/stl_wrappers -I/builds/worker/workspace/obj-build/dist/system_wrappers -include /builds/worker/checkouts/gecko/config/gcc_hidden.h -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -ftrivial-auto-var-init=pattern -DDEBUG=1 -DOS_POSIX=1 -DOS_LINUX=1 -DCAIRO_GFX '-DMOZ_APP_NAME="firefox"' -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -DSTATIC_EXPORTABLE_JS_API -I/builds/worker/checkouts/gecko/widget/gtk -I/builds/worker/workspace/obj-build/widget/gtk -I/builds/worker/workspace/obj-build/ipc/ipdl/_ipdlheaders -I/builds/worker/checkouts/gecko/ipc/chromium/src -I/builds/worker/checkouts/gecko/ipc/glue -I/builds/worker/checkouts/gecko/layout/base -I/builds/worker/checkouts/gecko/layout/generic -I/builds/worker/checkouts/gecko/layout/xul -I/builds/worker/checkouts/gecko/other-licenses/atk-1.0 -I/builds/worker/checkouts/gecko/widget -I/builds/worker/checkouts/gecko/widget/headless -I/builds/worker/checkouts/gecko/widget/x11 -I/builds/worker/workspace/obj-build/dist/include -I/builds/worker/workspace/obj-build/dist/include/nspr -I/builds/worker/workspace/obj-build/dist/include/nss -fPIC -DMOZILLA_CLIENT -include /builds/worker/workspace/obj-build/mozilla-config.h -Qunused-arguments -Qunused-arguments -Wall -Wbitfield-enum-conversion -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wshadow-field-in-constructor-modified -Wsign-compare -Wtype-limits -Wunreachable-code -Wunreachable-code-return -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wempty-init-stmt -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wloop-analysis -Wc++2a-compat -Wcomma -Wimplicit-fallthrough -Wunused-function -Wunused-variable -Werror=non-literal-null-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-enum-zero-compare -Wtautological-unsigned-zero-compare -Wno-error=tautological-type-limit-compare -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=backend-plugin -Wno-error=return-std-move -Wno-error=atomic-alignment -Wformat -Wformat-security -Wno-gnu-zero-variadic-macro-arguments -Wno-unknown-warning-option -D_GLIBCXX_USE_CXX11_ABI=0 -fno-sized-deallocation -fno-aligned-new -fcrash-diagnostics-dir=/builds/worker/artifacts -fno-exceptions -fno-strict-aliasing -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -Xclang -load -Xclang /builds/worker/workspace/obj-build/build/clang-plugin/libclang-plugin.so -Xclang -add-plugin -Xclang moz-check -Os -fno-omit-frame-pointer -funwind-tables -Werror -I/builds/worker/checkouts/gecko/widget/gtk/compat-gtk3 -pthread -I/usr/include/gtk-3.0/unix-print -I/usr/include/gtk-3.0 -I/usr/include/at-spi2-atk/2.0 -I/usr/include/at-spi-2.0 -I/usr/include/dbus-1.0 -I/usr/lib/x86_64-linux-gnu/dbus-1.0/include -I/usr/include/gtk-3.0 -I/usr/include/cairo -I/usr/include/pango-1.0 -I/usr/include/harfbuzz -I/usr/include/pango-1.0 -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12 -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/libpng12 -I/usr/include/gio-unix-2.0/ -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -pthread -I/usr/include/gtk-3.0 -I/usr/include/at-spi2-atk/2.0 -I/usr/include/at-spi-2.0 -I/usr/include/dbus-1.0 -I/usr/lib/x86_64-linux-gnu/dbus-1.0/include -I/usr/include/gtk-3.0 -I/usr/include/gio-unix-2.0/ -I/usr/include/cairo -I/usr/include/pango-1.0 -I/usr/include/harfbuzz -I/usr/include/pango-1.0 -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12 -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/libpng12 -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/usr/include/libdrm -I/usr/include/dbus-1.0 -I/usr/lib/x86_64-linux-gnu/dbus-1.0/include -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -Wno-error=shadow -fexperimental-new-pass-manager -MD -MP -MF .deps/Unified_cpp_widget_gtk0.o.pp Unified_cpp_widget_gtk0.cpp
[task 2020-06-13T10:50:33.398Z] 10:50:33 INFO - In file included from Unified_cpp_widget_gtk0.cpp:119:
[task 2020-06-13T10:50:33.398Z] 10:50:33 ERROR - /builds/worker/checkouts/gecko/widget/gtk/WaylandDMABufSurface.cpp:815:40: error: comparison of integers of different signs: 'uint32_t' (aka 'unsigned int') and 'int' [-Werror,-Wsign-compare]
[task 2020-06-13T10:50:33.398Z] 10:50:33 INFO - if (!yData || mMappedRegionStride[0] != mWidth[0]) {
[task 2020-06-13T10:50:33.400Z] 10:50:33 INFO - ~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~
[task 2020-06-13T10:50:33.400Z] 10:50:33 ERROR - /builds/worker/checkouts/gecko/widget/gtk/WaylandDMABufSurface.cpp:821:41: error: comparison of integers of different signs: 'uint32_t' (aka 'unsigned int') and 'int' [-Werror,-Wsign-compare]
[task 2020-06-13T10:50:33.400Z] 10:50:33 INFO - if (!nvData || mMappedRegionStride[1] != mWidth[1] * 2) {
[task 2020-06-13T10:50:33.400Z] 10:50:33 INFO - ~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~
[task 2020-06-13T10:50:33.400Z] 10:50:33 INFO - 2 errors generated.
[task 2020-06-13T10:50:33.400Z] 10:50:33 INFO - /builds/worker/checkouts/gecko/config/rules.mk:746: recipe for target 'Unified_cpp_widget_gtk0.o' failed
[task 2020-06-13T10:50:33.400Z] 10:50:33 ERROR - make[4]: *** [Unified_cpp_widget_gtk0.o] Error 1
[task 2020-06-13T10:50:33.400Z] 10:50:33 INFO - make[4]: Leaving directory '/builds/worker/workspace/obj-build/widget/gtk'
[task 2020-06-13T10:50:33.400Z] 10:50:33 INFO - make[4]: *** Waiting for unfinished jobs....
Assignee | ||
Comment 23•4 years ago
|
||
Updated, Thanks.
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9400ced514e5519af73148d20e45d257f33d13e3
Comment 24•4 years ago
|
||
Comment 25•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/0c7df6f9b0c1
https://hg.mozilla.org/mozilla-central/rev/8a463700d169
https://hg.mozilla.org/mozilla-central/rev/b4fee14fe36a
Comment 26•4 years ago
|
||
Hi, as a Linux user, I thank you for this work !
Could you please sum up the final work since this is quite technical ?
I've understood that all these things were explored :
- reducing the number of copies
- avoiding conversion between color formats
- enabling hardware rendering in the absence of hardware decoding
and that the whole work will land in mozilla79 behind the pref widget.wayland-dmabuf-video-textures.enabled
Am I right ?
Assignee | ||
Comment 27•4 years ago
|
||
(In reply to antistress from comment #26)
Hi, as a Linux user, I thank you for this work !
Could you please sum up the final work since this is quite technical ?
I've understood that all these things were explored :
- reducing the number of copies
- avoiding conversion between color formats
- enabling hardware rendering in the absence of hardware decoding
and that the whole work will land in mozilla79 behind the pref widget.wayland-dmabuf-video-textures.enabled
This patch is not finished yet, I need to tweak how the dmabuf surfaces are created and also implement YUV420p direct rendering without format conversion. Right now we convert YUV420p to NV12.
Reporter | ||
Comment 28•4 years ago
|
||
bug 1653409 implements this on macOS.
Assignee | ||
Comment 29•4 years ago
|
||
This should be finished now as Bug 1656436 landed. I briefly test it (plain profile, enabled webrender and set media.ffmpeg.dmabuf-textures.enabled to true) but I don't see any performance improvements, at least on my box.
Reporter | ||
Comment 30•4 years ago
|
||
Desktop computer, Gnome Xwayland/MOZ_X11_EGL, Debian Testing, 2560x1440@60Hz, Intel HD Graphics 630 (KBL GT2)
https://www.youtube.com/watch?v=TvWcU3aztmo plays in 4320p mostly fluently:
- If media.ffmpeg.dmabuf-textures.enabled is false, my mouse pointer doesn't feel as reactive as usual and as long the video actually plays and doesn't pause there is a massive delay between hovering bookmark toolbar entries and being shown as highlighted.
- If media.ffmpeg.dmabuf-textures.enabled is true, nothing feels slow, the video just slightly hangs sometimes as it would anyway do.
So thank you very much, it has an effect! ;) I was so unsure about this, I tested it the last 15 minutes again and again and always restarted Nightly to apply the pref change.
Reporter | ||
Comment 31•4 years ago
|
||
Side note: What I wasn't aware of: At least my Intel CPU supports hardware video decoding apparently only up to 4K, not 8K. So it's even nicer that media.ffmpeg.dmabuf-textures.enabled can prevent a sluggish UI with 8K videos. With this command you can see GPU workload, including hardware video decoding:
$ sudo apt install intel-gpu-tools && sudo intel_gpu_top
Assignee | ||
Comment 32•4 years ago
|
||
Okay, thanks for testing. Jean-Yves, is there any way how to propagate actual media playback setup to about:support? Does Windows support for instance? I mean if the playback is HW accelerated and so.
Thanks.
Comment 33•4 years ago
|
||
(In reply to Martin Stránský [:stransky] from comment #32)
Okay, thanks for testing. Jean-Yves, is there any way how to propagate actual media playback setup to about:support? Does Windows support for instance? I mean if the playback is HW accelerated and so.
Thanks.
We used to, but that code got ripped out as about:support now runs in a different process and passing that data required changes.
Right now, the only way to check is per video, using the media devtools ; https://addons.mozilla.org/en-US/firefox/addon/devtools-media-panel/
Description
•