Closed Bug 1713202 Opened 3 years ago Closed 3 years ago

Investigate zero-copy uploads for SW-WR on unified memory GPUs / VGEM

Categories

(Core :: Graphics: WebRender, enhancement)

enhancement

Tracking

()

RESOLVED WONTFIX

People

(Reporter: rmader, Unassigned)

References

(Blocks 1 open bug)

Details

On certain very common architectures (e.g. Intel) we may be able to avoid texture uploads/copies from SHM for SW-WR:

(21:06:05) robert_mader: emersion, daniels: talking about shm buffer handling, has one of you looked into https://software.intel.com/content/www/us/en/develop/articles/zero-copy-texture-uploads-in-chrome-os.html before? Looks like on intel and similar we could avoid texture uploads from shm altogether
(21:08:11) daniels: robert_mader: we can't really avoid uploads from shm, because you need to alloc a BO from vgem - it doesn't give you a way to wrap vaddr+len into a BO
(21:08:28) daniels: so clients can do that today if they alloc using vgem and send via dmabuf
(21:09:04) robert_mader: oh cool - so all compositors support this in theory?
(21:16:02) jadahl: daniels: sounds like something gtk3 could benefit from
(21:16:15) jadahl: as it'll likely never see any hw acceleration
(21:17:52) daniels: robert_mader: yeah, as long as your driver can import from vgem, which is ... not all of them
(21:17:55) daniels: jadahl: EFL does this too
(21:39:13) emersion: i don't see vgem loaded often
(21:40:51) daniels: a lot of people don't ship it because it will generally claim card0 and annoy dumb userspace
(21:41:27) daniels: also it's an unguarded interface which allows userspace to allocate memory outwith ulimit
(21:43:42) emersion: allocations not counted towards the allocating process?
(21:44:59) daniels: right, they don't accrue towards your memory limit
(21:45:13) daniels: (insert opinion about practical usefulness of ulimit here)

See Also: → 1708416

Actually the article is about dmabuf surfaces, it's already implemented for GL compositor as dmabuf-textures.
I also removed it from WaylandSurfaces as it was significantly slower than shm. Also I would not spend time on SW backends as the improvements are tiny compared to gain when we enable fully accelerated backend.

Also AFAIK Chrome itself does not use it although it's implemented there and you can enable it.

IIUC it's not just common DMABUF (what you tried, IIRC), but a special one using VGEM. Or did you try that as well already? Apparently it's not yet commonly available in Linux distris because of certain limitations.

But yeah, I agree that HW-WR is more important. I just like to test things on old hardware, including pre GL 3 :)

(In reply to Robert Mader [:rmader] from comment #3)

IIUC it's not just common DMABUF (what you tried, IIRC), but a special one using VGEM. Or did you try that as well already? Apparently it's not yet commonly available in Linux distris because of certain limitations.

I have no idea what the VGEM is. But reading the 'Chrome OS solution: VGEM and the hardware backed GpuMemoryBuffer' section it looks like a regular dmabuf objects we already use (which are mapped internally as GEM AFAIK and you get the fd by PRIME).

Robert, I think the zero-copy concept makes sense for rapidly changing content - video playback, canvas rendering, WebGL. I don't think we get significant gain for regular web pages / menus etc.

I was considering it for video playback already - we can allocate dmabuf surface and let ffmpeg directly decode video frame into it. That will reduce intermediate copy from av_frame to dmabuf. AFAIK ffmpeg already provides API for it. Also WebGL uses dmabuf by default (in Wayland/EGL).

Using dmabuf surfaces as textures directly (without GL api) has some disadvantages as we can't use modifiers (texture tiling, more planes, compression) so rendering from linear dmabuf surfaces may have performance penalty compared to textures created/uploaded by GL.

I filed Bug 1713276 for the zero-copy SW video decode.

Looks like reading from dmabuf (at least on amdgpu which I'm using) is very slow. The in-place decoding code (Bug 1713276) uses 700% CPU while decoding to regular ram uses 100% CPU. The only working scenario is plain memove (or I suspect any write) to dmabuf which seems to be fast.

So if you implement it make sure you don't use any read operation over dmabuf (like add/xor a similar) although I'm still skeptical about real benefits here. From my experience the best and well optimized path is use EGL and/or create texture / EGL image over dmabuf and use EGL code to upload data there.

Thanks for having a look! IIUC you are right and this approach only/mostly makes sense on systems that share CPU and GPU memory - and with proper driver support (probably only when VGEM is loaded). Simply mmapping dmabufs on AMD has been reported as being very slow for Gnome-Shell screen sharing. So well, probably not really worth pursuing further ATM until the driver situation improves further.

Assignee: robert.mader → nobody
Status: ASSIGNED → NEW

Closing for now - driver situation seems to be far from usable yet and the benefit is likely little - we have zero-copy for HW accelerated rendering.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.