Closed Bug 1619882 Opened 4 years ago Closed 4 years ago

High CPU usage and Jittery/stuttering video playback when using VA-API

Categories

(Core :: Audio/Video: Playback, defect, P5)

Desktop
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla78
Tracking Status
firefox-esr68 --- unaffected
firefox73 --- unaffected
firefox74 --- unaffected
firefox75 --- disabled
firefox76 --- disabled
firefox77 --- disabled
firefox78 --- fixed

People

(Reporter: yoasif, Assigned: stransky)

References

(Blocks 2 open bugs)

Details

(Keywords: correctness, nightly-community)

Attachments

(9 files)

Steps to reproduce:

  1. Navigate to https://twitter.com/thehill/status/1235022399298400256
  2. Play video

See video for what it looks like.

See attached log from MOZ_LOG="PlatformDecoderModule:5"

I am running GNOME 3.35.91 on Ubuntu Focal Fossa.

Attached file about:support

I see the same behaviour with the same gpu. I noticed in my case, that it does not happen if I fullscreen the video

I can confirm this with GPU Intel 520, Mesa 19.3.4, Manjaro OS

It's a stretch, but is everyone by any chance using Gnome 3.36 or Gnome 3.34 on Arch Linux?

I noticed that the workaround in bug 1609538 (restoring/un-maximizing the window) temporarily fixes the jitter problem, and it didn't seem to happen under Sway.

I'm using Pop!_OS 19.10 with gnome 3.34

I also see this on Ubuntu 19.10 when there is a changing video overlay, like when the player controls appear or when scanning across the seekbar to see thumbnails. If there is no overlay the video plays perfectly.

Arch Linux latest, Gnome Shell 3.36.0, Intel graphics from Skylake i7-6700HQ, Nightly 75.0a1 (2020-03-09) (64-bit): Can confirm that jittery video only occurs while an overlay is displayed and goes away (I think) immediately when the overlay is gone.

Bug 1619258 may help here as it exports a flag that HW playback is performed and compositor should threat it differently.

I also have this issue. Ubuntu 19.10 on wayland skylake hd 515 graphics gnome 3.34.2. not sure which logs or any would be helpful.

when I am watching a video in full screen, the flickering only happens when I hover over the controls/on screen display. I do notice it a lot worse on twitch.tv

I also see this on Ubuntu 19.10 when there is a changing video overlay, like when the player controls appear or when scanning across the seekbar to see thumbnails. If there is no overlay the video plays perfectly.

For me this issue is reproducible even when YouTube controls is hidden.

Intel Core i3-6100U with Intel HD 520. Ubuntu 20.04 with Mesa 20.0.4 and Gnome Shell 3.36.0.

When using picture on picture, glitchy videos play normally once the video is floating by itself by the way.

(In reply to Brandon Watkins from comment #21)

In 77 nightly, I no longer see this issue at all on youtube, however, I still see the issue on twitch.tv (but only when mousing over the video to bring up the controls)

Yeah, this is a bit odd. I noticed when set to auto quality video was jittery, but when I picked 1080p60 FPS (source) the jittering disappears.
BUT the "workaround" is not that simple... It seems to depend on the stream a bit, Lirik's stream worked flawlessy on source quality while Dansgaming stream had some jitter.

Some followup... strangley it sometimes seems to work for a while then it starts to lag or jitter. I haven't really found any obvious reason for it.
When it lags it behaves kind of like a slow cpu doing software decoding, I increased EXTRA_HW_FRAMES to 12 instead of 6. Made no difference from what I could tell, also tried 0, no difference.
For me using nightly and KDE wayland the video seems jitters or lag regadless if video overlay controls are shown or not.
Same behavior on two different distributions and laptops.

Same issue here but no difference between YouTube or not.
@Pedro maybe you using VP9 on YT and your hardware doesn't support decoding it ?

The issue is present on both F76 release and nightly (today's version).
I have a ThinkPad T440 with i5-4200U on Archlinux.
No specific configuration other than text scaling factor in Gnome Tweaks (set to 1.25), which could be the cause.

I don't know which information could be interesting to help solve this bug.

I did a test with RX 5700 XT under stock Ubuntu 20.04 with Firefox 76 and results are stunning, no stuttering at all with 4k 60 fps videos. Doesn't matter if fullscreen or not. Tried same video with mpv and again no stutters.
'''
mpv https://www.youtube.com/watch?v=LXb3EKWsInQ
(+) Video --vid=1 () (vp9 3840x2160 59.940fps)
(+) Audio --aid=1 --alang=eng (
) (opus 2ch 48000Hz)
[vo/gpu/wayland] GNOME's wayland compositor is known to have many serious issues with mpv. Switch to GNOME's xorg session for the best experience.
mesa: for the --simplifycfg-sink-common option: may only occur zero or one times!
mesa: for the --global-isel-abort option: may only occur zero or one times!
AO: [pulse] 48000Hz stereo 2ch float
Using hardware decoding (vaapi).
VO: [gpu] 3840x2160 vaapi[p010]
AV: 00:00:18 / 00:05:13 (6%) A-V: 0.000 Cache: 42s/150MB
'''

But this is not the case on Intel laptop with i7-7500U. Video stutters and mpv has some dropped frames (video is much smoother in it):
'''
mpv https://www.youtube.com/watch?v=LXb3EKWsInQ
(+) Video --vid=1 () (vp9 3840x2160 59.940fps)
(+) Audio --aid=1 --alang=eng (
) (opus 2ch 48000Hz)
[vo/gpu/wayland] GNOME's wayland compositor is known to have many serious issues with mpv. Switch to GNOME's xorg session for the best experience.
AO: [pulse] 48000Hz stereo 2ch float
Using hardware decoding (vaapi).
VO: [gpu] 3840x2160 vaapi[p010]
AV: 00:00:20 / 00:05:13 (6%) A-V: 0.000 Dropped: 9 Cache: 42s/150MB
'''
There are some tickets also mentioning this:
https://gitlab.freedesktop.org/xorg/xserver/-/issues/928
https://gitlab.freedesktop.org/drm/intel/-/issues/629

Mmm, what I think is weird is that it also sometimes works, at least for a while.
I have to mention that Chromium's "hacked in" VA-API support works fine all the time, at least under Xorg, no stuttering whatsoever.

There are so many potential sources of errors here, Intel Wayland drivers, Kernel drivers, Wayland itself, Firefox's implementation... etc.

Also, when playing a 1080p h264 video in VLC (Wayland) the CPU usage is very low... so maybe it isn't an intel video driver issue after all.

I never use 4k video (as I have a 1080p display), but I can definitely say that I never have issues with vaapi playback in mpv

Also, interestingly I've found that in firefox, I have no jittering at all on netflix with vaapi enabled, even when mousing over the controls etc...

(In reply to Brandon Watkins from comment #38)

Also, interestingly I've found that in firefox, I have no jittering at all on netflix with vaapi enabled, even when mousing over the controls etc...

The video stream on Netflix and other paid services is decoded by widevine, there's no hardware decoding available on Linux for these services.

(In reply to kurmikon from comment #39)

(In reply to Brandon Watkins from comment #38)

Also, interestingly I've found that in firefox, I have no jittering at all on netflix with vaapi enabled, even when mousing over the controls etc...

The video stream on Netflix and other paid services is decoded by widevine, there's no hardware decoding available on Linux for these services.

Widevine is just a drm service, I don't think that it has anything to do with either software/hardware decoding on any platform, APIs like vaapi, dxva and others take care of that... or am I wrong here?

@Gnomalex, my intel comet lake can decode vp9 I had to enable by disabling media.ffvpx.enabled, I see is loaded in the logs MOZ_ENABLE_WAYLAND=1 firefox-trunk

as @rickard says some times works normaly for some time , in the lasts nightlys you see less flikering , but is still present.
Chromium vaapi and mpv vaapi works very well

I also do see the issue with those observation:

  • Happens when video is played online from youtube for instance. When it's loaded from disk it works ok.
  • Tends to break when mouse is moved across the video and video control is rendered.
  • It's broken in optimized builds, works ok in debug builds.

All those symptoms look like vaapi frame holder (Bug 1616892) releases the hw surfaces too early and they're reused while they're still in FF rendering queue.

Assignee: nobody → stransky

Also it affects both vp9 and h264 codecs.

(In reply to Martin Stránský [:stransky] from comment #42)

All those symptoms look like vaapi frame holder (Bug 1616892) releases the hw surfaces too early and they're reused while they're still in FF rendering queue.

It doesn't seem to be related, when vaapi surfaces are wrongly reused various color artifacts are shown as YUV planes are mixed together.
This one looks like some problem with texture cache/timing or so when whole frame is hold. Also is reproducible with both GL compositor and WebRender.

Sotaro, do you know if any special flags / config is used for texture clients from hw decoders to upload the textures asap or so?
Thanks.

Flags: needinfo?(sotaro.ikeda.g)
Attached file 4k_costa_rica_error
(In reply to Martin Stránský [:stransky] from comment #42)
> I also do see the issue with those observation:
> 
> - Happens when video is played online from youtube for instance. When it's loaded from disk it works ok.
Just for information, I can confirm this too.
Tho not all videos can be played from disk:

(In reply to Martin Stránský [:stransky] from comment #42)

I also do see the issue with those observation:

  • Happens when video is played online from youtube for instance. When it's loaded from disk it works ok.

Just for information, I can confirm this too.

Not all videos can be played from disk, attached 4k_costa_rica_error log with this video https://www.youtube.com/watch?v=LXb3EKWsInQ downloaded with youtube-dl.

(In reply to Martin Stránský [:stransky] from comment #44)

Sotaro, do you know if any special flags / config is used for texture clients from hw decoders to upload the textures asap or so?
Thanks.

Sorry, I do not know about it:( But before doing rendering, texture needs to be ready. Is WaylandDMABufSurface::FenceWait() provided for it? At RenderDXGITextureHostOGL, the wait is done with IDXGIKeyedMutex in Windows. On android, Fence is used for it.

Flags: needinfo?(sotaro.ikeda.g)

(In reply to Sotaro Ikeda [:sotaro] from comment #47)

(In reply to Martin Stránský [:stransky] from comment #44)

Sotaro, do you know if any special flags / config is used for texture clients from hw decoders to upload the textures asap or so?
Thanks.

Sorry, I do not know about it:( But before doing rendering, texture needs to be ready. Is WaylandDMABufSurface::FenceWait() provided for it? At RenderDXGITextureHostOGL, the wait is done with IDXGIKeyedMutex in Windows. On android, Fence is used for it.

Thanks, that may be the point. WaylandDMABufSurface::FenceWait() is not implemented.

Actually WaylandDMABufSurface::FenceWait() is implemented. I'll check how it's used.

(In reply to Eduard Kachur from comment #46)

(In reply to Martin Stránský [:stransky] from comment #42)

I also do see the issue with those observation:

  • Happens when video is played online from youtube for instance. When it's loaded from disk it works ok.

Just for information, I can confirm this too.

Not all videos can be played from disk, attached 4k_costa_rica_error log with this video https://www.youtube.com/watch?v=LXb3EKWsInQ downloaded with youtube-dl.

Eduard, this is Bug 1632456 (kind of out of GPU memory bug).

Looks like there's also expensive CPU usage when the controls (youtube for instance) is drawn. When I play video without the control I see ~ 8% CPU usage but with controls I see ~ 14%. I wonder what causes that, maybe the video rendered to some temporary buffer or so.

Summary: Jittery/stuttering video playback when using VA-API → High CPU usage and Jittery/stuttering video playback when using VA-API and video control is shown

(In reply to Martin Stránský [:stransky] from comment #50)

Looks like there's also expensive CPU usage when the controls (youtube for instance) is drawn. When I play video without the control I see ~ 8% CPU usage but with controls I see ~ 14%. I wonder what causes that, maybe the video rendered to some temporary buffer or so.

Is bug 1617498 needed to get the improvements of bug 1579235?

This bug affects GL compositor as well.

Gnome Wayland, Debian Testing, Intel Iris 6100 (Broadwell GT3), 2560x1600
My STR for intermittently stuttering video: Repeatedly hover the progress bar to make stuttering more likely.
$ pip3 install --upgrade mozregression
$ MOZ_ENABLE_WAYLAND=1 MOZ_LOG="PlatformDecoderModule:5" mozregression --launch 2020-05-13 --pref gfx.webrender.all:true media.webm.enabled:false widget.wayland-dmabuf-vaapi.enabled:true security.sandbox.content.level:0 --process-output stdout -a https://www.youtube.com/embed/PvKiWRTSAzg

Disabling gfx.webrender.picture-caching does not further reduce low CPU usage of 4-8% and does not fix the problem.
It can stutter regardless of video resolution or video control visibility.
I have watched 1080p in a loop: "Double right click > Full screen" does not have video controls, but stutters as well if my cursor is not yet hidden.

Glen, do you have any idea why there's such overhead when the video is played with controls? With both GL and WebRender compositor.
Thanks.

Flags: needinfo?(gwatson)

When SW video decode and HW video playback is used, I see the CPU goes up but I don't see any jittering when control is shown.

When there are overlay controls on a video with WebRender enabled, we do draw the tiles that intersect the video in a (slightly) less efficient way than when the video has nothing on top of it.

The picture cache tiles that intersect the video (if it was promoted to a compositor surface) are switched from opaque blend to alpha blend and drawn on top of the video surface, with an alpha cutout. This doesn't add much overhead, but it is a difference when controls are present.

However, this has no effect on the GL compositor, so it seems unlikely to be the cause here (unless there are two separate causes with similar symptoms).

I suspect in the GL compositor mode, the layer config looks quite different, but I don't know much about how that code works, sorry.

Flags: needinfo?(gwatson)

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #51)

(In reply to Martin Stránský [:stransky] from comment #50)

Looks like there's also expensive CPU usage when the controls (youtube for instance) is drawn. When I play video without the control I see ~ 8% CPU usage but with controls I see ~ 14%. I wonder what causes that, maybe the video rendered to some temporary buffer or so.

Is bug 1617498 needed to get the improvements of bug 1579235?

No, we shouldn't need that bug implemented - the improvements in 1579235 should allow the video to be composited directly into the main framebuffer without being drawn to the picture tiles.

We should add some debug information to the picture caching debug overlay about compositor surfaces.

In the interim, it should be possible to test that the video is being drawn as a compositor surface with WebRender enabled by enabling the picture cache debug overlay (gfx.webrender.debug.picture-caching) - what we would hope is that the picture cache tiles under / over the video are staying green and not constantly invalidating.

I wonder if there are any tools we can use on Linux to see exactly what hardware packets the GPU is scheduling / working on across processes, similar to how GPUView works on Windows. I wonder if the Intel GPA tools allow anything like that? Or maybe the nvidia nsight tools if running the proprietary driver?

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #53)

Gnome Wayland, Debian Testing, Intel Iris 6100 (Broadwell GT3), 2560x1600
My STR for intermittently stuttering video: Repeatedly hover the progress bar to make stuttering more likely.
$ pip3 install --upgrade mozregression
$ MOZ_ENABLE_WAYLAND=1 MOZ_LOG="PlatformDecoderModule:5" mozregression --launch 2020-05-13 --pref gfx.webrender.all:true media.webm.enabled:false widget.wayland-dmabuf-vaapi.enabled:true security.sandbox.content.level:0 --process-output stdout -a https://www.youtube.com/embed/PvKiWRTSAzg

Top row is yellow, middle row is green, bottom is mostly yellow, but progress bar and preview thumbnail are of course red. In rare cases the whole content is red for a short moment, but video is smooth then. When video stutters, middle row tiles are still green. That stuttering looks like as if frames were shown in the wrong order for a second. YouTube video stats say no frames were dropped. WR slow frame indicator never moves. The only thing I noticed was that Content Send Time is often shortly red around the time stuttering occurs.
https://perfht.ml/2LrYxk3 From hovering profiler screenshots it seems the most severe stuttering is between 7s and 11s.

You see these 3 red markers in "Web Content". Hover the screenshots, it's seems to be exactly the moment it stutters.

Interesting find!

It does seem to correlate with the skips - I don't know too much about the web content process, but if I expand the profile stacks, it maybe looks like there is some JS that is running that takes a really long time to complete? Markus might be able to help identify what's happening here, or point us to someone who would know more about that.

Flags: needinfo?(mstange)

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #59)

https://perfht.ml/2LrYxk3 From hovering profiler screenshots it seems the most severe stuttering is between 7s and 11s.

Jan, Thanks a lot for the profile data!

I went through it and compared the pictures and it looks like the video images are drawn in wrong order. That points back to vaapi decoder as it's reproducible with vaapi only.

There's a great sample between 7.97 s and 7.99 s where the alternation is clearly visible and there isn't much other activity there.

I looks to me that this affects h264 streams only. I can't reproduce that with vp9 HW vaapi streams...can anyone confirm that?

(In reply to Martin Stránský [:stransky] from comment #63)

I looks to me that this affects h264 streams only. I can't reproduce that with vp9 HW vaapi streams...can anyone confirm that?

No, it affects vp9 too, just to a lesser extent. For me it affects vp9 only when having controls enabled and frame skipping/bumped cpu usage is slight compared to H264, but it is there.

Is the problem caused by JS, content process priority, audio, animations or some upload of other textures? It could be interesting to test a try build of bug 1595994. (It looks like it was postponed to improve Jitsi WebRTC compatibility first?)
In comparison, playing https://4kmedia.org/lg-landscape-uhd-4k-demo/ (H264) from file:// is totally fine: https://perfht.ml/2T7Ve5P
VLC (DRM VAAPI) stutters sometimes, but Nightly is buttery smooth the whole video long.

I tested a patch where only decoding is done by VAAPI and then decoded frames are downloaded/copied as with SW path and I see the jittering too.
So the difference is only vaapi decoding, not rendering.

When media is played over youtube only media fragments are downloaded (not whole video at once AFAIK) which may cause the issue. Maybe vaapi needs more time ahead for decoding or so?

Jean, do you have any idea here? Can be the media decoding process measured somehow?
Thanks.

Flags: needinfo?(jyavenard)

There seems to be some variation in symptoms reported here, so I'll report mine to make sure we're all talking about the same bug. My APU is a Pentium G4560, which has 2 cores/4 threads, and the GPU includes support for VP9. I use GNOME on Arch Linux. On most sites, including twitch, but excluding youtube, I experience the jittering, whether or not the video is fullscreen or has controls overlaid. Switching between fullscreen, maximised etc doesn't fix it. I haven't really checked CPU usage, except to confirm there was a big improvement compared to software decoding the first time I tried it on youtube.

Didn't some other people here also report that VP9 was affected, but not H264? That's probably why I'm not seeing this problem on youtube. I'm fairly sure my PC is using VAAPI for VP9 though, so software rendering isn't the reason that's working properly here.

I tested vp9 but I can't say i noticed anything wrong with it. I didn't play it for very long tho'.. The CPU usage was higher without MOZ_ENABLE_WAYLAND=1 but it wasn't drastically lowered with it enabled. Which almost made me wonder if HW decoding was doing its thing, is there any way to check that HW decoding is running on the currently played media? Something like chrome's media-internals.

(In reply to rickard from comment #68)

I tested vp9 but I can't say i noticed anything wrong with it. I didn't play it for very long tho'.. The CPU usage was higher without MOZ_ENABLE_WAYLAND=1 but it wasn't drastically lowered with it enabled. Which almost made me wonder if HW decoding was doing its thing, is there any way to check that HW decoding is running on the currently played media? Something like chrome's media-internals.

Run on command line with MOZ_LOG="PlatformDecoderModule:5" and examine the log. It should contain info about vaapi usage. The CPU save isn't so big when you have fairly powerful cpu and/or play low-res clips.

(In reply to Martin Stránský [:stransky] from comment #69)

Run on command line with MOZ_LOG="PlatformDecoderModule:5" and examine the log. It should contain info about vaapi usage. The CPU save isn't so big when you have fairly powerful cpu and/or play low-res clips.

Mmm, for h264 I see VA-API is being used. But I don't see anything about it when it plays vp9 content.
I noticed the hardware decode support is only partial on Broadwell CPU's, it seems like full HW decoding isn't supported until Apollo Lake and above. vainfo shows the vp9 profile tho' but it might be a bit misleading.

Attached video youtube.mp4

(In reply to Martin Stránský [:stransky] from comment #66)

When media is played over youtube only media fragments are downloaded (not whole video at once AFAIK) which may cause the issue. Maybe vaapi needs more time ahead for decoding or so?

I can reproduce this problem even with a downloaded YouTube video. I switched various prefs and it didn't help.

$ MOZ_ENABLE_WAYLAND=1 MOZ_LOG="PlatformDecoderModule:5" mozregression --launch 2020-05-13 --pref gfx.webrender.all:true media.webm.enabled:false widget.wayland-dmabuf-vaapi.enabled:true security.sandbox.content.level:0 media.mediasource.enabled:false layers.async-pan-zoom.enabled:false javascript.enabled:false widget.wayland-smooth-rendering:true widget.wayland-dmabuf-textures.enabled:true media.decoder.recycle.enabled:true media.decoder.skip-to-next-key-frame.enabled:false media.ffvpx.enabled:false media.media-capabilities.enabled:false media.memory_cache_max_size:65536 media.navigator.enabled:false media.peerconnection.enabled:false media.rdd-opus.enabled:false media.seekToNextFrame.enabled:false media.video-queue.default-size:50 media.video-queue.send-to-compositor-size:99999 media.videocontrols.picture-in-picture.enabled:false svg.disabled:true plugin.disable:true layout.interruptible-reflow.enabled:false layout.display-list.retain:false layout.display-list.retain.chrome:false layout.css.scroll-anchoring.enabled:false layers.omtp.enabled:false image.mem.shared:false gfx.webrender.enable-item-cache:false dom.animations.offscreen-throttling:false canvas.path.enabled:false canvas.focusring.enabled:false canvas.filters.enabled:false canvas.capturestream.enabled:false privacy.resistFingerprinting:true --process-output stdout -a file:///home/darkspirit/Downloads/youtube.mp4

It happens more likely if I set play speed to 2x and continuously show and hide the progress bar by moving the mouse up and down.

(In reply to Marko from comment #40)

Widevine is just a drm service, I don't think that it has anything to do with either software/hardware decoding on any platform, APIs like vaapi, dxva and others take care of that... or am I wrong here?

Widevine does the video and audio decoding as well as decryption.

(In reply to Martin Stránský [:stransky] from comment #66)

I tested a patch where only decoding is done by VAAPI and then decoded frames are downloaded/copied as with SW path and I see the jittering too.
So the difference is only vaapi decoding, not rendering.

When media is played over youtube only media fragments are downloaded (not whole video at once AFAIK) which may cause the issue. Maybe vaapi needs more time ahead for decoding or so?

Jean, do you have any idea here? Can be the media decoding process measured somehow?
Thanks.

Decoding is currently done in the content process. Is that's what you mean by process.

We used to have a tool to measure decoding speed, but it wasn't ported to the new MozPromise based API.

If the MediaDataDecoder::IsHardwareAccelerated returns true, then the MediaDecoderStateMachine will keep 3 frames decoded ahead of the current time. If it returns false, 10 frames will be kept in the queue ahead of current time.

I doubt that being YouTube with MSE would lead to increase CPU usage over playing a file from disk. Quite the opposite. The demuxing process is much simpler with MSE as we get pre-demuxed content.

Do you have a video of what the jittering looks like? does it stutter or does it seems that some frames are painted out of order?

If the later, it could be the same issue we had on macOS when using global IOSurface.

The decoding is performed in the content process, into a global GPU surface that is then being transferred across the GPU process for rendering.

At one point in time, when the IOSurface was sent across IPC, the refcount was decreased, which caused the decoder to re-use the surface too early. So when rendered, it looked like they were out of order.

Ok, I just looked at the video, and that certainly looks like out of order frames to me, and quite identical to the bug we had with IOSurface.

How is the VAAPI image telling the decoder that it's okay to re-use it once composited?

Flags: needinfo?(jyavenard)

(In reply to Jean-Yves Avenard [:jya] from comment #74)
[...]

Thanks a lot Jean! Yes, the issue is out of order frames.

At one point in time, when the IOSurface was sent across IPC, the refcount was decreased, which caused the decoder to re-use the surface too early. So when rendered, it looked like they were out of order.

So I should not see the bug when e10s is disabled, right?

How is the VAAPI image telling the decoder that it's okay to re-use it once composited?

We allocate the WaylandDMABUFSurfaceImage here:

https://searchfox.org/mozilla-central/rev/09b8072a543c145de2dc9bb76eddddd4a6c09adc/dom/media/platforms/ffmpeg/FFmpegVideoDecoder.cpp#676

and we use WaylandDMABUFSurfaceImage destructor to to tell va-api decoder it's ok to reuse the surface:

https://searchfox.org/mozilla-central/rev/09b8072a543c145de2dc9bb76eddddd4a6c09adc/gfx/layers/WaylandDMABUFSurfaceImage.h#35

so when the Image we create at decoder is released we also release va-api surface reference and it's reused by va-api decoder.

Flags: needinfo?(jyavenard)

Yes, it's definitely use-after-free scenario, there's a log from my debug build where surface UID is used across the processes. It's 100% reproducible when CPU usage is 100% and then video with overlay is played (I use Firefox compilation as the CPU utilization :-)). When the video overlay is not used the vaapi playback is ok even when 100% CPU is used.

Created dmabuf UID = 63 HW surface 4000014
Creating TextureClient(), UID = 63
Create WaylandDMABUFTextureData UID = 63
Creating WaylandDMABUFTextureHostOGL() UID = 63
WaylandDMABUFTextureHostOGL::CreateRenderTexture() UID = 63
Created RenderWaylandDMABUFTextureHostOGL() UID = 63
WaylandDMABUFTextureHostOGL::PushResourceUpdates() UID = 63
RenderWaylandDMABUFTextureHostOGL::Lock() UID = 63

Releasing WaylandDMABUFSurfaceImage(), UID = 63 <<< released HW surface

WaylandDMABUFTextureData::Forget UID = 63
WaylandDMABufSurfaceNV12 Release UID = 63
RenderWaylandDMABUFTextureHostOGL::Unlock() UID = 63
RenderWaylandDMABUFTextureHostOGL::Lock() UID = 63
RenderWaylandDMABUFTextureHostOGL::Unlock() UID = 63

Flags: needinfo?(mstange)

(In reply to Martin Stránský [:stransky] from comment #75)

(In reply to Jean-Yves Avenard [:jya] from comment #74)
[...]

Thanks a lot Jean! Yes, the issue is out of order frames.

At one point in time, when the IOSurface was sent across IPC, the refcount was decreased, which caused the decoder to re-use the surface too early. So when rendered, it looked like they were out of order.

So I should not see the bug when e10s is disabled, right?

correct

https://searchfox.org/mozilla-central/rev/09b8072a543c145de2dc9bb76eddddd4a6c09adc/gfx/layers/WaylandDMABUFSurfaceImage.h#35

so when the Image we create at decoder is released we also release va-api surface reference and it's reused by va-api decoder.

The image is allocated in the content process and composited in the GPU process.
How is the ownership being carried across process?

The example with IOSurface I gave earlier isn't relevant here, because we don't yet have a GPU process on mac, so the situation is easier to manage.

I'm not familiar WaylandDMABUFSurfaceImage ; but ownership of the VAAPI buffer must be kept by the IPC serialiser and carried across the GPU process.

Doesn't look like it's happening by what you describe

Flags: needinfo?(jyavenard)

BTW, this was precisely the question I asked on your original bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1616185#c13

I should have worded it a bit more clearly at the time

The image is allocated in the content process and composited in the GPU process.

Wayland doesn't have a GPU process yet (bug 1481405 comment 15), but X11 has, so it would be good to account for both.

Jean, what we need here is to release client after host. Do you think TextureFlags::WAIT_HOST_USAGE_END can solve that?

Flags: needinfo?(jyavenard)

(In reply to Jean-Yves Avenard [:jya] from comment #77)

I'm not familiar WaylandDMABUFSurfaceImage ; but ownership of the VAAPI buffer must be kept by the IPC serialiser and carried across the GPU process.

Doesn't look like it's happening by what you describe

Yes. The situation here is that we can't transfer the ownesrhip of VAAPI buffer across firefox processes as we don't own it. The VAAPI buffer is owned by ffmpeg library and must stay locked until it's used by any part of gecko. So we need to make sure it's released (unlocked) only when all operations over it are finished.

I may need to implement a kind of IPC monitor which can be serialized and passed as a part of dmabuf surface and release it VAAPI buffer only when it's truly unused.

(In reply to Martin Stránský [:stransky] from comment #81)

I may need to implement a kind of IPC monitor which can be serialized and passed as a part of dmabuf surface and release it VAAPI buffer only when it's truly unused.

There is a texture recycling concept in the codebase already, see TextureClientRecycleAllocator. They pass message back when texture hosts are done with a texture. You may also need to change CreateImageVAAPI() to first allocate a recycled texture client, then wrap it with TextureWrapperImage, then make VideoData from TextureWrapperImage instance. I believe that overriding TextureClientRecycleAllocator::RecycleTextureClient will give you the point where AVFrame is to be released.

Summary: High CPU usage and Jittery/stuttering video playback when using VA-API and video control is shown → High CPU usage and Jittery/stuttering video playback when using VA-API

Bug 1539735 would provide a solution to this problem, but it's unlikely to be done anytime soon.

Flags: needinfo?(jyavenard)
See Also: → 1539735

(In reply to Martin Stránský [:stransky] from comment #80)

Jean, what we need here is to release client after host. Do you think TextureFlags::WAIT_HOST_USAGE_END can solve that?

I don't know.

Need to ask someone from the gfx team. :nical knows probably best about texture life management

Flags: needinfo?(nical.bugzilla)

Do you think TextureFlags::WAIT_HOST_USAGE_END can solve that?

This causes the client side to keep the texture data alive until the last reference to the texture is gone on the host side (at which point the host sends a NotifyNotUsed message which informs the client side that there is no more references to it on the other side.

The same system is used with the TextureClientRecycleAllocator mechanism mentioned by Rinat.

If the lifecycle of VAAPI frames works with the texture client recycling stuff, I would suggest using it so that platforms don't diverge too much. If it doesn't fit it would be worth figuring out if we can extend the system to make it work. Shared texture life cycle is quite complicated overall (especially when factoring in (unexpected) shutdown) so it would be nice if at some point it all works with a common system.

Flags: needinfo?(nical.bugzilla)

Thanks Nicolas, I'll investigate the WAIT_HOST_USAGE_END path.

After some investigation I end up with a global mutex similar to IDXGIKeyedMutex (Bug 1066312) which is a part of dmabuf surface and keeps global ref count. It's based on shm memory (opened by shm_open()) and linux semaphores (sem_* functions).

Unfortunately the shm_open() fails in content process with:

Sandbox: SandboxBroker: denied op=stat rflags=400000 perms=0 path=mozilla-shared-E2xfeo for pid=37667
Sandbox: Failed errno -13 op stat flags 0400000 path mozilla-shared-E2xfeo

Jed, I expect it's ok to use shared memory in content process, right? What's the best way how to deal with it? I think ShmMem is a bit overkill here as I just need a bit of shm memory which is attached to the surface and it's shared across processes as file descriptor with other dmabuf data.
Thanks.

Flags: needinfo?(jld)
Depends on: 1639855

(In reply to Martin Stránský [:stransky] from comment #88)

Jed, I expect it's ok to use shared memory in content process, right? What's the best way how to deal with it?

In general, yes, but I didn't want a sandboxed process to be able to access shared memory belonging to other processes. The way this works is that base::SharedMemory constructs a prefix that's unique to that process (and conforms to name restrictions imposed by external sandboxes like Snap), which is used in the sandbox policy. Note that when memfd support is added to base::SharedMemory (bug 1440203), no access to /dev/shm will be allowed under the current policy if the kernel supports memfd.

Is it possible for you to use base::SharedMemory? Alternately, the CrossProcessSemaphore class sounds similar to what you've done and is used to manage graphics resources, but I don't know if it would be useful in this context.

Flags: needinfo?(jld)

Ehm... Why are shared memory and mutexes necessary here? There is already a message channel between processes. The act of passing a message is a synchronization itself. Adding another synchronization channel may even introduce deadlocks in unexpected places.

Here is a PoC patch that uses TextureFlags::WAIT_HOST_USAGE_END. Looks like a clear way to a solution to me. Am I missing something?

Flags: needinfo?(stransky)

(In reply to Rinat from comment #90)

Created attachment 9151351 [details] [diff] [review]
PoC-vaapi-frame-lifetime.patch

Ehm... Why are shared memory and mutexes necessary here? There is already a message channel between processes. The act of passing a message is a synchronization itself. Adding another synchronization channel may even introduce deadlocks in unexpected places.

Here is a PoC patch that uses TextureFlags::WAIT_HOST_USAGE_END. Looks like a clear way to a solution to me. Am I missing something?

Rinat, thanks for the patch. Did you test it?

When I tested the WAIT_HOST_USAGE_END the frame release was too slow and which led to OOM on GPU side and also sometimes the frames were released too early (before unlock) on webrender side so the flickering was still there. But maybe I did that wrong.

The mutex I implemented here is very fast which enables to recycle the GPU frames ASAP. That's critical becuase number of HW VAAPI surfaces is limited.

Flags: needinfo?(stransky)

(In reply to Martin Stránský [:stransky] from comment #91)

(In reply to Rinat from comment #90)

Created attachment 9151351 [details] [diff] [review]
PoC-vaapi-frame-lifetime.patch

Ehm... Why are shared memory and mutexes necessary here? There is already a message channel between processes. The act of passing a message is a synchronization itself. Adding another synchronization channel may even introduce deadlocks in unexpected places.

Here is a PoC patch that uses TextureFlags::WAIT_HOST_USAGE_END. Looks like a clear way to a solution to me. Am I missing something?

When I tested the WAIT_HOST_USAGE_END the frame release was too slow and which led to OOM on GPU side and also sometimes the frames were released too early (before unlock) on webrender side so the flickering was still there. But maybe I did that wrong.

This is actually cause by missing link between WaylandDMABUFSurfaceImage and texture client. WaylandDMABUFSurfaceImage is released early when VideoData is deleted and TextureClient is released later. So the WAIT_HOST_USAGE_END does not solve WaylandDMABUFSurfaceImage/TextureClient dependence here.

We can use eventfd() for the IPC dmabuf surface refcount instead of shm memory + semaphore and it does not interfere with firefox sandbox.

This patch implements:

  • Global surface UID for better tracking/debugging. The UID is passes acros all processes so we can track
    surface usage.
  • WaylandAllocateShmMemory() function which uses shm_open() instead of files at tmp for Shm memory creation.
  • DMABufRefcount() class based on eventfd() which is used as a global surface reference count.

VA-API surfaces are owned by ffmpeg library and we need to keep them locked until any part of rendering pipeline use them.
We use a global lock provided by WaylandDMABUFSurface for it, keep array of active surfaces at FFmpegVideoDecoder
and release them only when all references are cleared or FFmpegVideoDecoder is deleted.

Depends on D76689

VAAPIFrameHolder was moved to FFmpegVideoDecoder class and we use WaylandDMABufSurface global lock to keep
surface reference until it's used.

Depends on D76690

(In reply to Martin Stránský [:stransky] from comment #91)

Did you test it?

I've only tested the patch on Weston. Haven't seen out of order frames on a test video with the patch, while without the patch the issue can be reproduced quite easily.

WAIT_HOST_USAGE_END the frame release was too slow and which led to OOM

WAIT_HOST_USAGE_END behavior is currently broken. It makes TextureClient go to the waiting list, but doesn't change TextureHost behavior, and that's why NotifyNotUsed is never sent. TextureClient's keep accumulating without any limit. This is why OOM you saw occurs.
In the patch, TextureHost's code is modified so it sends required NotifyNotUsed messages.

WaylandDMABUFSurfaceImage is released early when VideoData is deleted and TextureClient is released later. So the WAIT_HOST_USAGE_END does not solve WaylandDMABUFSurfaceImage/TextureClient dependence here.

Yeah, and that's why the PoC patch passes responsibility over VAAPIFrameHolder from WaylandDMABUFSurfaceImage to newly-created TextureClient backed by WaylandDMABUFTextureData. Since it WAIT_HOST_USAGE_END is TextureClient feature, it only controls TextureClient lifetime.

The mutex I implemented here is very fast which enables to recycle the GPU frames ASAP.

Yeah, it probably is fast. It just looks deadlock-prone to me, since I can't understand its safety easily. On the other hand, message passing looks simple.

As I thought about it more, I must admit that another locking mechanism may be necessary here. As far as I understand, video decoder produces frames on its own, without limitations. It's fine if renderer manages to consume all produced frames in time. But if it doesn't, video decoder may run out of available hw frames to use. Then it may lock up and even if renderer process releases frames, decoder process can be locked for good. So some waitable locking may be required here. But it looks so wrong...

As :jya mentioned earlier, compositor-driven decoding would solve this in a clear way. Unfortunately nobody's is working on it at the moment, apparently.

May or may not be relevant here but: TextureClient has a "ReadLock" mechanism based on a reference count written in a shmem that is not used by default but can be used when calling TextureClient::EnableReadLock https://searchfox.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#752

The main idea with this system is that the the client side increments the read count before sending and the host side decrements it when nothing references the texture anymore. The client side can query whether the texture is available for reuse but cannot block on it. This is all happening under the hood. It uses a single shmem to host the shared reference counts of all textures as shmems can be a bit expensive. Doing this allows potentially reusing a texture earlier than if we had had wait for a message back from the compositor.

It is currently mostly used for tiled painted layers so it may or may not work out of the box for video frames but I expect it wouldn't be hard to make it work if it doesn't.

(In reply to Rinat from comment #98)

video decoder may run out of available hw frames to use. Then it may lock up and even if renderer process releases frames, decoder process can be locked for good.

Oh, it may be solved by copying frames. When a single hw frame is left in the reserve, code may switch to making copies of frames. This may degrade performance a bit, but will prevent deadlocks. This fallback will probably never run, but would be nice to have this corner covered.

(In reply to Rinat from comment #100)

(In reply to Rinat from comment #98)

video decoder may run out of available hw frames to use. Then it may lock up and even if renderer process releases frames, decoder process can be locked for good.

Oh, it may be solved by copying frames. When a single hw frame is left in the reserve, code may switch to making copies of frames. This may degrade performance a bit, but will prevent deadlocks. This fallback will probably never run, but would be nice to have this corner covered.

No, it won't lock in any case. If that happens you'll see "video is corrupted" message and you can reload the video.

I think the plan for video playback may be:

  • finish the vaapi playback (this bug)
  • implement EGLimage backend for SW decoding (Bug 1629788)
  • implement texture recycling for SW decoding

When the texture recycling is going to be more effective than the direct dmabuf surface tracking we can remove it but as it's implemented here it's the most effective variant AFAIK. It uses single 64bit integer created by eventfd() which is stored in kernel and it's shared by file descriptor along other dmabuf surface info. It does not use shared memory/semaphores, just simple IO.

(In reply to Rinat from comment #100)

(In reply to Rinat from comment #98)

video decoder may run out of available hw frames to use. Then it may lock up and even if renderer process releases frames, decoder process can be locked for good.

Oh, it may be solved by copying frames. When a single hw frame is left in the reserve, code may switch to making copies of frames. This may degrade performance a bit, but will prevent deadlocks. This fallback will probably never run, but would be nice to have this corner covered.

btw. we can't copy frames in this case. When all hw frames are exhausted the HW decoder just fails to decode any further frames. What can be done here is to postpone decoding until the frames are released. With the patch here we can do that as we track usage of all vappi hw frames used by ffmpeg.

(In reply to Martin Stránský [:stransky] from comment #101)

No, it won't lock in any case. If that happens you'll see "video is corrupted" message and you can reload the video.

What I tried for the test is to remove mLib->av_buffer_unref(&mHWFrame); from VAAPIFrameHolder::~VAAPIFrameHolder. That makes Firefox's part of the decoder to keep references to all hw frames. And when I try to play video on such build, it plays first 32 frames, then shows a loading spinner animation and text "No video with supported format and MIME type found". I still can interact with video controls, but changing position on timeline does nothing. It doesn't offer me to reload video or anything else. Remainder of the page and the browser UI are fine though.
Perhaps we are talking about different aspects.

btw. we can't copy frames in this case. When all hw frames are exhausted the HW decoder just fails to decode any further frames.

But number of frames held in VAAPIFrameHolder instances can be tracked. And you are explicitly asking FFmpeg to allocate known number of additional frames. It's known when an VAAPIFrameHolder instance is created, and when it's destroyed. So, it should be possible to track number of frames left?

What can be done here is to postpone decoding until the frames are released. With the patch here we can do that as we track usage of all vappi hw frames used by ffmpeg.

OK.

Pushed by rgurzau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/594ac84515dc
[Wayland] Implement dmabuf global ref count, r=jhorak
https://hg.mozilla.org/integration/autoland/rev/45a693471a1c
[Wayland] Use VAAPIFrameHolder array at FFmpegVideoDecoder to hold active vaapi hw video surfaces, r=jya
https://hg.mozilla.org/integration/autoland/rev/5c5cdf8961e6
[Wayland] Use WaylandDMABufSurface global lock instead of VAAPIFrameHolder at WaylandDMABUFSurfaceImage, r=sotaro

Seems to work now. Good job.

However I don't think this is related to the patch, but if I open Konsole in KDE and have it on top of Firefox that have an open tab with a video paused and start typing in Konsole, the Konsole window partly disappears.
But maybe that''s a different bug report we need to make.

(In reply to rickard from comment #108)

However I don't think this is related to the patch, but if I open Konsole in KDE and have it on top of Firefox that have an open tab with a video paused and start typing in Konsole, the Konsole window partly disappears.
But maybe that''s a different bug report we need to make.

KDE Plasma has had many issues with Firefox Wayland, mainly https://bugs.kde.org/show_bug.cgi?id=387313 which caused issues like https://bugs.kde.org/show_bug.cgi?id=419797.

That fix will be part of Plasma 5.19. If it still doesn't work there, you should post a new bug report in the wayland-generic component of the kwin product.

Hmm. I thought that the inter-process semaphore approach was chosen to prevent decoder from blocking in the case where all VASurfaces are in use. But ReleaseUnusedVAAPIFrames() in FFmpegVideoDecoder just peeks at all elements in mFrameHolders and never waits. So all this IPC mechanism is just a re-implementation of the existing texture recycling framework. Is there a reason for such complications that I failed to see?

(In reply to Rinat from comment #110)

So all this IPC mechanism is just a re-implementation of the existing texture recycling framework. Is there a reason for such complications that I failed to see?

According to https://bugzilla.mozilla.org/show_bug.cgi?id=1619882#c86 :

Shared texture life cycle is quite complicated overall (especially when factoring in (unexpected) shutdown) so it would be nice if at some point it all works with a common system.

But if you make it work with the recycle I'm happy to remove the frame holder there. Right now I find the frameholder solution as simplification here.

(In reply to Martin Stránský [:stransky] from comment #111)

But if you make it work with the recycle I'm happy to remove the frame holder there. Right now I find the frameholder solution as simplification here.

Frameholder is necessary. I wasn't arguing about that. Instead I was talking about how frameholders are destroyed. Your current solution is to add a shared flag. So when decoder issues a frame, you create an WaylandDMABUFSurfaceImage image, it sets the flag, creates a TextureClient, which is serialized and sent to the compositor where it becomes TextureHost (WaylandDMABUFTextureHostOGL) which creates WaylandDMABufSurface. TextureClient is destroyed at some point in time, which is not relevant. Then, when WaylandDMABUFTextureHostOGL is no longer necessary, it is destroyed, which destroys WaylandDMABufSurface which resets the shared flag. When another frame is decoded in decoder process, it checks all backlog frameholders for their corresponding flags and destroys those which have no flag set. So there are message channel IPC, and a ad-hoc IPC channel, these flags.

There is another way to solve the issue: to use texture recycling. In that scenario WaylandDMABUFSurfaceImage passes frameholder responsibility to a TextureClient, which is created with WAIT_HOST_USAGE_END flag. It's then serialized and passed to the compositor, where it becomes a TextureHost. When TextureHost is no longer needed, a NotifyNotUsed message is sent back to the TextureClient side, where it causes the corresponding TextureClient to be destroyed, which destroys a frameholder. So, the existing message channel IPC is used. No additional channel is necessary. Here is how it looks as a patch: https://bugzilla.mozilla.org/attachment.cgi?id=9151351&action=diff.

Both approaches have no idea what to do if the compositor for some reason decides to hold TextureHosts longer than usual, but I believe WAIT_HOST_USAGE_END approach is clearer, since it already uses what's there in the code.

I probably should submit the part that fixes WAIT_HOST_USAGE_END as a separate bug though.

(In reply to Rinat from comment #112)

Frameholder is necessary. I wasn't arguing about that. Instead I was talking about how frameholders are destroyed. Your current solution is to add a shared flag. So when decoder issues a frame, you create an WaylandDMABUFSurfaceImage image, it sets the flag, creates a TextureClient, which is serialized and sent to the compositor where it becomes TextureHost (WaylandDMABUFTextureHostOGL) which creates WaylandDMABufSurface. TextureClient is destroyed at some point in time, which is not relevant. Then, when WaylandDMABUFTextureHostOGL is no longer necessary, it is destroyed, which destroys WaylandDMABufSurface which resets the shared flag. When another frame is decoded in decoder process, it checks all backlog frameholders for their corresponding flags and destroys those which have no flag set. So there are message channel IPC, and a ad-hoc IPC channel, these flags.

Yes, you're right but the last point - mozilla IPC is not involved in the "busy" flag settings. WaylandDMABufSurface uses a global lock based on eventfd. So when webrender TextureHost releases its WaylandDMABufSurface reference there's no mozilla IPC involved here.

There is another way to solve the issue: to use texture recycling. In that scenario WaylandDMABUFSurfaceImage passes frameholder responsibility to a TextureClient, which is created with WAIT_HOST_USAGE_END flag. It's then serialized and passed to the compositor, where it becomes a TextureHost. When TextureHost is no longer needed, a NotifyNotUsed message is sent back to the TextureClient side, where it causes the corresponding TextureClient to be destroyed, which destroys a frameholder. So, the existing message channel IPC is used. No additional channel is necessary. Here is how it looks as a patch: https://bugzilla.mozilla.org/attachment.cgi?id=9151351&action=diff.

I'm a bit worried about IPC channel delays here. The recent solution (internal WaylandDMABufSurface global lock) does not involve any IPC communication so the change is instant. But maybe I'm wrong and this is not any issue at all.

(In reply to Rinat from comment #112)

channel is necessary. Here is how it looks as a patch: https://bugzilla.mozilla.org/attachment.cgi?id=9151351&action=diff.

I feel the VAAPIFrameHolder should not be exported from media playback scope.

OTOH There's already an effort to use SW decoder with recycled HW textures (Bug 1629788) so it would be great to have one solution for both cases (vaapi+dmabuf, SW decode + EGLImages/dmabuf) where the textures will be recycled by the same way so I suspect we'd need something like you posted there.

Regressions: 1618914

Everything works but it feels like Firefox slows down quite a bit if it's running more than one video stream or running a single 60fps Twitch stream.
Chromium seems to run smoother than Firefox in this regard.

Can things be optimized perhaps?

You need to log in before you can comment on or make changes to this bug.