Closed Bug 1713276 Opened 1 year ago Closed 4 months ago

[ffmpeg] Decode video frames directly into VideoData

Categories

(Core :: Audio/Video: Playback, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
98 Branch
Tracking Status
firefox98 --- fixed

People

(Reporter: stransky, Assigned: alwu)

References

(Depends on 1 open bug, Blocks 5 open bugs, Regressed 1 open bug)

Details

Attachments

(7 files, 3 obsolete files)

37.93 KB, patch
Details | Diff | Splinter Review
48 bytes, text/x-phabricator-request
Details | Review
1.53 MB, image/png
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
Assignee: nobody → stransky
Depends on: 1715365

With Bug 1715365 (FFmpegVideoFramePool) we can extend it for frames located at ram to reduce AVFrame -> VideoData copy for all Linux backends - even Basic (at least for some well know pixel formats).

If we allocate the frames in shm memory we may even reduce copy during frame serializing (I'm not sure how the VideoData is processed in GLX/Basic case).

Summary: [Linux] Decode video frames directly into dmabuf surfaces → [Linux] Decode video frames directly into video surfaces
Summary: [Linux] Decode video frames directly into video surfaces → [Linux] Decode video frames directly into VideoData

If my understanding is correct, this will not only help to avoid one copy (moving data into AVFrame) on Linux, but also can help ffvpx decoding on other platforms?

Flags: needinfo?(stransky)

(In reply to Alastor Wu [:alwu] from comment #2)

If my understanding is correct, this will not only help to avoid one copy (moving data into AVFrame) on Linux, but also can help ffvpx decoding on other platforms?

Yes, any platform which uses ffvpx can use that (but I think it's used on Linux only).

Flags: needinfo?(stransky)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #3)

Yes, any platform which uses ffvpx can use that (but I think it's used on Linux only).

We also use ffvpx on Windows and MacOS, so this would help on those platforms as well.

Summary: [Linux] Decode video frames directly into VideoData → [ffmpeg] Decode video frames directly into VideoData

Hi, Martin, I wonder if I could know your plan for this bug? Is there anything that I can help with? Because I feel that could probably help a lot on the performance of vpx decoding while using ffvpx, which will be beneficial on multiple platforms. Thank you!

Flags: needinfo?(stransky)

Hi Alastor, I keep it on my radar but I'm busy with more pressing Linux/Wayland issues right now. But I have patches WIP and I'm going to come back and look at it.

Blocks: video-perf

I started investigating how chromium does that today. Here is the place they set the get_buffer2 for ffmpeg in order to use their own buffer, and here is the place they create their own buffer for ffmpeg.

Flags: needinfo?(stransky)
Attached patch in-place.patchSplinter Review

Hi Alastor,

there's my WIP patch attached here, it may help you with the start. It's related to DMABuf but key parts may be reused for general in-place decoding as they are shared.

Key part is:

#if LIBAVCODEC_VERSION_MAJOR >= 58
if (mVideoFramePool /* && (mCodecContext->flags & AV_CODEC_CAP_DR1)*/) {
mCodecContext->opaque = this;
mCodecContext->get_buffer2 = GetVideoBufferWrapper;
mCodecContext->thread_safe_callbacks = 1;
}
#endif

where we set custom method to supply video buffer to ffmpeg:

static int GetVideoBufferWrapper(struct AVCodecContext* aCodecContext,
AVFrame* aFrame, int aFlags);
static void ReleaseVideoBufferWrapper(void* opaque, uint8_t* data);

At FFmpegVideoDecoder<LIBAV_VER>::GetVideoBuffer() you need to create buffer according to ffmpeg expectations, it's similar how we create the buffer for copy.

The patch also contains related changes to FFmpegLibWrapper which may be the same.

Note that you don't have to create custom buffer for every format - see call avcodec_default_get_buffer2(). You say ffmpeg to create the buffer and you need to copy data later as we do now.

Thank for your patch!

After a quick glance, this seems only work for Linux because it only handle DMAbuffer surface. As ffmpeg decoder can also be used on ffvpx for software decoding on Windows and Mac, we also need to handle those cases where we have to create a shmem to share with GPU/Chrome process, in order to let compositor to access the data directly.

(In reply to Alastor Wu [:alwu] from comment #10)

Thank for your patch!

After a quick glance, this seems only work for Linux because it only handle DMAbuffer surface. As ffmpeg decoder can also be used on ffvpx for software decoding on Windows and Mac, we also need to handle those cases where we have to create a shmem to share with GPU/Chrome process, in order to let compositor to access the data directly.

Yes, the patch is designed for dmabuf but can be updated for shm mem. You need to create SHM memory buffer at mCodecContext->get_buffer2 callback and pass it to VideoData instead of copy them:

https://searchfox.org/mozilla-central/rev/75e9d727ce5ba2c14653cf8fb0f1367f085271b7/dom/media/MediaData.cpp#356
https://searchfox.org/mozilla-central/rev/75e9d727ce5ba2c14653cf8fb0f1367f085271b7/gfx/layers/ImageContainer.cpp#246

Hi, Martin,
I'm still investigating the code to see how we can implement that, now I have some rough idea and I collect them into this doc. Would you mind to check if the direction of my thought is correct or not?
Thank you so much.

Flags: needinfo?(stransky)

(In reply to Alastor Wu [:alwu] from comment #12)

Hi, Martin,
I'm still investigating the code to see how we can implement that, now I have some rough idea and I collect them into this doc. Would you mind to check if the direction of my thought is correct or not?
Thank you so much.

Hello Alastor,

There are some facts which needs to be considered:

  • direct decode to dmabuf (on Linux) is very slow. It because ffmpeg also reads from the buffer during decode and it expects the buffer is zeroed. Direct decoding to dmabuf takes 700%!! CPU on my box while indirect decoding 40-60%. The GPU memory is generally very slow to read (if it's even supported). I don't know how is that handled on Windows/Mac but I expect the situation is similar (or the data are cached and then moved to GPU at once but that's what we do already). The correct (and only working) way here is to create a texture, upload video buffer into it and use such texture (which is done by gecko anyway).

  • recycled buffers can be hold for long time by different parts of gecko/different processes.

  • you may need to use avcodec_default_get_buffer2() for video buffer formats which are not supported by us (at least for the beginning).

Right now the decoding sequence is (on Linux):

  1. ffmpeg allocates video buffer and decodes video data into it
  2. we allocate shm buffer and copy video data there
  3. we allocate GL texture and copy data there

So there are 3 allocations and 3 video data copy.

Given the facts I'm suggesting to decode to shm memory buffers only. We can allocate every frame now and implement frame pool later if that's needed, I don't know how shm allocation is expensive (we allocate shm memory for every frame now).

Flags: needinfo?(stransky)

According to comment13, as ffmpeg will read (referenceing frames for decoding) and write the buffer (outputing decoded data), if we provide GPU memory for that, it would be pretty slow because of the need to access GPU memory frequently from CPU. Therefore, we should only allocate CPU memory (shmem buffer) to ffmpeg decoder.

If my above understanding is correct, then this improvement will only help on software composition?

Because for the hardware compostion and hardware decoding, the decoded data are already based on GPU memory, so we don't actually need to copy them [1], and can give them directly to the hardware compositor. For software decoding and hardware composition, we would still create a sharable hardware texture (eg. D3D11 or MacIOSurface ) to share the data to hardware compositor.

For above cases, we would eventually create a hardware based memory buffer to convey the data, so using shmem seems no help at all?

Sorry I still don't really understand how compostior works, so above comment might be wrong.

Thank you so much.

[1] Although now we still copy data on Windows, but from my understanding, that can be avoid (bug 1723207)
https://searchfox.org/mozilla-central/rev/a9ef6ad97d2d5f96d5ed51eda38f1a02700ccff7/dom/media/platforms/wmf/DXVA2Manager.cpp#906-931

Flags: needinfo?(stransky)

(In reply to Alastor Wu [:alwu] from comment #14)

According to comment13, as ffmpeg will read (referenceing frames for decoding) and write the buffer (outputing decoded data), if we provide GPU memory for that, it would be pretty slow because of the need to access GPU memory frequently from CPU. Therefore, we should only allocate CPU memory (shmem buffer) to ffmpeg decoder.

In addition, if this is the reason, then does running ffmpeg decoder on GPU solve the slow problem? (I know only Windows has GPU process, just thinking about its possibility)

Take vp9 on Windows as an example, when hardware decoding is enabled, we will use WMF decoder (Windows platform decoder) in GPU process, but once if we disable hardware decoding, we will fallback to ffvpx decoder in RDD. I imagine that if we can run ffvpx decoder in GPU process, then all read/write on decoded buffer would happen on GPU process and no longer "slow" (if the reason I mentioned above is correct)

(In reply to Alastor Wu [:alwu] from comment #14)

According to comment13, as ffmpeg will read (referenceing frames for decoding) and write the buffer (outputing decoded data), if we provide GPU memory for that, it would be pretty slow because of the need to access GPU memory frequently from CPU. Therefore, we should only allocate CPU memory (shmem buffer) to ffmpeg decoder.

If my above understanding is correct, then this improvement will only help on software composition?

I'm not familiar with Windows/Mac decoding so I can give you only rough picture:

HW decoding:
GPU (decoded ffmpeg video frame) -> GPU (decoded video frame in dmabuf) -> IPC -> GPU (texture used by webrender)

SW decoding with dmabuf backend / WebRender:
CPU (decoded ffmpeg video frame) -> GPU (decoded video frame in dmabuf) -> IPC -> GPU (texture used by webrender)

SW decoding with WebRender
CPU (decoded ffmpeg video frame) -> CPU (decoded video frame in shm) -> IPC -> GPU (texture used by webrender)

SW decoding with SW WebRender
CPU (decoded ffmpeg video frame) -> CPU (decoded video frame in shm) -> IPC -> CPU (SW rendering of video frame)

With in-place decoding we can transform:

'CPU (decoded ffmpeg video frame) -> CPU (decoded video frame in shm)'

to just one step

'CPU (decoded ffmpeg video frame in shm)'

So if we decode directly to shm buffer, the shm buffer can be send to another process for compositing (SW or HW) without additional copy, no matter if it's transformed to texure or used for SW composition.

Flags: needinfo?(stransky)

Update the current WIP, now on Linux I can decode vpx and h264 onto the shmem directly, but sometime it causes a crash (SIGSEGV) inside the libmozavcodec.so, which I suspect that I didn't set the size (or stride, linesize) well. I'm still debugging that issue.

Assignee: stransky → alwu
Severity: -- → N/A
Priority: -- → P2

Hi, Martin,

Currently my patch works for live streaming vp9, like this one and h264 video, like this, but it would still crash on this video in 1080p or 1440p (on other resolutions, it won't crash! but I don't know why)

I wonder if you could help me check my patch and provide me any feedback?
Thank you so much.

Flags: needinfo?(stransky)

Sure. I expect we need to align the frame sizes somehow as ffmpeg is a bit picky here and needs extra space.
Will look at it.

Indeed, I printed the data size (the second parameter for av_buffer_create() ) allocated by default allocator, which is different from the size I set.

  • H264 (640 * 368)
    Data size (by default allocator) = 247119, data size (by my custom allocator) = 235520 (640*368)
  • VP9 (1280 * 720)
    Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)
  • VP9 (2560 * 1440)
    Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)

I can found the default allocator always allocates larger size than my custom allocator. In addition, Chrome uses larger size for that parameter, which is calulated by adding the sizes of three planes. I also tried this method, but still got a crash on the last VP9 video.

(In reply to Alastor Wu [:alwu] from comment #21)

Indeed, I printed the data size (the second parameter for av_buffer_create() ) allocated by default allocator, which is different from the size I set.

  • H264 (640 * 368)
    Data size (by default allocator) = 247119, data size (by my custom allocator) = 235520 (640*368)
  • VP9 (1280 * 720)
    Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)
  • VP9 (2560 * 1440)
    Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)

I can found the default allocator always allocates larger size than my custom allocator. In addition, Chrome uses larger size for that parameter, which is calulated by adding the sizes of three planes. I also tried this method, but still got a crash on the last VP9 video.

I recall ffmpeg extends/align both width & height, not just width as one may expect. The default allocator allocation method may be the way how to get it correctly, perhaps we need to debug it.

Also the latest VP9 video is rotated - may that be the factor here? Flipped width/height?

(In reply to Martin Stránský [:stransky] (ni? me) from comment #22)

I recall ffmpeg extends/align both width & height, not just width as one may expect. The default allocator allocation method may be the way how to get it correctly, perhaps we need to debug it.

Wow you're right! Aligning the height did solving the crash! \o/
Thank you so much, I will keep working on this and see if there is any other issues or crashes.

Flags: needinfo?(stransky)
Attached image wrong-color-issue.png

Now I encounter another problem which is color showing incorrectly due to larger Cb/Cr surfaces size. As ffmpeg needs a special alignment on both height and width, I would create a image and allocate a corresponding buffer. Then, when we receive decoded data, if we give that surface directly to the compositor, it would cause the color mismatch, because the Cb/Cr stride is larger than the compositor expect.

For example, for 1920*1080, the Cb/Cr stride and height should be 960 and 540. But ffmpeg needs 1024 and 544 after doing the alignment, if I represent this texture directly to the compositor, then it would display a wrong green area like what this photo shows.

It seems that I will have to find a way to re-adjust the cb/cr surface region stored in the texture client after I receive decoded data?

How do you create the textures?

For instance eglCreateImage allows to specify strides for all planes which don't need to be power-of-two aligned:
https://searchfox.org/mozilla-central/rev/24fac1ad31fb9c6e9c4c767c6a7ff45d226078f3/widget/gtk/DMABufSurface.cpp#564

(It's also possible that the dmabuf code here is affected by this too).

First I create a PlanarYCbCrImage from the image container, and create a texture client from there. Then I create a new method to create an empty buffer by providing a PlanarYCbCrData, the new method would call SharedPlanarYCbCrImage::Allocate to create a shmem buffer.

As FFmpeg requires aligment, when setting the PlanarYCbCrData (which is used to allocate the shmem buffer), I would set the aligned stride and height on the data (see here, search CreateEmptyPlanarYCbCrData)

Looks like SharedPlanarYCbCrImage is incorrectly mapped to textures. I'd expect BufferTextureData::CreateForYCbCr() is called for the video frames but BufferTextureData::Create() is used instead with RGBA format. Looks like the YUV is mapped to RGB somewhere and we're uploading RGB textures which is wrong.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #28)

Looks like SharedPlanarYCbCrImage is incorrectly mapped to textures. I'd expect BufferTextureData::CreateForYCbCr() is called for the video frames but BufferTextureData::Create() is used instead with RGBA format. Looks like the YUV is mapped to RGB somewhere and we're uploading RGB textures which is wrong.

So no, I was wrong here. The texture seems to be created as YUV at RenderBufferTextureHost::Lock().

Yes, I do see the artifacts with https://www.youtube.com/watch?v=LXb3EKWsInQ
Looks like with and also height is wrong.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #30)

Yes, I do see the artifacts with https://www.youtube.com/watch?v=LXb3EKWsInQ
Looks like with and also height is wrong.

Yes, that's ture, that was caused by extra padding for the ffmpeg. If I watch that video on 4K, then this issue won't happen because we don't need the extra padding for linesize in that resulution.

I asked this quesiton on #gfx-firefox and the response from :jgilbert seems matching about what I thought, I have to dive deeper into the buffer creation, to create a larger size for shmem buffer, but keep the YCbCrDescriptor correct (use the original size which we don't modify for ffmpeg)

So no, I was wrong here. The texture seems to be created as YUV at RenderBufferTextureHost::Lock().

In the beginning, I was also thinking about the possiblility of creating the wrong format texture, but after my testing, that is not true. The problem is still I didn't provide correct YCbCrDescriptor so that the compositor would render the wrong image.

Now I follow this suggestion to resize the planes to useful-size before I create the video data, and the result is good! I can eliminate the green part and video in different resolutions could all be displayed correctly.

Yes, the blit can be another way although it adds another frame copy, but that's done on gfx card. I believe there is a way how to upload the textures directly with stride without additional blit (it's already done for dmabuf textures).

(In reply to Martin Stránský [:stransky] (ni? me) from comment #33)

Yes, the blit can be another way although it adds another frame copy, but that's done on gfx card. I believe there is a way how to upload the textures directly with stride without additional blit (it's already done for dmabuf textures).

Oh the current way I use won't add any copy, it's just simply readjust the size in the YCbCrDescriptor. The reason of not porting the logic to shmem buffer is that, this is the only use case where we would need a speicial stride on height as well. So instead of adding more explanation and logic on creating shmem buffer, I feel adding another method to readjust the descriptor would be easier and better for maintainence.

But now the interesting thing is that, even if I remove the need of copying the decoded data, the decoding time doesn't seem to be improved. I wonder if creating shmem on the ffmpeg decoder threads degrade the performance of decoding. Still in the investigation.

Attachment #9248959 - Attachment description: WIP: Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly. → Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly.
Attachment #9250317 - Attachment description: WIP: Bug 1713276 - part2 : directly use the shmem video buffer to create a video data which would remove a need to copy decoded data. → Bug 1713276 - part2 : directly use the shmem video buffer to create a video data which would remove a need to copy decoded data.

Here are some profiled results. I followed this way to turn ETW log to the format that Firefox profiler can accept. They are all recording while playing this 4K video in 10s.

  • Firefox using default buffer allocator (what we current have, will need to copy data into texture or shmem later) result

For vp9 decoding, I searched for vp9_decode_frame(), which happened in thread#270~277.
The total sample times for vp9_decode_frame() is 48111ms (=5929+6333+5961+6809+5920+5249+5648+6262) and the total sample amount is 20680 (=2559+2753+2488+2652+2532+2676+2386+2634). Therefore, we can say one sample took 2.3ms in average.

For ffmpeg video decoder, I searched for mozilla::FFmpegVideoDecoder::DoDecode(), which happened in thread#268,269 and #282.
The total sample times for mozilla::FFmpegVideoDecoder::DoDecode() is 43219ms (=16353+17186+9680) and the total sample amount is 2472 (=949+885+638). Therefore, one sample took 17.48ms in average.

  • Firefox using the customized buffer allocator (allocate shmem buffer) result

Same searching for vp9_decode_frame(), which happened in thread#302~309.
The total sample times for vp9_decode_frame() are 57393ms (=7606+6329+7865+7275+6327+7107+7563+7321) and the total sample amount is 28356 (=3454+3345+3565+3776+3445+3355+3565+3851). Therefore, we can say one sample took 2.02ms in average.

For ffmpeg video decoder, I searched for mozilla::FFmpegVideoDecoder::DoDecode(), which happened in MediaPdecoder#1~#3.
The total sample times for mozilla::FFmpegVideoDecoder::DoDecode() is 23295ms (=11242+10936+11117) and the total sample amount is 358 (=114+127+117). Therefore, one sample took 65.07ms in average. (?)


In term of decoding time in ffvpx decoder, it only improves very little. For the ffmpeg's DoDecode, it's actully getting worse, but I am not sure if I uses the correct way to analysize these profiles...

Hello Alastor,

I'm not a hardware expert but I think this article gives basic explanation what's going on:
https://tldp.org/HOWTO/Parallel-Processing-HOWTO-2.html

When we decode to shared memory, the memory is cached and needs to be synchronized with other cores (that's the Shared Everything Vs. Shared Something dilemma) as we use multi-threaded decoding. Also I guess the speed heavily depends on cache size (L1, L2, L3) and also DRAM speed. (when L* cache size is small then we may see better performance with direct decode to shm memory as cache may not be so involved there).

Would be interesting to compare single thread decoding (direct decode to shm vs. decode + copy) to see if the multi-thread decoding is a factor here.

Anyway, unless we find a way how to tell OS to not share/propagate shm changes during decode this may be a dead end (as well as it's for dmabuf direct decoding).

I'll try the direct decoding with some old hardware to check the cache theory.

What I feel is multi-thread decoding is not a factor, but the mechanism of synchronizing the shmem might be related. Following result are for multi-threads, and single thread (change this to 1) From below result, it shows that even if using single thread, decode to shmem is still slower.

  • Linux, Multi-threads, Decode + Copy
    [Child 579563: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 19340.897155 us in the average.

  • Linux, Single thread, Decode + Copy
    [Child 794861: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 44225.718321 us in the average.

  • Linux, Multi-threads, Decode to shmem
    [Child 579979: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 20269.318408 us in the average.

  • Linux, Single thread, Decode to shmem
    [Child 795783: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 53721.353087 us in the average.


By observing the stack in comment 37 (decode to shmem), it took a lot of time on ZwFreeVirtualMemory (on Windows) which is called from av_buffer_unref. In vp9_decode_update_thread_context, I did see ffvpx would start checking which frames are no longer needed and would unref those AVFrames. And then it seems triaggering resetting data on shmem, which would cost a lot of time. Also, on Linux, the time spend most are on __pthread_cond_wait which also looks like synchronizing shmeme data between two processes?

However, if those data got reset during decoding, how could we see the image on the compositor?! Because the images are still complete, the shmem buffer should suppose unchanged after we receive decoded video frames. If those data didn't get reset, why av_buffer_unref would trigger those cleaning methods and took a lot of time?

(In reply to Alastor Wu [:alwu] from comment #40)

What I feel is multi-thread decoding is not a factor, but the mechanism of synchronizing the shmem might be related. Following result are for multi-threads, and single thread (change this to 1) From below result, it shows that even if using single thread, decode to shmem is still slower.

  • Linux, Multi-threads, Decode + Copy
    [Child 579563: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 19340.897155 us in the average.

  • Linux, Single thread, Decode + Copy
    [Child 794861: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 44225.718321 us in the average.

  • Linux, Multi-threads, Decode to shmem
    [Child 579979: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 20269.318408 us in the average.

  • Linux, Single thread, Decode to shmem
    [Child 795783: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 53721.353087 us in the average.


By observing the stack in comment 37 (decode to shmem), it took a lot of time on ZwFreeVirtualMemory (on Windows) which is called from av_buffer_unref. In vp9_decode_update_thread_context, I did see ffvpx would start checking which frames are no longer needed and would unref those AVFrames. And then it seems triaggering resetting data on shmem, which would cost a lot of time. Also, on Linux, the time spend most are on __pthread_cond_wait which also looks like synchronizing shmeme data between two processes?

Hm, maybe the shm memory allocation/free takes so much time? I wonder if the original way (decode to regular memory and then copy to shm) uses different/better way to allocate shm memory chunks or are they recycled?

However, if those data got reset during decoding, how could we see the image on the compositor?! Because the images are still complete, the shmem buffer should suppose unchanged after we receive decoded video frames. If those data didn't get reset, why av_buffer_unref would trigger those cleaning methods and took a lot of time?

Is av_buffer_unref called along ReleaseVideoBufferWrapper()? Because that seems to be the only correct way how to call it, when your own buffer is released, right?

As for the images - the shm images are uploaded from shm to OpenGL textures (located at GPU) and that's a single operation - you can release the shm after that as GL keeps copy of the texture on GPU (and maybe also in RAM but that depends how the textures are uploaded). So it's possible that the underlying shm memory is released but the texture is still live and used for rendering.

Attachment #9248959 - Attachment description: Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly. → WIP: Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly.
Attachment #9248959 - Attachment description: WIP: Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly. → Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly.

Last time I discussed with :jrmuizel, and he pointed me out that when I do the measurement, I should use the build which is closer to the release build. Then I tried to measure them again on OPT build with RUSTC_OPT_LEVEL=2 (because release build doesn't have full RUST optimization), here are the result.


I tested on five different videos and on the different platforms, and the result shows that on Linux and MacOS, this improvement did help! (especially on my Linux machine, SW decoding + SW compositing) In my previous measurement, I didn't use OPT build and only test on Linux. This new result should be more precise than the old one I made.

But on Windows, the result is interesting. For some video, it shows very good result (53% improvement), but on other video, it shows very bad result (-77% improvement). For now I didn't figure out the reason, and I will keep investigating on this.

In general, if this result is correct, then this improvement is still worthy to do.

In the previous measurement, the result for Windows 11 Dev didn't show a stable improvement, I am not sure if that is because Windows 11 Dev is still not stable not there might be some unexpected behavior. In addition, yesterday after I restarted (and updated) my Windows 11 Dev, my Windows 11 became totallu usuable, so I couldn't do any measurement again, which forced me to reinstall Windows 10 on my machine.

Therefore, I did the measurement again on Windows 10, and the result did show a stable improvement on all videos. Therefore, I think it would be good to use shmem for decoded data for all sw decoding cases, even if we uses hardware composition. We can always share the decoded data to the compositior process, and then upload the texture to GPU later if necessary.

Attachment #9250317 - Attachment is obsolete: true
Attachment #9251652 - Attachment is obsolete: true
Attachment #9248959 - Attachment description: Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly. → Bug 1713276 - allocate shmem for ffmpeg to store decoded data directly.

Alastor, while I'm looking at the patche, have you tried to run try on them? I see some failures there:
https://treeherder.mozilla.org/jobs?repo=try&revision=7074eecc4c93a9af5859a666cc4a8609c3ddc032

Sure, but the tests I ran are all high level integrated tests because I didn't expect this would break the gtest. I will look into it later, thanks.

Attachment #9248959 - Attachment description: Bug 1713276 - allocate shmem for ffmpeg to store decoded data directly. → Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly.

Change some less important debugs to verbose so that we can use DEBUG level to focus more important things.

For AudioTrimmer, this will help to reduce huge useless logs if audio doesn't need to be trimmed.

Depends on D130220

ffmpeg decoder can also be used for VPX (and AV1 in the future), so remove 'h264' wording in the warning message.

Depends on D134350

Depends on D134351

Attachment #9256272 - Attachment description: Bug 1713276 - part4 : add fuzzy range for some reftests on OSX. → Bug 1713276 - part4 : modify fuzzy range for some reftests.
Depends on: 1748540
Attachment #9256272 - Attachment description: Bug 1713276 - part4 : modify fuzzy range for some reftests. → Bug 1713276 - part4 : adjust fuzzy range for some reftests.

There are some offline discussion about how to recycle the shmem buffer, here jya mentioned that we need a cross-process, atomic refcount. At the discussion, I didn't really understand the detail of how we release the shmem buffer from the compositor side, so I checked that again and make a record here.

tl;dr, from my investigation, we do have a cross-process, atomic refcount mechanism to release the shmem buffer.

First, when we create a shmem buffer, it would be stored inside a TextureClient. and the shmem buffer would only be destroyed when the texture client gets destroyed.

Texture client has an atomic refcount. After decoder returns a data to us, and we give the texture to the compostior, which in another process, there are two refcount that we can ensure at least happened. The first one is the one holding by ffmpeg (we hold a reference to image (done in patch1, in the end of GetVideoBuffer()), and image holds a reference to texture client. Another is in CompositorBridgeChild, where the texture would be added to a list, then when compositor finishes using the texture, the texture client would be removed from the list. As the refcount is notified from another process, so it's a cross process mechanism.

When ffmpeg doesn't need to keep the buffer, we would be notifted by ReleaseVideoBufferWrapper() callback (in patch1) in which we would release the refcount for texture. When compositor doesn't need to keep the buffer, we would be notified by CompositorBridge. Therefore, we can always ensure shmem would be released correctly.

Update the amount usage comparison in here, directly decoding to shmem significantly reduces the amount of memory usage on Linux and MacOS. But on Windows the result doesn't show any improvement or deterioration, I am not sure why. (Also, the memory usage is super low on Windows. Maybe profiler couldn't capture memory usage well on Windows?)

Do you compare on the same box? Is the decoding on Windows faster than on Linux on the same hardware with SW decoding?

No, they are different devices. You can check about:support for those devices in the part of Testing Devices in my document.

I see. I think it would be interesting to compare performance on the same device on Windows / Linux. You can compare Linux performance with/without shm direct decode and the Windows one to deduce if Windows behaves well or not.

Alastor, I have a laptop with both Windows 10 & Linux installed so I can do the Windows/Linux tests if there's a try build available.

Here it's, thank you so much!

(Alastor Wu [:alwu] from comment #55)

Here it's, thank you so much!

https://treeherder.mozilla.org/jobs?repo=try&revision=3f1c74e0c0f351c27ddaa906e40c5dc9d4d7155f
mozregression --repo try --launch 3f1c74e0c0f351c27ddaa906e40c5dc9d4d7155f

vs.
mozregression --repo try --launch 3f1c74e0c0f351c27ddaa906e40c5dc9d4d7155f --pref media.ffmpeg.customized-buffer-allocation:false

Thanks. I tested that with gecko profiler but I can't get any meaningful numbers. I'm going to try some system level profiler on Windows to get the difference (also on linux perf may be the best tool for it).

Hello Alastor, is there any video playback performance stats in Firefox? I mean decoded / dropped frames, fps etc.
I testes some clips (8K, 4K) on both Windows/Linux on the same box and:

  • CPU usage looks pretty much the same on Windows/Linux. media.ffmpeg.customized-buffer-allocation doesn't seem to have impact to CPU usage, even media.ffmpeg.customized-buffer-allocation:false is about ~1-2% CPU better.
  • Video playback seems to be smoother for media.ffmpeg.customized-buffer-allocation:true

Also number of decoding thread may be a factor here.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #59)

Hello Alastor, is there any video playback performance stats in Firefox? I mean decoded / dropped frames, fps etc.
I testes some clips (8K, 4K) on both Windows/Linux on the same box and:

  • CPU usage looks pretty much the same on Windows/Linux. media.ffmpeg.customized-buffer-allocation doesn't seem to have impact to CPU usage, even media.ffmpeg.customized-buffer-allocation:false is about ~1-2% CPU better.
  • Video playback seems to be smoother for media.ffmpeg.customized-buffer-allocation:true

Eh, sorry, the 8K clip was actually AV1 so I need to retest again :)

I re-tested a 4K VP9 clip and there are results:

Windows:
media.ffmpeg.customized-buffer-allocation:false
RDD process: 11-13% CPU, 306MB RAM (700MB all Firefox RAM)

media.ffmpeg.customized-buffer-allocation:true
RDD process: 12-15% CPU, 40MB RAM (400MB all Firefox RAM)

Linux:
The same numbers but CPU is 1-2% lower

Looks like direct decoding has minor performance penalty but huge RAM impact.
I used simple process monitor to see CPU%/RAM sized as I didn't get any useful numbers from VTune. Also I think the CPU/RAM is what we and users care most.

Sorry for interfering, but can I test this "flag" too with my UHD620 + 4K display combo on Win and Linux since I have issues on both OS with playing 4k@60fps videos? I just do not know how to make it. :)

(In reply to Ivan from comment #62)

Sorry for interfering, but can I test this "flag" too with my UHD620 + 4K display combo on Win and Linux since I have issues on both OS with playing 4k@60fps videos? I just do not know how to make it. :)

Yes - you'll need mozregression (pip install --user mozregression or so) , then you can run the commands from comment 56 and comment 57.

I didn't measure RAM usage, but in my showcase (Garuda Linux + 4K display + UHD620 + HW accel) I don't catch any benefits in terms of temps and framedrops behaviour compared to 97.0b2 (as I was hope). But could be I shouldn't))
But at the same time no problems to me, video works just fine.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #59)

Hello Alastor, is there any video playback performance stats in Firefox? I mean decoded / dropped frames, fps etc.

Yes, you can use VideoPlaybackQuality.droppedVideoFrames.

Looks like direct decoding has minor performance penalty but huge RAM impact.

Thanks for testing! That seems understandable, because now we don't need to allocate memory for copying decoded data, so the improvement should be more significant on high resolution video. Using higher CPU seems also understandable, because allocating shmem needs more work than allocating normal memory.

(In reply to Alastor Wu [:alwu] from comment #65)

Thanks for testing! That seems understandable, because now we don't need to allocate memory for copying decoded data, so the improvement should be more significant on high resolution video. Using higher CPU seems also understandable, because allocating shmem needs more work than allocating normal memory.

I see. I though you did some kind of memory frames recycling with shm. I wonder what causes such massive memory difference then.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #66)

I see. I though you did some kind of memory frames recycling with shm. I wonder what causes such massive memory difference then.

I think that is probably also true? I didn't trace the whole process of how those texture buffers would be used though, but the texture allocator supports recycling. The texture client would be pushed back to the pool when its reference drops to 1.

That does not makes much sense to me. I wonder why the decode to shm is slower. When the time is consumed by allocation, then what's recycled? I expect when we're recycling allocated shm memory (and everything looks like so) we don't allocate anything but just cycle few already allocated shm buffers which are mapped/unmapped as textures later during the cycle (that's the texture Lock/Unlock IMHO). Also the tiny memory usage indicates that.

I really think that direct decode to shm is slower due to random access to shm as ffmpeg do read/write to shm during decode and that may be handled by kernel to sync shm/flush cahe etc. I'll try to do some benchmarks to compare so. If that's confirmed we can do the decode to RAM as we did before but use the recycling shm image container to keep the RAM effect of this patch.

I'll also try to get detailed performance data from perf as this is really interesting problem :)

Got the performance data. I test playing 1 min of 4K clip. With direct decode to shm there are shm mapping failures on the stack (but rest is the same):

-    8.07%     0.67%  MediaPD~oder #1  [kernel.kallsyms]
   - 7.40% asm_exc_page_fault                                                                                                                                                                ▒
      - 7.33% exc_page_fault                                                                                                                                                                 ▒
         - 7.29% do_user_addr_fault                                                                                                                                                          ▒
            - 6.85% handle_mm_fault                                                                                                                                                          ▒
               - 6.56% __handle_mm_fault                                                                                                                                                     ▒
                  - 5.04% __do_fault                                                                                                                                                         ▒
                     - 4.98% shmem_fault                                                                                                                                                     ▒
                        - 4.91% shmem_getpage_gfp                                                                                                                                            ▒
                             1.58% clear_page_rep                                                                                                                                            ▒
                           - 1.43% shmem_alloc_and_acct_page                                                                                                                                 ▒
                              - 1.27% shmem_alloc_page                                                                                                                                       ▒
                                 - 1.19% alloc_pages_vma                                                                                                                                     ▒
                                    - 1.12% __alloc_pages                                                                                                                                    ▒
                                       - 0.96% get_page_from_freelist                                                                                                                        ▒
                                          - 0.71% rmqueue_bulk                                                                                                                               ▒
                                               0.58% __list_del_entry_valid                                                                                                                  ▒
                             1.19% shmem_add_to_page_cache                                                                                                                                   ▒
   - 0.66% start_thread                                                                                                                                                                      ▒
      - 0.59% frame_worker_thread                                                                                                                                                            ▒
         - vp9_decode_frame                                                                                                                                                                  ▒
              0.59% decode_sb                                                                                                                                                                ▒

So looks like we hit mapping failure so the process is suspended unless the page is loaded. May this be caused by multi-thread decode to the same shared memory page?

btw. Recorded on Linux as:

$perf record -g ./firefox

And analyzed as:

$perf report --call-graph

(In reply to Martin Stránský [:stransky] (ni? me) from comment #70)

So looks like we hit mapping failure so the process is suspended unless the page is loaded. May this be caused by multi-thread decode to the same shared memory page?

How about if you set thread_count = 1 to disable multi-threads decoding in ffmpeg? will that make any difference?

Linux: The same numbers but CPU is 1-2% lower

Forgot to reply this. So on Linux, memory usage between those two approach didn't have any difference? hmm but shmem should be able to reused because all buffers we want to allocate should be compatible with the shmem we have allocated.

In addition, you said CPU is lower when enabling shmem decoding, so enabling that doesn't degrade the performance on Linux? Even if we hit 8.07% page fault?

I will also try to test by using other profile tools tomorrow to see if I can get other interesting data or not.

The profile suggests we're not properly recycling the shared memory. (On Linux pages of allocations are usually allocated in the page fault handler when they're first touched. alloc_pages_vma)

We don't want to be doing any shared memory allocations during video playback once the initial buffers have been allocated.

(In reply to Alastor Wu [:alwu] from comment #72)

In addition, you said CPU is lower when enabling shmem decoding, so enabling that doesn't degrade the performance on Linux? Even if we hit 8.07% page fault?

To be clear I see:

Windows:
media.ffmpeg.customized-buffer-allocation:false
RDD process: 11-13% CPU, 306MB RAM (700MB all Firefox RAM)

media.ffmpeg.customized-buffer-allocation:true
RDD process: 12-15% CPU, 40MB RAM (400MB all Firefox RAM)

Linux:
media.ffmpeg.customized-buffer-allocation:false
RDD process: 9-11% CPU, 306MB RAM (700MB all Firefox RAM)

media.ffmpeg.customized-buffer-allocation:true
RDD process: 11-14% CPU, 40MB RAM (400MB all Firefox RAM)

(In reply to Alastor Wu [:alwu] from comment #72)

How about if you set thread_count = 1 to disable multi-threads decoding in ffmpeg? will that make any difference?

It's the same, i.e. asm_exc_page_fault is on the perf stack.

No longer blocks: 1739213
Blocks: 1739213
No longer blocks: egl-linux-vaapi

(In reply to Jeff Muizelaar [:jrmuizel] from comment #73)

The profile suggests we're not properly recycling the shared memory. (On Linux pages of allocations are usually allocated in the page fault handler when they're first touched. alloc_pages_vma)

We don't want to be doing any shared memory allocations during video playback once the initial buffers have been allocated.

That seems to be correct - we repeatedly allocate shm memory by VideoBridgeChild. Looking at it now.

So yes. The frames are not recycled because texture sizes are different (shrank). At YCbCrTextureClientAllocationHelper::IsCompatible():

aTextureClient->GetSize()
width = 3840,
height = 2160

mData.mYSize
width = 3840,
height = 2176

bufferData->GetCbCrSize().ref()
width = 1920,
height = 1080

mData.mCbCrSize
width = 1920,
height = 1088

(In reply to Martin Stránský [:stransky] (ni? me) from comment #77)

So yes. The frames are not recycled because texture sizes are different (shrank). At YCbCrTextureClientAllocationHelper::IsCompatible():

Ah, good finding! So it seems that we have to provide a cropped size if there are avaliable texture for recycling. But I am curious, if we didn't recylce textures correctly, why we could reduce such a significant amount of memory usage?

(In reply to Alastor Wu [:alwu] from comment #78)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #77)

So yes. The frames are not recycled because texture sizes are different (shrank). At YCbCrTextureClientAllocationHelper::IsCompatible():

Ah, good finding! So it seems that we have to provide a cropped size if there are avaliable texture for recycling. But I am curious, if we didn't recylce textures correctly, why we could reduce such a significant amount of memory usage?

That may be some ffmpeg/ffvpx internal bug or we manage AV Buffers incorrectly (Bug 1743750).

See Also: → 1750858

I had some offline discussion with Jeff, and I think the texture recyling issue (comment 77) should be fixed along with bug 1750858. The reason is that I don't want to add any new tech debt. An easy way to fix recyling problem would be to modify YCbCrTextureClientAllocationHelper::IsCompatible() and let it check cropped size instead, but that introduces a new tech debt and definitely is a workaround.

The ultimate goal should be fixing bug 1750858 and then we don't need to crop our planes anymore. Therefore, all texture would be using a larger size for checking the compatibility so that those textures would be possible to be reused.

Even considering the worst case, we would allocate shmem for every video frame, that is actually equal to what we're current doing (create a shmem everytime and then copying decoded data into it). So landing the current patch won't cause any new problem. Considering this feature is an experimental feature which would only be enabled on Nightly first, 1~2% CPU gain in't a big deal and we also see a huge amount of memory usage reduction (although we're not clear why)

Therefore, I think I would still prefer to land current patches first (after them all r+), and bake them on Nightly to see if there is any new issues happening. In the meanwhile, we will also try to fix bug 1750858 in order to recycle textures correctly.

Blocks: 1751051
Pushed by alwu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/13ce7037ff27
part1 : allocate shmem for ffmpeg to store decoded data directly. r=stransky,gfx-reviewers,sotaro,jgilbert,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/b23498172031
part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. r=stransky,bryce,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/cdc8216b3f18
part3 : correct warning message. r=stransky,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/12a9f3fad481
part4 : adjust fuzzy range for some reftests. r=gfx-reviewers,jgilbert
Flags: needinfo?(alwu)
Pushed by alwu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f8814abefa20
part1 : allocate shmem for ffmpeg to store decoded data directly. r=stransky,gfx-reviewers,sotaro,jgilbert,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/3d88e393a0cb
part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. r=stransky,bryce,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/9a47b267032e
part3 : correct warning message. r=stransky,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/cee073e045a3
part4 : adjust fuzzy range for some reftests. r=gfx-reviewers,jgilbert

Backed out for causing build bustages and wpt failures

Backout link: https://hg.mozilla.org/integration/autoland/rev/a20b8d912b3f8cfe4e54143f3d708b549e17fe41

Push with failures: failure push wpt // failure push bustage

Failure log(s): failure log wpt // failure log bustage

Failure line(s): TEST-UNEXPECTED-FAIL | /html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-ImageBitmap-video.html | sRGB-BB0000, Context srgb, ImageData srgb, cropSource=true - assert_true: Actual pixel value 0,0,0,0 is approximately equal to 187,0,0,255. expected true got false // gmake[4]: *** [/builds/worker/checkouts/gecko/config/rules.mk:660: Unified_cpp_ffmpeg_ffmpeg580.o] Error 1

Flags: needinfo?(alwu)

I don't think that wpt failure was caused by my patches, by looking at my last try-run before landing my patches, you can see that failure is intermittent. When the failure happened, the crash happened inside the web render, which is not something my code would touch.

I would say, probably my code changes the timing and exposes this failure and make it more happening more frequently, but my patches are not a root cause. But I will update my patch to avoid that build bustage (weird, it didn't show up in my try-run before!)

Flags: needinfo?(alwu)
Pushed by alwu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/002c71e3c3f7
part1 : allocate shmem for ffmpeg to store decoded data directly. r=stransky,gfx-reviewers,sotaro,jgilbert,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/5503198fc6b2
part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. r=stransky,bryce,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/52fdcd1dc5ed
part3 : correct warning message. r=stransky,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/f4ca14151598
part4 : adjust fuzzy range for some reftests. r=gfx-reviewers,jgilbert

Backed out on suspicion of causing wpt failure in canvas + video test. Multiple wpt failures.

  • Backout link
  • Push with failures
  • Failure Log
  • Failure line 1: html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-video.html | sRGB-FF0100, Context srgb, ImageData srgb, scaleImage=false - assert_true: Actual pixel value 0,0,0,0 is approximately equal to 255,1,0,255. expected true got fals
  • Failure line 2: TEST-UNEXPECTED-FAIL | /html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-video.html | sRGB-BB0000, Context display-p3, ImageData srgb, scaleImage=false - assert_true: Actual pixel value 0,0,0,0 is approximately equal to 187,1,0,255. expected true got false
Flags: needinfo?(alwu)
Flags: needinfo?(alwu)
Blocks: 1751498
Pushed by alwu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b2676297c6ff
part1 : allocate shmem for ffmpeg to store decoded data directly. r=stransky,gfx-reviewers,sotaro,jgilbert,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/0a1be4875045
part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. r=stransky,bryce,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/d8e319535a01
part3 : correct warning message. r=stransky,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/21a1cb173d50
part4 : adjust fuzzy range for some reftests. r=gfx-reviewers,jgilbert
https://hg.mozilla.org/integration/autoland/rev/9ac49fde0e53
part5 : don't use shmem textures for 10bits videos on Windows. r=media-playback-reviewers,stransky

Alastor, may you land it with media.ffmpeg.customized-buffer-allocation set to false to pass tests and enable it in another patch when all tests are fixed? That will save your time with rebases to latest trunk. It also helps to ensure it doesn't break anything when the pref if off.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #91)

Alastor, may you land it with media.ffmpeg.customized-buffer-allocation set to false to pass tests and enable it in another patch when all tests are fixed? That will save your time with rebases to latest trunk. It also helps to ensure it doesn't break anything when the pref if off.

It's ok, that new failure is in my expectation and I know why it failed. (See this, and that should be fixed in bug 1748540) I just forgot to handle that as we disable shmem-decoding for 10bit+ videos on Windows in the last update. After I modify patch4 again and do full testing, and ensure everything is ok, I will reland them again.

Flags: needinfo?(alwu)

I've updated my patch4 to remove the changes for 10+ bits video, which are all related with SW WR on Windows (but now we won't enable shmem decoding for 10+ bits video on Windows, see 1751498)

Partial try run and a completed try run both look good. Also, the new try run I just pushed which is based on the latest m-c after doing some rebasing.

Depends on D134352

Attachment #9260503 - Attachment is obsolete: true
Pushed by alwu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/555a3481ee4d
part1 : allocate shmem for ffmpeg to store decoded data directly. r=stransky,gfx-reviewers,sotaro,jgilbert,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/f032a3f3200a
part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. r=stransky,bryce,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/4d0ac457af61
part3 : correct warning message. r=stransky,media-playback-reviewers
https://hg.mozilla.org/integration/autoland/rev/38eb80e3c143
part4 : adjust fuzzy range for some reftests. r=gfx-reviewers,jgilbert
https://hg.mozilla.org/integration/autoland/rev/3b1e43ee7774
part5 : don't use shmem textures for 10bits videos on Windows. r=media-playback-reviewers,stransky
Regressions: 1751913
Regressions: 1752388
Regressions: 1753575

With bug 1753575 fixed I can confirm that shm memory recycling works.

Blocks: 1757436
Blocks: 1757980
Regressions: 1758193
No longer regressions: 1758193
Regressions: 1758193
Regressions: 1768259
You need to log in before you can comment on or make changes to this bug.