1713276 - [ffmpeg] Decode video frames directly into VideoData

With Bug 1715365 (FFmpegVideoFramePool) we can extend it for frames located at ram to reduce AVFrame -> VideoData copy for all Linux backends - even Basic (at least for some well know pixel formats).

If we allocate the frames in shm memory we may even reduce copy during frame serializing (I'm not sure how the VideoData is processed in GLX/Basic case).

Summary: [Linux] Decode video frames directly into dmabuf surfaces → [Linux] Decode video frames directly into video surfaces

Martin Stránský [:stransky] (ni? me)

Reporter

Updated

•

4 years ago

Summary: [Linux] Decode video frames directly into video surfaces → [Linux] Decode video frames directly into VideoData

Alastor Wu [:alwu]

Assignee

Comment 2

•

4 years ago

If my understanding is correct, this will not only help to avoid one copy (moving data into AVFrame) on Linux, but also can help ffvpx decoding on other platforms?

Flags: needinfo?(stransky)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 3

•

4 years ago

(In reply to Alastor Wu [:alwu] from comment #2)

If my understanding is correct, this will not only help to avoid one copy (moving data into AVFrame) on Linux, but also can help ffvpx decoding on other platforms?

Yes, any platform which uses ffvpx can use that (but I think it's used on Linux only).

Flags: needinfo?(stransky)

Alastor Wu [:alwu]

Assignee

Comment 4

•

4 years ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #3)

Yes, any platform which uses ffvpx can use that (but I think it's used on Linux only).

We also use ffvpx on Windows and MacOS, so this would help on those platforms as well.

Summary: [Linux] Decode video frames directly into VideoData → [ffmpeg] Decode video frames directly into VideoData

Alastor Wu [:alwu]

Assignee

Comment 5

•

4 years ago

Hi, Martin, I wonder if I could know your plan for this bug? Is there anything that I can help with? Because I feel that could probably help a lot on the performance of vpx decoding while using ffvpx, which will be beneficial on multiple platforms. Thank you!

Flags: needinfo?(stransky)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 6

•

4 years ago

Hi Alastor, I keep it on my radar but I'm busy with more pressing Linux/Wayland issues right now. But I have patches WIP and I'm going to come back and look at it.

Darkspirit

Updated

•

4 years ago

Blocks: video-perf

Alastor Wu [:alwu]

Assignee

Comment 7

•

4 years ago

I started investigating how chromium does that today. Here is the place they set the get_buffer2 for ffmpeg in order to use their own buffer, and here is the place they create their own buffer for ffmpeg.

Flags: needinfo?(stransky)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 8

•

4 years ago

Attached patch in-place.patch — Details — Splinter Review

Hi Alastor,

there's my WIP patch attached here, it may help you with the start. It's related to DMABuf but key parts may be reused for general in-place decoding as they are shared.

Key part is:

#if LIBAVCODEC_VERSION_MAJOR >= 58
if (mVideoFramePool /* && (mCodecContext->flags & AV_CODEC_CAP_DR1)*/) {
mCodecContext->opaque = this;
mCodecContext->get_buffer2 = GetVideoBufferWrapper;
mCodecContext->thread_safe_callbacks = 1;
}
#endif

where we set custom method to supply video buffer to ffmpeg:

static int GetVideoBufferWrapper(struct AVCodecContext* aCodecContext,
AVFrame* aFrame, int aFlags);
static void ReleaseVideoBufferWrapper(void* opaque, uint8_t* data);

At FFmpegVideoDecoder<LIBAV_VER>::GetVideoBuffer() you need to create buffer according to ffmpeg expectations, it's similar how we create the buffer for copy.

The patch also contains related changes to FFmpegLibWrapper which may be the same.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 9

•

4 years ago

Note that you don't have to create custom buffer for every format - see call avcodec_default_get_buffer2(). You say ffmpeg to create the buffer and you need to copy data later as we do now.

Alastor Wu [:alwu]

Assignee

Comment 10

•

4 years ago

Thank for your patch!

After a quick glance, this seems only work for Linux because it only handle DMAbuffer surface. As ffmpeg decoder can also be used on ffvpx for software decoding on Windows and Mac, we also need to handle those cases where we have to create a shmem to share with GPU/Chrome process, in order to let compositor to access the data directly.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 11

•

4 years ago

(In reply to Alastor Wu [:alwu] from comment #10)

Thank for your patch!

After a quick glance, this seems only work for Linux because it only handle DMAbuffer surface. As ffmpeg decoder can also be used on ffvpx for software decoding on Windows and Mac, we also need to handle those cases where we have to create a shmem to share with GPU/Chrome process, in order to let compositor to access the data directly.

Yes, the patch is designed for dmabuf but can be updated for shm mem. You need to create SHM memory buffer at mCodecContext->get_buffer2 callback and pass it to VideoData instead of copy them:

https://searchfox.org/mozilla-central/rev/75e9d727ce5ba2c14653cf8fb0f1367f085271b7/dom/media/MediaData.cpp#356
https://searchfox.org/mozilla-central/rev/75e9d727ce5ba2c14653cf8fb0f1367f085271b7/gfx/layers/ImageContainer.cpp#246

Alastor Wu [:alwu]

Assignee

Comment 12

•

4 years ago

Hi, Martin,
I'm still investigating the code to see how we can implement that, now I have some rough idea and I collect them into this doc. Would you mind to check if the direction of my thought is correct or not?
Thank you so much.

Flags: needinfo?(stransky)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 13

•

4 years ago

•

Edited

(In reply to Alastor Wu [:alwu] from comment #12)

Hi, Martin,
I'm still investigating the code to see how we can implement that, now I have some rough idea and I collect them into this doc. Would you mind to check if the direction of my thought is correct or not?
Thank you so much.

Hello Alastor,

There are some facts which needs to be considered:

direct decode to dmabuf (on Linux) is very slow. It because ffmpeg also reads from the buffer during decode and it expects the buffer is zeroed. Direct decoding to dmabuf takes 700%!! CPU on my box while indirect decoding 40-60%. The GPU memory is generally very slow to read (if it's even supported). I don't know how is that handled on Windows/Mac but I expect the situation is similar (or the data are cached and then moved to GPU at once but that's what we do already). The correct (and only working) way here is to create a texture, upload video buffer into it and use such texture (which is done by gecko anyway).
recycled buffers can be hold for long time by different parts of gecko/different processes.
you may need to use avcodec_default_get_buffer2() for video buffer formats which are not supported by us (at least for the beginning).

Right now the decoding sequence is (on Linux):

ffmpeg allocates video buffer and decodes video data into it
we allocate shm buffer and copy video data there
we allocate GL texture and copy data there

So there are 3 allocations and 3 video data copy.

Given the facts I'm suggesting to decode to shm memory buffers only. We can allocate every frame now and implement frame pool later if that's needed, I don't know how shm allocation is expensive (we allocate shm memory for every frame now).

Flags: needinfo?(stransky)

Alastor Wu [:alwu]

Assignee

Comment 14

•

4 years ago

According to comment13, as ffmpeg will read (referenceing frames for decoding) and write the buffer (outputing decoded data), if we provide GPU memory for that, it would be pretty slow because of the need to access GPU memory frequently from CPU. Therefore, we should only allocate CPU memory (shmem buffer) to ffmpeg decoder.

If my above understanding is correct, then this improvement will only help on software composition?

Because for the hardware compostion and hardware decoding, the decoded data are already based on GPU memory, so we don't actually need to copy them [1], and can give them directly to the hardware compositor. For software decoding and hardware composition, we would still create a sharable hardware texture (eg. D3D11 or MacIOSurface ) to share the data to hardware compositor.

For above cases, we would eventually create a hardware based memory buffer to convey the data, so using shmem seems no help at all?

Sorry I still don't really understand how compostior works, so above comment might be wrong.

Thank you so much.

[1] Although now we still copy data on Windows, but from my understanding, that can be avoid (bug 1723207)
https://searchfox.org/mozilla-central/rev/a9ef6ad97d2d5f96d5ed51eda38f1a02700ccff7/dom/media/platforms/wmf/DXVA2Manager.cpp#906-931

Flags: needinfo?(stransky)

Alastor Wu [:alwu]

Assignee

Comment 15

•

4 years ago

(In reply to Alastor Wu [:alwu] from comment #14)

According to comment13, as ffmpeg will read (referenceing frames for decoding) and write the buffer (outputing decoded data), if we provide GPU memory for that, it would be pretty slow because of the need to access GPU memory frequently from CPU. Therefore, we should only allocate CPU memory (shmem buffer) to ffmpeg decoder.

In addition, if this is the reason, then does running ffmpeg decoder on GPU solve the slow problem? (I know only Windows has GPU process, just thinking about its possibility)

Take vp9 on Windows as an example, when hardware decoding is enabled, we will use WMF decoder (Windows platform decoder) in GPU process, but once if we disable hardware decoding, we will fallback to ffvpx decoder in RDD. I imagine that if we can run ffvpx decoder in GPU process, then all read/write on decoded buffer would happen on GPU process and no longer "slow" (if the reason I mentioned above is correct)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 16

•

4 years ago

(In reply to Alastor Wu [:alwu] from comment #14)

According to comment13, as ffmpeg will read (referenceing frames for decoding) and write the buffer (outputing decoded data), if we provide GPU memory for that, it would be pretty slow because of the need to access GPU memory frequently from CPU. Therefore, we should only allocate CPU memory (shmem buffer) to ffmpeg decoder.

If my above understanding is correct, then this improvement will only help on software composition?

I'm not familiar with Windows/Mac decoding so I can give you only rough picture:

HW decoding:
GPU (decoded ffmpeg video frame) -> GPU (decoded video frame in dmabuf) -> IPC -> GPU (texture used by webrender)

SW decoding with dmabuf backend / WebRender:
CPU (decoded ffmpeg video frame) -> GPU (decoded video frame in dmabuf) -> IPC -> GPU (texture used by webrender)

SW decoding with WebRender
CPU (decoded ffmpeg video frame) -> CPU (decoded video frame in shm) -> IPC -> GPU (texture used by webrender)

SW decoding with SW WebRender
CPU (decoded ffmpeg video frame) -> CPU (decoded video frame in shm) -> IPC -> CPU (SW rendering of video frame)

With in-place decoding we can transform:

'CPU (decoded ffmpeg video frame) -> CPU (decoded video frame in shm)'

to just one step

'CPU (decoded ffmpeg video frame in shm)'

So if we decode directly to shm buffer, the shm buffer can be send to another process for compositing (SW or HW) without additional copy, no matter if it's transformed to texure or used for SW composition.

Flags: needinfo?(stransky)

Alastor Wu [:alwu]

Assignee

Comment 17

•

4 years ago

Attached file Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly. — Details

Alastor Wu [:alwu]

Assignee

Comment 18

•

4 years ago

Update the current WIP, now on Linux I can decode vpx and h264 onto the shmem directly, but sometime it causes a crash (SIGSEGV) inside the libmozavcodec.so, which I suspect that I didn't set the size (or stride, linesize) well. I'm still debugging that issue.

Assignee: stransky → alwu

Severity: -- → N/A

Priority: -- → P2

Jeff Muizelaar [:jrmuizel]

Updated

•

4 years ago

Blocks: 1739213

Alastor Wu [:alwu]

Assignee

Comment 19

•

4 years ago

Hi, Martin,

Currently my patch works for live streaming vp9, like this one and h264 video, like this, but it would still crash on this video in 1080p or 1440p (on other resolutions, it won't crash! but I don't know why)

I wonder if you could help me check my patch and provide me any feedback?
Thank you so much.

Flags: needinfo?(stransky)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 20

•

4 years ago

Sure. I expect we need to align the frame sizes somehow as ffmpeg is a bit picky here and needs extra space.
Will look at it.

Alastor Wu [:alwu]

Assignee

Comment 21

•

4 years ago

•

Edited

Indeed, I printed the data size (the second parameter for av_buffer_create() ) allocated by default allocator, which is different from the size I set.

H264 (640 * 368)
Data size (by default allocator) = 247119, data size (by my custom allocator) = 235520 (640*368)
VP9 (1280 * 720)
Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)
VP9 (2560 * 1440)
Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)

I can found the default allocator always allocates larger size than my custom allocator. In addition, Chrome uses larger size for that parameter, which is calulated by adding the sizes of three planes. I also tried this method, but still got a crash on the last VP9 video.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 22

•

4 years ago

(In reply to Alastor Wu [:alwu] from comment #21)

Indeed, I printed the data size (the second parameter for av_buffer_create() ) allocated by default allocator, which is different from the size I set.

H264 (640 * 368)
Data size (by default allocator) = 247119, data size (by my custom allocator) = 235520 (640*368)

VP9 (1280 * 720)
Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)

VP9 (2560 * 1440)
Data size (by default allocator) = 942159, data size (by my custom allocator) = 921600 (1280*760)

I can found the default allocator always allocates larger size than my custom allocator. In addition, Chrome uses larger size for that parameter, which is calulated by adding the sizes of three planes. I also tried this method, but still got a crash on the last VP9 video.

I recall ffmpeg extends/align both width & height, not just width as one may expect. The default allocator allocation method may be the way how to get it correctly, perhaps we need to debug it.

Also the latest VP9 video is rotated - may that be the factor here? Flipped width/height?

Alastor Wu [:alwu]

Assignee

Comment 23

•

4 years ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #22)

I recall ffmpeg extends/align both width & height, not just width as one may expect. The default allocator allocation method may be the way how to get it correctly, perhaps we need to debug it.

Wow you're right! Aligning the height did solving the crash! \o/
Thank you so much, I will keep working on this and see if there is any other issues or crashes.

Flags: needinfo?(stransky)

Alastor Wu [:alwu]

Assignee

Comment 24

•

4 years ago

Attached file Bug 1713276 - part2 : directly use the shmem video buffer to create a video data which would remove a need to copy decoded data. (obsolete) — Details

Depends on D130220

Alastor Wu [:alwu]

Assignee

Comment 25

•

4 years ago

•

Edited

Attached image wrong-color-issue.png — Details

Now I encounter another problem which is color showing incorrectly due to larger Cb/Cr surfaces size. As ffmpeg needs a special alignment on both height and width, I would create a image and allocate a corresponding buffer. Then, when we receive decoded data, if we give that surface directly to the compositor, it would cause the color mismatch, because the Cb/Cr stride is larger than the compositor expect.

For example, for 1920*1080, the Cb/Cr stride and height should be 960 and 540. But ffmpeg needs 1024 and 544 after doing the alignment, if I represent this texture directly to the compositor, then it would display a wrong green area like what this photo shows.

It seems that I will have to find a way to re-adjust the cb/cr surface region stored in the texture client after I receive decoded data?

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 26

•

4 years ago

How do you create the textures?

For instance eglCreateImage allows to specify strides for all planes which don't need to be power-of-two aligned:
https://searchfox.org/mozilla-central/rev/24fac1ad31fb9c6e9c4c767c6a7ff45d226078f3/widget/gtk/DMABufSurface.cpp#564

(It's also possible that the dmabuf code here is affected by this too).

Alastor Wu [:alwu]

Assignee

Comment 27

•

4 years ago

First I create a PlanarYCbCrImage from the image container, and create a texture client from there. Then I create a new method to create an empty buffer by providing a PlanarYCbCrData, the new method would call SharedPlanarYCbCrImage::Allocate to create a shmem buffer.

As FFmpeg requires aligment, when setting the PlanarYCbCrData (which is used to allocate the shmem buffer), I would set the aligned stride and height on the data (see here, search CreateEmptyPlanarYCbCrData)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 28

•

4 years ago

Looks like SharedPlanarYCbCrImage is incorrectly mapped to textures. I'd expect BufferTextureData::CreateForYCbCr() is called for the video frames but BufferTextureData::Create() is used instead with RGBA format. Looks like the YUV is mapped to RGB somewhere and we're uploading RGB textures which is wrong.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 29

•

4 years ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #28)

Looks like SharedPlanarYCbCrImage is incorrectly mapped to textures. I'd expect BufferTextureData::CreateForYCbCr() is called for the video frames but BufferTextureData::Create() is used instead with RGBA format. Looks like the YUV is mapped to RGB somewhere and we're uploading RGB textures which is wrong.

So no, I was wrong here. The texture seems to be created as YUV at RenderBufferTextureHost::Lock().

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 30

•

4 years ago

Yes, I do see the artifacts with https://www.youtube.com/watch?v=LXb3EKWsInQ
Looks like with and also height is wrong.

Alastor Wu [:alwu]

Assignee

Comment 31

•

4 years ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #30)

Yes, I do see the artifacts with https://www.youtube.com/watch?v=LXb3EKWsInQ
Looks like with and also height is wrong.

Yes, that's ture, that was caused by extra padding for the ffmpeg. If I watch that video on 4K, then this issue won't happen because we don't need the extra padding for linesize in that resulution.

I asked this quesiton on #gfx-firefox and the response from :jgilbert seems matching about what I thought, I have to dive deeper into the buffer creation, to create a larger size for shmem buffer, but keep the YCbCrDescriptor correct (use the original size which we don't modify for ffmpeg)

So no, I was wrong here. The texture seems to be created as YUV at RenderBufferTextureHost::Lock().

In the beginning, I was also thinking about the possiblility of creating the wrong format texture, but after my testing, that is not true. The problem is still I didn't provide correct YCbCrDescriptor so that the compositor would render the wrong image.

Alastor Wu [:alwu]

Assignee

Comment 32

•

4 years ago

Now I follow this suggestion to resize the planes to useful-size before I create the video data, and the result is good! I can eliminate the green part and video in different resolutions could all be displayed correctly.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 33

•

4 years ago

Yes, the blit can be another way although it adds another frame copy, but that's done on gfx card. I believe there is a way how to upload the textures directly with stride without additional blit (it's already done for dmabuf textures).

Alastor Wu [:alwu]

Assignee

Comment 34

•

4 years ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #33)

Yes, the blit can be another way although it adds another frame copy, but that's done on gfx card. I believe there is a way how to upload the textures directly with stride without additional blit (it's already done for dmabuf textures).

Oh the current way I use won't add any copy, it's just simply readjust the size in the YCbCrDescriptor. The reason of not porting the logic to shmem buffer is that, this is the only use case where we would need a speicial stride on height as well. So instead of adding more explanation and logic on creating shmem buffer, I feel adding another method to readjust the descriptor would be easier and better for maintainence.

Alastor Wu [:alwu]

Assignee

Comment 35

•

4 years ago

But now the interesting thing is that, even if I remove the need of copying the decoded data, the decoding time doesn't seem to be improved. I wonder if creating shmem on the ffmpeg decoder threads degrade the performance of decoding. Still in the investigation.

Phabricator Automation

Updated

•

4 years ago

Attachment #9248959 - Attachment description: WIP: Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly. → Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly.

Phabricator Automation

Updated

•

4 years ago

Attachment #9250317 - Attachment description: WIP: Bug 1713276 - part2 : directly use the shmem video buffer to create a video data which would remove a need to copy decoded data. → Bug 1713276 - part2 : directly use the shmem video buffer to create a video data which would remove a need to copy decoded data.

Alastor Wu [:alwu]

Assignee

Comment 36

•

4 years ago

Attached file Bug 1713276 - part3 : re-adjust ycbcr useful size to keep video frames being displayed correctly. (obsolete) — Details

Depends on D130944

Alastor Wu [:alwu]

Assignee

Comment 37

•

4 years ago

Here are some profiled results. I followed this way to turn ETW log to the format that Firefox profiler can accept. They are all recording while playing this 4K video in 10s.

Firefox using default buffer allocator (what we current have, will need to copy data into texture or shmem later) result

For vp9 decoding, I searched for vp9_decode_frame(), which happened in thread#270~277.
The total sample times for vp9_decode_frame() is 48111ms (=5929+6333+5961+6809+5920+5249+5648+6262) and the total sample amount is 20680 (=2559+2753+2488+2652+2532+2676+2386+2634). Therefore, we can say one sample took 2.3ms in average.

For ffmpeg video decoder, I searched for mozilla::FFmpegVideoDecoder::DoDecode(), which happened in thread#268,269 and #282.
The total sample times for mozilla::FFmpegVideoDecoder::DoDecode() is 43219ms (=16353+17186+9680) and the total sample amount is 2472 (=949+885+638). Therefore, one sample took 17.48ms in average.

Firefox using the customized buffer allocator (allocate shmem buffer) result

Same searching for vp9_decode_frame(), which happened in thread#302~309.
The total sample times for vp9_decode_frame() are 57393ms (=7606+6329+7865+7275+6327+7107+7563+7321) and the total sample amount is 28356 (=3454+3345+3565+3776+3445+3355+3565+3851). Therefore, we can say one sample took 2.02ms in average.

For ffmpeg video decoder, I searched for mozilla::FFmpegVideoDecoder::DoDecode(), which happened in MediaPdecoder#1~#3.
The total sample times for mozilla::FFmpegVideoDecoder::DoDecode() is 23295ms (=11242+10936+11117) and the total sample amount is 358 (=114+127+117). Therefore, one sample took 65.07ms in average. (?)

In term of decoding time in ffvpx decoder, it only improves very little. For the ffmpeg's DoDecode, it's actully getting worse, but I am not sure if I uses the correct way to analysize these profiles...

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 38

•

4 years ago

Hello Alastor,

I'm not a hardware expert but I think this article gives basic explanation what's going on:
https://tldp.org/HOWTO/Parallel-Processing-HOWTO-2.html

When we decode to shared memory, the memory is cached and needs to be synchronized with other cores (that's the Shared Everything Vs. Shared Something dilemma) as we use multi-threaded decoding. Also I guess the speed heavily depends on cache size (L1, L2, L3) and also DRAM speed. (when L* cache size is small then we may see better performance with direct decode to shm memory as cache may not be so involved there).

Would be interesting to compare single thread decoding (direct decode to shm vs. decode + copy) to see if the multi-thread decoding is a factor here.

Anyway, unless we find a way how to tell OS to not share/propagate shm changes during decode this may be a dead end (as well as it's for dmabuf direct decoding).

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 39

•

4 years ago

I'll try the direct decoding with some old hardware to check the cache theory.

Alastor Wu [:alwu]

Assignee

Comment 40

•

4 years ago

•

Edited

What I feel is multi-thread decoding is not a factor, but the mechanism of synchronizing the shmem might be related. Following result are for multi-threads, and single thread (change this to 1) From below result, it shows that even if using single thread, decode to shmem is still slower.

Linux, Multi-threads, Decode + Copy
[Child 579563: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 19340.897155 us in the average.
Linux, Single thread, Decode + Copy
[Child 794861: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 44225.718321 us in the average.
Linux, Multi-threads, Decode to shmem
[Child 579979: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 20269.318408 us in the average.
Linux, Single thread, Decode to shmem
[Child 795783: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 53721.353087 us in the average.

By observing the stack in comment 37 (decode to shmem), it took a lot of time on ZwFreeVirtualMemory (on Windows) which is called from av_buffer_unref. In vp9_decode_update_thread_context, I did see ffvpx would start checking which frames are no longer needed and would unref those AVFrames. And then it seems triaggering resetting data on shmem, which would cost a lot of time. Also, on Linux, the time spend most are on __pthread_cond_wait which also looks like synchronizing shmeme data between two processes?

However, if those data got reset during decoding, how could we see the image on the compositor?! Because the images are still complete, the shmem buffer should suppose unchanged after we receive decoded video frames. If those data didn't get reset, why av_buffer_unref would trigger those cleaning methods and took a lot of time?

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 41

•

4 years ago

•

Edited

(In reply to Alastor Wu [:alwu] from comment #40)

What I feel is multi-thread decoding is not a factor, but the mechanism of synchronizing the shmem might be related. Following result are for multi-threads, and single thread (change this to 1) From below result, it shows that even if using single thread, decode to shmem is still slower.

Linux, Multi-threads, Decode + Copy
[Child 579563: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 19340.897155 us in the average.

Linux, Single thread, Decode + Copy
[Child 794861: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 44225.718321 us in the average.

Linux, Multi-threads, Decode to shmem
[Child 579979: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 20269.318408 us in the average.

Linux, Single thread, Decode to shmem
[Child 795783: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took 53721.353087 us in the average.

By observing the stack in comment 37 (decode to shmem), it took a lot of time on ZwFreeVirtualMemory (on Windows) which is called from av_buffer_unref. In vp9_decode_update_thread_context, I did see ffvpx would start checking which frames are no longer needed and would unref those AVFrames. And then it seems triaggering resetting data on shmem, which would cost a lot of time. Also, on Linux, the time spend most are on __pthread_cond_wait which also looks like synchronizing shmeme data between two processes?

Hm, maybe the shm memory allocation/free takes so much time? I wonder if the original way (decode to regular memory and then copy to shm) uses different/better way to allocate shm memory chunks or are they recycled?

However, if those data got reset during decoding, how could we see the image on the compositor?! Because the images are still complete, the shmem buffer should suppose unchanged after we receive decoded video frames. If those data didn't get reset, why av_buffer_unref would trigger those cleaning methods and took a lot of time?

Is av_buffer_unref called along ReleaseVideoBufferWrapper()? Because that seems to be the only correct way how to call it, when your own buffer is released, right?

As for the images - the shm images are uploaded from shm to OpenGL textures (located at GPU) and that's a single operation - you can release the shm after that as GL keeps copy of the texture on GPU (and maybe also in RAM but that depends how the textures are uploaded). So it's possible that the underlying shm memory is released but the texture is still live and used for rendering.

Phabricator Automation

Updated

•

4 years ago

Attachment #9248959 - Attachment description: Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly. → WIP: Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly.

Phabricator Automation

Updated

•

4 years ago

Attachment #9248959 - Attachment description: WIP: Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly. → Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly.

Alastor Wu [:alwu]

Assignee

Comment 42

•

4 years ago

Last time I discussed with :jrmuizel, and he pointed me out that when I do the measurement, I should use the build which is closer to the release build. Then I tried to measure them again on OPT build with RUSTC_OPT_LEVEL=2 (because release build doesn't have full RUST optimization), here are the result.

I tested on five different videos and on the different platforms, and the result shows that on Linux and MacOS, this improvement did help! (especially on my Linux machine, SW decoding + SW compositing) In my previous measurement, I didn't use OPT build and only test on Linux. This new result should be more precise than the old one I made.

But on Windows, the result is interesting. For some video, it shows very good result (53% improvement), but on other video, it shows very bad result (-77% improvement). For now I didn't figure out the reason, and I will keep investigating on this.

In general, if this result is correct, then this improvement is still worthy to do.

Alastor Wu [:alwu]

Assignee

Comment 43

•

3 years ago

In the previous measurement, the result for Windows 11 Dev didn't show a stable improvement, I am not sure if that is because Windows 11 Dev is still not stable not there might be some unexpected behavior. In addition, yesterday after I restarted (and updated) my Windows 11 Dev, my Windows 11 became totallu usuable, so I couldn't do any measurement again, which forced me to reinstall Windows 10 on my machine.

Therefore, I did the measurement again on Windows 10, and the result did show a stable improvement on all videos. Therefore, I think it would be good to use shmem for decoded data for all sw decoding cases, even if we uses hardware composition. We can always share the decoded data to the compositior process, and then upload the texture to GPU later if necessary.

Phabricator Automation

Updated

•

3 years ago

Attachment #9250317 - Attachment is obsolete: true

Phabricator Automation

Updated

•

3 years ago

Attachment #9251652 - Attachment is obsolete: true

Phabricator Automation

Updated

•

3 years ago

Attachment #9248959 - Attachment description: Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly. → Bug 1713276 - allocate shmem for ffmpeg to store decoded data directly.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 44

•

3 years ago

•

Edited

Alastor, while I'm looking at the patche, have you tried to run try on them? I see some failures there:
https://treeherder.mozilla.org/jobs?repo=try&revision=7074eecc4c93a9af5859a666cc4a8609c3ddc032

Alastor Wu [:alwu]

Assignee

Comment 45

•

3 years ago

Sure, but the tests I ran are all high level integrated tests because I didn't expect this would break the gtest. I will look into it later, thanks.

Phabricator Automation

Updated

•

3 years ago

Attachment #9248959 - Attachment description: Bug 1713276 - allocate shmem for ffmpeg to store decoded data directly. → Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly.

Alastor Wu [:alwu]

Assignee

Comment 46

•

3 years ago

Attached file Bug 1713276 - part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. — Details

Change some less important debugs to verbose so that we can use DEBUG level to focus more important things.

For AudioTrimmer, this will help to reduce huge useless logs if audio doesn't need to be trimmed.

Depends on D130220

Alastor Wu [:alwu]

Assignee

Comment 47

•

3 years ago

Attached file Bug 1713276 - part3 : correct warning message. — Details

ffmpeg decoder can also be used for VPX (and AV1 in the future), so remove 'h264' wording in the warning message.

Depends on D134350

Alastor Wu [:alwu]

Assignee

Comment 48

•

3 years ago

Attached file Bug 1713276 - part4 : adjust fuzzy range for some reftests. — Details

Depends on D134351

Phabricator Automation

Updated

•

3 years ago

Attachment #9256272 - Attachment description: Bug 1713276 - part4 : add fuzzy range for some reftests on OSX. → Bug 1713276 - part4 : modify fuzzy range for some reftests.

Alastor Wu [:alwu]

Assignee

Updated

•

3 years ago

Depends on: 1748540

Phabricator Automation

Updated

•

3 years ago

Attachment #9256272 - Attachment description: Bug 1713276 - part4 : modify fuzzy range for some reftests. → Bug 1713276 - part4 : adjust fuzzy range for some reftests.

Alastor Wu [:alwu]

Assignee

Comment 49

•

3 years ago

There are some offline discussion about how to recycle the shmem buffer, here jya mentioned that we need a cross-process, atomic refcount. At the discussion, I didn't really understand the detail of how we release the shmem buffer from the compositor side, so I checked that again and make a record here.

tl;dr, from my investigation, we do have a cross-process, atomic refcount mechanism to release the shmem buffer.

First, when we create a shmem buffer, it would be stored inside a TextureClient. and the shmem buffer would only be destroyed when the texture client gets destroyed.

Texture client has an atomic refcount. After decoder returns a data to us, and we give the texture to the compostior, which in another process, there are two refcount that we can ensure at least happened. The first one is the one holding by ffmpeg (we hold a reference to image (done in patch1, in the end of GetVideoBuffer()), and image holds a reference to texture client. Another is in CompositorBridgeChild, where the texture would be added to a list, then when compositor finishes using the texture, the texture client would be removed from the list. As the refcount is notified from another process, so it's a cross process mechanism.

When ffmpeg doesn't need to keep the buffer, we would be notifted by ReleaseVideoBufferWrapper() callback (in patch1) in which we would release the refcount for texture. When compositor doesn't need to keep the buffer, we would be notified by CompositorBridge. Therefore, we can always ensure shmem would be released correctly.

Alastor Wu [:alwu]

Assignee

Comment 50

•

3 years ago

Update the amount usage comparison in here, directly decoding to shmem significantly reduces the amount of memory usage on Linux and MacOS. But on Windows the result doesn't show any improvement or deterioration, I am not sure why. (Also, the memory usage is super low on Windows. Maybe profiler couldn't capture memory usage well on Windows?)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 51

•

3 years ago

Do you compare on the same box? Is the decoding on Windows faster than on Linux on the same hardware with SW decoding?

Alastor Wu [:alwu]

Assignee

Comment 52

•

3 years ago

No, they are different devices. You can check about:support for those devices in the part of Testing Devices in my document.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 53

•

3 years ago

I see. I think it would be interesting to compare performance on the same device on Windows / Linux. You can compare Linux performance with/without shm direct decode and the Windows one to deduce if Windows behaves well or not.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 54

•

3 years ago

Alastor, I have a laptop with both Windows 10 & Linux installed so I can do the Windows/Linux tests if there's a try build available.

Alastor Wu [:alwu]

Assignee

Comment 55

•

3 years ago

Here it's, thank you so much!

Darkspirit

Comment 56

•

3 years ago

(Alastor Wu [:alwu] from comment #55)

Here it's, thank you so much!

https://treeherder.mozilla.org/jobs?repo=try&revision=3f1c74e0c0f351c27ddaa906e40c5dc9d4d7155f
mozregression --repo try --launch 3f1c74e0c0f351c27ddaa906e40c5dc9d4d7155f

Darkspirit

Comment 57

•

3 years ago

vs.
mozregression --repo try --launch 3f1c74e0c0f351c27ddaa906e40c5dc9d4d7155f --pref media.ffmpeg.customized-buffer-allocation:false

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 58

•

3 years ago

Thanks. I tested that with gecko profiler but I can't get any meaningful numbers. I'm going to try some system level profiler on Windows to get the difference (also on linux perf may be the best tool for it).

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 59

•

3 years ago

Hello Alastor, is there any video playback performance stats in Firefox? I mean decoded / dropped frames, fps etc.
I testes some clips (8K, 4K) on both Windows/Linux on the same box and:

CPU usage looks pretty much the same on Windows/Linux. media.ffmpeg.customized-buffer-allocation doesn't seem to have impact to CPU usage, even media.ffmpeg.customized-buffer-allocation:false is about ~1-2% CPU better.
Video playback seems to be smoother for media.ffmpeg.customized-buffer-allocation:true

Also number of decoding thread may be a factor here.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 60

•

3 years ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #59)

Hello Alastor, is there any video playback performance stats in Firefox? I mean decoded / dropped frames, fps etc.
I testes some clips (8K, 4K) on both Windows/Linux on the same box and:

CPU usage looks pretty much the same on Windows/Linux. media.ffmpeg.customized-buffer-allocation doesn't seem to have impact to CPU usage, even media.ffmpeg.customized-buffer-allocation:false is about ~1-2% CPU better.

Video playback seems to be smoother for media.ffmpeg.customized-buffer-allocation:true

Eh, sorry, the 8K clip was actually AV1 so I need to retest again :)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 61

•

3 years ago

I re-tested a 4K VP9 clip and there are results:

Windows:
media.ffmpeg.customized-buffer-allocation:false
RDD process: 11-13% CPU, 306MB RAM (700MB all Firefox RAM)

media.ffmpeg.customized-buffer-allocation:true
RDD process: 12-15% CPU, 40MB RAM (400MB all Firefox RAM)

Linux:
The same numbers but CPU is 1-2% lower

Looks like direct decoding has minor performance penalty but huge RAM impact.
I used simple process monitor to see CPU%/RAM sized as I didn't get any useful numbers from VTune. Also I think the CPU/RAM is what we and users care most.

Ivan

Comment 62

•

3 years ago

Sorry for interfering, but can I test this "flag" too with my UHD620 + 4K display combo on Win and Linux since I have issues on both OS with playing 4k@60fps videos? I just do not know how to make it. :)

Robert Mader [:rmader]

Comment 63

•

3 years ago

(In reply to Ivan from comment #62)

Sorry for interfering, but can I test this "flag" too with my UHD620 + 4K display combo on Win and Linux since I have issues on both OS with playing 4k@60fps videos? I just do not know how to make it. :)

Yes - you'll need mozregression (pip install --user mozregression or so) , then you can run the commands from comment 56 and comment 57.

Ivan

Comment 64

•

3 years ago

I didn't measure RAM usage, but in my showcase (Garuda Linux + 4K display + UHD620 + HW accel) I don't catch any benefits in terms of temps and framedrops behaviour compared to 97.0b2 (as I was hope). But could be I shouldn't))
But at the same time no problems to me, video works just fine.

Alastor Wu [:alwu]

Assignee

Comment 65

•

3 years ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #59)

Hello Alastor, is there any video playback performance stats in Firefox? I mean decoded / dropped frames, fps etc.

Yes, you can use VideoPlaybackQuality.droppedVideoFrames.

Looks like direct decoding has minor performance penalty but huge RAM impact.

Thanks for testing! That seems understandable, because now we don't need to allocate memory for copying decoded data, so the improvement should be more significant on high resolution video. Using higher CPU seems also understandable, because allocating shmem needs more work than allocating normal memory.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 66

•

3 years ago

(In reply to Alastor Wu [:alwu] from comment #65)

Thanks for testing! That seems understandable, because now we don't need to allocate memory for copying decoded data, so the improvement should be more significant on high resolution video. Using higher CPU seems also understandable, because allocating shmem needs more work than allocating normal memory.

I see. I though you did some kind of memory frames recycling with shm. I wonder what causes such massive memory difference then.

Alastor Wu [:alwu]

Assignee

Comment 67

•

3 years ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #66)

I see. I though you did some kind of memory frames recycling with shm. I wonder what causes such massive memory difference then.

I think that is probably also true? I didn't trace the whole process of how those texture buffers would be used though, but the texture allocator supports recycling. The texture client would be pushed back to the pool when its reference drops to 1.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 68

•

3 years ago

That does not makes much sense to me. I wonder why the decode to shm is slower. When the time is consumed by allocation, then what's recycled? I expect when we're recycling allocated shm memory (and everything looks like so) we don't allocate anything but just cycle few already allocated shm buffers which are mapped/unmapped as textures later during the cycle (that's the texture Lock/Unlock IMHO). Also the tiny memory usage indicates that.

I really think that direct decode to shm is slower due to random access to shm as ffmpeg do read/write to shm during decode and that may be handled by kernel to sync shm/flush cahe etc. I'll try to do some benchmarks to compare so. If that's confirmed we can do the decode to RAM as we did before but use the recycling shm image container to keep the RAM effect of this patch.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 69

•

3 years ago

I'll also try to get detailed performance data from perf as this is really interesting problem :)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 70

•

3 years ago

Got the performance data. I test playing 1 min of 4K clip. With direct decode to shm there are shm mapping failures on the stack (but rest is the same):

-    8.07%     0.67%  MediaPD~oder #1  [kernel.kallsyms]
   - 7.40% asm_exc_page_fault                                                                                                                                                                ▒
      - 7.33% exc_page_fault                                                                                                                                                                 ▒
         - 7.29% do_user_addr_fault                                                                                                                                                          ▒
            - 6.85% handle_mm_fault                                                                                                                                                          ▒
               - 6.56% __handle_mm_fault                                                                                                                                                     ▒
                  - 5.04% __do_fault                                                                                                                                                         ▒
                     - 4.98% shmem_fault                                                                                                                                                     ▒
                        - 4.91% shmem_getpage_gfp                                                                                                                                            ▒
                             1.58% clear_page_rep                                                                                                                                            ▒
                           - 1.43% shmem_alloc_and_acct_page                                                                                                                                 ▒
                              - 1.27% shmem_alloc_page                                                                                                                                       ▒
                                 - 1.19% alloc_pages_vma                                                                                                                                     ▒
                                    - 1.12% __alloc_pages                                                                                                                                    ▒
                                       - 0.96% get_page_from_freelist                                                                                                                        ▒
                                          - 0.71% rmqueue_bulk                                                                                                                               ▒
                                               0.58% __list_del_entry_valid                                                                                                                  ▒
                             1.19% shmem_add_to_page_cache                                                                                                                                   ▒
   - 0.66% start_thread                                                                                                                                                                      ▒
      - 0.59% frame_worker_thread                                                                                                                                                            ▒
         - vp9_decode_frame                                                                                                                                                                  ▒
              0.59% decode_sb                                                                                                                                                                ▒

So looks like we hit mapping failure so the process is suspended unless the page is loaded. May this be caused by multi-thread decode to the same shared memory page?

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 71

•

3 years ago

btw. Recorded on Linux as:

$perf record -g ./firefox

And analyzed as:

$perf report --call-graph

Alastor Wu [:alwu]

Assignee

Comment 72

•

3 years ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #70)

So looks like we hit mapping failure so the process is suspended unless the page is loaded. May this be caused by multi-thread decode to the same shared memory page?

How about if you set thread_count = 1 to disable multi-threads decoding in ffmpeg? will that make any difference?

Linux: The same numbers but CPU is 1-2% lower

Forgot to reply this. So on Linux, memory usage between those two approach didn't have any difference? hmm but shmem should be able to reused because all buffers we want to allocate should be compatible with the shmem we have allocated.

In addition, you said CPU is lower when enabling shmem decoding, so enabling that doesn't degrade the performance on Linux? Even if we hit 8.07% page fault?

I will also try to test by using other profile tools tomorrow to see if I can get other interesting data or not.

Jeff Muizelaar [:jrmuizel]

Comment 73

•

3 years ago

The profile suggests we're not properly recycling the shared memory. (On Linux pages of allocations are usually allocated in the page fault handler when they're first touched. alloc_pages_vma)

We don't want to be doing any shared memory allocations during video playback once the initial buffers have been allocated.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 74

•

3 years ago

(In reply to Alastor Wu [:alwu] from comment #72)

In addition, you said CPU is lower when enabling shmem decoding, so enabling that doesn't degrade the performance on Linux? Even if we hit 8.07% page fault?

To be clear I see:

Windows:
media.ffmpeg.customized-buffer-allocation:false
RDD process: 11-13% CPU, 306MB RAM (700MB all Firefox RAM)

media.ffmpeg.customized-buffer-allocation:true
RDD process: 12-15% CPU, 40MB RAM (400MB all Firefox RAM)

Linux:
media.ffmpeg.customized-buffer-allocation:false
RDD process: 9-11% CPU, 306MB RAM (700MB all Firefox RAM)

media.ffmpeg.customized-buffer-allocation:true
RDD process: 11-14% CPU, 40MB RAM (400MB all Firefox RAM)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 75

•

3 years ago

(In reply to Alastor Wu [:alwu] from comment #72)

How about if you set thread_count = 1 to disable multi-threads decoding in ffmpeg? will that make any difference?

It's the same, i.e. asm_exc_page_fault is on the perf stack.

Martin Stránský [:stransky] (ni? me)

Reporter

Updated

•

3 years ago

No longer blocks: 1739213

Martin Stránský [:stransky] (ni? me)

Reporter

Updated

•

3 years ago

Blocks: 1739213
No longer blocks: egl-linux-vaapi

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 76

•

3 years ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #73)

The profile suggests we're not properly recycling the shared memory. (On Linux pages of allocations are usually allocated in the page fault handler when they're first touched. alloc_pages_vma)

We don't want to be doing any shared memory allocations during video playback once the initial buffers have been allocated.

That seems to be correct - we repeatedly allocate shm memory by VideoBridgeChild. Looking at it now.

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 77

•

3 years ago

So yes. The frames are not recycled because texture sizes are different (shrank). At YCbCrTextureClientAllocationHelper::IsCompatible():

aTextureClient->GetSize()
width = 3840,
height = 2160

mData.mYSize
width = 3840,
height = 2176

bufferData->GetCbCrSize().ref()
width = 1920,
height = 1080

mData.mCbCrSize
width = 1920,
height = 1088

Alastor Wu [:alwu]

Assignee

Comment 78

•

3 years ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #77)

So yes. The frames are not recycled because texture sizes are different (shrank). At YCbCrTextureClientAllocationHelper::IsCompatible():

Ah, good finding! So it seems that we have to provide a cropped size if there are avaliable texture for recycling. But I am curious, if we didn't recylce textures correctly, why we could reduce such a significant amount of memory usage?

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 79

•

3 years ago

(In reply to Alastor Wu [:alwu] from comment #78)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #77)

So yes. The frames are not recycled because texture sizes are different (shrank). At YCbCrTextureClientAllocationHelper::IsCompatible():

Ah, good finding! So it seems that we have to provide a cropped size if there are avaliable texture for recycling. But I am curious, if we didn't recylce textures correctly, why we could reduce such a significant amount of memory usage?

That may be some ffmpeg/ffvpx internal bug or we manage AV Buffers incorrectly (Bug 1743750).

Alastor Wu [:alwu]

Assignee

Updated

•

3 years ago

Comment 80

•

3 years ago

I had some offline discussion with Jeff, and I think the texture recyling issue (comment 77) should be fixed along with bug 1750858. The reason is that I don't want to add any new tech debt. An easy way to fix recyling problem would be to modify YCbCrTextureClientAllocationHelper::IsCompatible() and let it check cropped size instead, but that introduces a new tech debt and definitely is a workaround.

The ultimate goal should be fixing bug 1750858 and then we don't need to crop our planes anymore. Therefore, all texture would be using a larger size for checking the compatibility so that those textures would be possible to be reused.

Even considering the worst case, we would allocate shmem for every video frame, that is actually equal to what we're current doing (create a shmem everytime and then copying decoded data into it). So landing the current patch won't cause any new problem. Considering this feature is an experimental feature which would only be enabled on Nightly first, 1~2% CPU gain in't a big deal and we also see a huge amount of memory usage reduction (although we're not clear why)

Therefore, I think I would still prefer to land current patches first (after them all r+), and bake them on Nightly to see if there is any new issues happening. In the meanwhile, we will also try to fix bug 1750858 in order to recycle textures correctly.

Alastor Wu [:alwu]

Assignee

Updated

•

3 years ago

Blocks: 1751051

Pulsebot

Comment 81

•

3 years ago

Pushed by alwu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/13ce7037ff27 part1 : allocate shmem for ffmpeg to store decoded data directly. r=stransky,gfx-reviewers,sotaro,jgilbert,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/b23498172031 part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. r=stransky,bryce,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/cdc8216b3f18 part3 : correct warning message. r=stransky,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/12a9f3fad481 part4 : adjust fuzzy range for some reftests. r=gfx-reviewers,jgilbert

Narcis Beleuzu [:NarcisB]

Comment 82

•

3 years ago

Backed out for bustages on FFmpegVideoDecoder.cpp

Backout link: https://hg.mozilla.org/integration/autoland/rev/ae774ab3fdd80116907a1742bfeaa3b43f3a8caa
Log link: https://treeherder.mozilla.org/logviewer?job_id=364752223&repo=autoland&lineNumber=16074

Flags: needinfo?(alwu)

Alastor Wu [:alwu]

Assignee

Updated

•

3 years ago

Flags: needinfo?(alwu)

Pulsebot

Comment 83

•

3 years ago

Pushed by alwu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f8814abefa20 part1 : allocate shmem for ffmpeg to store decoded data directly. r=stransky,gfx-reviewers,sotaro,jgilbert,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/3d88e393a0cb part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. r=stransky,bryce,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/9a47b267032e part3 : correct warning message. r=stransky,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/cee073e045a3 part4 : adjust fuzzy range for some reftests. r=gfx-reviewers,jgilbert

Norisz Fay [:noriszfay]

Comment 84

•

3 years ago

Backed out for causing build bustages and wpt failures

Backout link: https://hg.mozilla.org/integration/autoland/rev/a20b8d912b3f8cfe4e54143f3d708b549e17fe41

Push with failures: failure push wpt // failure push bustage

Failure log(s): failure log wpt // failure log bustage

Failure line(s): TEST-UNEXPECTED-FAIL | /html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-ImageBitmap-video.html | sRGB-BB0000, Context srgb, ImageData srgb, cropSource=true - assert_true: Actual pixel value 0,0,0,0 is approximately equal to 187,0,0,255. expected true got false // gmake[4]: *** [/builds/worker/checkouts/gecko/config/rules.mk:660: Unified_cpp_ffmpeg_ffmpeg580.o] Error 1

Flags: needinfo?(alwu)

Alastor Wu [:alwu]

Assignee

Comment 85

•

3 years ago

I don't think that wpt failure was caused by my patches, by looking at my last try-run before landing my patches, you can see that failure is intermittent. When the failure happened, the crash happened inside the web render, which is not something my code would touch.

I would say, probably my code changes the timing and exposes this failure and make it more happening more frequently, but my patches are not a root cause. But I will update my patch to avoid that build bustage (weird, it didn't show up in my try-run before!)

Flags: needinfo?(alwu)

Pulsebot

Comment 86

•

3 years ago

Pushed by alwu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/002c71e3c3f7 part1 : allocate shmem for ffmpeg to store decoded data directly. r=stransky,gfx-reviewers,sotaro,jgilbert,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/5503198fc6b2 part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. r=stransky,bryce,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/52fdcd1dc5ed part3 : correct warning message. r=stransky,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/f4ca14151598 part4 : adjust fuzzy range for some reftests. r=gfx-reviewers,jgilbert

Cristian Tuns

Comment 87

•

3 years ago

•

Edited

Backed out on suspicion of causing wpt failure in canvas + video test. Multiple wpt failures.

Backout link
Push with failures
Failure Log
Failure line 1: html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-video.html | sRGB-FF0100, Context srgb, ImageData srgb, scaleImage=false - assert_true: Actual pixel value 0,0,0,0 is approximately equal to 255,1,0,255. expected true got fals
Failure line 2: TEST-UNEXPECTED-FAIL | /html/canvas/element/manual/wide-gamut-canvas/canvas-display-p3-drawImage-video.html | sRGB-BB0000, Context display-p3, ImageData srgb, scaleImage=false - assert_true: Actual pixel value 0,0,0,0 is approximately equal to 187,1,0,255. expected true got false

Flags: needinfo?(alwu)

Alastor Wu [:alwu]

Assignee

Updated

•

3 years ago

Flags: needinfo?(alwu)

Alastor Wu [:alwu]

Assignee

Updated

•

3 years ago

Blocks: 1751498

Alastor Wu [:alwu]

Assignee

Comment 88

•

3 years ago

Attached file Bug 1713276 - part5 : don't use shmem textures for 10bits videos on Windows. — Details

Pulsebot

Comment 89

•

3 years ago

Pushed by alwu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b2676297c6ff part1 : allocate shmem for ffmpeg to store decoded data directly. r=stransky,gfx-reviewers,sotaro,jgilbert,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/0a1be4875045 part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. r=stransky,bryce,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/d8e319535a01 part3 : correct warning message. r=stransky,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/21a1cb173d50 part4 : adjust fuzzy range for some reftests. r=gfx-reviewers,jgilbert https://hg.mozilla.org/integration/autoland/rev/9ac49fde0e53 part5 : don't use shmem textures for 10bits videos on Windows. r=media-playback-reviewers,stransky

Atila Butkovits

Comment 90

•

3 years ago

Backed out for causing reftest failures at color_quads/720p.png.bt709.bt709.pc.yuv420p10.vp9.webm.

Backout link: https://hg.mozilla.org/integration/autoland/rev/9a0a0d21c4d3a0f19dae8fb56e317b88da06eb1b

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&searchStr=win%2C10%2Copt&revision=9ac49fde0e532f3f97475fc4d20ba5c503cd4c5b&selectedTaskRun=NFP7ntZ9TJeAHq3alvRe-g.0

Failure log: https://treeherder.mozilla.org/logviewer?job_id=365096697&repo=autoland&lineNumber=34091

Flags: needinfo?(alwu)

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 91

•

3 years ago

Alastor, may you land it with media.ffmpeg.customized-buffer-allocation set to false to pass tests and enable it in another patch when all tests are fixed? That will save your time with rebases to latest trunk. It also helps to ensure it doesn't break anything when the pref if off.

Alastor Wu [:alwu]

Assignee

Comment 92

•

3 years ago

•

Edited

(In reply to Martin Stránský [:stransky] (ni? me) from comment #91)

Alastor, may you land it with media.ffmpeg.customized-buffer-allocation set to false to pass tests and enable it in another patch when all tests are fixed? That will save your time with rebases to latest trunk. It also helps to ensure it doesn't break anything when the pref if off.

It's ok, that new failure is in my expectation and I know why it failed. (See this, and that should be fixed in bug 1748540) I just forgot to handle that as we disable shmem-decoding for 10bit+ videos on Windows in the last update. After I modify patch4 again and do full testing, and ensure everything is ok, I will reland them again.

Flags: needinfo?(alwu)

Alastor Wu [:alwu]

Assignee

Comment 93

•

3 years ago

•

Edited

I've updated my patch4 to remove the changes for 10+ bits video, which are all related with SW WR on Windows (but now we won't enable shmem decoding for 10+ bits video on Windows, see 1751498)

Partial try run and a completed try run both look good. Also, the new try run I just pushed which is based on the latest m-c after doing some rebasing.

Alastor Wu [:alwu]

Assignee

Comment 94

•

3 years ago

Attached file Bug 1713276 - part5 : don't use shmem textures for 10bits videos on Windows. (obsolete) — Details

Depends on D134352

Phabricator Automation

Updated

•

3 years ago

Attachment #9260503 - Attachment is obsolete: true

Pulsebot

Comment 95

•

3 years ago

Pushed by alwu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/555a3481ee4d part1 : allocate shmem for ffmpeg to store decoded data directly. r=stransky,gfx-reviewers,sotaro,jgilbert,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/f032a3f3200a part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. r=stransky,bryce,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/4d0ac457af61 part3 : correct warning message. r=stransky,media-playback-reviewers https://hg.mozilla.org/integration/autoland/rev/38eb80e3c143 part4 : adjust fuzzy range for some reftests. r=gfx-reviewers,jgilbert https://hg.mozilla.org/integration/autoland/rev/3b1e43ee7774 part5 : don't use shmem textures for 10bits videos on Windows. r=media-playback-reviewers,stransky

Norisz Fay [:noriszfay]

Comment 96

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/555a3481ee4d
https://hg.mozilla.org/mozilla-central/rev/f032a3f3200a
https://hg.mozilla.org/mozilla-central/rev/4d0ac457af61
https://hg.mozilla.org/mozilla-central/rev/38eb80e3c143
https://hg.mozilla.org/mozilla-central/rev/3b1e43ee7774

Status: NEW → RESOLVED

Closed: 3 years ago

status-firefox98: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 98 Branch

Cristian Tuns

Updated

•

3 years ago

Regressions: 1751913

Alastor Wu [:alwu]

Assignee

Updated

•

3 years ago

Regressions: 1752388

Sotaro Ikeda [:sotaro]

Updated

•

3 years ago

Regressions: 1753575

Martin Stránský [:stransky] (ni? me)

Reporter

Comment 97

•

3 years ago

With bug 1753575 fixed I can confirm that shm memory recycling works.

Alastor Wu [:alwu]

Assignee

Updated

•

3 years ago

Blocks: 1757436

Alastor Wu [:alwu]

Assignee

Updated

•

3 years ago

Blocks: 1757980

Kevin Locke

Updated

•

3 years ago

Regressions: 1758193

Alastor Wu [:alwu]

Assignee

Updated

•

3 years ago

No longer regressions: 1758193

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

Regressions: 1758193

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

3 years ago

Regressions: 1768259

Darkspirit

Updated

•

3 years ago

Depends on: 1772013

in-place.patch 4 years ago Martin Stránský [:stransky] (ni? me) 37.93 KB, patch		Details \| Diff \| Splinter Review
Bug 1713276 - part1 : allocate shmem for ffmpeg to store decoded data directly. 4 years ago Alastor Wu [:alwu] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1713276 - part2 : directly use the shmem video buffer to create a video data which would remove a need to copy decoded data. 4 years ago Alastor Wu [:alwu] 48 bytes, text/x-phabricator-request		Details \| Review
wrong-color-issue.png 4 years ago Alastor Wu [:alwu] 1.53 MB, image/png		Details
Bug 1713276 - part3 : re-adjust ycbcr useful size to keep video frames being displayed correctly. 4 years ago Alastor Wu [:alwu] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1713276 - part2 : add verbose log level for FFmpegLog and AudioTrimmerLog. 3 years ago Alastor Wu [:alwu] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1713276 - part3 : correct warning message. 3 years ago Alastor Wu [:alwu] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1713276 - part4 : adjust fuzzy range for some reftests. 3 years ago Alastor Wu [:alwu] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1713276 - part5 : don't use shmem textures for 10bits videos on Windows. 3 years ago Alastor Wu [:alwu] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1713276 - part5 : don't use shmem textures for 10bits videos on Windows. 3 years ago Alastor Wu [:alwu] 48 bytes, text/x-phabricator-request		Details \| Review