Bug 1713276 Comment 41 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

(In reply to Alastor Wu [:alwu] from comment #40)
> What I feel is multi-thread decoding is not a factor, but the mechanism of synchronizing the shmem might be related. Following result are for multi-threads, and single thread (change [this](https://searchfox.org/mozilla-central/rev/55a826a9ef74e92988e56cd9615d4fc6a470695e/dom/media/platforms/ffmpeg/FFmpegVideoDecoder.cpp#404) to 1) From below result, it shows that even if using single thread, decode to shmem is still slower.
> 
> * Linux, Multi-threads, Decode + Copy
> [Child 579563: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took `19340.897155` us in the average.
> 
> * Linux, Single thread, Decode + Copy
> [Child 794861: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took `44225.718321` us in the average.
> 
> * Linux, Multi-threads, Decode to shmem 
> [Child 579979: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took `20269.318408` us in the average.
> 
> * Linux, Single thread, Decode to shmem 
> [Child 795783: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took `53721.353087` us in the average.
> 
> ---
> 
> By observing the stack in comment 37 (decode to shmem), it took a lot of time on `ZwFreeVirtualMemory` (on Windows) which is called from `av_buffer_unref`. In `vp9_decode_update_thread_context`, I did see ffvpx would start checking which frames are no longer needed and would unref those AVFrames. And then it seems triaggering resetting data on shmem, which would cost a lot of time. Also, on Linux, the time spend most are on `__pthread_cond_wait` which also looks like synchronizing shmeme data between two processes?

Hm, maybe the shm memory allocation/free takes so much time? I wonder if the original way (decode to regular memory and then copy to shm) uses different/better way to allocate shm memory chunks or are they recycled?
 
> However, if those data got reset during decoding, how could we see the image on the compositor?! Because the images are still complete, the shmem buffer should suppose unchanged after we receive decoded video frames. If those data didn't get reset, why `av_buffer_unref` would trigger those cleaning methods and took a lot of time?

Is av_buffer_unref called along ReleaseVideoBufferWrapper()? Because that seems to be the only correct way how your own buffer is released, right?

As for the images - the shm images are uploaded from shm to OpenGL textures (located at GPU) and that's a single operation - you can release the shm after that as GL keeps copy of the texture on GPU (and maybe also in RAM but that depends how the textures are uploaded). So it's possible that the underlying shm memory is released but the texture is still live and used for rendering.
(In reply to Alastor Wu [:alwu] from comment #40)
> What I feel is multi-thread decoding is not a factor, but the mechanism of synchronizing the shmem might be related. Following result are for multi-threads, and single thread (change [this](https://searchfox.org/mozilla-central/rev/55a826a9ef74e92988e56cd9615d4fc6a470695e/dom/media/platforms/ffmpeg/FFmpegVideoDecoder.cpp#404) to 1) From below result, it shows that even if using single thread, decode to shmem is still slower.
> 
> * Linux, Multi-threads, Decode + Copy
> [Child 579563: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took `19340.897155` us in the average.
> 
> * Linux, Single thread, Decode + Copy
> [Child 794861: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took `44225.718321` us in the average.
> 
> * Linux, Multi-threads, Decode to shmem 
> [Child 579979: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took `20269.318408` us in the average.
> 
> * Linux, Single thread, Decode to shmem 
> [Child 795783: Main Thread]: D/MediaAveragePerf 'RequestDecode' stage for 'V:1440<h<=2160' took `53721.353087` us in the average.
> 
> ---
> 
> By observing the stack in comment 37 (decode to shmem), it took a lot of time on `ZwFreeVirtualMemory` (on Windows) which is called from `av_buffer_unref`. In `vp9_decode_update_thread_context`, I did see ffvpx would start checking which frames are no longer needed and would unref those AVFrames. And then it seems triaggering resetting data on shmem, which would cost a lot of time. Also, on Linux, the time spend most are on `__pthread_cond_wait` which also looks like synchronizing shmeme data between two processes?

Hm, maybe the shm memory allocation/free takes so much time? I wonder if the original way (decode to regular memory and then copy to shm) uses different/better way to allocate shm memory chunks or are they recycled?
 
> However, if those data got reset during decoding, how could we see the image on the compositor?! Because the images are still complete, the shmem buffer should suppose unchanged after we receive decoded video frames. If those data didn't get reset, why `av_buffer_unref` would trigger those cleaning methods and took a lot of time?

Is av_buffer_unref called along ReleaseVideoBufferWrapper()? Because that seems to be the only correct way how to call it, when your own buffer is released, right?

As for the images - the shm images are uploaded from shm to OpenGL textures (located at GPU) and that's a single operation - you can release the shm after that as GL keeps copy of the texture on GPU (and maybe also in RAM but that depends how the textures are uploaded). So it's possible that the underlying shm memory is released but the texture is still live and used for rendering.

Back to Bug 1713276 Comment 41