Closed Bug 773440 Opened 8 years ago Closed 7 years ago

OMTC: Remove unnecessary frame copies with async-video

Categories

(Core :: Graphics: Layers, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla17

People

(Reporter: nical, Assigned: nical)

References

Details

Attachments

(1 file, 7 obsolete files)

Right now, with async-video, every video frame is copied into non-shared memory after it has been decoded (A) and the resulting image is later copied again into shared memory (B) before being sent to the compositor.

This is a waste of cpu and memory.

The copy (A) can be removed most of the time by picking an already allocated shared image in ImageBridgeChild's pool.

The copy (B) can be removed by modifying our decoder libraries to decode into buffers provided by the caller (and provide shared images directly).
Relevant comment from roc on the async-video bug (bug 598868)

> Here's the plan I've been carrying in my head for minimal-copying when using
> our builtin decoders:
> -- PlanarYCbCrImage gets a new method Allocate(Data* aData) where the data
> pointers are null when passed in, and the method allocates memory and fills
> in the data pointers. Then there are two modes of operation:
> 1) "caller-allocate", the current setup, where the caller already has data
> in memory and just calls SetData() to copy that data into the image
> 2) "callee-allocate", the new option, where the caller calls Allocate(),
> fills in the buffers allocated by the image, then calls SetData() when
> that's finished. The image must keep the buffers alive and readable by the
> caller as long as the Image object lives.
> -- The default implementation of Allocate() just allocates buffers in main
> memory.
> -- For IPC we implement Allocate() to allocate buffers in shared memory.
> -- Modify our decoder libraries to decode into buffers provided by the
> caller. I've already worked out a plan for this with Tim Terriberry.
> Basically we extend the codec API to allow the caller to pass in two
> callbacks: "create a frame buffer", and "release a frame buffer". The codec
> will need to keep some frames around longer than others to be used as
> references when decoding other frames.
> -- Implement those callbacks in our video decoder: create a frame buffer
> creating a PlanarYCbCrImage and calling Allocate(), release a frame buffer
> by derefing the PlanarYCbCrImage.
Blocks: 598868
Depends on: 774347
Depends on: 775244
Assignee: nobody → nsilva
Depends on: 782372
Work in progress. It works but the code needs to be cleaned up, documented and maybe move SharedPlanarYCbCrImage to its own .h/.cpp.

Right now with async-video here is what happens:
The decoding libraries hand us a video frame that we copy into a PlanarYCbCrImage which is put in the MediaQueue. Later we take this image's data and copy it once again in shared memory to send it to the compositor.

With this patch, instead of using a PlanarYCbCrImage we use a SharedPlanarYCbCrImage, which copies the image into shared memory right away. this way wont need to copy it again before sending it to the compositor. So this eliminates one copy when using async-video.
When the SharedPlanarYCbCrImage sends its data to the compositor it looses ownership of the data.

I ran into a few obstacles:
 - We sometimes call RenderVideoFrame with the same Image several times. This obviously break with the image loosing ownership to its data when sent to the compositor the first time, so an image that does not have data is considered invalid and trying to send an invalid image returns and does not send anything.
 - The main thread sometimes wants to access the image data, for example when we do the screenshot (for the new tab page, etc.). When this happens it calls GetAsSurface on the Image and this crashes when the image has lost ownership of its data. So GetAsSurface returns nullptr always with SharedPlanarYCbCrImage.

There may be better solutions to these two problems. I think for the second one we need to rethink and clarify the sharing model of video frames. the main thread should never access the image data just like that. Instead i think it should ask the decoder to hand in an image (in which case the decoder would sometimes have to re-decode a frame i suppose).
A bit of cleanup and documentation
Attachment #652850 - Attachment is obsolete: true
Attachment #652882 - Attachment is obsolete: true
Attachment #652942 - Flags: review?(roc)
(In reply to Nicolas Silva [:nical] from comment #2)
> When the SharedPlanarYCbCrImage sends its data to the compositor it looses
> ownership of the data.

Images were designed to be immutable and shared. We shouldn't break that unless we absolutely have to, and I don't understand why we'd have to. Why can't we share the image data read-only between the content thread and the compositor?
We don't absolutely have to, although IMHO this sharing model is making memory management much more complicated than it should with OMTC. With gralloc it is even harder because you need to open/close on each side, with the data potentially showing up at a different address when you reopen, and I don't even know what the overhead of closing/opening is, but i suppose it is not free. The more we will want to take advantage of gpu memory and such, the more complicated this model will be.

I think the current design fits non-OMTC better than OMTC. Maybe changing the design is complicated and not worth it, though.
We sometimes need to get at the image data from the main thread, efficiently if possible --- for drawing to canvas, printing and other users. So we chose a design that permits that. (When we decode directly into a gralloc buffer, the decoder itself will need read access to the image data to decode future frames, so this will become a much more common case.)

I think a design where we give up ownership of the data to the layer system and can't synchronously get access to the data again simply would not work, since you wouldn't be able to implement canvas.drawImage(video). Some other things would become much more difficult to implement.
I don't mean to defend a unique-owner sharing model too much in this case since the current model apparently has its reasons that you mentioned. I mostly raised the idea because I was wondering if it was something more relevant to before omtc than it is now.

Anyway, I changed the patch to keep sharing the data to the main thread after the image has been sent to the compositor.
The way it works is that since images that are sent are kept around until their data is sent back thanks to one of kanru's recent patches, We can just let the SharedPlanarYCbCrImage's destructor decide when to recycle the shared memory. So When ImageContainerChild receives the shared image back from the compositor, it first checks that the image was not originally a SharedPlanarYCbCrImage in which case it does not put the image in the pool since reference counting will take care of that.
To know whether the SharedImage comes from a SharedPlanarYCbCrImage, I added an ImageID field in YUVImage that is non zero when the image is originated from a SharedPlanarYCbCrImage.
Attachment #652942 - Attachment is obsolete: true
Attachment #652942 - Flags: review?(roc)
Attachment #653844 - Flags: review?(roc)
Small fix to remove a warning that got -Werror'ed on the try servers
Attachment #653844 - Attachment is obsolete: true
Attachment #653844 - Flags: review?(roc)
Attachment #653972 - Flags: review?(roc)
Comment on attachment 653972 [details] [diff] [review]
Copy the image out of the decoder directly into shared memory.

Review of attachment 653972 [details] [diff] [review]:
-----------------------------------------------------------------

Generally this looks great.

I think you missed some usage of PlanarYCbCrImage --- getUserMedia is on trunk and using that now. There's additional WebRTC code coming that will use it too. See MediaEngineDefaultVideoSource::Start and MediaEngineWebRTCVideoSource::DeliverFrame. I think they will need to behave like nsBuiltinDecoderReader currently does, copying frames into the allocated buffer. So, I think it might be a good idea to have a shared API that copies a source buffer into the allocated buffer. This could be the existing SetData, and you could have SetData internally handle the case where the buffer in aData is one we allocated via Allocate().

::: gfx/layers/ImageContainer.cpp
@@ +155,5 @@
>  ImageContainer::SetCurrentImage(Image *aImage)
>  {
>    ReentrantMonitorAutoEnter mon(mReentrantMonitor);
>  
> +  if (aImage && !aImage->IsValid()) {

Shouldn't this be an assertion?

::: gfx/layers/ImageContainer.h
@@ +72,5 @@
>  
>    ImageFormat GetFormat() { return mFormat; }
>    void* GetImplData() { return mImplData; }
>  
> +  virtual bool IsValid() { return true; }

Document this!

@@ +669,5 @@
>  
>    /**
> +   * Allocate image buffer(s) and give aData ownership of the image.
> +   * The information about the size , stride, etc. must be set in aData
> +   * prior to calling this method.

Expand this comment to indicate that Allocate fills in the buffer pointers in aData. Also, fix space before ','

Also, explain the requirements around which methods must be called and in what order.

::: gfx/layers/ipc/ImageContainerChild.cpp
@@ +65,5 @@
> +  }
> +
> +  bool IsSent() const
> +  {
> +    return mSent;

Document what this means

::: gfx/layers/ipc/ImageContainerChild.h
@@ +225,5 @@
> +  void AllocateSharedBufferForImageNow(Image* aImage);
> +
> +  /**
> +   * Calls AllocateSharedBufferForImageNow and notify the barrier monitor.
> +   * This is meant to be called with AllocateSharedBufferForImage if the

"called by" instead of "called with", I think

::: gfx/layers/ipc/LayersSurfaces.ipdlh
@@ +72,5 @@
>    Shmem Ydata;
>    Shmem Udata;
>    Shmem Vdata;
>    nsIntRect picture;
> +  uintptr_t imageID;

This should be documented
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #10) 
> I think you missed some usage of PlanarYCbCrImage --- getUserMedia is on
> trunk and using that now. There's additional WebRTC code coming that will
> use it too. See MediaEngineDefaultVideoSource::Start and
> MediaEngineWebRTCVideoSource::DeliverFrame. I think they will need to behave
> like nsBuiltinDecoderReader currently does, copying frames into the
> allocated buffer. So, I think it might be a good idea to have a shared API
> that copies a source buffer into the allocated buffer. This could be the
> existing SetData, and you could have SetData internally handle the case
> where the buffer in aData is one we allocated via Allocate().

In this version, SetData(data) either acquires the data if it is in a buffer allocated by Allocate(data), or allocate the buffer and copy into it otherwise. In this case the buffer passed in data should be deallocated by the caller (since he allocated it).

> 
> ::: gfx/layers/ImageContainer.cpp
> @@ +155,5 @@
> >  ImageContainer::SetCurrentImage(Image *aImage)
> >  {
> >    ReentrantMonitorAutoEnter mon(mReentrantMonitor);
> >  
> > +  if (aImage && !aImage->IsValid()) {
> 
> Shouldn't this be an assertion?
> 
> ::: gfx/layers/ImageContainer.h
> @@ +72,5 @@
> >  
> >    ImageFormat GetFormat() { return mFormat; }
> >    void* GetImplData() { return mImplData; }
> >  
> > +  virtual bool IsValid() { return true; }
> 
> Document this!

The IsValid stuff is a mistake from me, It was used when the image was loosing ownership of its buffers, and I forgot to remove it entirely when I went back to the current sharing model (now we use IsSent instead). I removed it.

I fixed your other items and rebased the patch.
Attachment #653972 - Attachment is obsolete: true
Attachment #653972 - Flags: review?(roc)
Attachment #654342 - Flags: review?(roc)
Comment on attachment 654342 [details] [diff] [review]
Copy the image out of the decoder directly into shared memory.

Review of attachment 654342 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with the nsBuiltinDecoderReader changes removed

::: content/media/nsBuiltinDecoderReader.cpp
@@ +249,5 @@
> +                            data.mCbCrSize,
> +                            Cr.mStride, data.mCbCrStride,
> +                            Cr.mOffset, Cr.mSkip);
> +
> +  // set the image's data to 'data'

We don't actually need to change nsBuiltinDecoderReader now, do we? The default SetData should handle it.
Attachment #654342 - Flags: review?(roc) → review+
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> Comment on attachment 654342 [details] [diff] [review]
> Copy the image out of the decoder directly into shared memory.
> 
> Review of attachment 654342 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r+ with the nsBuiltinDecoderReader changes removed
> 
> ::: content/media/nsBuiltinDecoderReader.cpp
> @@ +249,5 @@
> > +                            data.mCbCrSize,
> > +                            Cr.mStride, data.mCbCrStride,
> > +                            Cr.mOffset, Cr.mSkip);
> > +
> > +  // set the image's data to 'data'
> 
> We don't actually need to change nsBuiltinDecoderReader now, do we? The
> default SetData should handle it.

Yes indeed. I removed the changes to nsBuiltinDecoderReader.
backed out for apparently causing an orange in Android M2
https://hg.mozilla.org/integration/mozilla-inbound/rev/00dea115746d
It looks like it may also have caused a Linux64 reftest permaorange, opt and debug.

REFTEST TEST-UNEXPECTED-FAIL | file:///home/cltbld/talos-slave/test/build/reftest/tests/layout/reftests/webm-video/bug686957.html | image comparison (==)
Doh!
Looking at M2 crash, it seems like we run out of shared memory. Shared memory consumption could be reduced by reducing the size the the MediaQueue in nsBuiltinDecoderReader.h. roc, does it sound like something reasonable (I mean reducing the number of elements in the mediaqueue)?

At first sight it looks like we don't have a way to recover from failing to allocate shared memory, it would probably be useful to be able to not crash and fallback to using PlanarYCbCrImage rather than SharedPlanarYCbCrImage when running out of shared memory.
If we run out of shmem, actor->AllocateShmem() will return null.  It happens all the time.
can't SharedPlanarYCbCrImage fall back to behaving like PlanarYCbCrImage if we're out of shmem?
We could try.  If we're really out of shmem though, there'll be no way to efficiently send PlanarYCbCrImage to the compositor either.
Then it seems like we need a significantly smarter way to manage shared memory here.

Can we monitor shared memory usage and use SharedPlanarYCbCrImage when shared memory is plentiful (say more than 50% available) and fall back to PlanarYCbCrImage otherwise?

(I wonder why shared memory is a scarce resource anyway, but never mind.)
It's not scarce compared to normal memory, but it's like <img> --- a giant <video> will eat all our RAM.  It happens.

I don't understand what problem we're trying to solve here.  Are you proposing to write the PlanarYCbCrImage bytes directly to the IPC socket if we can't allocate shmem?
Sorry, I spoke too fast.

Actually The crash was more likely to be bad reference counting of the sent images using mImageQueue in ImageContainerChild.

I removed the queue and instead I serialize the PlanarYCbCrImage pointer in the SharedImage, AddRef and it manually when sending and Release it when it comes back by deserializing the pointer rather than popping from the mImageQueue. I can't reproduce the crash anymore on fennec so it looks good, but i need to try it on b2g.


(In reply to Chris Jones [:cjones] [:warhammer] from comment #22)
> It's not scarce compared to normal memory, but it's like <img> --- a giant
> <video> will eat all our RAM.  It happens.

Ok so I got it wrong. My fear of running out of shared memory comes from my (not very scientific) observation that I crashed way faster on linux when leaking shmems than when leaking images allocated into "normal" memory. So I wrongly assumed that it was a scarcer resource than RAM. If shared memory is not a scarece resource, ignore the next bloc of text:

If shared memory happens to be a scarcer resource on some platforms and we run into a limit, we could switch to using non-shared YCbCrImages in the queue and do the extra copy to shared memory just bfore sending to the compositor. This would make it consume less shared memory (the 10 images sitting in the queue per video, basically). And skip frames while we are out of shmem, so that the frames already in shared memory eventually get sent to the compositor, sent back, recycled in the pool and then only used a the last moment. (at this point we would also reduce the max size of the pool).
The transition from eagerly using shared images to using them only at the last moment would most likely skip one or two frames, which is ok.
It might make sense to install a low-memory handler that drops video frames from the queues of video elements (e.g. every second frame?), and/or have the decoder throw away frames at the same rate if we think memory is low.
Ok, I come back on what I said about running out of sharing memory: with many open videos on the same page we get to a point where we have "too many open files (24)" and allocation fails, which aborts in the ImageBridgeChild code. So we don't really run out of shared memory but we do run out of individual shmems.
I'm changing the code so that the frame is just skipped when we get into this situation.
Target Milestone: --- → mozilla17
Version: unspecified → Trunk
Yeah, this is a known problem exacerbated by mac having a low fd limit.
Now that mochitest content/media/tests/test_connections_closed.html does not crash anymore from creating 20 videos, the test still fails because it times out waiting for every video to get its first frame (which can't happen because we are already using all the shmems available in the other videos).

I feel like adding a flag parameter to ImageContainer::CreateImage allowing something like FORCE_NON_SHARED to force us not to place SharedImages in the media queue when we detect that we are out of shmems. Does it sound like a bad thing to do?
(In reply to Nicolas Silva [:nical] from comment #27)
> I feel like adding a flag parameter to ImageContainer::CreateImage allowing
> something like FORCE_NON_SHARED to force us not to place SharedImages in the
> media queue when we detect that we are out of shmems. Does it sound like a
> bad thing to do?

I think ImageContainerChild::CreateImage should fall back internally, which is more or less what I suggested in comment #21.
Depends on: 790716
Depends on: 790762
No longer depends on: 790762
Depends on: 792252
This is a new version rewritten on top of bug 790716, which makes it so that we use 1 shmem per image instead of 3. With this we don't run into the limit of open file descriptors we were hitting before in mochitest 2.

Bug 790716 made it easier for us to make it so that we can allocate several images in one shmem if need be (then we would probably want to put the entire MediaQueue in one shmem per video). We don't do this right now but we may decide to do it as a follow up bug (though i am not sure we need it anymore).

By the way, how did we decide that we'd have 10 images in the MediaQueue? It "feels" like a lot to me but I am not knowledgeable in the matter. I just see that I have almost all the time 10 images sitting in the queue. Maybe we don't need that much buffering and maybe we could lower it to reduce our memory consumption (and amount of open file descriptors)?
Attachment #654342 - Attachment is obsolete: true
Attachment #665247 - Flags: review?(roc)
nical, want to do a quick patch to use a pref for those 10 images? Would be nice to tune that on b2g. Our memory is pretty scarce.
Making it a pref is a good idea, but make sure not to use the pref API off the main thread. The value will have to be messaged over.

The value needs to be large enough to cover the latency of decoding large keyframes. 10 covers 300ms of latency at 30fps. We have observed some pretty big latencies in software decoders.

We could probably autotune the value based on the observed latencies of deocding a particular video stream, e.g. start conservatively and once we've decoded a few seconds of video, make the queue length 50% bigger than the largest latency we've seen so far.
I filed bug 794747 for the MediaQueue size
Comment on attachment 665247 [details] [diff] [review]
Copy the image out of the decoder directly into shared memory.

Review of attachment 665247 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/ImageContainerChild.cpp
@@ +436,5 @@
> +  p.mProtocol = this;
> +  p.mBufSize = aBufSize;
> +  p.mType = aType;
> +  p.mShmem = aShmem;
> +  p.mResult = false;

I think you can use a { } struct initializer here.
Attachment #665247 - Flags: review?(roc) → review+
Here is the final version, with the struct initializer as you said.

We need bug 792252 (which is still being reviewed) to land before we can land this one.
Attachment #665247 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
You need to log in before you can comment on or make changes to this bug.