Closed Bug 901224 Opened 11 years ago Closed 11 years ago

ImageBridgeChild returns gralloc buffers too early to GonkNativeWindow

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:koi+, firefox26 fixed)

RESOLVED FIXED
1.2 FC (16sep)
blocking-b2g koi+
Tracking Status
firefox26 --- fixed

People

(Reporter: sotaro, Assigned: nical)

References

Details

Attachments

(1 file, 17 obsolete files)

33.16 KB, patch
sotaro
: review+
Details | Diff | Splinter Review
On b2g master, during h.264 video playback, sometimes I saw out of order video frames are rendered on display.
blocking-b2g: --- → koi?
This sounds like the problem pchang told me about with the GonkNativeWindow mnaging a circular buffer of gralloc buffers without having ways to make sure the buffers are not being used by the layers, which results in sometimes the camera drawing directly into the buffer that is brought to the screen instead of the ones that are on there way in the queue.

Bug 858914 tries to build a saner foundation on top of which we can implement ways to notify correctly the camera about which buffers are free to be reused, but this part is not implemented yet. Also the new architecture from 858914 is preffed off on B2G right now.

If this bug is the problem I am referring to, it is probably possible to hack around it on top of the current (deprecated) textures, but that's tricky and very fragile. I would recommend waiting on new textures to arrive on B2G (that's what I am working on at the moment) and implement a sane mechanism on top of that.

If this is not the bug I am referring to, then nevermind what I just said.
(In reply to Nicolas Silva [:nical] from comment #1)
> This sounds like the problem pchang told me about with the GonkNativeWindow
> mnaging a circular buffer of gralloc buffers without having ways to make
> sure the buffers are not being used by the layers, which results in
> sometimes the camera drawing directly into the buffer that is brought to the
> screen instead of the ones that are on there way in the queue.

Do you know why we wouldn't see this on 1.1?
(In reply to Nicolas Silva [:nical] from comment #1)
> This sounds like the problem pchang told me about with the GonkNativeWindow
> mnaging a circular buffer of gralloc buffers without having ways to make
> sure the buffers are not being used by the layers, which results in
> sometimes the camera drawing directly into the buffer that is brought to the
> screen instead of the ones that are on there way in the queue.

Actually GonkNativeWindow do not work as a ciruclar buffer. Just a buffer pool by using slots.

> Bug 858914 tries to build a saner foundation on top of which we can
> implement ways to notify correctly the camera about which buffers are free
> to be reused, but this part is not implemented yet. Also the new
> architecture from 858914 is preffed off on B2G right now.

Yup, I am waiting Bug 858914 fixed. I feel the problem happens because gralloc buffers are returned to GonkNativeWindow too early.
> Do you know why we wouldn't see this on 1.1?

In v 1.1, buffer is returned to GonkNativeWindow when next rendering buffer is arrived to ImageBridgeParent.
(In reply to Sotaro Ikeda [:sotaro] from comment #4)
> > Do you know why we wouldn't see this on 1.1?
> 
> In v 1.1, buffer is returned to GonkNativeWindow when next rendering buffer
> is arrived to ImageBridgeParent.

What code does this?
> > In v 1.1, buffer is returned to GonkNativeWindow when next rendering buffer
> > is arrived to ImageBridgeParent.
> 
> What code does this?

sorry, it is ImageContainerParent. At ImageContainerParent::RecvPublishImage().
http://mxr.mozilla.org/mozilla-b2g18/source/gfx/layers/ipc/ImageContainerParent.cpp#33
But the code has one problem. The above code does not ensure previous buffer is not in the middle of the rendering.
Summary: Video frame order is not correct at H.264 video playback on b2g master → Video frame order is not correct at H.264 video playback on b2g master(ICS gonk)
Summary: Video frame order is not correct at H.264 video playback on b2g master(ICS gonk) → ImageBridgeChild returns gralloc buffers too early to GonkNativeWindow
Depends on: 907745
Even when applying the patches in Bug 907745, gralloc buffers are returned to  GonkNativeWindow too early timing. It happens because GrallocTextureClientOGL do not hold a reference to GraphicBufferLocked, but hold a reference to GraphicBuffer.
Yes - we need the patches from bug 907745, but that is just the prerequisite, it does not actually solve the problem.
This is a temporary patch. Confirmed on master unagi.
Upload attachment 796268 [details] [diff] [review], just to share current temporary patch. I created the patch to help development of Bug 871364.
Comment on attachment 796268 [details] [diff] [review]
temporary patch - release gralloc buffers when it is not used by Compositor

I forgot to add necessary part.
Attachment #796268 - Attachment is obsolete: true
Blocks: 909564
Add the missing part in v1. Added reference from GrallocTextureClientOGL to GraphicBufferLocked. The patch is just a temporary patch.
Blocks: 871364
unbitrot.
Attachment #796283 - Attachment is obsolete: true
No longer depends on: 858914
On nexus4, by applying attachment 798269 [details] [diff] [review], the video rendering distortion's problem becomes much better. But I can still sometimes see a few distortion during youtube video playback via web browser like the following. By the patch, gfx layers handle gralloc buffers correctly. To fix the remaining problem, another way of fixing might be necessary. It seems better he remaining problem to be handled in another bug. 

http://www.youtube.com/watch?v=I1L9-YALU6s
nexus4 occasionally freeze during youtube video playback. When the phone come into this status, the touch ui do not respond at all. Even "soft home button" did not respond. At first I thought that it is related to new GonkNativeWindow for JB make the problem. But from quick investigation, it seems to be caused by compositor side in b2g process. From application's point of view, transaction to LayerTransactionParent did not respond.

I checked the following combination on nexus4. The freeze happens in all cases.
- [1]
- [1]+[2]
- [1]+[2]+[3]
- [1]+[2]+[3]+[4]

[1] GonkNativeWindow support in JB gonk (Bug 871364)
[2] Enable new gralloc textures (Bug 907745)
[3] Fix memory leak at ImageClientSingle (Bug 910921)
[4] Hold a reference to Image during being used by compsitor(Bug 901224)
(In reply to Sotaro Ikeda [:sotaro] from comment #17)
> nexus4 occasionally freeze during youtube video playback. 

Bug 911560 is the bug for the youtube freeze on nexus4.
Assignee: nobody → sotaro.ikeda.g
Fix compile error in another platform than b2g.
Attachment #798269 - Attachment is obsolete: true
Move holding a reference to Image part from GrallocTectureClient to TextureClient.
Attachment #798342 - Attachment is obsolete: true
Comment on attachment 798841 [details] [diff] [review]
patch v6 - hold a reference to Image ony during TextureClient is added to CompositableForwarder

nical, can I have a feed back about the patch from you?
Attachment #798841 - Flags: feedback?(nical.bugzilla)
(In reply to Sotaro Ikeda [:sotaro] from comment #21)
> Comment on attachment 798841 [details] [diff] [review]
> patch v6 - hold a reference to Image ony during TextureClient is added to
> CompositableForwarder
> 
> nical, can I have a feed back about the patch from you?

How do you think about the way of implementing the patch?
(In reply to Sotaro Ikeda [:sotaro] from comment #21)
> Comment on attachment 798841 [details] [diff] [review]
> patch v6 - hold a reference to Image ony during TextureClient is added to
> CompositableForwarder
> 
> nical, can I have a feed back about the patch from you?

I need to spend some time thinking about the problem and your solution. My first gut feeling is that it it's a bit scary and that I would like to avoid doing that in the long run but it is a difficult problem so let me think about it for a few hours. I'll give you feedbacks before tomorrow.
Yeah, it is a bit scary. We need a way to extend the lifetime of Image during TextureClient/TextureHost using it.
(In reply to Sotaro Ikeda [:sotaro] from comment #24)
> Yeah, it is a bit scary. We need a way to extend the lifetime of Image
> during TextureClient/TextureHost using it.

Actually need to extend the life time of GraphicBufferLocked. A problem of current GraphicBufferLocked is that it can not call GraphicBufferLocked::Unlock() during GraphicBufferLocked's destructor.
(In reply to Sotaro Ikeda [:sotaro] from comment #25)
> Actually need to extend the life time of GraphicBufferLocked. A problem of
> current GraphicBufferLocked is that it can not call
> GraphicBufferLocked::Unlock() during GraphicBufferLocked's destructor.

GraphicBufferLocked is used only on b2g gonk, other Image(SharedPlanarYCbCrImage) also need to extend the allocated buffer during the buffer is used by TextureClient/TextureHost.
(In reply to Sotaro Ikeda [:sotaro] from comment #26)
> GraphicBufferLocked is used only on b2g gonk, other
> Image(SharedPlanarYCbCrImage) also need to extend the allocated buffer
> during the buffer is used by TextureClient/TextureHost.

SharedPlanarYCbCrImage (not the deprecated one) holds a strong ref to a TextureClient and it is the texture client that manages the life time of the shared data (the Image does not have a direct reference to the shmem). So even if the Image goes away, the shared data is still alive as long as someone else holds a reference to the texture client (theoretically that is, and will be true in practice when I fix bug 912040, for which I have a patch that looks good locally (but I need to test it extensively))

So we don't need to worry about non-gralloc stuff.
What is different with gralloc is that it is the GrallocBufferLocked that controls when we return the buffer to the GonkNativeWindow (if I understand correctly).
Ideally we want a mechanism to call GrallocBufferLocked::Unlock() here:
http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/ClientLayerManager.cpp?from=ClientLayerManager.cpp#l335
because it is the first moment on the client side that we are guaranteed that nobody on the compositor side has a reference to the texture.

However, this happens when the texture client is already destroyed so we need to register a callback to the CompositableForwarder (the image bridge), that will be executed there.

Just extending the lifetime of the shared data to the one of the texture client won't be enough because the message that tells the compositor to remove the texture will be sent on the next transaction so at the moment of the texture client's destructor you do not have the guarantee that the compositor is not using the data.
attachment 798841 [details] [diff] [review] have 2 parts.
 - TextureClient hold reference to GraphicBufferLocked during the gralloc buffers are used by TextureHost.
 - Implement SetIdle()

It might be better to split this bug to 2 bugs.
>  - TextureClient hold reference to GraphicBufferLocked during the gralloc
> buffers are used by TextureHost.
>  - Implement SetIdle()

Bug 912186 is for SetIdle().
(In reply to Nicolas Silva [:nical] from comment #27)
> 
> So we don't need to worry about non-gralloc stuff.
> What is different with gralloc is that it is the GrallocBufferLocked that
> controls when we return the buffer to the GonkNativeWindow (if I understand
> correctly).
> Ideally we want a mechanism to call GrallocBufferLocked::Unlock() here:
> http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/
> ClientLayerManager.cpp?from=ClientLayerManager.cpp#l335
> because it is the first moment on the client side that we are guaranteed
> that nobody on the compositor side has a reference to the texture.
> 

Thanks for the comment, I am going to update a patch focus only to gralloc buffer.
Saner approach than v6. Change ImageBridgeChild::RemoveTexture() as to use synchronous transaction. By this change, transaction end can ensure texture removal in Compositor(Host) side. From ANativeWindow's request, buffer removal needs to be synchronous from GonkNativeWindow. If it is asynchronous, delay could become significant if Compositor is busy(experience from Bug 826829). It can not acceptable for the product. And in current implementation, each video frame update include TextureClient add(sync) and removal. So, it should not affect to the performance.
Attachment #798841 - Attachment is obsolete: true
Attachment #798841 - Flags: feedback?(nical.bugzilla)
By applying attachment 799289 [details] [diff] [review], video playback seems not having rendering distortion on nexus-4 that is present until v6 patch.
Comment on attachment 799289 [details] [diff] [review]
patch v7 - hold a reference to Image ony during TextureClient is added to CompositableForwarder

Review of attachment 799289 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/GrallocImages.h
@@ +52,5 @@
>  protected:
>    SurfaceDescriptor mSurfaceDescriptor;
>  };
>  
> +class GraphicBufferLockedWrapper : public GraphicBufferLocked {

It would be nice to have a comment here explaining that we need this wrapper to avoid calling Unlock which is virtual in GraphicBufferLocked's destructor.
Otherwise six month from now people with good intentions may not see this and break it without knowing.

@@ +68,5 @@
> +      mGraphicBufferLocked = nullptr;
> +    }
> +  }
> +
> +  virtual void Unlock() {}

Would bad thing happen if users of GraphicBufferLocked call this method and expect unlocking to happen while under the hood they are using the wrapper and which unlocks in the destructor?
If so maybe we should put a warning in unlock. Otherwise just a comment maybe?

::: gfx/layers/client/ImageClient.cpp
@@ +243,5 @@
>  void
> +ImageClientBuffered::SetIdleNow()
> +{
> +    if (mFrontBuffer) {
> +      RemoveTextureClient(mFrontBuffer);

We need to not call RemoveTextureClient manually. This should be only called by TextureClient's destructor (See bug 912040).

@@ +247,5 @@
> +      RemoveTextureClient(mFrontBuffer);
> +      mFrontBuffer = nullptr;
> +    }
> +    if (mBackBuffer) {
> +      RemoveTextureClient(mBackBuffer);

Same thing

@@ +263,5 @@
> +void
> +ImageClientSingle::RemoveTextureClient(TextureClient* aClient)
> +{
> +  CompositableClient::RemoveTextureClient(aClient);
> +}

Why do we need to override here?
(In reply to Nicolas Silva [:nical] from comment #33)
> Comment on attachment 799289 [details] [diff] [review]
> patch v7 - hold a reference to Image ony during TextureClient is added to
> CompositableForwarder
> 
> Review of attachment 799289 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/layers/GrallocImages.h
> @@ +52,5 @@
> >  protected:
> >    SurfaceDescriptor mSurfaceDescriptor;
> >  };
> >  
> > +class GraphicBufferLockedWrapper : public GraphicBufferLocked {
> 
> It would be nice to have a comment here explaining that we need this wrapper
> to avoid calling Unlock which is virtual in GraphicBufferLocked's destructor.
> Otherwise six month from now people with good intentions may not see this
> and break it without knowing.

Will add a comment.

> 
> @@ +68,5 @@
> > +      mGraphicBufferLocked = nullptr;
> > +    }
> > +  }
> > +
> > +  virtual void Unlock() {}
> 
> Would bad thing happen if users of GraphicBufferLocked call this method and
> expect unlocking to happen while under the hood they are using the wrapper
> and which unlocks in the destructor?
> If so maybe we should put a warning in unlock. Otherwise just a comment
> maybe?

unlocks in the destructor. Just a comment seems enough.

> ::: gfx/layers/client/ImageClient.cpp
> @@ +243,5 @@
> >  void
> > +ImageClientBuffered::SetIdleNow()
> > +{
> > +    if (mFrontBuffer) {
> > +      RemoveTextureClient(mFrontBuffer);
> 
> We need to not call RemoveTextureClient manually. This should be only called
> by TextureClient's destructor (See bug 912040).

Will update.
 
> @@ +247,5 @@
> > +      RemoveTextureClient(mFrontBuffer);
> > +      mFrontBuffer = nullptr;
> > +    }
> > +    if (mBackBuffer) {
> > +      RemoveTextureClient(mBackBuffer);
> 
> Same thing

Will update.

> 
> @@ +263,5 @@
> > +void
> > +ImageClientSingle::RemoveTextureClient(TextureClient* aClient)
> > +{
> > +  CompositableClient::RemoveTextureClient(aClient);
> > +}
> 
> Why do we need to override here?

Forgot to remove the code from base other patch.
Apply the comment.
Attachment #799289 - Attachment is obsolete: true
Depends on: 912040
Fix build failure.
Attachment #799705 - Attachment is obsolete: true
From v8, the patch depends on bug 912040. Then the assumption of the TextureClient is changed. Prrevisously, the  Current pach does not take the change. Need to update the patch. RemoveTextureClient() was called before the destructor was called. But bug 912040 changed the code to call RemoveTextureClient() in the destructor.
Update the patch depend on Bug 912040. This patch is very dirty patch.
Attachment #799726 - Attachment is obsolete: true
(In reply to Sotaro Ikeda [:sotaro] from comment #39)
> Created attachment 799793 [details] [diff] [review]
> patch v10 - hold a reference to Image ony during TextureClient is added to
> CompositableForwarder
> 
> Update the patch depend on Bug 912040. This patch is very dirty patch.

On emulator, I still can see a problem...
Comment on attachment 799793 [details] [diff] [review]
patch v10 - hold a reference to Image ony during TextureClient is added to CompositableForwarder

This code has a deadlock problem. Set to obsolete.
Attachment #799793 - Attachment is obsolete: true
Attached file dead lock's stack (obsolete) —
Stack dump of the deadlock. The deadlock happen by ImageContainer's ReentrantMonitor. ImageContainer::SetCurrentImage() try to flush the buffer synchronously. But ImageClientSingle::UpdateImage() try to hold a ImageContainer's ReentrantMonitor by calling ImageContainer::LockCurrentImage(). Then fall into deadlock.
In Bug 912040, the way of implementation is changed. Then also need to change the patch in this bug.
I am going to revert to patch v7 and then apply necessary change.
(In reply to Sotaro Ikeda [:sotaro] from comment #43)
> In Bug 912040, the way of implementation is changed. Then also need to
> change the patch in this bug.
> I am going to revert to patch v7 and then apply necessary change.

Yes, sorry about that
update patch. Still have the following problem.
 - On emulator, camera preview frame order is still not correct.
 - Deadlock happens to change the camera mode(camera/video)
Here is roughly the idea that I have in mind to deal with deallocating or unlocking shared buffers:

When we send OpRemoveTexture, we ask the TextureClient to drop it's shared data as a TextureClientData object. we store this textureClientData in a map with an ID and when we receive the ReplyTextureRemoved message from the compositor we call DeallocateSharedData(...) on the TextureClientData.

In the case of GraphicBufferLockedTextureClientData, DeallocateSharedData just unlocks the GraphicBufferLocked (but for Shmem and friends it does a deallocation).

Sotaro, this is a very rough patch, there are missing parts and I didn't even test it. I just want to to know what you think about the idea before I spend more time on it.

If you apply it it will most likely not work (although it compiles on top of the patch from bug 912939).
Attachment #800264 - Flags: feedback?(sotaro.ikeda.g)
Idea looks good! From this bug's scope, ImageContainer::SetCurrentImage(nullptr) call should ensure the gralloc buffers returned to GonkNativeWindow synchronously. To to this, right now, I made following change to delete Texture synchronous. Is there no problem about the change?
-------------------------------
 void
 ImageBridgeChild::RemoveTexture(CompositableClient* aCompositable,
                                 uint64_t aTexture,
                                 TextureFlags aFlags)
 {
-  mTxn->AddNoSwapEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
+  mTxn->AddEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
                                       aTexture,
                                       aFlags));
 }
Flags: needinfo?(nical.bugzilla)
Comment on attachment 799959 [details]
dead lock's  stack

dead lock was already fixed.
Attachment #799959 - Attachment is obsolete: true
- fix deadlock and reduce the competition of ImageContainer's reentrant monitor.
- code is a little bit dirty.
Attachment #800223 - Attachment is obsolete: true
Even by this patch, camera mode change between camera mode and video mode is unacceptably slow. The cause of slowness seems to sync transaction of SendUpdate(). Camera preview update is handled by synchronously, this seems to block the gralloc buffer re-allocation via ImageBridge. In b2g18, the ImageBridge's video frame update was done asynchronously. The Blockage by the preview update in b2g18 was slow but acceptable range. But in master's reallocation slowness is way over the acceptable range.
comment #50 is big problem than I thought at the first time.
About comment #50, for Firefox OS v1.2 we need to implement another way of allocate/deallocate the gralloc buffers. My current thought is like the following. Any way this is out of scope of this bug.
- use android binder IPC +  android::IGraphicBufferAlloc like api
- The allocated gralloc buffers are identified by ID in IPDL ipc.
Update function name SetIdle() to FlushImage().
Attachment #800294 - Attachment is obsolete: true
Nical, if RemoveTexture is always synchronous, client side seems not to need to wait ReplyTextureRemoved. How do you think about it? In current ImageBridge, a new TextureClient is created and old one is destroyed in each rendering updates. Therefore almost all destroy operations are delivered to CompositableTransactionParent synchronously with other operations. And the corresponding TextureHosts are destroyed synchronously. Is my understanding correct?
Blocks: 911560
(In reply to Sotaro Ikeda [:sotaro] from comment #45)
> update patch. Still have the following problem.
>  - On emulator, camera preview frame order is still not correct.

Emulator's problem seems to be caused by other reasons. From debugging, it became clear that the gralloc buffer handling around ImageBridge, ImageHost and ImageClientBuffered is correct. And By applying the patch, H.264 video rendering becomes correct, no rendering distortion and no incorrect order rendering. So, it can be put off as a separate bug.
Blocks: 913279
Bug 913279 is for emulator camera preview problem.
(In reply to Sotaro Ikeda [:sotaro] from comment #50)
> Even by this patch, camera mode change between camera mode and video mode is
> unacceptably slow. The cause of slowness seems to sync transaction of
> SendUpdate(). Camera preview update is handled by synchronously, this seems
> to block the gralloc buffer re-allocation via ImageBridge. In b2g18, the
> ImageBridge's video frame update was done asynchronously. The Blockage by
> the preview update in b2g18 was slow but acceptable range. But in master's
> reallocation slowness is way over the acceptable range.

I tested it on unagi device, it seems that horrible delay seems unagi device specific problem. I tested on master hamachi(HWComposer disabled) and master nexus-4. They does not has such horrible delay. It might be affected by unagi's following problem.
- [Bug 880780] Camera - viewfinder is corrupted, mostly green, with random noise
camera mode change between camera mode and video mode was not horrible in hamachi and nexus-4. But hamachi's camera preview start is actually slow than b2g18. Need to be more quick than b2g18. Following reasons are possible.
- [1] gralloc buffers allocation via ImageBridge is slow
- [2] Connection of ImageClientBuffered class to ImageHost via ImageClientBridge is slow.

[1] is already a problem in b2g18.
[2] is recognized during debugging around ImageBridge in master.
On master hamachi, duration until camera preview start is not horrible as in unagi device. But still too long as a product. Need to fix it. This should be handled in a different bug.
Blocks: 913299
Bug 913299 is created for camera preview start lag problem on hamachi.
(In reply to Sotaro Ikeda [:sotaro] from comment #54)
> Nical, if RemoveTexture is always synchronous, client side seems not to need
> to wait ReplyTextureRemoved. How do you think about it? In current
> ImageBridge, a new TextureClient is created and old one is destroyed in each
> rendering updates. Therefore almost all destroy operations are delivered to
> CompositableTransactionParent synchronously with other operations. And the
> corresponding TextureHosts are destroyed synchronously. Is my understanding
> correct?

The way things currently work, as long as GraphicBufferLocked::Unlock() is called is called after the IPC round trip, it should be ok as you said.
I like the idea of explicitly using ReplyTextureRemoved though, because as the code evolves it is harder to maintain implicit invariants like "this will be safe to destroy/recycle at the end of the next transaction because of the way things work" than "This will be safe to destroy/recycle when you are notified explicitly that it is safe".

This is more of a way to prevent things from breaking as people add code. Because some people will not always know the specific requirements of B2G camera and will test their code on other platforms with different memory management requirements, and this kind of race conditions are hard to catch with try servers.
Flags: needinfo?(nical.bugzilla)
Comment on attachment 800264 [details] [diff] [review]
Register callbacks to deallocate/unlcok texture client data

Yeah, I agreed that the implementation is better than my current one. I am going to recreate the patch based on your patch.
Attachment #800264 - Flags: feedback?(sotaro.ikeda.g) → feedback+
I am also doing that just now (adding the FlushImage stuff from your patch to mine)
(In reply to Sotaro Ikeda [:sotaro] from comment #50)
> Even by this patch, camera mode change between camera mode and video mode is
> unacceptably slow. The cause of slowness seems to sync transaction of
> SendUpdate(). Camera preview update is handled by synchronously, this seems
> to block the gralloc buffer re-allocation via ImageBridge. In b2g18, the
> ImageBridge's video frame update was done asynchronously. The Blockage by
> the preview update in b2g18 was slow but acceptable range. But in master's
> reallocation slowness is way over the acceptable range.

I have a proposal for this performance issue: see bug 913461
(In reply to Nicolas Silva [:nical] from comment #64)
> I am also doing that just now (adding the FlushImage stuff from your patch
> to mine)

OK. Thanks.
Here is what I have so far, although I still have the crash after the last video frame, so there is something that I must be doing wrong or that I forgot to pull from sotaro's patch somewhere along the FlushImage code I suppose.
Attachment #800264 - Attachment is obsolete: true
(In reply to Nicolas Silva [:nical] from comment #67)
> Created attachment 800809 [details] [diff] [review]
> Register callbacks to deallocate/unlcok texture client data + stuff from
> Sotaro's patch
> 
> Here is what I have so far, although I still have the crash after the last
> video frame, so there is something that I must be doing wrong or that I
> forgot to pull from sotaro's patch somewhere along the FlushImage code I
> suppose.

Thanks for the patch! I am going to check it on my device.
I confirmed that the by attachment 800809 [details] [diff] [review], some buffers are not returned to OMXCodec. And OMXCodec wait forever at OMXCodec::stop().
I made some debuggings around attachment 800809 [details] [diff] [review]. I found 2 symptoms around the probmel
-[1] video thumbnail generation does not complete.
  + When TextureClient is added and then removed to/from CompositableHost
     before CompositorParent is set to CompositableHost, 
    op.textureID() is 0 within TOpRemoveTexture operation.
    Therefore, ReplyTextureRemoved is not delivered to the client side.
-[2] video playback closing does not complete.
  + Last Transaction is asynchronous(SendUpdateNoSwap()) is used. Therefore, client side can not receive transaction response. Therefore, some gralloc buffers are not returned to the client. This happens because, following implementation is used for the RemoveTexture(). Third argument is omitted then TEXTURE_FLAGS_DEFAULT is used for it. TEXTURE_FLAGS_DEFAULT select async trancaction in attachment 800809 [details] [diff] [review].

--------------------------------------------------------------------
void
CompositableClient::OnTransaction()
{
  for (unsigned i = 0; i < mTexturesToRemove.Length(); ++i) {
    mForwarder->RemoveTexture(this, mTexturesToRemove[i]);
  }
  mTexturesToRemove.Clear();
} 


  virtual void RemoveTexture(CompositableClient* aCompositable,
                             uint64_t aTextureID,
                             TextureFlags aFlags = TEXTURE_FLAGS_DEFAULT) = 0;
From comment #70, layers transaction seems very fragile.
In patch v13, [1] was already recognized. And [2] did not happen because remove texture set transaction always sync.
From comment #70, problem is in compositable transaction, it might be better the remaining problems to be solved by an enginneer who implemented these codes.
To speed up the fix, it might be better to fix only [2] in comment #70 in this bug. [1] is going to be fix in a new bug.
Bug 913821 is created for [1].
Sotaro is busy with other bugs at a work week so I am taking over this one.
Assignee: sotaro.ikeda.g → nical.bugzilla
For testing/landing you need to also apply patches from bug 913821 which fix subsequent problems.
Attachment #800367 - Attachment is obsolete: true
Attachment #800809 - Attachment is obsolete: true
Attachment #801716 - Flags: review?(sotaro.ikeda.g)
I created a ROM with  attachment 801716 [details] [diff] [review] and patches in bug 913821 [1] in comment #70 seems to be fixed. But H.264 video playback become very unstable. I tested by the following video.
  http://people.mozilla.org/~cpeterson/videos/H264_Baseline_Profile_Level_30_640x360p.mp4
Comment on attachment 801716 [details] [diff] [review]
Postpone deallocation/recycling of texture data to when it is safe to do so (after the next transaction)

>+void
> ImageContainer::SetCurrentImage(Image *aImage)
> {
>+  if (IsAsync() && !aImage) {
>+    printf_stderr(" ** FlushImage\n");
>+    // Set ImageClient to Idle only when an active image is present.
>+    // Let ImageClient to release all TextureClients.
>+    ImageBridgeChild::FlushImage(mImageClient, this);
>+    return;

Sorry, I wrote the above comment. It does not fit to current implementation.
Just "// Let ImageClient to release all TextureClients." is correct.


>diff --git a/gfx/layers/client/CompositableClient.cpp b/gfx/layers/client/CompositableClient.cpp
>--- a/gfx/layers/client/CompositableClient.cpp
>+++ b/gfx/layers/client/CompositableClient.cpp
>@@ -202,21 +202,40 @@ CompositableClient::AddTextureClient(Tex
>   mForwarder->AddTexture(this, aClient);
> }
> 
> void
> CompositableClient::RemoveTextureClient(TextureClient* aClient)
> {
>   MOZ_ASSERT(aClient);
>   mTexturesToRemove.AppendElement(aClient->GetID());
>+  if (!(aClient->GetFlags() & TEXTURE_DEALLOCATE_HOST)) {
>+    printf_stderr(" -- DropTextureData\n");
>+    TextureClientData* data = aClient->DropTextureData();
>+    if (data) {
>+      mTexturesToRemoveCallbacks[aClient->GetID()] = data;
>+    }
>+  }
>   aClient->ClearID();
>   aClient->MarkInvalid();

Need to remove "printf_stderr()".

>+void
> CompositableClient::OnTransaction()
> {
>   for (unsigned i = 0; i < mTexturesToRemove.Length(); ++i) {
>     mForwarder->RemoveTexture(this, mTexturesToRemove[i]);
>   }
>   mTexturesToRemove.Clear();
> }

It is already mentioned in Comment 70. Is there a reason not to change it?

>+
>+  virtual void DeallocateSharedData(ISurfaceAllocator*)
>+  {
>+    printf_stderr(" ** MemoryTextureClientData::Deallocate\n");
>+    delete[] mBuffer;
>+  }

printf_stderr() is not ncecessary.


> void
> ImageBridgeChild::RemoveTexture(CompositableClient* aCompositable,
>                                 uint64_t aTexture,
>                                 TextureFlags aFlags)
> {
>-  mTxn->AddNoSwapEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
>-                                      aTexture,
>-                                      aFlags));
>+  if (aFlags & TEXTURE_DEALLOCATE_HOST) {
>+    // if deallocation happens on the host side, we don't need the transaction
>+    // to be synchronous.
>+    mTxn->AddNoSwapEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
>+                                        aTexture,
>+                                        aFlags));
>+  } else {
>+    mTxn->AddEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
>+                                  aTexture,
>+                                  aFlags));
>+  }
> }

This patch does not apply Comment 70. Then H.264 video playback go mTxn->AddNoSwapEdit() path.

>-  mTxn->AddEdit(OpRemoveTexture(nullptr,
>-                aCompositable->GetIPDLActor(),
>-                aTexture,
>-                aFlags));
>+  mTxn->AddEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
>+                                aTexture,
>+                                aFlags));
>+  if (aFlags & TEXTURE_DEALLOCATE_HOST) {
>+    mTxn->MarkSyncTransaction();
>+  }
> }

Why is this change necessary?
(In reply to Sotaro Ikeda [:sotaro] from comment #79)
> Comment on attachment 801716 [details] [diff] [review]
> Postpone deallocation/recycling of texture data to when it is safe to do so
> (after the next transaction)
> >diff --git a/gfx/layers/client/CompositableClient.cpp b/gfx/layers/client/CompositableClient.cpp
> >--- a/gfx/layers/client/CompositableClient.cpp
> >+++ b/gfx/layers/client/CompositableClient.cpp
> >@@ -202,21 +202,40 @@ CompositableClient::AddTextureClient(Tex
> >   mForwarder->AddTexture(this, aClient);
> > }
> > 
> >+void
> > CompositableClient::OnTransaction()
> > {
> >   for (unsigned i = 0; i < mTexturesToRemove.Length(); ++i) {
> >     mForwarder->RemoveTexture(this, mTexturesToRemove[i]);
> >   }
> >   mTexturesToRemove.Clear();
> > }
> 
> It is already mentioned in Comment 70. Is there a reason not to change it?

Do you mean the NoSwap thing? If so, see, below, if you mean the zero textureID then I couldn't reproduce it so I assume it has changed with the rest of the patch.

> > void
> > ImageBridgeChild::RemoveTexture(CompositableClient* aCompositable,
> >                                 uint64_t aTexture,
> >                                 TextureFlags aFlags)
> > {
> >-  mTxn->AddNoSwapEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
> >-                                      aTexture,
> >-                                      aFlags));
> >+  if (aFlags & TEXTURE_DEALLOCATE_HOST) {
> >+    // if deallocation happens on the host side, we don't need the transaction
> >+    // to be synchronous.
> >+    mTxn->AddNoSwapEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
> >+                                        aTexture,
> >+                                        aFlags));
> >+  } else {
> >+    mTxn->AddEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
> >+                                  aTexture,
> >+                                  aFlags));
> >+  }
> > }
> 
> This patch does not apply Comment 70. Then H.264 video playback go
> mTxn->AddNoSwapEdit() path.

If H264 goes through the NoSwap path it means the wrong flag was set in the TextureClient. If TEXTURE_DEALLOCATE_HOST is set, the shared buffer should be deallocated on the host side which means the client side doesn't need to wait for the transaction to end.
In the case of GonkNativeWindow, the texture should be deallocated/recycled on the client side so TEXTURE_DEALLOCATE_HOST should not be set. I'll check that.

> 
> >-  mTxn->AddEdit(OpRemoveTexture(nullptr,
> >-                aCompositable->GetIPDLActor(),
> >-                aTexture,
> >-                aFlags));
> >+  mTxn->AddEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
> >+                                aTexture,
> >+                                aFlags));
> >+  if (aFlags & TEXTURE_DEALLOCATE_HOST) {
> >+    mTxn->MarkSyncTransaction();
> >+  }
> > }
> 
> Why is this change necessary?

Before this change, (main thread) transactions with OpRemoveTexture were always asynchronous making it impossible to deallocate properly (race-free) on the client side. For the scope of b2g h264 it doesn't matter since video playback is always using ImageBridge but this code was not correct and main thread transactions should be able to have client-side textures deallocation.
> > > 
> > >+void
> > > CompositableClient::OnTransaction()
> > > {
> > >   for (unsigned i = 0; i < mTexturesToRemove.Length(); ++i) {
> > >     mForwarder->RemoveTexture(this, mTexturesToRemove[i]);
> > >   }
> > >   mTexturesToRemove.Clear();
> > > }
> > 
> > It is already mentioned in Comment 70. Is there a reason not to change it?
> 
> Do you mean the NoSwap thing? If so, see, below, if you mean the zero
> textureID then I couldn't reproduce it so I assume it has changed with the
> rest of the patch.

RemoveTexture()'s third argument is not present. Then default parameter seems to be used.

  virtual void RemoveTexture(CompositableClient* aCompositable,
                             uint64_t aTextureID,
                             TextureFlags aFlags = TEXTURE_FLAGS_DEFAULT) = 0;

// the default flags
const TextureFlags TEXTURE_FLAGS_DEFAULT = TEXTURE_DEALLOCATE_HOST
                                         | TEXTURE_FRONT;


> > This patch does not apply Comment 70. Then H.264 video playback go
> > mTxn->AddNoSwapEdit() path.
> 
> If H264 goes through the NoSwap path it means the wrong flag was set in the
> TextureClient. If TEXTURE_DEALLOCATE_HOST is set, the shared buffer should
> be deallocated on the host side which means the client side doesn't need to
> wait for the transaction to end.
> In the case of GonkNativeWindow, the texture should be deallocated/recycled
> on the client side so TEXTURE_DEALLOCATE_HOST should not be set. I'll check
> that.

As described above, RemoveTexture()'s third argument is not present. Then default parameter is used. Then seems to go into wrong path.

> > >-  mTxn->AddEdit(OpRemoveTexture(nullptr,
> > >-                aCompositable->GetIPDLActor(),
> > >-                aTexture,
> > >-                aFlags));
> > >+  mTxn->AddEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
> > >+                                aTexture,
> > >+                                aFlags));
> > >+  if (aFlags & TEXTURE_DEALLOCATE_HOST) {
> > >+    mTxn->MarkSyncTransaction();
> > >+  }
> > > }
> > 
> > Why is this change necessary?
> 
> Before this change, (main thread) transactions with OpRemoveTexture were
> always asynchronous making it impossible to deallocate properly (race-free)
> on the client side. For the scope of b2g h264 it doesn't matter since video
> playback is always using ImageBridge but this code was not correct and main
> thread transactions should be able to have client-side textures deallocation.

I feel it's weird that the way of doing is totally opposite between ImageBridgeChild and ShadowLayerForwarder. Why is it opposite?
- ImageBridgeChild::RemoveTexture() request synchronous remove when flag is not TEXTURE_DEALLOCATE_HOST
- ShadowLayerForwarder::RemoveTexture() request synchronous remove when flag is TEXTURE_DEALLOCATE_HOST
Flags: needinfo?(nical.bugzilla)
Oh gosh, this is just me submitting rubbish patches, I need to stop working 12h a day. I am really lucky that you are a careful reviewer, thanks.

(In reply to Sotaro Ikeda [:sotaro] from comment #81)
> > > > 
> > > >+void
> > > > CompositableClient::OnTransaction()
> > > > {
> > > >   for (unsigned i = 0; i < mTexturesToRemove.Length(); ++i) {
> > > >     mForwarder->RemoveTexture(this, mTexturesToRemove[i]);
> > > >   }
> > > >   mTexturesToRemove.Clear();
> > > > }
> > > 
> > > It is already mentioned in Comment 70. Is there a reason not to change it?
> > 
> > Do you mean the NoSwap thing? If so, see, below, if you mean the zero
> > textureID then I couldn't reproduce it so I assume it has changed with the
> > rest of the patch.
> 
> RemoveTexture()'s third argument is not present. Then default parameter
> seems to be used.

Yes, definitely my mistake, that'll teach me a lesson about using default arguments without special care :/

> 
>   virtual void RemoveTexture(CompositableClient* aCompositable,
>                              uint64_t aTextureID,
>                              TextureFlags aFlags = TEXTURE_FLAGS_DEFAULT) =
> 0;
> 
> // the default flags
> const TextureFlags TEXTURE_FLAGS_DEFAULT = TEXTURE_DEALLOCATE_HOST
>                                          | TEXTURE_FRONT;
> 
> 
> > > This patch does not apply Comment 70. Then H.264 video playback go
> > > mTxn->AddNoSwapEdit() path.
> > 
> > If H264 goes through the NoSwap path it means the wrong flag was set in the
> > TextureClient. If TEXTURE_DEALLOCATE_HOST is set, the shared buffer should
> > be deallocated on the host side which means the client side doesn't need to
> > wait for the transaction to end.
> > In the case of GonkNativeWindow, the texture should be deallocated/recycled
> > on the client side so TEXTURE_DEALLOCATE_HOST should not be set. I'll check
> > that.
> 
> As described above, RemoveTexture()'s third argument is not present. Then
> default parameter is used. Then seems to go into wrong path.
> 
> > > >-  mTxn->AddEdit(OpRemoveTexture(nullptr,
> > > >-                aCompositable->GetIPDLActor(),
> > > >-                aTexture,
> > > >-                aFlags));
> > > >+  mTxn->AddEdit(OpRemoveTexture(nullptr, aCompositable->GetIPDLActor(),
> > > >+                                aTexture,
> > > >+                                aFlags));
> > > >+  if (aFlags & TEXTURE_DEALLOCATE_HOST) {
> > > >+    mTxn->MarkSyncTransaction();
> > > >+  }
> > > > }
> > > 
> > > Why is this change necessary?
> > 
> > Before this change, (main thread) transactions with OpRemoveTexture were
> > always asynchronous making it impossible to deallocate properly (race-free)
> > on the client side. For the scope of b2g h264 it doesn't matter since video
> > playback is always using ImageBridge but this code was not correct and main
> > thread transactions should be able to have client-side textures deallocation.
> 
> I feel it's weird that the way of doing is totally opposite between
> ImageBridgeChild and ShadowLayerForwarder. Why is it opposite?
> - ImageBridgeChild::RemoveTexture() request synchronous remove when flag is
> not TEXTURE_DEALLOCATE_HOST
> - ShadowLayerForwarder::RemoveTexture() request synchronous remove when flag
> is TEXTURE_DEALLOCATE_HOST

Completely. Sorry about that, I totally did the opposite of what I meant to do.
Flags: needinfo?(nical.bugzilla)
No problem. Thanks for your effort.
Attachment #801716 - Attachment is obsolete: true
Attachment #801716 - Flags: review?(sotaro.ikeda.g)
Attachment #802277 - Flags: review?(sotaro.ikeda.g)
Comment on attachment 802277 [details] [diff] [review]
Postpone deallocation/recycling of texture data to when it is safe to do so (after the next transaction)

Looks good to me!
Attachment #802277 - Flags: review?(sotaro.ikeda.g) → review+
Comment on attachment 802277 [details] [diff] [review]
Postpone deallocation/recycling of texture data to when it is safe to do so (after the next transaction)

Review of attachment 802277 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ImageContainer.cpp
@@ +188,5 @@
>  {
> +  if (IsAsync() && !aImage) {
> +    // Let ImageClient to release all TextureClients.
> +    ImageBridgeChild::FlushImage(mImageClient, this);
> +    return;

This "return;" is causing M3 regressions on Fennec. I think it is correct to clear the Image held by the ImageContainer when we receive aImage==nullptr, which the "return;" here is preventing us from doing.

So I'm testing without the return statement and if it fixes the M3 regression we are good to land the patch and finally get h264 video frames in the right order \o/
ImageClientBuffered::FlushImage() calls ImageContainer::ClearCurrentImage(). The function clear the mActiveImage. If new texture is used, the "return" is correct behaviour. I think android disables new TextureClient. This difference make the texts failures in android.

------------------
ImageClientBuffered::FlushImage(ImageContainer* aContainer)
{
  if (aContainer) {
    aContainer->ClearCurrentImage();
(In reply to Sotaro Ikeda [:sotaro] from comment #88)
> ImageClientBuffered::FlushImage() calls ImageContainer::ClearCurrentImage().
> The function clear the mActiveImage. If new texture is used, the "return" is
> correct behaviour. I think android disables new TextureClient. This
> difference make the texts failures in android.
> 
> ------------------
> ImageClientBuffered::FlushImage(ImageContainer* aContainer)
> {
>   if (aContainer) {
>     aContainer->ClearCurrentImage();

Right, this code needs to work both with and without new textures because of fennec though;
Is it because of the deadlock you were experiencing that you had ClearCurrentImage() happen in the ImageBridgeChild thread ?
With this try push https://tbpl.mozilla.org/?tree=Try&rev=13ec62d48be6
ClearCurrentImage is moved to ImageBridgeChild::FlushImageNow so that it is exercised by both the new and deprecated code paths (and the return statement is still present).
(In reply to Nicolas Silva [:nical] from comment #89)
> 
> Right, this code needs to work both with and without new textures because of
> fennec though;
> Is it because of the deadlock you were experiencing that you had
> ClearCurrentImage() happen in the ImageBridgeChild thread ?

If ImageContainer holds mutex during ImageBridgeChild::FlushImage(), it could fall into deadlock. Just removing the "return" seems not cause deadlock. But seems to affect to the performance.
Blocks: 912134
meh, build error in the previous push, trying again https://tbpl.mozilla.org/?tree=Try&rev=dc72ffab8b3d
The previous try results made no sense. I pushed to try again and it looks greener: https://tbpl.mozilla.org/?tree=Try&rev=9be89c98918e
https://hg.mozilla.org/mozilla-central/rev/855f1df4acfb

Please try to land B2G patches on b2g-inbound in the future to lower the risk of merge conflicts.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → 1.2 FC (16sep)
blocking-b2g: koi? → koi+
You need to log in before you can comment on or make changes to this bug.