Closed Bug 833128 Opened 11 years ago Closed 11 years ago

[layers-refactoring] Implement gralloc TextureHost/Client

Categories

(Core :: Graphics: Layers, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: nical, Assigned: bjacob)

References

Details

(Whiteboard: [metro-gfx])

Attachments

(1 file, 10 obsolete files)

1.32 MB, text/plain
Details
We need a gralloc based TextureHost (and maybe TextureClient) for the layers refactoring to work on B2G.
Whiteboard: [metro-gfx]
Blocks: metro-omtc
+cc: jones.chris.g@gmail.com, snorp@snorp.net, kchen@mozilla.com, gal@uci.edu, blassey.bugs@lassey.us

We'll be pasting the latest debugging info here shortly. We may need your expertise to get past some tough issues we're seeing.
Attached patch WIP patch (not yet working) (obsolete) — Splinter Review
Here's my WIP patch. It's rebased off a recent state of the graphics branch.

It runs without any OpenGL error (as checked by MOZ_GL_DEBUG_VERBOSE_ABORT_ON_ERROR) and the gralloc path it adds is indeed used. However it has the following issues:

 - does not display anything - why? The texture uploads are made in TextureOGL.cpp, specifically in DirectExternalTextureHost::Lock, see the call to BindExternalBuffer. TODO: check the contents of the gralloc buffer we're passing there?

 - sometimes crashes in IPC gralloc code with "corrupted actor state" - and there the actor is filled with 0x5a bytes indicating it's been freed too early.

 - (this may or may not have been resolved in the last rebasing) Sometimes TextureHost::Lock is called while Update hasn't been called. The present patch has asserts to catch that case so it will be obvious if it happens again (mGraphicBuffer is then null).
Another issue is that in DirectExternalTextureHost::Unlock we destroy and recreate the GL texture everytime, as that is the only way I know to effectively unbind the gralloc buffer from it. GLContext::UnbindExternalBuffer is junk, tries to bind a EXTERNAL texture to the 2D target which fails.
Got a bunch of help from snorp; the not-displaying-anything issue was caused by using the wrong program type; now it displays correctly except with R and B channels swapped. The homescreen app (or whatever draws the 4 icons at the bottom) keeps crashing; looking.
This is would similar to my patch in Bug 837591. 

I have not read your change, but I guess the display nothing problem is when render to texture in content side, you have to bind the eglImage to a texture and attach the texture into FBO, where you can not use GL_TEXTURE_EXTERNAL the BindExternalBuffer does.

The R and B swap problem is due to all software renderer we use now will produce BGRA output. That's why I added a FrontendType in my patch in Bug 837591.

I think this one would similar to my patch except that I use PLayers to allocate GrallocBuffer for me :P
Thanks for the link. Yes, the display-nothing bug was fixed by setting the correct program type (RGBAExternalLayerProgramType). The R-B swap is definitely what you describe; I'll look at your patch to see how you fix it (I was thinking that the natural fix was to modify the fragment shader of RGBAExternalLayerProgramType into a new BGRAExternalLayerProgramType).

However I am running into more pressing issue --- crashes --- that likely do not have equivalents in your patch as they seem to pertain to the new layers system and how TextureHosts work.
This patch works with a self-built libgenlock, for the reason that by default if you build libgenlock yourself it does nothing.

With a real libgenlock it still runs into "trying to upgrade a read lock to a writelock".

See in TextureOGL.cpp, in DirectExternalTextureHost, Lock() and Unlock().

UnbindExternalBuffer doesn't work (plus, it generates INVALID_OPERATION as it tries to bind to TEXTURE_2D). Not clear what to put instead of it in Unlock(). The present patch tries deleting and recreating the texture, which doesn't help.
Attachment #717939 - Attachment is obsolete: true
(In reply to Benoit Jacob [:bjacob] from comment #7)
> Created attachment 719135 [details] [diff] [review]
> WIP patch (almost working, just libgenlock locking issues)
> 
> This patch works with a self-built libgenlock, for the reason that by
> default if you build libgenlock yourself it does nothing.
> 
> With a real libgenlock it still runs into "trying to upgrade a read lock to
> a writelock".
> 
> See in TextureOGL.cpp, in DirectExternalTextureHost, Lock() and Unlock().
> 
> UnbindExternalBuffer doesn't work (plus, it generates INVALID_OPERATION as
> it tries to bind to TEXTURE_2D). Not clear what to put instead of it in
> Unlock(). The present patch tries deleting and recreating the texture, which
> doesn't help.

Does a way in Bug 774530 work in this case?
Comment on attachment 719135 [details] [diff] [review]
WIP patch (almost working, just libgenlock locking issues)

Review of attachment 719135 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/TextureClient.cpp
@@ +59,4 @@
>    // The owning layer must be locked at some point in the chain of callers
>    // by calling Hold.
>    mLayerForwarder->DestroyedThebesBuffer(mDescriptor);
> +  mDescriptor = SurfaceDescriptor();

This TextureClient may still be used (locally only, this cleans up the IPC connection) after the Destroyed() call (ThebesLayerBuffer may copy the buffer contents into the newly created one). Why are you clearing the SD here?

@@ +99,4 @@
>    if (mSurface) {
>      mSurface = nullptr;
>      ShadowLayerForwarder::CloseDescriptor(mDescriptor);
> +    mDescriptor = SurfaceDescriptor();

This will leak, won't it? We're going to miss the DestroySharedSurface call which is required to free Shmem
The existing broken UnbindExternalBuffer code comes from bug 774530 where related issues have been discussed.
(In reply to Sotaro Ikeda [:sotaro] from comment #8)
> Does a way in Bug 774530 work in this case?

The code that was added in Bug 774530 is still present in UnbindExternalBuffer, right? Calling it doesn't make any difference (other than generate a GL_INVALID_OPERATION)
(In reply to Matt Woodrow (:mattwoodrow) from comment #9)
> Comment on attachment 719135 [details] [diff] [review]
> WIP patch (almost working, just libgenlock locking issues)
> 
> Review of attachment 719135 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/layers/TextureClient.cpp
> @@ +59,4 @@
> >    // The owning layer must be locked at some point in the chain of callers
> >    // by calling Hold.
> >    mLayerForwarder->DestroyedThebesBuffer(mDescriptor);
> > +  mDescriptor = SurfaceDescriptor();
> 
> This TextureClient may still be used (locally only, this cleans up the IPC
> connection) after the Destroyed() call (ThebesLayerBuffer may copy the
> buffer contents into the newly created one). Why are you clearing the SD
> here?

My concern here was that I was running into double-frees as the shmem was scheduled for destruction here and was subsequently manually destroyed in the TextureClientShmem destructor, if it was a valid descriptor there. That code just meant to avoid the second destruction. Please suggest the right way. It feels wrong that we have both the present scheduling-for-destruction and then the immediate destruction in the TextureClientShmem dtor. Should one of these two things go away?


> 
> @@ +99,4 @@
> >    if (mSurface) {
> >      mSurface = nullptr;
> >      ShadowLayerForwarder::CloseDescriptor(mDescriptor);
> > +    mDescriptor = SurfaceDescriptor();
> 
> This will leak, won't it? We're going to miss the DestroySharedSurface call
> which is required to free Shmem

Thanks for this catch, I'll look into this as soon as my ongoing genlock woes are over.
Good news! It works now with the following change: all occurences of LOCAL_GL_TEXTURE_EXTERNAL replaced by TEXTURE_2D. I only used TEXTURE_EXTERNAL accidentally because it was used in BindExternalbuffer.

It all works now, going to upload patch, then resume cleaning it up. Will also have to apply Matt's comments.
Depends on: 846580
I spent the WHOLE day rebasing and fixing the fallout. Seems like everything changed over the past week.

I am FRUSTRATED.
Attachment #719215 - Attachment is obsolete: true
This is basically how far I can go alone. What I'm going to do now is slice this into small 1-fix-each patches, get these reviewed and landed, and get more people to help with the remaining work.

Remaining issues:

 - gralloc doesn't like us doing same-process drawing. gralloc_register_buffer sees that we're same-process and decides to not do anything, so we don't call genlock_attach_lock, so the first genlock_lock_buffer made by the GL in EGLImageTargetTexture2D fails.

 - we're still double-deleting GrallocBufferActor's as two different TextureHost's reference the same GrallocBufferActor through their mBuffer pointers. These seem to be passed around by SwapTextures.
Attachment #720492 - Attachment is obsolete: true
jrmuizel and me investigated the gralloc same-process woes mentioned in comment 18 and at this point our working hypothesis is that this is just a manifestation of the other issue mentioned above, the double delete.
Attachment #721407 - Attachment is obsolete: true
Here's a GDB session showing a bit of what's happening with the double deleted GrallocBufferActor. The two Send__delete__ are called from under the TextureHost destructor. The problem is that two different TextureHost objects are referencing the same GrallocBufferActor through their respective mBuffer SurfaceDescriptor* pointers.

Below is just some summary -- everything is in the attached log.

The GrallocBufferActor in question is 0x4045c4d8: see the actor=0x4045c4d8 in the Send__delete__ calls at the end just before the crash.

The first TextureHost is 0x4394c520 and its mBuffer SurfaceDescriptor pointer is pointing to 0x4394c610. The log shows that that SurfaceDescriptor was set by SwapTextures to be a TSurfaceDescriptorGralloc (its former value was a Tnull_t): (see the top of the attached log file for the GDB commands run there on this breakpoint):

Breakpoint 4, mozilla::layers::TextureHost::SwapTextures (this=0x4394c520, aImage=..., aResult=0x46eff158, aIsInitialised=0x442b4b61, aNeedsReset=0x46eff187, aRegion=0x0) at /hack/mozilla-graphics/gfx/layers/Compositor.cpp:56
56      {
$76 = (mozilla::layers::GrallocTextureHostOGL *) 0x4394c520
$77 = (class mozilla::layers::SurfaceDescriptor *) 0x4394c610
$78 = mozilla::layers::SurfaceDescriptor::Tnull_t
$79 = mozilla::layers::SurfaceDescriptor::TSurfaceDescriptorGralloc
$80 = (class mozilla::layers::SurfaceDescriptor *) 0x46eff158
#0  mozilla::layers::TextureHost::SwapTextures (this=0x4394c520, aImage=..., aResult=0x46eff158, aIsInitialised=0x442b4b61, aNeedsReset=0x46eff187, aRegion=0x0) at /hack/mozilla-graphics/gfx/layers/Compositor.cpp:56


The second TextureHost, which is crashing as it tries to call Send__delete__ to the same actor, is 0x43957c40 and its mBuffer SurfaceDescriptor pointer points to 0x43957df0. The log shows that that SurfaceDescriptor was set by SwapTextures to be a TSurfaceDescriptorGralloc (its former value was a Tnull_t):

Breakpoint 4, mozilla::layers::TextureHost::SwapTextures (this=0x43957c40, aImage=..., aResult=0x46eff158, aIsInitialised=0x442b4b61, aNeedsReset=0x46eff187, aRegion=0x0) at /hack/mozilla-graphics/gfx/layers/Compositor.cpp:56
56      {
$81 = (mozilla::layers::GrallocTextureHostOGL *) 0x43957c40
$82 = (class mozilla::layers::SurfaceDescriptor *) 0x43957df0
$83 = mozilla::layers::SurfaceDescriptor::Tnull_t
$84 = mozilla::layers::SurfaceDescriptor::TSurfaceDescriptorGralloc
$85 = (class mozilla::layers::SurfaceDescriptor *) 0x46eff158
#0  mozilla::layers::TextureHost::SwapTextures (this=0x43957c40, aImage=..., aResult=0x46eff158, aIsInitialised=0x442b4b61, aNeedsReset=0x46eff187, aRegion=0x0) at /hack/mozilla-graphics/gfx/layers/Compositor.cpp:56
Some of the values in those stacks seem broken.

For example, the CompositableHost |this| pointers are the same between the two calls, but the ContentHostDirect |this| pointers are different (to both the CompositableHost, and each other).

This is making it very hard to draw conclusions from the stack traces.
Yeah, the |this| pointers are indeed sometimes wrong. However with some care it's still possible to figure what's going on, because there is enough redundancy there (each pointer is mentioned more than once) that one can see which ones are wrong.

This new GDB session shows how we end up calling SwapTextures on different TextureHosts with the same gralloc buffer: a single ContentHost here is doing AddTextureHost with two different TextureHosts and each time calls SwapTextures before switching to the next TextureHost.
Attached file GDB log showing what's going on here. (obsolete) —
Here's the story of what happened here.

Let's summarize the problem: we have a gralloc buffer being deleted twice. It's referenced by two different TextureHosts. The question is how do we end up with 2 TextureHosts referencing the same gralloc.

There is a SINGLE ContentHost involved here. We primary question to answer is: Why did this ContentHost recreate its TextureHost a second time?

The log shows that the TextureHosts are being created by TextureParent::EnsureTextureHost. This function starts with a check to see if we already have a TextureHost for the right SurfaceDescriptor Type:

  if (!SurfaceTypeChanged(aSurfaceType)) {
    return false;
  }

What TextureParent::SurfaceTypeChanged does is:

bool TextureParent::SurfaceTypeChanged(SurfaceDescriptor::Type aNewSurfaceType)
{
  return mLastSurfaceType != aNewSurfaceType;
}

The problem here is that while the existing TextureHost is stored by the ContentHost, the data of the existing TextureHost type used to check if we need a new one is stored by the TextureParent.

Here, while we have only one TextureHost, we have TWO TextureParents involved. The second TextureParent doesn't already know that the first TextureParent has already created a TextureHost of the right type on the same TextureHost.

So it seems like the fix is to kill mLastSurfaceType and instead query directly the Compositable for its existing TextureHost type?
Attachment #721490 - Attachment is obsolete: true
Attachment #721919 - Attachment is obsolete: true
mLastSurfaceType was added by:

# HG changeset patch
# User Nicolas Silva <nical.bugzilla@gmail.com>
# Date 1359662473 -3600
# Node ID 10168b7f753f696b7a3912d525ea7f12993244cd
# Parent  32fab3b568ae7f45688f22b81481dfdef01eb22e
Get (non-async) video to work again with yuv-rgb conversion on the gpu, and PictureRect passed Compositable
oops had attached wrong file.
Attachment #722378 - Attachment is obsolete: true
As discussed with Nical and Nick: the issue there is not that we have two TextureParent-TextureHost pairs, but that both are referencing the same gralloc surface, and that comes from those SwapTextures calls getting the thebeslayer's frontbuffer twice; It's our understanding that Nick's current work is likely to fix this, so I'll wait for him to land it, and will retry (hopefully tomorrow).
Blocks: 849261
No longer blocks: metro-omtc
Now (cset afa4f0576f0f) everything should be working.

Here's a quick summary of what's been going on over the past few days.

Our theory in comment 20, that the genlock failures were just a manifestation of the double deletes, turned out to be wrong.

The double deletes got fixes on the graphics branch several days ago, but the genlock failures persisted.

They were explained by logs showing that the content and compositor sides were both locking the same gralloc buffer. That was an issue with how we did double-buffering, so I proceeded first by a local hack disabling double buffering here, then by applying Nick's WIP double-buffering-rework patch, then with Nick's stuff as landed on the graphics branch; Nick's changes definitely did make things better, but we were still getting genlock failures, as the content and compositor sides were still trying to lock the same surfaces, with the content side trying to get a write lock.

It turned out (thanks Matt&Nick for the correct analysis of the logs!) that the surface in question was the front buffer, and the reason why the content side was trying to lock it was that it wants to read from the font buffer so that it can reuse any non-invalidated drawing from the previous frame, and the bug was that while it only needed a read lock to do that, it was effectively trying to get a write lock, and that causes genlock failures because while genlock's read locks are non-exclusive and reference-counted, genlock's write locks are completey exclusive.

This was fixed by http://hg.mozilla.org/projects/graphics/rev/64afda421c9d modulo a little compile error fixed in the next cset afa4f0576f0f.
Going to mark that FIXED now, which isn't to say that more improvements/fixes will follow, but at this point it appears to be working.

A caveat though: B2G with mozilla-central gecko is seriously broken at the moment, as a known bug makes the Gaia lockscreen impossible to unlock. So my testing was fairly limited. Still, during all that time fighting this bug, I couldn't get even that to run without generating errors, as it now does. (And I'm running with my instrumented libgenlock that aborts on any error, and with MOZ_GL_DEBUG_ABORT_ON_ERROR, which is why I catch quite a lot of possible issues already with just the Gaia lockscreen).
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: