Closed Bug 897452 (PTexture) Opened 7 years ago Closed 6 years ago

Implement the PTexture protocol

Categories

(Core :: Graphics: Layers, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla29

People

(Reporter: nical, Assigned: nical)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

Attachments

(17 files, 29 obsolete files)

2.97 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
4.13 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
55.50 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
34.54 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
33.93 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
3.58 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
3.63 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
806 bytes, patch
bjacob
: review+
Details | Diff | Splinter Review
3.17 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
1.36 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
1.18 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
20.02 KB, patch
nical
: review+
Details | Diff | Splinter Review
9.01 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
16.28 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
10.04 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
1.26 KB, patch
nical
: review+
Details | Diff | Splinter Review
7.87 KB, patch
nical
: review+
Details | Diff | Splinter Review
With the new TextureClient/Host model, it now makes sense for textures to have their own IPDL protocol.

This will give us more flexibility with the lifetime of texture clients and hosts (right now it is limited by the lifetime of a compositable), and make it possible for us to share textures between layers.
Blocks: 912897
Assignee: nical.bugzilla → nobody
After make this protocol. I feel that it is better to implement as to remove TEXTURE_DEALLOCATE_HOST/TEXTURE_DEALLOCATE_CLIENT. This flag seems to make the architecture more fragile. If a buffer of the texture is correctly wrapped by a reference counted object. The buffer is automatically released when all reference to the buffer is removed. It could work also across a process boundary. 

It seems a protocol for a buffer allocation. I prefer a name of the protocol like PTextureBuffer.
(In reply to Sotaro Ikeda [:sotaro] from comment #1)
> After make this protocol. I feel that it is better to implement as to remove
> TEXTURE_DEALLOCATE_HOST/TEXTURE_DEALLOCATE_CLIENT. This flag seems to make
> the architecture more fragile. If a buffer of the texture is correctly
> wrapped by a reference counted object. The buffer is automatically released
> when all reference to the buffer is removed. It could work also across a
> process boundary.

Doing that without suffering from the cost of extra synchronization is not easy. It would be interesting to do though.
Before we commit to big changes like that, there is one thing that can be easily simplified: we currently support
1) TEXTURE_DEALLOCATE_HOST is set (layers deallocate the data on the host side)
2) TEXTURE_DEALLOCATE_CLIENT is set (layers deallocate the data on the client side)
3) none of the two flags are set (layers don't deallocate but notify the owner of the data)

We ended up implementing 2) and 3) the same way (with TextureClientData) so we can just have one bit for TEXTURE_DEALLOCATE_CLIENT (for 2) and 3)) and consider that when it is not set we deallocate on the host (the default behaviour which doesn't have the extra synchronization cost, 1)).
There is a bunch of code duplication that we can avoid by moving things into CompositableForwader. the next patch is going to add some more code that would have been duplicated if we don't do that.
Assignee: nobody → nical.bugzilla
Attached patch PTexture (WIP) (obsolete) — Splinter Review
beginning of the ptexture work. right now it mostly just replaces the pair {compositable, ID} by an actor as the way to match textures on both sides.
It's not finished yet.
The added feature is actually not used yet, but will be in subsequent patches for this bug.
Attachment #815661 - Flags: review?(bjacob)
I ran into cases where the texture host did not have a compositor which lead to crashes.
With this patch we make sure that when CompsoitableHost::UseTexture is called the compositor is properly set, as well as before compositing. In the very unlikely event of the compositor not being set when uploading the texture, BufferTextureHost will now shout a warning and return instead of crashing.
Attachment #815673 - Flags: review?(matt.woodrow)
Comment on attachment 815673 [details] [diff] [review]
Prelude 3 - Make sure the texture host has a compositor

Review of attachment 815673 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/composite/CompositableHost.cpp
@@ +54,5 @@
> +  }
> +  aTexture->SetCompositor(GetCompositor());
> +}
> +
> +

Only need one empty line here.
Attachment #815673 - Flags: review?(matt.woodrow) → review+
Attached patch Part 1 - PTexture (WIP) (obsolete) — Splinter Review
This patch puts in place the base of the PTexture stuff. It mostly replaces the pair {PCompositable, ID} by a proper PTexture actor in our IPC code to refer to textures.
The patch in itself can't be landed without some of the patches coming after.
I am trying to keep different steps of the conversion to PTexture in separate patch for ease of review, but I think the patch queue will have to be landed together to work properly (except for the "prelude" patches).
Attachment #815420 - Attachment is obsolete: true
TextureClient's destruction now triggers the deallocation protocol for the shared data, as opposed to relying on RemoveTextureClient to be called manually or on the Compositable to be flushed.
This could have been in part 1, but I kept it separate to keep part 1 smaller. This patch removes the unused code left after Part 1, that is the code around texture IDs.
try push to get a feel of potential regressions https://tbpl.mozilla.org/?tree=Try&rev=744e6d7f3483
I expect there will be some failures as at the moment these patches relax some of the constraints imposed by flushing the ImageClient that are needed due to some libstagefright quirks.
Whiteboard: [leave open]
The leaks in that try push are caused by texture removal being deferred to the next transaction, and sometimes there is nothing to trigger that last transaction.

one solution could be to take that out of the layer transaction, but I feel like it's not good to keep sending small messages too often.
Another way could be to do a last transaction just before shutting down ImageBridge/Shadowlayers, but that would most likely mean keeping the last frame for longer than we should.
I don't know how much overhead small IPDL messages involve, I have the feeling it might add up quickly.
Attachment #815661 - Attachment is patch: true
Comment on attachment 815661 [details] [diff] [review]
Prelude 2 - Make it possible fore TextureClient to call virtual methods before the destructor

Review of attachment 815661 [details] [diff] [review]:
-----------------------------------------------------------------

This is fine by itself; but I would like to see what precisely you want to do with that Finalize() trick before this actually lands. Indeed, this should only be done if there is a really good reason to.
Attachment #815661 - Flags: review?(bjacob) → review+
(In reply to Benoit Jacob [:bjacob] from comment #14)
> This is fine by itself; but I would like to see what precisely you want to
> do with that Finalize() trick before this actually lands. Indeed, this
> should only be done if there is a really good reason to.

In one of the  subsequent patches I need the texture client/host deallocation logic to be triggered by the ref count going to zero, which needs at least one virtual function call (DropTextureData). You can find this in the currently uploaded patches but they are a bit out of date. I am rebasing and reorganizing my patch queue now. Reviewable patches coming soon.
More recent version, almost final. I am not entirely done with those patches but very close. I still need to tidy up a few things.

bjacob and Sotaro, you will most likely be my reviewers for this patch, so you guys can start looking at this if you want to.

This is the patch that makes the deallocation logic of TextureClient/TextureHost use the IPDL actors rather than depending on CompositableCLient/Host.
Attachment #816930 - Attachment is obsolete: true
Attachment #816933 - Attachment is obsolete: true
This fixes a shutdown crash with async-video on Linux.
Attached patch Part 1 - PTexture (obsolete) — Splinter Review
This Patch replaces the {ID, compositable} pair with a PTexture actor to match texture clients and hosts.

This patch does not implement the deallocation logic (see Part 2), so it cannot be landed without the other patches.

This patch also leaves some garbage code like the TextureClient and Host ID stuff (which is cleaned up in Part 3)

The split between those patches is an attempt at making review easier but unfortunately it means that this one feels a bit incomplete. Just ignore the deallocation stuff and the unused methods.
Attachment #816927 - Attachment is obsolete: true
Attachment #824236 - Flags: review?(matt.woodrow)
Attached patch Part 2 - Deallocation logic (obsolete) — Splinter Review
This patch has the fun stuff: getting the deallocation logic to use PTexture.
Attachment #816929 - Attachment is obsolete: true
Attachment #823457 - Attachment is obsolete: true
Attachment #824239 - Flags: review?(sotaro.ikeda.g)
Attachment #824239 - Flags: review?(bjacob)
This patch removes the unused TextureID code. It's rather big because a lot of code was passing IDs around, but it does not have fancy logic.
Attachment #823458 - Attachment is obsolete: true
Attachment #824242 - Flags: review?(matt.woodrow)
This patch makes sure TextureHost never makes its GL texture outlive the widget. It fixes a shutdown crash on Linux. During the workweek Benoit showed me how to fix this at the GLContext level but I haven't had the time to write the patch yet so this will do for now.
Attachment #823460 - Attachment is obsolete: true
Attachment #824244 - Flags: review?(bjacob)
With the ptexture stuff, TextureClient determines that it has a TextureHost by checking whether it has an IPDL actor. In this test, we cheat by not using the IPDL actor which puts TextureClient in a state where it thinks it does not have a TextureHost and always deallocates the shared data in it's destructor. This change makes sure we do not deallocate the shared data in the host (as opposed to what we were previously doing).
Attachment #824247 - Flags: review?(bgirard)
Is there a reason you haven't landed the three r+ preludes?
I don't know exactly who are the most busy these days so don't hesitate to pass the reviews around. The Paris workweek should have gotten a lot of gfx folks up to speed with TextureClient.

The most important patch is Part 2, the other ones are fairly mechanical.
(In reply to Milan Sreckovic [:milan] from comment #24)
> Is there a reason you haven't landed the three r+ preludes?

Not really, the first one turned out to be not that useful so I am considering not doing it and rebasing the patch queue on top of it (would just move two functions from a .cpp to a .h)
The second one was reviewed very recently and makes little sense without Part 2. If I land it now people may wonder why I added a hook to an unimplemented method.
Attachment #824247 - Flags: review?(bgirard) → review+
Comment on attachment 824236 [details] [diff] [review]
Part 1 - PTexture

Review of attachment 824236 [details] [diff] [review]:
-----------------------------------------------------------------

Most of this look fairly mechanical, seems sane to me.

::: gfx/layers/client/CompositableClient.cpp
@@ +199,5 @@
>      ++mNextTextureID;
>    }
>    aClient->SetID(mNextTextureID);
> +
> +  return aClient->InitIPDLActor(mForwarder);

Won't this result in InitIPDLActor being called multiple times if we attach the texture client to multiple compositables?

Maybe the implementation of InitIPDLActor should handle this (just assert that its the same forwarder, and return?).
Attachment #824236 - Flags: review?(matt.woodrow) → review+
Comment on attachment 824242 [details] [diff] [review]
Part 3 - Cleanup some unused code

Review of attachment 824242 [details] [diff] [review]:
-----------------------------------------------------------------

Nice.
Attachment #824242 - Flags: review?(matt.woodrow) → review+
(In reply to Matt Woodrow (:mattwoodrow) from comment #27)
> Comment on attachment 824236 [details] [diff] [review]
> Part 1 - PTexture
> 
> Review of attachment 824236 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Most of this look fairly mechanical, seems sane to me.
> 
> ::: gfx/layers/client/CompositableClient.cpp
> @@ +199,5 @@
> >      ++mNextTextureID;
> >    }
> >    aClient->SetID(mNextTextureID);
> > +
> > +  return aClient->InitIPDLActor(mForwarder);
> 
> Won't this result in InitIPDLActor being called multiple times if we attach
> the texture client to multiple compositables?
> 
> Maybe the implementation of InitIPDLActor should handle this (just assert
> that its the same forwarder, and return?).

Right, at the moment we can't attach multiple times (I'll address this in bug 912897) so InitIPDLActor is just asserting that the actor is not already created. I will replace the assertion in InitIPDLActor with:

  if (mActor && mActor->GetForwarder() == aForwarder) {
    return true;
  }
  MOZ_ASSERT(!mActor, "Cannot use a texture on several IPC channels.");

But in this patch TextureChild does not have the GetForwarder method yet so I'll do that in bug 912897.
Comment on attachment 824239 [details] [diff] [review]
Part 2 - Deallocation logic

Review of attachment 824239 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Can you answer some question?

::: gfx/layers/client/TextureClient.cpp
@@ +224,5 @@
>  
>  TextureClient::~TextureClient()
> +{
> +  // All the destruction code that may lead to virtual method calls must be in
> +  // Finalize().

Is there a reason to comment out it?

::: gfx/layers/composite/TextureHost.cpp
@@ +581,5 @@
>    if (mShmem) {
>      MOZ_ASSERT(mDeallocator,
>                 "Shared memory would leak without a ISurfaceAllocator");
>      mDeallocator->DeallocShmem(*mShmem);
> +    delete mShmem;

Isn't it necessary to set nullptr to mShmem?

@@ +589,5 @@
> +void
> +ShmemTextureHost::ForgetSharedData()
> +{
> +  if (mShmem) {
> +    delete mShmem;

Isn't it necessary to set nullptr to mShmem?

::: gfx/layers/composite/TextureHost.h
@@ +337,5 @@
> +   * to it's shared data.
> +   *
> +   * This is important to ensure the correctness of the deallocation protocol.
> +   */
> +  virtual void ForgetSharedData() {}

Is it called somewhere? I did not see a code calling it.
Is there no possibility the Host is still used by someone?
Attachment #824239 - Flags: review?(sotaro.ikeda.g) → review+
(In reply to Sotaro Ikeda [:sotaro] from comment #30)
> Comment on attachment 824239 [details] [diff] [review]
> Part 2 - Deallocation logic
> 
> Review of attachment 824239 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good. Can you answer some question?
> 
> ::: gfx/layers/client/TextureClient.cpp
> @@ +224,5 @@
> >  
> >  TextureClient::~TextureClient()
> > +{
> > +  // All the destruction code that may lead to virtual method calls must be in
> > +  // Finalize().
> 
> Is there a reason to comment out it?

Finalize is actually part of the comment that starts on the line above it. I'll reformat the comment so that it doesn't look like commented code (plus, calling something named "Finalize" in de destructor feels right, so I can see how it gets confusing!)

> 
> ::: gfx/layers/composite/TextureHost.cpp
> @@ +581,5 @@
> >    if (mShmem) {
> >      MOZ_ASSERT(mDeallocator,
> >                 "Shared memory would leak without a ISurfaceAllocator");
> >      mDeallocator->DeallocShmem(*mShmem);
> > +    delete mShmem;
> 
> Isn't it necessary to set nullptr to mShmem?
> 
> @@ +589,5 @@
> > +void
> > +ShmemTextureHost::ForgetSharedData()
> > +{
> > +  if (mShmem) {
> > +    delete mShmem;
> 
> Isn't it necessary to set nullptr to mShmem?

Right, I will fix that.


> 
> ::: gfx/layers/composite/TextureHost.h
> @@ +337,5 @@
> > +   * to it's shared data.
> > +   *
> > +   * This is important to ensure the correctness of the deallocation protocol.
> > +   */
> > +  virtual void ForgetSharedData() {}
> 
> Is it called somewhere? I did not see a code calling it.
> Is there no possibility the Host is still used by someone?

Nice catch! I need to call this somehwere in textureParent::ActorDestroy when we don't deallocate the data on the host side (when TEXTURE_DEALLOCATE_CLIENT is set). I'll fix that.
Comment on attachment 824239 [details] [diff] [review]
Part 2 - Deallocation logic

Review of attachment 824239 [details] [diff] [review]:
-----------------------------------------------------------------

I've reviewed parts of this patch, but don't feel like a very good connaisseur myself of many of the things here. I hope that Matt or Nick can take a closer look at these changes.

I also have a nontrivial design question:

::: gfx/layers/client/TextureClient.cpp
@@ +227,5 @@
> +  // All the destruction code that may lead to virtual method calls must be in
> +  // Finalize().
> +}
> +
> +void TextureClient::ForceRemove()

I would like to understand how the existence of a 'ForceRemove' method, that immediately sends a __delete__ message, is compatible with reference counting of TextureClients?

I mean, when one of the things that hold on to a TextureClient calls ForceRemove(), what happens to the other things that are holding on to this TextureClient?
Attachment #824239 - Flags: review?(bjacob) → review+
Comment on attachment 824244 [details] [diff] [review]
Part 4 - Make sure we release all GL resources asap when shutting down the communication channel

Review of attachment 824244 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/composite/ImageHost.cpp
@@ +158,5 @@
> +ImageHost::SetCompositor(Compositor* aCompositor)
> +{
> +  if (mFrontBuffer && mCompositor != aCompositor) {
> +    mFrontBuffer->SetCompositor(aCompositor);
> +  }

So if we SetCompositor before we have a FrontBuffer, we silently do nothing? Should we rather assert that we have a FrontBuffer?

::: gfx/layers/ipc/ImageBridgeParent.cpp
@@ +128,5 @@
>  
>  bool ImageBridgeParent::RecvStop()
>  {
> +  // If there is any texture still alive we have to force it to deallocate the
> +  // device data (GL textures, etc.) now because shortly after SenStop() returns

SenStop?

@@ +135,5 @@
> +  InfallibleTArray<PTextureParent*> textures;
> +  ManagedPTextureParent(textures);
> +  for (unsigned int i = 0; i < textures.Length(); ++i) {
> +    RefPtr<TextureHost> tex = TextureHost::AsTextureHost(textures[i]);
> +    tex->DeallocateDeviceData();

Can you double-confirm that DeviceData, not SharedData, is the only thing that we want to deallocate here?
Attachment #824244 - Flags: review?(bjacob) → review+
(In reply to Benoit Jacob [:bjacob] from comment #32)
> 
> I would like to understand how the existence of a 'ForceRemove' method, that
> immediately sends a __delete__ message, is compatible with reference
> counting of TextureClients?
> 
> I mean, when one of the things that hold on to a TextureClient calls
> ForceRemove(), what happens to the other things that are holding on to this
> TextureClient?

Ideally we should not expose ForceRemove, and it would only be internal to TextureClient. However, we can have cases where the producer (like the B2G camera shutdown) says "when this function returns I have full ownership over all the gralloc buffers and delete them" At this point it is really hard to ensure that all the existing references to the TextureClients have been cleared, so we force the deallocation sequence.

Once you call ForceRemove on a TextureClient, it becomes invalid in the sense that it does not hold shared data any more. So calling Lock() on it will fail and return false.

ForceRemove should not be used except in bad edge cases like the camera shutdown.
In this patch it is unfortuantely used a bit more than it should, to reflect the behavior before PTexture (currently we send the OpRemoveTexture message manually regardless of refcounting, because of limitations caused by TextureClient depending on CompositableClient. PTexture removes this limitation and I have a patch to removes all the uses of ForceRemove except in the camera shutdown, but this patch will go in bug 926745.
(In reply to Benoit Jacob [:bjacob] from comment #33)
> Comment on attachment 824244 [details] [diff] [review]
> Part 4 - Make sure we release all GL resources asap when shutting down the
> communication channel
> 
> Review of attachment 824244 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/layers/composite/ImageHost.cpp
> @@ +158,5 @@
> > +ImageHost::SetCompositor(Compositor* aCompositor)
> > +{
> > +  if (mFrontBuffer && mCompositor != aCompositor) {
> > +    mFrontBuffer->SetCompositor(aCompositor);
> > +  }
> 
> So if we SetCompositor before we have a FrontBuffer, we silently do nothing?
> Should we rather assert that we have a FrontBuffer?
> 

Currently we tend to call SetCompositor "too often just in case" to make sure we do have the right compositor when we need it. There are a lot of possible combinations that may lead to having a compositor set before or after having a texture so we can't easily assert that we have a front buffer. when you set the front buffer it automatically sets the compositor of the front buffer to be the one of the compositable so either way it works.

> @@ +135,5 @@
> > +  InfallibleTArray<PTextureParent*> textures;
> > +  ManagedPTextureParent(textures);
> > +  for (unsigned int i = 0; i < textures.Length(); ++i) {
> > +    RefPtr<TextureHost> tex = TextureHost::AsTextureHost(textures[i]);
> > +    tex->DeallocateDeviceData();
> 
> Can you double-confirm that DeviceData, not SharedData, is the only thing
> that we want to deallocate here?

Confirmed: here we want to deallocate GL resources basically, so DeviceData.
(In reply to Nicolas Silva [:nical] from comment #34)
> PTexture removes this
> limitation and I have a patch to removes all the uses of ForceRemove except
> in the camera shutdown, but this patch will go in bug 926745.

OK, could you land bug 926745 immediately with the present bug? What you described in comment 34 sounds scary even if I didn't understand all of the details; makes me wonder if I was perhaps even less competent to r+ your present patch, than I thought I was.

Also, consider making ForceRemove private and befriending the few classes that need to call it, with comments explaining how bad that is.
Alias: PTexture
Attachment #815419 - Attachment is obsolete: true
Attachment #824236 - Attachment is obsolete: true
Attachment #824239 - Attachment is obsolete: true
Attachment #824242 - Attachment is obsolete: true
I rebased the entire patch queue on top of today's m-c so that bjacob can pick it up. I only made it so that each patch applies and builds, I did not try to run it after the rebase so there may be some breakage.
Prelude patches are to be applied before the rest.
Attachment #824247 - Attachment is obsolete: true
Trying to fix the build on Mac, and enabling reftest:

https://tbpl.mozilla.org/?tree=Try&rev=26e79036cc3c
Attached patch Part 3.1 - Fix the build on Mac (obsolete) — Splinter Review
Attachment #8341234 - Flags: review?(nical.bugzilla)
Now we're getting crashes like this:

https://tbpl.mozilla.org/php/getParsedLog.php?id=31331431&tree=Try&full=1#error0

12:08:55     INFO -  Crash reason:  EXC_BAD_ACCESS / 0x0000000d
12:08:55     INFO -  Crash address: 0x0
12:08:55     INFO -  Thread 29 (crashed)
12:08:55     INFO -   0  XUL!mozilla::layers::TextureClient::Finalize() [TextureClient.cpp:26e79036cc3c : 250 + 0x0]
12:08:55     INFO -      rbx = 0x0000000149053bd0   r12 = 0x0000000120bade08
12:08:55     INFO -      r13 = 0x0000000000006701   r14 = 0x0000000000006723
12:08:55     INFO -      r15 = 0x000000011f7dbbd0   rip = 0x0000000102084344
12:08:55     INFO -      rsp = 0x000000011f7dbb60   rbp = 0x000000011f7dbb60
12:08:55     INFO -      Found by: given as instruction pointer in context
12:08:55     INFO -   1  XUL!mozilla::layers::ReleaseTextureClientNow [TextureClient.h:26e79036cc3c : 138 + 0x7]
12:08:55     INFO -      rbx = 0x0000000149053bd0   r12 = 0x0000000120bade08
12:08:55     INFO -      r13 = 0x0000000000006701   r14 = 0x0000000000006723
12:08:55     INFO -      r15 = 0x000000011f7dbbd0   rip = 0x00000001020b5613
12:08:55     INFO -      rsp = 0x000000011f7dbb70   rbp = 0x000000011f7dbb80
12:08:55     INFO -      Found by: call frame info
12:08:55     INFO -   2  XUL!MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) [message_loop.cc:26e79036cc3c : 338 + 0x8]
12:08:55     INFO -      rbx = 0x000000011f7dbd00   r12 = 0x0000000120bade08
12:08:55     INFO -      r13 = 0x0000000000006701   r14 = 0x000000014734bbd0
12:08:55     INFO -      r15 = 0x000000011f7dbbd0   rip = 0x0000000101c6d429
12:08:55     INFO -      rsp = 0x000000011f7dbb90   rbp = 0x000000011f7dbbc0
12:08:55     INFO -      Found by: call frame info


What's happening here is likely that we're using mActor after it's been deleted. Not too surprising as TextureClient is refcounted and its actor isn't... the fix should be simple: make PTexture actors refcounted, and hold strong refs to them.
Ah, no, sorry, this crash seems to be about something else.

The line of code that's crashing is:

void
TextureClient::Finalize()
{
  if (mActor) {
    // this will call ForceRemove in the right thread, using a sync proxy if needed
    mActor->GetForwarder()->RemoveTexture(this); // *** CRASHING HERE ***
  }
}

And looking at the registers it seems that 'this' is not null. Looking at code, then, GetForwarder() is not virtual so it can't be directly responsible for this null-pointer-deref crash.

So what we likely have here is that mActor->mForwarder is null.

Also, it's only crashing on Mac.

Nical, my about fix-build-on-Mac patch is just a naive adaptation to remove the aID parameter to the constructor. Is there something else that I needed to do?
Attachment #8341234 - Flags: review?(nical.bugzilla) → review+
(In reply to Benoit Jacob [:bjacob] from comment #49)
> Nical, my about fix-build-on-Mac patch is just a naive adaptation to remove
> the aID parameter to the constructor. Is there something else that I needed
> to do?

Looks good, Part 3 was really just dead code removal, nothing fancier.
(In reply to Benoit Jacob [:bjacob] from comment #49)
> And looking at the registers it seems that 'this' is not null. Looking at
> code, then, GetForwarder() is not virtual so it can't be directly
> responsible for this null-pointer-deref crash.
> 
> So what we likely have here is that mActor->mForwarder is null.

I think the only way this can happen is if TextureClient::InitIPDLActor gets a null aForwarder parameter. We should assert that it doesn't happen (or maybe return false if it's legit) and find out why it is happening on Mac.
Blocks: 925444
OK, some news. The above assertion is not failing. So I made a try push with lots of logging,

  https://tbpl.mozilla.org/?tree=Try&rev=640af26574e9

and I get the same crash as before, with this log:

https://tbpl.mozilla.org/php/getParsedLog.php?id=31402692&tree=Try&full=1#error0

14:46:06     INFO -  XXX TextureClient ctor, this=0x11b596200
14:46:06     INFO -  XXX TextureClient InitIPDLActor, this=0x11b596200, aForwarder=0x1003752c8
14:46:06     INFO -  XXX TextureChild ctor, this=0x100393400
14:46:06     INFO -  XXX    mActor=0x100393400, mActor->mForwarder=0x1003752c8
14:46:06     INFO -  XXX TextureClient ctor, this=0x11af7cf80
14:46:06     INFO -  XXX TextureClient ctor, this=0x11b596500
14:46:06     INFO -  XXX TextureClient ctor, this=0x11b5965c0
14:46:06     INFO -  XXX TextureClient ctor, this=0x11b596620
14:46:06     INFO -  XXX TextureClient ctor, this=0x119b91040
14:46:06     INFO -  XXX TextureClient ctor, this=0x119b140e0
14:46:06     INFO -  XXX TextureClient ctor, this=0x11ace55e0
14:46:06     INFO -  XXX TextureClient ctor, this=0x11af075c0
14:46:06     INFO -  XXX TextureClient ctor, this=0x11b51fc40
14:46:06     INFO -  tests/content/media/test/id3tags.mp3
14:46:06     INFO -  155551 INFO TEST-PASS | /tests/content/media/test/test_media_sniffer.html | The media loads when served without a Content-Type.
14:46:06     INFO -  XXX TextureClient ForceRemove, this=0x11b596200, mActor=0x100393400
14:46:06     INFO -  XXX TextureChild dtor, this=0x100393400
14:46:06     INFO -  XXX TextureClient Finalize, this=0x11b596200, mActor=0x100393400
14:46:06     INFO -  XXX    mActor->GetForwarder()=0x0
14:46:47  WARNING -  PROCESS-CRASH | /tests/content/media/test/test_media_sniffer.html | application crashed [@ mozilla::layers::TextureClient::Finalize()]

This means that even though InitIPDLActor was called exactly once on this TextureClient 0x11b596200, with a non-null aForwarder 0x1003752c8, at the time of the crash mForwarder is null, which explains the crash (in particular, ruling out the possibility of a dead actor, or of a dead forwarder).

Need to do the logging equivalent of a watchpoint to understand why mForwarder switches to null...
Haha! It is right before our eyes in the log pasted in the above comment!

This part:

14:46:06     INFO -  XXX TextureClient ForceRemove, this=0x11b596200, mActor=0x100393400
14:46:06     INFO -  XXX TextureChild dtor, this=0x100393400
14:46:06     INFO -  XXX TextureClient Finalize, this=0x11b596200, mActor=0x100393400
14:46:06     INFO -  XXX    mActor->GetForwarder()=0x0

So the TextureChild here (0x100393400) is already dead, so we're in a use-after-free situation.
(In reply to Benoit Jacob [:bjacob] from comment #55)
> So the TextureChild here (0x100393400) is already dead, so we're in a
> use-after-free situation.

IIUC, it looks like the shutdown case where we should be carefully implementing TextureChild::ActorDestroy as we talked about on skype.
Yes we should implement ActorDestroy, but I want to make sure that I understand for sure if that is the reason of the present crash; the next try push should tell us.
Attachment #8341891 - Flags: review?(nical.bugzilla) → review+
Alright, the theory is confirmed:
 - The only reason why mActor->mForwarder has a bad value is that mActor is dead (I instrumented the mForwarder pointer to log all changes to its value).
 - The reason why mActor is dead is that is was destroyed by IPDL, ActorDestroy did get called, with reason 0x1 which is 'Deletion'.

06:21:27     INFO -  XXX TextureClient ctor, this=0x151d49d30
06:21:27     INFO -  XXX TextureClient InitIPDLActor, this=0x151d49d30, aForwarder=0x11024ffa8
06:21:27     INFO -  XXX at construction: TextureChild(0x10d9307e0)->mForwarder = 0x0
06:21:27     INFO -  XXX TextureChild ctor, this=0x10d9307e0
06:21:27     INFO -  XXX before assignment: TextureChild(0x10d9307e0)->mForwarder = 0x0
06:21:27     INFO -  XXX after assignment: TextureChild(0x10d9307e0)->mForwarder = 0x11024ffa8
06:21:27     INFO -  XXX    mActor=0x10d9307e0, mActor->mForwarder=0x11024ffa8
06:21:27     INFO -  XXX TextureClient ForceRemove, this=0x151d49d30, mActor=0x10d9307e0
06:21:27     INFO -  XXX    ForceRemove calling SendRemoveTexture
06:21:27     INFO -  XXX    ForceRemove finished
06:21:27     INFO -  XXX TextureChild::ActorDestroy, this=0x10d9307e0, ActorDestroyReason=0x1
06:21:27     INFO -  XXX TextureChild dtor, this=0x10d9307e0
06:21:27     INFO -  XXX at destruction: TextureChild(0x10d9307e0)->mForwarder = 0x11024ffa8
06:21:27     INFO -  XXX TextureClient ctor, this=0x11cbe2cf0
06:21:27     INFO -  XXX TextureClient InitIPDLActor, this=0x11cbe2cf0, aForwarder=0x11024ffa8
06:21:27     INFO -  XXX at construction: TextureChild(0x11e3f0860)->mForwarder = 0x0
06:21:27     INFO -  XXX TextureChild ctor, this=0x11e3f0860
06:21:27     INFO -  XXX before assignment: TextureChild(0x11e3f0860)->mForwarder = 0x0
06:21:27     INFO -  XXX after assignment: TextureChild(0x11e3f0860)->mForwarder = 0x11024ffa8
06:21:27     INFO -  XXX    mActor=0x11e3f0860, mActor->mForwarder=0x11024ffa8
06:21:27     INFO -  XXX TextureClient Finalize, this=0x151d49d30, mActor=0x10d9307e0
06:21:27     INFO -  XXX    mActor->GetForwarder()=0x20000
06:21:31  WARNING -  TEST-UNEXPECTED-FAIL | file:///builds/slave/talos-slave/test/build/tests/reftest/tests/layout/reftests/ogg-video/poster-11.html | Exited with code 1 during test run
06:21:31     INFO -  INFO | automation.py | Application ran for: 0:22:24.252249
06:21:31     INFO -  INFO | zombiecheck | Reading PID log: /var/folders/c1/dtzwlzq94zgd200q5swqh2b800000w/T/tmp5myNgrpidlog
06:21:44  WARNING -  PROCESS-CRASH | file:///builds/slave/talos-slave/test/build/tests/reftest/tests/layout/reftests/ogg-video/poster-11.html | application crashed [@ mozilla::layers::TextureClient::Finalize()]
NOOOoooo...  tryserver is not functional at the moment, "Frequent AWS disconnects."
Now conversely notifying the TextureChild when the TextureClient goes away.

https://tbpl.mozilla.org/?tree=Try&rev=7a3bea1634ba
Attachment #8342373 - Attachment is obsolete: true
Attachment #8342373 - Flags: review?(nical.bugzilla)
Attachment #8342485 - Flags: review?(nical.bugzilla)
Previous push was green.

Now this patch it about the Parent side. It seems that there were maybe a couple of issues in this function:

 - the condition about DEALLOCATE_CLIENT looked weird with the ! in front of the constant. Can you confirm that this is what was meant?

 - the if (mTextureHost) check was omitted at the end. It seemed better to do it once and for all at the beginning. Is that correct?

https://tbpl.mozilla.org/?tree=Try&rev=f833275b376d
Attachment #8342639 - Flags: review?(nical.bugzilla)
Comment on attachment 8342485 [details] [diff] [review]
Part 7: Implement TextureChild::ActorDestroy to notify its TextureClient

Review of attachment 8342485 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/client/TextureClient.h
@@ +254,5 @@
>     * Must only be called by Release().
>     */
>    void Finalize();
>  
> +  virtual void OnTextureChildDestroyed() {}

nit: on the host side the convention is to call this method OnActorDestroy (it's ActorDestroy for actors and OnActorDestroy for other objects that have to be notified about it)

Why do we need the method anyway? TextureClient is a friend of TextureChild, so TextureChild can manipulate its guts at will already and the method is empty right now (or do you plan to add stuff in there in another patch?)
Attachment #8342485 - Flags: review?(nical.bugzilla) → review+
Comment on attachment 8342639 [details] [diff] [review]
Part 8: fix TextureParent::ActorDestroy

Review of attachment 8342639 [details] [diff] [review]:
-----------------------------------------------------------------

wow indeed that is much better!
Attachment #8342639 - Flags: review?(nical.bugzilla) → review+
Previous try push crashed half the time: https://tbpl.mozilla.org/?tree=Try&rev=f833275b376d . My understanding is that the try push before, which was green, was leaking texturehosts because of the !TEXTURE_DEALLOCATE_CLIENT bug, and these TextureHosts are now correctly destroyed, now exposing a use-after-free bug.

This patch aims to solve this use-after-free.

New try push is consistently green:
https://tbpl.mozilla.org/?tree=Try&rev=4bb9128171bc
Attachment #8343025 - Flags: review?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #66)
> Comment on attachment 8342485 [details] [diff] [review]
> Part 7: Implement TextureChild::ActorDestroy to notify its TextureClient
> 
> Review of attachment 8342485 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/layers/client/TextureClient.h
> @@ +254,5 @@
> >     * Must only be called by Release().
> >     */
> >    void Finalize();
> >  
> > +  virtual void OnTextureChildDestroyed() {}
> 
> nit: on the host side the convention is to call this method OnActorDestroy
> (it's ActorDestroy for actors and OnActorDestroy for other objects that have
> to be notified about it)

OK, will adapt to this convention.

> 
> Why do we need the method anyway? TextureClient is a friend of TextureChild,
> so TextureChild can manipulate its guts at will already and the method is
> empty right now (or do you plan to add stuff in there in another patch?)

I didn't know whether we would need to do special work there for special kinds of TextureClient, such as ShmemTextureClient? So the plan was for this empty virtual method to be overridden by the TextureClients that need to. But if you don't see a need to, then sure let's remove it.
Attachment #8343025 - Flags: review?(nical.bugzilla) → review+
Attachment #8343025 - Attachment description: Check for null buffer in BufferTextureHost::Upload → Part 9: Check for null buffer in BufferTextureHost::Upload
Set to 1.3?, because this bug blocks Bug 925444.
blocking-b2g: --- → 1.3?
Previous push accidentally had a local patch that I need to be able to build on GCC 4.6 since bug 945613 landed.

https://tbpl.mozilla.org/?tree=Try&rev=ef045db298aa
remove 1.3? flag, based on off-line discussion. PTexture will not be landed in b2g 1.3
blocking-b2g: 1.3? → ---
No longer blocks: 925444
It's all green!

I'll do some testing on real devices, but this should be ready to land. That said, this is a very intrusive change, so we don't want to land it in the very last days of the B2G 1.3 cycle. Instead, we'll wait for the 1.3 branching and land on mozilla-inbound right after, in the gecko 29 cycle.
Attachment #8341033 - Attachment is obsolete: true
Attachment #8341034 - Attachment is obsolete: true
Attachment #8341036 - Attachment is obsolete: true
Attachment #8341038 - Attachment is obsolete: true
Attachment #8341039 - Attachment is obsolete: true
Attachment #8341041 - Attachment is obsolete: true
Attachment #8341045 - Attachment is obsolete: true
Attachment #8341234 - Attachment is obsolete: true
Attachment #8341891 - Attachment is obsolete: true
Attachment #8342485 - Attachment is obsolete: true
Attachment #8342639 - Attachment is obsolete: true
Attachment #8343025 - Attachment is obsolete: true
Attachment #8343303 - Flags: review+
Re-uploaded all 11 patches here (0.1, 0.2, and 1 -- 9) with commit info ready for landing as soon as we're on the gecko 29 train.

The updated patch series is exactly what was pushed on this green tryserver push:

  https://tbpl.mozilla.org/?tree=Try&rev=ef045db298aa
The landing of bug 893301 destroyed this patch queue; spent the day adapting to it.

New try push:
https://tbpl.mozilla.org/?tree=Try&rev=34f50fc4326a

Will ask nical to review a big additional patch if that's green; and we'll have to collapse for landing because I couldn't restore the successful compilation of each of the steps in the queue.
Looking OK so far on tryserver... time for a VERY careful review, lots of basic changes there as we discussed on IRC, which I could have failed to get right.
Attachment #8345589 - Flags: review?(nical.bugzilla)
Attachment #8345589 - Flags: review?(ncameron)
Attachment #8345589 - Flags: review?(nical.bugzilla) → review+
Comment on attachment 8345589 [details] [diff] [review]
Part 10: fix everything post bug 893301

Review of attachment 8345589 [details] [diff] [review]:
-----------------------------------------------------------------

Seems like there are a few things missing (or I am missing something). If you want to fix these things in another patch, then r=me for this one.

::: gfx/layers/client/ContentClient.cpp
@@ +301,5 @@
>    if (mTextureClient) {
>      mTextureClient->OnActorDestroy();
>    }
>    if (mTextureClientOnWhite) {
>      mTextureClientOnWhite->OnActorDestroy();

DO we still need all this OnActorDestroy stuff? Shouldn't that be handled via PTexture like it is on the host side?

::: gfx/layers/composite/CompositableHost.h
@@ -300,5 @@
>    TextureInfo mTextureInfo;
>    Compositor* mCompositor;
>    Layer* mLayer;
>    RefPtr<CompositableBackendSpecificData> mBackendData;
> -  RefPtr<TextureHost> mFirstTexture;

Can you remove the next sibling thing from TextureHosts too - or is that removed in an earlier patch

::: gfx/layers/composite/ContentHost.cpp
@@ +48,5 @@
>    mTextureHost = nullptr;
>  }
>  
>  void
>  ContentHostBase::DestroyTextureHostOnWhite()

these methods can be removed. Not needed in the destructor and inline at the other call site

@@ +309,1 @@
>    CompositableHost::OnActorDestroy();

remove this method

@@ +335,5 @@
>    mDeprecatedTextureHostOnWhite = nullptr;
>  }
>  
>  void
>  DeprecatedContentHostBase::OnActorDestroy()

remove this method

@@ +798,5 @@
>    // don't touch mDeprecatedTextureHost, we might need it for compositing
>  }
>  
>  void
>  DeprecatedContentHostDoubleBuffered::OnActorDestroy()

and this one

::: gfx/layers/composite/ImageHost.h
@@ +126,5 @@
>    virtual LayerRenderState GetRenderState() MOZ_OVERRIDE;
>  
>    virtual void SetCompositor(Compositor* aCompositor) MOZ_OVERRIDE;
>  
> +  virtual void OnActorDestroy() MOZ_OVERRIDE {}

can we just remove this?

::: gfx/layers/composite/TextureHost.cpp
@@ +47,2 @@
>  
> +  void ActorDestroy(ActorDestroyReason why);

why remove the MOZ_OVERRIDE?

@@ +705,5 @@
>        mTextureHost->DeallocateSharedData();
>      }
>      break;
>  
>    case NormalShutdown:

I am not sure about this at all. Could you fire off versions that do and don't do OnShutdown for NormalShutdown with Windows OMTC to try to see if there is any difference?

::: gfx/layers/composite/TextureHost.h
@@ +262,2 @@
>  {
> +  Atomic<int> mRefCount;

Do we have to do this manually? Can't we use the mfbt class? Or fork it with a different Release method. Having a bunch of special case ref counting code scattered around the tree is scary.

::: gfx/layers/composite/TiledContentHost.h
@@ +118,5 @@
>    {
>      mCompositor = aCompositor;
>    }
>  
> +  void OnActorDestroy() {}

remove
Attachment #8345589 - Flags: review?(ncameron)
Stumbled upon this while addressing nrc's comments.
Attachment #8345864 - Flags: review?(nical.bugzilla)
The above new patches should address most of your review comments. There remains:

> @@ +705,5 @@
> >        mTextureHost->DeallocateSharedData();
> >      }
> >      break;
> >  
> >    case NormalShutdown:
> 
> I am not sure about this at all. Could you fire off versions that do and
> don't do OnShutdown for NormalShutdown with Windows OMTC to try to see if
> there is any difference?

OK, will make try pushes now.

> 
> ::: gfx/layers/composite/TextureHost.h
> @@ +262,2 @@
> >  {
> > +  Atomic<int> mRefCount;
> 
> Do we have to do this manually? Can't we use the mfbt class? Or fork it with
> a different Release method. Having a bunch of special case ref counting code
> scattered around the tree is scary.

We can't use the MFBT RefCounted class because we need a Finalize method to be called when the refcount hits zero. I do see what you mean with scary. We could factor this in a common RefCountedWithFinalize base class, living somewhere under gfx/ (until it finds users outside).
Attachment #8345880 - Attachment is patch: true
Attachment #8345880 - Flags: review?(nical.bugzilla)
Part 15 should address Nick's comments about refcounting.
Windows OMTC try pushes:

With current code, calling OnShutdown on all shutdowns:
https://tbpl.mozilla.org/?tree=Try&rev=6273cbdba48d

With patch to call OnShutdowm only on abnormal shutdowns:
https://tbpl.mozilla.org/?tree=Try&rev=529bde92329f
Attachment #8345864 - Flags: review?(nical.bugzilla) → review+
Attachment #8345880 - Flags: review?(nical.bugzilla) → review+
The last Windows OMTC push is just as orange, (so all 3 are equally orange), so it doesn't look like the present patches make any measurable difference in this respect. Landing.
The diffstat is a net removal of 140 lines of code!

 52 files changed, 794 insertions(+), 937 deletions(-)
(In reply to Benoit Jacob [:bjacob] from comment #100)
> The last Windows OMTC push is just as orange, (so all 3 are equally orange),
> so it doesn't look like the present patches make any measurable difference
> in this respect. Landing.

I hope all that orange is fixed by bug 948555. It would be great if you could do another win omtc try push now that that bug is on m-c.
(In reply to Benoit Jacob [:bjacob] from comment #101)

Backed out for mass bustage.
https://hg.mozilla.org/integration/mozilla-inbound/rev/ff1bf2049ecf
The bustage was a last-minute regression from the last patch, introducing a AtomicRefCountedWithFinalize base class. Was calling delete on Base* instead of Derived* pointer. Here's the fix.

--- a/gfx/layers/AtomicRefCountedWithFinalize.h
+++ b/gfx/layers/AtomicRefCountedWithFinalize.h
@@ -27,18 +27,19 @@ class AtomicRefCountedWithFinalize
     }
 
     void Release() {
       MOZ_ASSERT(mRefCount > 0);
       if (0 == --mRefCount) {
 #ifdef DEBUG
         mRefCount = detail::DEAD;
 #endif
-        static_cast<T*>(this)->Finalize();
-        delete this;
+        T* derived = static_cast<T*>(this);
+        derived->Finalize();
+        delete derived;
       }
     }
Big PTexture try push:
https://tbpl.mozilla.org/?tree=Try&rev=c50309de0ed8

Windows OMTC + PTexture try push:
https://tbpl.mozilla.org/?tree=Try&rev=9e2f9c31b686

Windows OMTC + PTexture + only-call-onshutdown-on-abnormal-shutdown try push:
https://tbpl.mozilla.org/?tree=Try&rev=a59fedf06212
Relanded, for the following two reasons.

Reason #1:

Above try pushes are all green, regardless of the shutdown tweak.

Reason #2:

An old shadok (http://en.wikipedia.org/wiki/Les_Shadoks) saying goes like this: "The more it already failed, the higher the chances of it finally working". (which is also a basic fact of probability theory).


http://hg.mozilla.org/integration/mozilla-inbound/rev/c1b544bfad40
http://hg.mozilla.org/integration/mozilla-inbound/rev/01df0b7dcd8f
http://hg.mozilla.org/integration/mozilla-inbound/rev/fd0594ff74ce
http://hg.mozilla.org/integration/mozilla-inbound/rev/be500431a9d6
http://hg.mozilla.org/integration/mozilla-inbound/rev/7bc6247d6699
http://hg.mozilla.org/integration/mozilla-inbound/rev/f7eb2a0b79f4
http://hg.mozilla.org/integration/mozilla-inbound/rev/c57f9c2a5459
http://hg.mozilla.org/integration/mozilla-inbound/rev/146bd659d177
http://hg.mozilla.org/integration/mozilla-inbound/rev/3e24eaf5c3ec
http://hg.mozilla.org/integration/mozilla-inbound/rev/a283c87bafd1
http://hg.mozilla.org/integration/mozilla-inbound/rev/0cca2738e11e
http://hg.mozilla.org/integration/mozilla-inbound/rev/44928ee41344
http://hg.mozilla.org/integration/mozilla-inbound/rev/dfc65a7c9680
http://hg.mozilla.org/integration/mozilla-inbound/rev/8c943866a1dc
http://hg.mozilla.org/integration/mozilla-inbound/rev/13259025e424
http://hg.mozilla.org/integration/mozilla-inbound/rev/4ee8ce2e18ca
http://hg.mozilla.org/integration/mozilla-inbound/rev/8a3091a7302e
http://hg.mozilla.org/integration/mozilla-inbound/rev/990c7417b97b
http://hg.mozilla.org/integration/mozilla-inbound/rev/969cfcbb1c9c
http://hg.mozilla.org/integration/mozilla-inbound/rev/30f02ad6864d
(Changesets listed newest first; ignore the 4 first ones, they're from a different bug)
Whiteboard: [leave open]
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Depends on: 949756
No longer depends on: 858914
Depends on: 957783
Depends on: 962978
Attachment #8343305 - Attachment is patch: true
Flags: in-testsuite?
You need to log in before you can comment on or make changes to this bug.