Move all of the information required to shut gfx IPDL actors down to the IPDL actors themeselves

RESOLVED WONTFIX

Status

()

defect
RESOLVED WONTFIX
3 years ago
2 years ago

People

(Reporter: nical, Assigned: nical)

Tracking

(Blocks 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [gfx-noted])

Attachments

(2 attachments)

We have a family of shutdown race condition where a thread-safe-refcounted object Foo holds on to a FooChild IPDL actor with FooChild's destruction is triggered by either Foo's destructor or a task scheduled by Foo's destructor.
Problems occur at shutdown if Foo is still alive while IPDL's shutdown forces FooChild to be destroyed, because Foo's destructor can also happen to be triggered concurrently, and FooChild's destruction typically involve reading states from Foo.

The ipdl shutdown in pseudo-code typically looks like:

> // IPDL top level protocol shutting down, need to remove all actors *NOW*
> for actor in AllPFooChildProtocolsAlive {
>    RefPtr<Foo> foo = FromIPDLActor(actor)
>    if (foo) {
>        foo->Destroy()
>    }
> }

And the above can race with Foo's refcount reaching zero on another thread just before being assigned to the local foo RefPtr.

It happens pretty rarely but we need to fix it eventually.

long story short, I have a plan to fix this.
Below is a copy-pasted comment from another bug where I intended to land this patch but went for a simpler temporary solution instead:

--


The fix is getting a bit more involved than I'd like to but there are several issues at play and fixing one mostly makes us fall on the others.

The root of the problem, from a high level perspective, is that we have written the gfx ipdl actors with the idea that each actors is fully owned by one object (for example CompositableClient fully owns CompositableChild) and that object's destruction drives the actor's destruction. This is mostly true, except when IPDL shuts down and forces the destruction of the actor. Ideally all actors would be already deallocated when we shut ipdl down but that's often not the case. Orchestrating this all becomes a bit complicated when threads get involved.


This patch does a bunch of things:
- rename DestroyIPDLActor -> DeallocIPDLActor to avoid confusion with the Destroy() methods which involve doing the destruction handshake and sending messages, while DeallocIPDLActor is the hook that gets called when IPDL wants to delete the actor (at this point we can't send messages).
- expose "ReleaseActor" rather than "Destroy" to make it clear that after this is called the TextureClient/CompositableClient should NOT keep their pointer to the actor (which may just have been deleted).
- track what happens and in which order for the following things:
  - ***Client is destroyed and calls ReleaseActor
  - IPDL shuts down (normal or abnormal shutdown)
  - the ***Child actor's destruction handshake is complete
And make sure that the actor is deleted at the right time, ideally when the destruction handshake finishes, if and only if the ***Client has already released its reference to the the ***Child actor (this is where most of the complexity is).
This patch will log to stdout when an ipc related gfx object is destroyed after the ipc shutdown, and tries to log the name of the top level protocol to get a better idea of where the objects are coming from.
That should be enough to give a hint about what could be wrong if a crash follows. At this point it's happening too often for us to be able have even a debug assertion.

For anyone willing to investigate objects outliving shutdown, there's also a few MOZ_CRASH added in some places which are only enabled if GFX_STRICT_SHUTDOWN is #defined.
These MOZ_CRASH fire when the objects that live too long are destroyed (rather than during shutdown when we realize there are objects that should already be dead), so they give a more helpful backtrace when trying to figure out what is holding the reference, and will not be confused with potential leaks.
Attachment #8740493 - Flags: review?(edwin)
Blocks: 1217128
Blocks: 1264293
Looks like this is caused by Destroy() being called twice if the TextureClient is a SharedSurfaceTextureClient, which I overlooked. My try pushes didn't include asan but I'm surprised that this use-after-free didn't make waves on the other builds. Definitely underestimated the usefulness of asan.
Flags: needinfo?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #6)
> Looks like this is caused by Destroy() being called twice if the
> TextureClient is a SharedSurfaceTextureClient, which I overlooked.

Nevermind the SharedSurface thing, that's a video frame. That's odd.
Whiteboard: [gfx-noted]
This is a bit dated now.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.