Closed Bug 966284 Opened 6 years ago Closed 5 years ago
Fix gfx IPDL protocol shutdown
The shutdown the synchronous child-to-parent shutdown of some protocols (such as PLayerTransaction) can race with async parent-to-child messages of their managed protocols (such as PTexture). We need to redesign tear down logic making sure the shutdown of a manager protocol never race with its sub protocols.
I propose that we do the following: 1) At the protocol level: Almost all of the gfx protocol deletion is triggered by the child side. To make sure that deletion doesn't cross a message coming form the other side, we need a simple hand shake such as FooChild::SendShutdown -> FooParent::RecvShutdown -> FooParent::Send__delete__ Plus carefully keeping a state in FooChild so it knows that after it has sent the destroy message it can't send any other message (although it might receive some) This is how PTexture already works. 2) At the protocol hierarchy level: To avoid that a manager protocol races with it managees, we also need the manager to shutdown with a similar hand shake. The manager must also iterate over it's managees to trigger their destruction sequence as well as in the diagram below: FooManager::Destroy() | Go through all of PFooManager's managed | protocols, let them send their last | potential message, and mark them so | that they can't send any more message. PFooManager::SendShutDown --. \ \ ... can receive and send ... ... \ `--> PFooManager::RecvShutDown can receive | Go through all of PFooManager's managed but not send | protocols, let them send their last | potential message, and mark them so ... | that they can't send any more message. .-- PFooManager::Send__delete__ / / ... can't receive nor send ... / PFooManager::Recv__delete__ <--' 3) Shutting down the channel: When we shut down the channel we need to destroy the top level protocol synchronously with respect to the main thread and also need to accommodate with the managed protocols having an asynchronous shutdown. Right now we do that with two synchronous messages (WillStop followed by Stop). the first triggers all the deletion from the other side (after it returns the other side can't send message anymore), the second is just there to ensure that all of the messages generated by the destruction on the other side has reached the child side's event loop. In order to make it work with 1) and 2) we just need to iterate over our managees and trigger their shutdown before sending WillStop. So there is almost no change to be made there.
This patch fixes the shutdown problem by making PlayerTransaction do what I proposed earlier in this bug. This makes sure that destroying PLayerTransaction will not race with PTexture messages.
Assignee: nobody → nical.bugzilla
Attachment #8375544 - Flags: review?(sotaro.ikeda.g)
Comment on attachment 8375544 [details] [diff] [review] Fix the shut down sequence of PLayerTransaction Review of attachment 8375544 [details] [diff] [review]: ----------------------------------------------------------------- Looks good!
Attachment #8375544 - Flags: review?(sotaro.ikeda.g) → review+
(In reply to Nicolas Silva [:nical] from comment #4) > https://hg.mozilla.org/integration/mozilla-inbound/rev/9e7cf0b1d80c sorry that was for another bug
another try push https://tbpl.mozilla.org/?tree=Try&rev=fa569276a9ce
I haven't looked at this for a while but it looks like it's close to be landable. The problem is that it doesn't seem to be working well on the non-omtc platforms (surprisingly), as this try push shows https://tbpl.mozilla.org/?tree=Try&rev=78b0483db68c We get some MsgRouteError in PContentChild which this patch doesn't modify directly. Ben, does that ring any bell? What this patch does is mostly have PLayerTransaction send a Shutdown message from the client side and have the parent side send __delete__ rather than haing the child side send __delete__ directly.
Hrm. Usually this means that the child had a message in its queue for an actor that has been deleted by the time the message is processed. The only real way to figure out which message it is is to catch it in a debugger and look at the message type directly... It's possible that your patch just changed timing somehow and exposed a latent bug. Or maybe the parent layer actors hold strong refs to other PContent things, and this patch changes the order of releases?
This try push doesn't have the mysterious PContent issues, although bc1 is failing with some ipc related stuff. https://tbpl.mozilla.org/?tree=Try&rev=fa301b907093
quick fix for bc1, this push looks neat (not all tests are done as I write this): https://tbpl.mozilla.org/?tree=Try&rev=d195ff8212ed
patch with the bc1 fix.
Attachment #8375544 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
5 years ago
See Also: → 1119395
You need to log in before you can comment on or make changes to this bug.