###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID

RESOLVED FIXED

Status

()

Core
General
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: cpeterson, Assigned: nical)

Tracking

({regression})

27 Branch
regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s+)

Details

(Whiteboard: [rr], URL)

Attachments

(1 attachment)

(Reporter)

Description

4 years ago
STR:
1. Run Firefox from the terminal
2. Load https://bug922063.bugzilla.mozilla.org/attachment.cgi?id=811992
3. The test takes about 2 seconds to run.
4. Reload the page then quickly quit while the page's loading spinner is still spinning

RESULT:
###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID

I see this with Nightly 29 but not Aurora 28 (on OS X 10.9).
(Reporter)

Comment 1

4 years ago
Nical: could this error message be a regression from bug 897452?

I bisected recent Nightly builds and I think this was a regression in Nightly build 2013-12-13, which includes your ActorDestroy changes for bug 897452:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=1ad9af3a2ab8&tochange=8b5875dc7e31
Flags: needinfo?(nical.bugzilla)
Keywords: regression
(Assignee)

Comment 2

4 years ago
(In reply to Chris Peterson (:cpeterson) from comment #1)
> Nical: could this error message be a regression from bug 897452?

This changeset is large enough that I would not be surprised it caused it, although I have no idea why yet. I'll investigate this.
Assignee: nobody → nical.bugzilla
Flags: needinfo?(nical.bugzilla)
(Assignee)

Comment 3

4 years ago
Looks like this happens sometimes during shutdown when receiving PTexture::__delete__() on the client side.
(Assignee)

Comment 4

4 years ago
The problem seems to be only happening with the new ContentClient (at least on Linux which is what I am using to reproduce at them moment). My guess is that we did something wrong with handling TEXTURE_DEALLOCATE_DEFERRED
(Assignee)

Comment 5

4 years ago
Getting closer: This is happening when the TextureChild is destroyed because of ancestor deletion. Our code expects the deletion to always be triggered by the Recv__delete__ message.

Comment 6

4 years ago
Use ActorDestroy to invalidate state!
(Assignee)

Comment 7

4 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> Use ActorDestroy to invalidate state!

I think it's a little bit tricker:

PLayerTransaction's __delete__ message is asynchronously sent from the child to the parent, while PTexture's __delete__ message is sent asynchronously from the parent to the child.
When this is happening, the Manager protocol is destroyed on the child side before PTexture's __delete__ message arrives, but PTextureParent::ActorDestroy(AncestorDeletion) was not called before __delete__ was sent.

Something like:

                 Child                       |           Parent
                                             |
     async PTexture::SendRemoveTexture -.
                                         `-.
async PLayerTransaction::Send__delete__     `--> async PTexture::RecvRemoveTexture
      PTexture::ActorDestroy           `.     .- async PTexture::Send__delete__
                                         `.  /
                                           `/-.
           <error! unknown actor ID>  <----'   `->      ...
(Assignee)

Comment 8

4 years ago
This is a general problem that we run into every time we want to shutdown a protocol from the child side and there is potential async messages coming from the parent side (be it __delete__ or whatever else). ImageBridge's shutdown had a similar situation and we ended up doing some very complicated and slow dance with several synchronous messages, which I don't think we can afford to do here :(

The only other solution that comes to my mind is that all the __delete__ messages get always sent from the parent to the child side, and that this deletion is triggered by the child sending a shut-down message. This way the child side could know that as soon as the shutdown message is sent it can't send anything.


FooManager::Destroy()
| Go through all of PFooManager's managed
| protocols, let them send their last
| potential message, and mark them so 
| that they can't send any more message.
PFooManager::SendShutDown   --.
                               \
                                \      ... can receive and send ...
        ...                      \
                                  `--> PFooManager::RecvShutDown
     can receive                        | Go through all of PFooManager's managed
     but not send                       | protocols, let them send their last
                                        | potential message, and mark them so
        ...                             | that they can't send any more message.
                                   .-- PFooManager::Send__delete__
                                  /     
                                 /     ... can't receive nor send ...
                                /
PFooManager::Recv__delete__ <--'


And we would have to enforce this for the entire protocol hierarchy, which is probably a lot of work :(
On the other hand we'd end up with a clean and coherent shutdown protocol.

Comment 9

4 years ago
Yeah, in this case you really need to redesign the shutdown sequence. My best suggestion is to make sure that all the deletes come from the same side. e.g. instead of PTexture being deleted by the parent, the parent should send a message to the child saying "I'm done with this texture, please delete it".

Note that this doesn't completely eliminate the race, though, since you'd still be sending a message to an actor that the other side already thinks is dead! What you really need is to have the sides agree who's going to delete an actor and ensure that the other side is already aware that it's going away and cannot be sending messages to it.

IPDL already has machinery to statically enforce state machines which have this safety property, but nobody is using them, which is a tragedy of the hectic B2G development pace. I'll actually write up a post to dev-b2g later this week about how our stateless IPDL protocols are a huge footgun.
(Assignee)

Comment 10

4 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
> Yeah, in this case you really need to redesign the shutdown sequence. My
> best suggestion is to make sure that all the deletes come from the same
> side. e.g. instead of PTexture being deleted by the parent, the parent
> should send a message to the child saying "I'm done with this texture,
> please delete it".

For PTexture we are already doing something like that: it's always the child side who decides to destroy the protocol, and it sends a RemoveTextureMessage (which means "i'm done, delete the texture"), and the parent side after receiving this message __delete__ the actor. The problem here is that the Managing protocol shuts itself down in the middle of that which forces the PTexture protocol to shut down while the __delete__ message is on it's way.

My proposition (comment 8) is that we use the same pattern for the managing protocols as well to ensure that a managed protocol doesn't get brutally shut down in the middle of it's removal sequence by the death of it's manager.

> 
> Note that this doesn't completely eliminate the race, though, since you'd
> still be sending a message to an actor that the other side already thinks is
> dead! What you really need is to have the sides agree who's going to delete
> an actor and ensure that the other side is already aware that it's going
> away and cannot be sending messages to it.

Yes, the important part is that it should always be the child side who initiates the removal sequence, and always be the parent side who responds to that by calling Send__delete__. if all the protocols of a given "management chain" do that you don't have the race. We don't have to do that for a parent protocol if we know for sure that it will never be destroyed before it's managed protocol (for instance PImageBridge is created and destroyed along with the browser so it should be safe). 

> 
> IPDL already has machinery to statically enforce state machines which have
> this safety property, but nobody is using them, which is a tragedy of the
> hectic B2G development pace. I'll actually write up a post to dev-b2g later
> this week about how our stateless IPDL protocols are a huge footgun.

Thanks, it'd be nice if you post that on dev-platform actually, since IPDL affects almost all platforms.

If we do as I propose in comment 8, we can use the state machine to enforce the "can/can't receive, can/can't send" parts.
(Assignee)

Updated

4 years ago
Depends on: 966284
Whiteboard: [rr]
Duplicate of this bug: 961671
Can we at least turn these messages off in non-debug builds? They're really annoying when trying to work on test debugging.
Flags: needinfo?(nical.bugzilla)
(Reporter)

Updated

3 years ago
Blocks: 879538
tracking-e10s: --- → ?
tracking-e10s: ? → +

Comment 13

3 years ago
Would be great to have someone look at this again, it's pretty annoying when working on e10s.
(Assignee)

Comment 14

3 years ago
(In reply to Dave Townsend [:mossop] from comment #12)
> Can we at least turn these messages off in non-debug builds? They're really
> annoying when trying to work on test debugging.

Forwarding the needinfo to Ben since IPDL intrnals are way out of my jurisdiction and comfort zone (although disabling this message in release builds sounds like a very good idea to me).

(In reply to Josh Aas (Mozilla Corporation) from comment #13)
> Would be great to have someone look at this again, it's pretty annoying when
> working on e10s.

Unfortunately I don't have the time atm. Anyone, feel free to steal the bug from me.
Flags: needinfo?(nical.bugzilla) → needinfo?(bent.mozilla)
Created attachment 8497867 [details] [diff] [review]
patch

This turns off message channel error logging in non-debug builds or if the MOZ_QUIET environment variable is set.
Assignee: nical.bugzilla → dtownsend+bugmail
Status: NEW → ASSIGNED
Attachment #8497867 - Flags: review?(bent.mozilla)

Comment 16

3 years ago
Do we not want to actually fix the problems leading to these messages? I thought that's what this bug was about. Or, are you saying we'll mask them in release builds and leave this bug open for the real issues?
(In reply to Josh Aas (Mozilla Corporation) from comment #16)
> Do we not want to actually fix the problems leading to these messages? I
> thought that's what this bug was about. Or, are you saying we'll mask them
> in release builds and leave this bug open for the real issues?

Perhaps I should have created a separate bug but yes I'd like to just mask them for now as they make it harder to work on debugging tests. Given that no-one has really looked into fixing the root cause in nearly 9 months I don't hold out a lot of hope that it will be fixed anytime soon.

Comment 18

3 years ago
When I've written IPC code and caused errors like this to show up, I've always been very glad I fixed them. They typically don't point to trivial mistakes. I'd be very careful about setting any precedent that covering these up is OK, even in release builds. Maybe ask bent though?
Given that my patch is just a wallpaper I'm not really the assignee for the real problem here.
Assignee: dtownsend+bugmail → nobody
(Assignee)

Comment 20

3 years ago
I think that the patch that just landed in Bug 966284 should fix this issue.
(Assignee)

Updated

3 years ago
Flags: needinfo?(bent.mozilla)
I agree, this seems fixed here now as well.
Assignee: nobody → nical.bugzilla
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Comment 22

3 years ago
May be related: 

https://forums.mozilla.org/viewtopic.php?f=27&t=20755&p=43536#p43536

It's now possible to get this error from the new "sdk/ui" module in the add-on SDK.
You need to log in before you can comment on or make changes to this bug.