952404 - ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID

Reporter

Description

•

11 years ago

STR:
1. Run Firefox from the terminal
2. Load https://bug922063.bugzilla.mozilla.org/attachment.cgi?id=811992
3. The test takes about 2 seconds to run.
4. Reload the page then quickly quit while the page's loading spinner is still spinning

RESULT:
###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID

I see this with Nightly 29 but not Aurora 28 (on OS X 10.9).

Chris Peterson [:cpeterson]

Reporter

Comment 1

•

11 years ago

Nical: could this error message be a regression from bug 897452?

I bisected recent Nightly builds and I think this was a regression in Nightly build 2013-12-13, which includes your ActorDestroy changes for bug 897452:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=1ad9af3a2ab8&tochange=8b5875dc7e31

Flags: needinfo?(nical.bugzilla)

Keywords: regression

Nicolas Silva [:nical]

Assignee

Comment 2

•

11 years ago

(In reply to Chris Peterson (:cpeterson) from comment #1)
> Nical: could this error message be a regression from bug 897452?

This changeset is large enough that I would not be surprised it caused it, although I have no idea why yet. I'll investigate this.

Assignee: nobody → nical.bugzilla

Flags: needinfo?(nical.bugzilla)

Nicolas Silva [:nical]

Assignee

Comment 3

•

10 years ago

Looks like this happens sometimes during shutdown when receiving PTexture::__delete__() on the client side.

Nicolas Silva [:nical]

Assignee

Comment 4

•

10 years ago

The problem seems to be only happening with the new ContentClient (at least on Linux which is what I am using to reproduce at them moment). My guess is that we did something wrong with handling TEXTURE_DEALLOCATE_DEFERRED

Nicolas Silva [:nical]

Assignee

Comment 5

•

10 years ago

Getting closer: This is happening when the TextureChild is destroyed because of ancestor deletion. Our code expects the deletion to always be triggered by the Recv__delete__ message.

Benjamin Smedberg

Comment 6

•

10 years ago

Use ActorDestroy to invalidate state!

Nicolas Silva [:nical]

Assignee

Comment 7

•

10 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> Use ActorDestroy to invalidate state!

I think it's a little bit tricker:

PLayerTransaction's __delete__ message is asynchronously sent from the child to the parent, while PTexture's __delete__ message is sent asynchronously from the parent to the child.
When this is happening, the Manager protocol is destroyed on the child side before PTexture's __delete__ message arrives, but PTextureParent::ActorDestroy(AncestorDeletion) was not called before __delete__ was sent.

Something like:

                 Child                       |           Parent
                                             |
     async PTexture::SendRemoveTexture -.
                                         `-.
async PLayerTransaction::Send__delete__     `--> async PTexture::RecvRemoveTexture
      PTexture::ActorDestroy           `.     .- async PTexture::Send__delete__
                                         `.  /
                                           `/-.
           <error! unknown actor ID>  <----'   `->      ...

Nicolas Silva [:nical]

Assignee

Comment 8

•

10 years ago

This is a general problem that we run into every time we want to shutdown a protocol from the child side and there is potential async messages coming from the parent side (be it __delete__ or whatever else). ImageBridge's shutdown had a similar situation and we ended up doing some very complicated and slow dance with several synchronous messages, which I don't think we can afford to do here :(

The only other solution that comes to my mind is that all the __delete__ messages get always sent from the parent to the child side, and that this deletion is triggered by the child sending a shut-down message. This way the child side could know that as soon as the shutdown message is sent it can't send anything.


FooManager::Destroy()
| Go through all of PFooManager's managed
| protocols, let them send their last
| potential message, and mark them so 
| that they can't send any more message.
PFooManager::SendShutDown   --.
                               \
                                \      ... can receive and send ...
        ...                      \
                                  `--> PFooManager::RecvShutDown
     can receive                        | Go through all of PFooManager's managed
     but not send                       | protocols, let them send their last
                                        | potential message, and mark them so
        ...                             | that they can't send any more message.
                                   .-- PFooManager::Send__delete__
                                  /     
                                 /     ... can't receive nor send ...
                                /
PFooManager::Recv__delete__ <--'


And we would have to enforce this for the entire protocol hierarchy, which is probably a lot of work :(
On the other hand we'd end up with a clean and coherent shutdown protocol.

Benjamin Smedberg

Comment 9

•

10 years ago

Yeah, in this case you really need to redesign the shutdown sequence. My best suggestion is to make sure that all the deletes come from the same side. e.g. instead of PTexture being deleted by the parent, the parent should send a message to the child saying "I'm done with this texture, please delete it".

Note that this doesn't completely eliminate the race, though, since you'd still be sending a message to an actor that the other side already thinks is dead! What you really need is to have the sides agree who's going to delete an actor and ensure that the other side is already aware that it's going away and cannot be sending messages to it.

IPDL already has machinery to statically enforce state machines which have this safety property, but nobody is using them, which is a tragedy of the hectic B2G development pace. I'll actually write up a post to dev-b2g later this week about how our stateless IPDL protocols are a huge footgun.

Nicolas Silva [:nical]

Assignee

Comment 10

•

10 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
> Yeah, in this case you really need to redesign the shutdown sequence. My
> best suggestion is to make sure that all the deletes come from the same
> side. e.g. instead of PTexture being deleted by the parent, the parent
> should send a message to the child saying "I'm done with this texture,
> please delete it".

For PTexture we are already doing something like that: it's always the child side who decides to destroy the protocol, and it sends a RemoveTextureMessage (which means "i'm done, delete the texture"), and the parent side after receiving this message __delete__ the actor. The problem here is that the Managing protocol shuts itself down in the middle of that which forces the PTexture protocol to shut down while the __delete__ message is on it's way.

My proposition (comment 8) is that we use the same pattern for the managing protocols as well to ensure that a managed protocol doesn't get brutally shut down in the middle of it's removal sequence by the death of it's manager.

> 
> Note that this doesn't completely eliminate the race, though, since you'd
> still be sending a message to an actor that the other side already thinks is
> dead! What you really need is to have the sides agree who's going to delete
> an actor and ensure that the other side is already aware that it's going
> away and cannot be sending messages to it.

Yes, the important part is that it should always be the child side who initiates the removal sequence, and always be the parent side who responds to that by calling Send__delete__. if all the protocols of a given "management chain" do that you don't have the race. We don't have to do that for a parent protocol if we know for sure that it will never be destroyed before it's managed protocol (for instance PImageBridge is created and destroyed along with the browser so it should be safe). 

> 
> IPDL already has machinery to statically enforce state machines which have
> this safety property, but nobody is using them, which is a tragedy of the
> hectic B2G development pace. I'll actually write up a post to dev-b2g later
> this week about how our stateless IPDL protocols are a huge footgun.

Thanks, it'd be nice if you post that on dev-platform actually, since IPDL affects almost all platforms.

If we do as I propose in comment 8, we can use the state machine to enforce the "can/can't receive, can/can't send" parts.

Nicolas Silva [:nical]

Assignee

Updated

•

10 years ago

Depends on: 966284

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Updated

•

10 years ago

Whiteboard: [rr]

Dave Townsend [:mossop]

Comment 12

•

10 years ago

Can we at least turn these messages off in non-debug builds? They're really annoying when trying to work on test debugging.

Flags: needinfo?(nical.bugzilla)

Chris Peterson [:cpeterson]

Reporter

Updated

•

10 years ago

Blocks: core-e10s

tracking-e10s: --- → ?

Brad Lassey [:blassey] (use needinfo?)

Updated

•

10 years ago

tracking-e10s: ? → +

Josh Aas

Comment 13

•

10 years ago

Would be great to have someone look at this again, it's pretty annoying when working on e10s.

Nicolas Silva [:nical]

Assignee

Comment 14

•

10 years ago

(In reply to Dave Townsend [:mossop] from comment #12)
> Can we at least turn these messages off in non-debug builds? They're really
> annoying when trying to work on test debugging.

Forwarding the needinfo to Ben since IPDL intrnals are way out of my jurisdiction and comfort zone (although disabling this message in release builds sounds like a very good idea to me).

(In reply to Josh Aas (Mozilla Corporation) from comment #13)
> Would be great to have someone look at this again, it's pretty annoying when
> working on e10s.

Unfortunately I don't have the time atm. Anyone, feel free to steal the bug from me.

Flags: needinfo?(nical.bugzilla) → needinfo?(bent.mozilla)

Dave Townsend [:mossop]

Comment 15

•

10 years ago

Attached patch patch — Details — Splinter Review

This turns off message channel error logging in non-debug builds or if the MOZ_QUIET environment variable is set.

Assignee: nical.bugzilla → dtownsend+bugmail

Status: NEW → ASSIGNED

Attachment #8497867 - Flags: review?(bent.mozilla)

Josh Aas

Comment 16

•

10 years ago

Do we not want to actually fix the problems leading to these messages? I thought that's what this bug was about. Or, are you saying we'll mask them in release builds and leave this bug open for the real issues?

Dave Townsend [:mossop]

Comment 17

•

10 years ago

(In reply to Josh Aas (Mozilla Corporation) from comment #16)
> Do we not want to actually fix the problems leading to these messages? I
> thought that's what this bug was about. Or, are you saying we'll mask them
> in release builds and leave this bug open for the real issues?

Perhaps I should have created a separate bug but yes I'd like to just mask them for now as they make it harder to work on debugging tests. Given that no-one has really looked into fixing the root cause in nearly 9 months I don't hold out a lot of hope that it will be fixed anytime soon.

Josh Aas

Comment 18

•

10 years ago

When I've written IPC code and caused errors like this to show up, I've always been very glad I fixed them. They typically don't point to trivial mistakes. I'd be very careful about setting any precedent that covering these up is OK, even in release builds. Maybe ask bent though?

Dave Townsend [:mossop]

Comment 19

•

10 years ago

Given that my patch is just a wallpaper I'm not really the assignee for the real problem here.

Assignee: dtownsend+bugmail → nobody

Nicolas Silva [:nical]

Assignee

Comment 20

•

10 years ago

I think that the patch that just landed in Bug 966284 should fix this issue.

Nicolas Silva [:nical]

Assignee

Updated

•

10 years ago

Flags: needinfo?(bent.mozilla)

J. Ryan Stinnett [:jryans] (Use needinfo, replies may be slow)

Comment 21

•

10 years ago

I agree, this seems fixed here now as well.

Assignee: nobody → nical.bugzilla

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Ben Turner (not reading bugmail, use the needinfo flag!)

Updated

•

10 years ago

Attachment #8497867 - Flags: review?(bent.mozilla)

John Nagle

Comment 22

•

9 years ago

May be related: 

https://forums.mozilla.org/viewtopic.php?f=27&t=20755&p=43536#p43536

It's now possible to get this error from the new "sdk/ui" module in the add-on SDK.