Closed Bug 1188056 Opened 9 years ago Closed 2 years ago

ContentParent and child actors not being destroyed upon child process crash

Categories

(Core :: DOM: Content Processes, defect, P3)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
e10s + ---

People

(Reporter: bas.schouten, Unassigned)

References

Details

We're regularly creating ContentParents, seemingly for live thumbnail drawing so even with e10s switched off. It appears that when the child process for these dies we never end up destroying some of the ContentParents (and their associated child actors). Even when we are destroying them some seem to be destroyed through P*Parent::OnChannelError() messages so it seems even when we are shutting down it's not going in the way we wanted to.

Over the course of a session this seems to cause us to accumulate a collection of ContentParent objects and for example also some ImageBridgeParent objects which in turn are responsible for bug 1127270. This is bad.

Any suggestions as to what's going on or for who else might be helpful in looking at this will be appreciated!
Component: IPC → DOM: Content Processes
This is usually caused by a leak elsewhere.  A GC/CC log is the starting point to investigate that.
So it -appears- at the moment that this is occurring when my content process that's being created:

First-chance exception at 0x77140A31 (ntdll.dll) in plugin-container.exe: 0xC0000710: A threadpool worker thread is impersonating a client, after a callback to 0x%p(0x%p). This is unexpected, indicating that the callback is missing a call to revert the impersonation (parameters: 0x770D4200, 0x0124A158).
Unhandled exception at 0x77140A31 (ntdll.dll) in plugin-container.exe: 0xC0000710: A threadpool worker thread is impersonating a client, after a callback to 0x%p(0x%p). This is unexpected, indicating that the callback is missing a call to revert the impersonation (parameters: 0x770D4200, 0x0124A158).

The child process then dies somewhat spectacularly and on this crash it appears the protocols in the parent process never get cleaned up. This exception may very well somehow be debug build only, I'll try to verify this.
I've seen cases where we fail to handle FailedConstructor correctly before.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #3)
> I've seen cases where we fail to handle FailedConstructor correctly before.

Is there any information I can get for you here?
> First-chance exception at 0x77140A31 (ntdll.dll) in plugin-container.exe:
> 0xC0000710: A threadpool worker thread is impersonating a client, after a
> callback to 0x%p(0x%p). This is unexpected, indicating that the callback is
> missing a call to revert the impersonation (parameters: 0x770D4200,
> 0x0124A158).
> Unhandled exception at 0x77140A31 (ntdll.dll) in plugin-container.exe:


I ran into this on Win10 a couple weeks ago in a debug build, are you testing on 10? At the time I chalked it up to pre-release Windows bugs as I wasn't able to reproduce on win7. That said I think Arron added some thread pool code recently, he might be interested in this. Overall though I doubt it has anything to do with the compositor shutdown problems.
(In reply to Jim Mathies [:jimm] from comment #5)
> > First-chance exception at 0x77140A31 (ntdll.dll) in plugin-container.exe:
> > 0xC0000710: A threadpool worker thread is impersonating a client, after a
> > callback to 0x%p(0x%p). This is unexpected, indicating that the callback is
> > missing a call to revert the impersonation (parameters: 0x770D4200,
> > 0x0124A158).
> > Unhandled exception at 0x77140A31 (ntdll.dll) in plugin-container.exe:
> 
> 
> I ran into this on Win10 a couple weeks ago in a debug build, are you
> testing on 10? At the time I chalked it up to pre-release Windows bugs as I
> wasn't able to reproduce on win7. That said I think Arron added some thread
> pool code recently, he might be interested in this. Overall though I doubt
> it has anything to do with the compositor shutdown problems.

No, I'm running on Windows 8.1. I've seen this exception before as well, I feel I traced it back to something at some point. But I recall it not being very important for our users in the wild. I'm really more interested in us not cleaning up after a content process crash properly fairly consistently, since this would for one mean the shutdown will hang but it also means we'd be leaking resources.
I've also manually bypassed the exception, and I've added a crash in the content process manually during drawing to see what would happen if the crash occurred at another point in time, and I observed similar consequences, essentially the ContentParent and all associated objects did not get cleaned up.
Summary: ContentParent and child actors not being destroyed upon child process destruction → ContentParent and child actors not being destroyed upon child process crash
(In reply to Bas Schouten (:bas.schouten) from comment #4)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #3)
> > I've seen cases where we fail to handle FailedConstructor correctly before.
> 
> Is there any information I can get for you here?

Comment 1 ...
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #8)
> (In reply to Bas Schouten (:bas.schouten) from comment #4)
> > (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #3)
> > > I've seen cases where we fail to handle FailedConstructor correctly before.
> > 
> > Is there any information I can get for you here?
> 
> Comment 1 ...

Do you think this is actually caused by a leak rather than just the shutdown not running for some other reason (I don't know much about how IPDL shuts things down)? What's the best way of generating a log that has the information you think will be useful?
This probably isn't an IPDL issue. All IPDL handles is calling ContentParent::ActorDestroy when the content process crashes. Its job is to actually clean up. However, since ContentParent is a refcounted class, other outstanding refs can stop that from happening.

As far as I know, TabParent has the main ref to ContentParent, so the first thing to check is probably that all those are getting destroyed properly. They also get torn down by IPDL, through here:
http://mxr.mozilla.org/mozilla-central/source/dom/ipc/nsIContentParent.cpp#133

If there are other refs, then you'll probably need to do some kind of leak logging to figure it out.
(In reply to Bas Schouten (:bas.schouten) from comment #9)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #8)
> > (In reply to Bas Schouten (:bas.schouten) from comment #4)
> > > (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #3)
> > > > I've seen cases where we fail to handle FailedConstructor correctly before.
> > > 
> > > Is there any information I can get for you here?
> > 
> > Comment 1 ...
> 
> Do you think this is actually caused by a leak rather than just the shutdown
> not running for some other reason (I don't know much about how IPDL shuts
> things down)? What's the best way of generating a log that has the
> information you think will be useful?

My experience with b2g is that these issues are virtually always caused by JS not properly cleaning up after mozbrowsererror is fired.  Go to about:memory and get the verbose GC/CC logs.
(In reply to Bill McCloskey (:billm) from comment #10)
> This probably isn't an IPDL issue. All IPDL handles is calling
> ContentParent::ActorDestroy when the content process crashes. Its job is to
> actually clean up. However, since ContentParent is a refcounted class, other
> outstanding refs can stop that from happening.
> 
> As far as I know, TabParent has the main ref to ContentParent, so the first
> thing to check is probably that all those are getting destroyed properly.
> They also get torn down by IPDL, through here:
> http://mxr.mozilla.org/mozilla-central/source/dom/ipc/nsIContentParent.
> cpp#133
> 
> If there are other refs, then you'll probably need to do some kind of leak
> logging to figure it out.

Are we using TabParent even when we're just creating thumbnails in the background for the newtab screen?
(In reply to Bas Schouten (:bas.schouten) from comment #12)
> Are we using TabParent even when we're just creating thumbnails in the
> background for the newtab screen?

Yes. There's one TabParent for every remote <browser> element, which is what we use for background thumbnail generation.
Blocks: e10s-multi
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #11)
> My experience with b2g is that these issues are virtually always caused by
> JS not properly cleaning up after mozbrowsererror is fired.  Go to
> about:memory and get the verbose GC/CC logs.

khuey: Here's a verbose GC/CC log:

https://1drv.ms/u/s!AsfWgS0rNKaKyWpiqCXtBFb4U47n

I'd be most interested in getting this fixed, as I hit it every time I try to run e10s on my main Win10 work machine.
Flags: needinfo?(khuey)
I was hitting this error if I have Notepad++ installed and set as the Windows shell default file type handler for .js files in our call to IApplicationAssociationRegistration::QueryCurrentDefault here:
https://dxr.mozilla.org/mozilla-central/source/uriloader/exthandler/win/nsOSHelperAppService.cpp#565

This is for assocType=".js".

I set the default file type handler for ".js" to (Windows' built in) Notepad, and the exception no longer throws. However, once I set the default file type handler back to Notepad++, the exception *still* no longer throws.

So at least I can work again. But now I can't verify whatever fix we do here helps. :(
(In reply to Chris Pearce (:cpearce) from comment #14)
> I'd be most interested in getting this fixed, as I hit it every time I try
> to run e10s on my main Win10 work machine.

Is there a particular reason this is a big deal for you?  It's generally just a small memory leak, and not terribly concerning.
Flags: needinfo?(khuey) → needinfo?(cpearce)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #16)
> Is there a particular reason this is a big deal for you?  It's generally
> just a small memory leak, and not terribly concerning.

I was getting the exception thrown dialog with the same message as listed in comment #2 and then a crash in the content process, which made it impossible to run a (locally built) e10s build. Now that I've unset and reset my OS default *.js editor, the exception doesn't throw, and everything works as expected. Which is weird.
Flags: needinfo?(cpearce)
Priority: -- → P3

Not clear what was going on here. If somebody is still having some issues, we could look at a new bug.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.