Closed Bug 930838 Opened 11 years ago Closed 2 years ago

ContentParent should gracefully handle CompositorChild creation failure

Categories

(Core :: DOM: Content Processes, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID
Tracking Status
e10s later ---

People

(Reporter: adw, Unassigned)

References

Details

Attachments

(2 files)

The ContentParent ctor MOZ_ASSERTs that PCompositor::Open successfully returns [1].  On a debug build, when that assertion fails, or the assertion a few lines below it fails, the browser hangs and then crashes.  On a non-debug build or when those assertions are commented out, the browser seems to go on about its business OK, but the presence of the assertions would indicate that the failures aren't gracefully handled.

On a Mac, you can reproduce the debug-build hang by applying the patch in bug 902755 and moving plugin-container somewhere where Firefox can't find it.  On other platforms, moving plugin-container might be sufficient, I don't know.  I found this bug while working on bug 902755, which fixes another hang but reveals this one.

Filing in Core::Widget because that's where bug 745148, which added the assertion, is filed.  Apologies if it's the wrong component.

[1] http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#1454
Component: Widget → IPC
Blocks: 902755
IMO we should just remove the assertion if fixing it is hard and it's holding bug 902755 from landing.
Do we have a legitimate use case for handling this failure? Can this actually occur in a reasonable user scenario?

If we're running with e10s then we require this to work properly to run. If we're talking about an unlikely case I think a runtime assert would be more appropriate then a debug assert.

How do you suggest we handle this?

(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #1)
> IMO we should just remove the assertion if fixing it is hard and it's
> holding bug 902755 from landing.

I don't think trying to run with this code failing is meaningful so I would suggest against this change.
From the linked bug, there are clearly cases where this happens in practice (including on our test infra). It seems fine to fail catastrophically, but it seems like we shouldn't assert that we're not failing until we have a better understanding of why we are.
This does what Gavin suggests and just removes the assertions.
Attachment #827790 - Flags: review?(bgirard)
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #1)
> IMO we should just remove the assertion if fixing it is hard and it's
> holding bug 902755 from landing.

Until we understand why we're failing (or understand the consequences of running without a compositor when we think we should have one) it seems more sane to actually turn these assertions into MOZ_CRASH or something.
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #3)
> From the linked bug, there are clearly cases where this happens in practice
> (including on our test infra).

The initial description points to bugs and suggest that we need to make plugin-container inaccessible.

(Similar to :bent) For me to r+ this you have to convince me that without hitting this assertion we're going to fallback into running into a meaningful configuration otherwise crashing is the right thing. Moreover you have to convince me that the fallback configuration is one we want to maintain and support. From looking at the context it doesn't appear that is the case.
Attachment #827790 - Flags: review?(bgirard)
test_browserElement_oop_CloseApp.html seems to be the problematic test.  Sometimes the failure occurs in test_browserElement_oop_ContextmenuEvents.html instead, but test_browserElement_oop_ContextmenuEvents.html runs right after test_browserElement_oop_CloseApp.html.

This warning is hit during test_browserElement_oop_CloseApp.html:

> 23:06:47     INFO -  [Parent 1245] WARNING: Unable to use pre-allocated app process: file ../../../dom/ipc/ContentParent.cpp, line 746

Right after that, it appears that the child process exits, although there are at least two other children in the logs during this time.  Also after that, a new ContentParent is instantiated, and I think that's the instantiation that hits the failing assertion in the ctor.

Before the assertion is hit, there's this:

> 23:06:47     INFO -  [Parent 1245] WARNING: parent WaitForMessage() failed: 0x10004003 (ipc/rcv) timed out: file ../../../ipc/glue/GeckoChildProcessHost.cpp, line 682

It would be great if someone who knows this particular code or just knows more about Gecko IPC than I do could look at what happens during this test.  I've already sunk some time into it but it's really beyond my ken.
When somebody finds out how to fix this, the test in bug 924771 can probably be re-enabled, since it seems to have the same problem.
Very much a shot in the dark, but any chance this is related to bug 919878?
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #9)
> Very much a shot in the dark, but any chance this is related to bug 919878?

My first guess is that it's probably unrelated specifically, but the same basic problem - that the tests might be killing a child before it's been fully initialized on both sides.  I'll see if I can repro and get any extra info.
file_browserElement_CloseApp.html calls window.close() outside of a load listener, so given Mark's comment 10, I thought that might be the problem.  But the test (test_browserElement_oop_CloseApp.html) never gets that far before hitting the assertion.  It's hit early on, here:

> 59   document.body.appendChild(appFrame);

http://mxr.mozilla.org/mozilla-central/source/dom/browser-element/mochitest/browserElement_CloseApp.js#59
(In reply to Mark Hammond [:markh] from comment #10)
> My first guess is that ...the tests might be killing a child before it's been
> fully initialized on both sides.

Clearly that's wrong - the problem here is when we fail to create the child in the first place.  While I can't repro the assertion on Windows, I can repro the fact we blindly carry on even when the child process can't be created.  I'm attaching a patch that handles this - the ::Init() method is now a boolean, and many places now check the result of this before using the returned process.  Not *all* places do this, specifically, the creation of the nuwa "template" process, nor the children based on this template, but that might be OK for now - the failure scenario for those processes should remain the same as it was before.

With this patch, there's no need to remove assertions re the compositor etc - we just bail before even attempting to do that.

Requesting feedback from adw to ensure it solves the initially reported problem, and benwa to ensure he has no objections to this approach.
Attachment #8335644 - Flags: feedback?(bgirard)
Attachment #8335644 - Flags: feedback?(adw)
Comment on attachment 8335644 [details] [diff] [review]
Gracefully handle failure to create a content process.

In order to apply this patch I had to pull and rebuild, and without the patch, the assertion isn't printed to the terminal anymore, but the browser still hangs.  The patch doesn't change that behavior unfortunately.  I'll take a closer look at what's going on.
Attachment #8335644 - Flags: feedback?(adw)
Comment on attachment 8335644 [details] [diff] [review]
Gracefully handle failure to create a content process.

Clearing feedback as adw says it doesn't help him.
Attachment #8335644 - Flags: feedback?(bgirard)
Whiteboard: p=0
Mass tracking-e10s flag change. Filter bugmail on "2be0fcce-e36a-4e2c-aa80-0e3d33eb5406".
tracking-e10s: --- → +
No longer blocks: fxdesktopbacklog
Whiteboard: p=0
Component: IPC → DOM: Content Processes

PCompositor::Open does not exist anymore nor do I see something similar in today's ContentParent ctor. What's more, the tests cited in comment 7 do not exist anymore, too.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: