Deadlock in mozilla::ipc::GeckoChildProcessHost::LaunchAndWaitForProcessHandle after turning on new tab page thumbnails

RESOLVED WORKSFORME

Status

()

defect
--
critical
RESOLVED WORKSFORME
6 years ago
4 years ago

People

(Reporter: MattN, Assigned: adw)

Tracking

(Depends on 1 bug, {hang})

Trunk
x86_64
macOS
Points:
---
Dependency tree / graph
Bug Flags:
firefox-backlog +

Firefox Tracking Flags

(e10s+)

Details

(Whiteboard: p=0)

Attachments

(3 attachments, 2 obsolete attachments)

Steps I took in my everyday UX Nightly profile:
1) Open the new tab page with thumbnails turned off. I have had them off for many weeks.
2) Click the grid to turn on thumbnails

Expected result:
Thumbnails start loading from the background thumbnail process

Actual result:
Browser hangs/deadlocks

I haven't yet tried to reproduce this.

Some notes which may be relevant:
* I don't see a plugin process started for thumbnailing
* I had 2 plugin processes already running (Shockware Flash and Google Talk Plugin)
* Built from http://hg.mozilla.org/projects/ux/rev/eba98a54c087
* OS X 10.7.5

The following was in the OS X console:
8/7/13 6:51:04.912 PM [0x0-0x591591].org.mozilla.ux: [Parent 54023] WARNING: parent WaitForMessage() failed: 0x10004003 (ipc/rcv) timed out: file ../../../../ipc/glue/GeckoChildProcessHost.cpp, line 670
Not technically a "crash", but still worth blocking bug 899758 for.
Blocks: 899758
I suspect this is the Mac variant of bug 902790 (although that doesn't explain why the child process creation failed, but possibly does explain the hang after it fails)
Bug 902790 has fixed the issue for Windows/Linux - this bug remains to resolve a hang on Mac in the same situation.

The fix is to arrange so the block at http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#662 is only entered if |process| is non-zero, indicating the child was successfully created.
Assignee: nobody → adw
Status: NEW → ASSIGNED
As Mark says, base::LaunchApp's success is checked once, right after the #if defined(OS_POSIX) section.  The false branch of that #if exits immediately after calling LaunchApp and therefore ends up checking for success immediately, but the true branch does a bunch of stuff before reaching the check.  So this patch checks for success immediately within both branches.

This fixes the hang, but in a debug build it reveals another, because this compositor-related assertion is hit and then Firefox hangs: http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#1454  Without that assertion and the one a few lines down, no more hang.

Ben, Chris flagged you once for feedback in the similar bug 902790, so could you please review or pass this along?
Attachment #817413 - Flags: review?(bent.mozilla)
Comment on attachment 817413 [details] [diff] [review]
check LaunchApp's success in both branches

Review of attachment 817413 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for tackling this. I'm not sure about your assertion problem with the compositor, we'll need to nail that down before this gets landed I think.

::: ipc/glue/GeckoChildProcessHost.cpp
@@ +663,5 @@
> +  if (!process) {
> +    MonitorAutoLock lock(mMonitor);
> +    mProcessState = PROCESS_ERROR;
> +    lock.Notify();
> +    return false;

I don't think you should bail out before calling CloseClientFileDescriptor below, that will probably leak.

But more specifically why not just make the MOZ_WIDGET_COCOA block check for a non-zero |process| rather than duplicate this code here? Then you'd get down to the common |mProcessState = PROCESS_ERROR;| below.

I'm also concerned about the other |return false| points in this MOZ_WIDGET_COCOA block... Shouldn't they set |mProcessState| too?
Attachment #817413 - Flags: review?(bent.mozilla) → review-
Thanks, Ben.

(In reply to ben turner [:bent] (needinfo? encouraged) from comment #6)
> Thanks for tackling this. I'm not sure about your assertion problem with the
> compositor, we'll need to nail that down before this gets landed I think.

OK.  According to the comment there, it's messaging the child to create the CompostitorChild, which of course fails when the child doesn't exist.

> I'm also concerned about the other |return false| points in this
> MOZ_WIDGET_COCOA block... Shouldn't they set |mProcessState| too?

I guess so?  Handling this part made the patch a little more complex, as opposed to simply wrapping the Cocoa block in a conditional and falling through to the common mProcessState check.  But certainly there are multiple ways to structure this, so I'm open to better alternatives than what's in this patch.
Attachment #817413 - Attachment is obsolete: true
Attachment #820081 - Flags: review?(bent.mozilla)
I filed bug 930838 for the failing ContentParent/CompositorChild assertions.
Comment on attachment 820081 [details] [diff] [review]
add SetErrorAndReturnFalse and use it in the Cocoa block

Review of attachment 820081 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/GeckoChildProcessHost.h
@@ +166,5 @@
>  #endif
>  
>    void OpenPrivilegedHandle(base::ProcessId aPid);
>  
> +  bool SetErrorAndReturnFalse();

This is cute, but I think it should just be a void function called SetErrorState.

Then let's move it into a private block. And then I think you should be able to inline it (might have to use 'mozilla::MonitorAutoLock' but whatever).

Then just return false everywhere ;)
Attachment #820081 - Flags: review?(bent.mozilla) → review+
Thanks!  Thanks to Mark, too, for pointing out the fix.

https://tbpl.mozilla.org/?tree=Try&rev=13b28bcfd91d
Attachment #820081 - Attachment is obsolete: true
Keywords: checkin-needed
Sigh, that's the compositor assertion being revealed.

What would have happened on those runs without this patch?  Would they have hung or crashed?  There are bugs concerning the two tests where the assertion was triggered, all of them being timeouts or "exited with code -20" on OS X:

test_browserElement_oop_ContextmenuEvents.html: bug 835927, bug 930449
test_browserElement_oop_CloseApp.html: bug 820412, bug 925200

I suspect that if we had run the same runs without the patch, failures on the same machines would have happened, but they would have been chalked up to these bugs.
Looks like bug 930838 is going to block this from landing, so marking the dependency.
Depends on: 930838
Whiteboard: [triage]
Whiteboard: [triage]
Whiteboard: p=0
Mass tracking-e10s flag change. Filter bugmail on "2be0fcce-e36a-4e2c-aa80-0e3d33eb5406".
tracking-e10s: --- → +
(hangs are sev=critical)
Severity: major → critical
No longer blocks: fxdesktopbacklog
Flags: firefox-backlog+
Hi Matthew,

I cannot reproduce this on latest Aurora build (46.0a2-20160204004009) and latest Nightly build (47.0a1-20160204030229). I have set the "browser.newtabpage.enabled" to false, opened multiple pages on the browser and then re-enabled the preference. No hangs occurred while doing this and the previous visited pages were displayed in tiles.

I have test this on Mac OS X 10.7 (iMac) and 10.10 (MBP Retina 13").

Considering this, I am marking the issue as Resolved-Worksforme. If anyone can still reproduce this, feel free to reopen it providing more information.


Thanks,
Paul.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.