Content processes never shutdown in some cases

RESOLVED FIXED in Firefox 28

Status

()

defect
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: billm, Assigned: billm)

Tracking

unspecified
mozilla29
x86_64
Linux
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite ?

Firefox Tracking Flags

(firefox27 wontfix, firefox28 fixed, firefox29 fixed, b2g-v1.2 fixed, b2g-v1.3 fixed, b2g-v1.4 fixed)

Details

Attachments

(1 attachment)

Posted patch mac-fixSplinter Review
I've been running into the following problem. When we start a content process, we get the following call stack:

XRE_InitChildProcess
  http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsEmbedFunctions.cpp#516
MessagePumpForChildProcess::Run
  http://mxr.mozilla.org/mozilla-central/source/ipc/glue/MessagePump.cpp#251
XRE_RunAppShell
  http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsEmbedFunctions.cpp#635

When XRE_RunAppShell runs, it does some initial processing of messages. This initial processing was added in bug 625060. Afterwards it enters the normal event loop via appShell->Run().

The problem I'm seeing is that we're shutting down the content process (via XRE_ShutdownChildProcess) during the initial message processing, before nsAppShell::Run happens. XRE_ShutdownChildProcess prematurely exits the message loop and also calls nsAppShell::Exit. However, since we haven't called Run yet, the Exit has no effect. When we call Run, we run forever and the child process never exits. I'm seeing this as a shutdown timeout on Tinderbox because the main process waits for all child processes to exit on shutdown. I suspect this bug may be causing some of the Mac shutdown timeouts in bug 918759, but that's just a guess.

The simplest fix seems to be to exit early from nsAppShell::Run if nsAppShell::Exit has already been called. However, this code seems pretty tricky, so I'm asking for two reviews.
Attachment #8349737 - Flags: review?(smichaud)
Attachment #8349737 - Flags: review?(bent.mozilla)
Comment on attachment 8349737 [details] [diff] [review]
mac-fix

This looks fine to me.

But, as you say, appshell code is complicated -- not just in itself but because of accretions elsewhere (like the patch for bug 625060).  So we'll need to watch out for problems, and be prepared to back this patch out if it causes trouble.
Attachment #8349737 - Flags: review?(smichaud) → review+
Hrm, what's the stack when we call XRE_ShutdownChildProcess before XRE_RunAppShell? It sounds fishy to me that we're trying to shut down so quickly...
Flags: needinfo?(wmccloskey)
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #2)
> Hrm, what's the stack when we call XRE_ShutdownChildProcess before
> XRE_RunAppShell? It sounds fishy to me that we're trying to shut down so
> quickly...

I'm afraid I don't have a copy of the stack, and it's hard to reproduce. However, XRE_ShutdownChildProcess is not called before XRE_RunAppShell. It's call during the initial preprocessing stage of XRE_RunAppShell, before appShell->Run(). To put it another way, the stack trace for XRE_ShutdownChildProcess would contain this line:

http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsEmbedFunctions.cpp#674
Flags: needinfo?(wmccloskey)
Comment on attachment 8349737 [details] [diff] [review]
mac-fix

Review of attachment 8349737 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, let's give this a try.
Attachment #8349737 - Flags: review?(bent.mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/0d955b377e77
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Comment on attachment 8349737 [details] [diff] [review]
mac-fix

It looks like this fixed bug 960605, so I think we should backport it to reduce intermittent failures on aurora.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): ?
User impact if declined: none
Testing completed (on m-c, etc.): On m-c.
Risk to taking this patch (and alternatives if risky): Very low, since the code deals with a feature (content processes) that is rarely used in practice.
String or IDL/UUID changes made by this patch: None
Attachment #8349737 - Flags: approval-mozilla-aurora?
Attachment #8349737 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Does this have or need tests?
Flags: in-testsuite?
Bill, would this be reasonably safe to backport to b2g26 too? These hangs bite us there too and I would love to get this landed there if the risk is low. Not sure what the risk to non-desktop builds is.
Flags: needinfo?(wmccloskey)
I think it would be fine to backport it. It's a Mac-specific fix, so it couldn't possibly affect anything on b2g itself. It would just be helping us to improve testing.
Flags: needinfo?(wmccloskey)
Comment on attachment 8349737 [details] [diff] [review]
mac-fix

NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): ?
User impact if declined: sheriff pain
Testing completed: on m-c and aurora without any known issues
Risk to taking this patch (and alternatives if risky): None, this is NPOTB for B2G as it only affects OSX (the patch is to a Cocoa file, not shared platform code). Only the OSX builds we run in automation on the branch will see any effect from this.
String or UUID changes made by this patch: None
Attachment #8349737 - Flags: approval-mozilla-b2g26?
Comment on attachment 8349737 [details] [diff] [review]
mac-fix

bug 951908 for b2g26
2:16 PM <RyanVM|sheriffduty> the reasoning for that is that it's OSX-only code
2:16 PM <RyanVM|sheriffduty> it's completely NPOTB for b2g
2:16 PM <RyanVM|sheriffduty> but fixes a really annoying orange on tbpl
2:16 PM <RyanVM|sheriffduty> but that code isn't even compiled on b2g, so there's zero risk to taking it there
Attachment #8349737 - Flags: approval-mozilla-b2g26? → approval-mozilla-b2g26+
You need to log in before you can comment on or make changes to this bug.