Closed Bug 1035454 Opened 5 years ago Closed 5 years ago

Child DOMApplicationRegistry never sends "Webapps:UnregisterForMessages" to parent, causes parent leak and crashes debug content processes

Categories

(Core Graveyard :: DOM: Apps, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla36

People

(Reporter: bent.mozilla, Assigned: bent.mozilla)

References

(Depends on 1 open bug)

Details

Attachments

(5 files)

The DOMApplicationRegistry in AppsServiceChild.jsm tries to send "Webapps:UnregisterForMessages" in response to the "xpcom-shutdown" notification. This doesn't ever succeed for two reasons:

1. We never run XPCOM shutdown in non-DEBUG builds. See http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#1367

2. In DEBUG builds the "xpcom-shutdown" notification happens after the connection to the parent process has already been severed. Any attempt to send a message to the parent will fail and trigger a child process crash.

We need to make sure that any cleanup is listening to the proper notification in the parent process rather than relying on messages from the child. The child must be expected to die at any time.
Ben, we listen to child-process-shutdown that triggers a cleanup of all the listeners associated to this child from http://mxr.mozilla.org/mozilla-central/source/dom/apps/src/Webapps.jsm#1140

So is that enough to just remove the useless code in AppsServiceChild.jsm ?
Like this?
Assignee: fabrice → bent.mozilla
Status: NEW → ASSIGNED
Attachment #8468616 - Flags: review?(fabrice)
Attachment #8468616 - Flags: review?(fabrice) → review+
Ben, I just happened to see this. I think the "leak" this got backed out for is pre-existing. Previously we always crashed before getting a chance to report leaks in the child. I think it probably makes sense to disable leak checking for child processes until we can get the leaks fixed.
Sorry, forgot to comment here about the backout yesterday.
https://hg.mozilla.org/integration/mozilla-inbound/rev/e13073de5179

I was seeing a lot of B2G debug crashes as well, so please give this a |try: -b d -p all -u all -t none| run before re-landing.
(In reply to Bill McCloskey (:billm) from comment #5)
> I think it probably makes sense to disable leak
> checking for child processes until we can get the leaks fixed.

Is there a bug/patch for this yet?
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #7)
> (In reply to Bill McCloskey (:billm) from comment #5)
> > I think it probably makes sense to disable leak
> > checking for child processes until we can get the leaks fixed.
> 
> Is there a bug/patch for this yet?

Filed bug 1052224.
Depends on: 1052224
I have a patch to disable leak checking for child processes, but even with that, there is some kind of shutdown assertion failure in GMP in M3:
  https://tbpl.mozilla.org/?tree=Try&rev=0ff144b4e0d8
It looks like it's getting an XPCOM thread shutdown notification without getting a profile-change-teardown notification first. Maybe that doesn't happen in the child?
(In reply to Bill McCloskey (:billm) from comment #10)
> It looks like it's getting an XPCOM thread shutdown notification without
> getting a profile-change-teardown notification first. Maybe that doesn't
> happen in the child?

It looks like the profile-change-teardown notification does not go out, because mProfileNotified is false.  That is set to "true" in nsXREDirProvider::DoStartup(), which seems to be only called once, maybe from  XREMain::XRE_mainRun().
Should the GeckoMediaPluginService even be running in the child process?
(In reply to Andrew McCreight [:mccr8] from comment #12)
> Should the GeckoMediaPluginService even be running in the child process?

Ultimately we display the video in the child, so I would expect so. I wonder how this will interact with sandboxing of the child process though. The child isn't supposed to be able to create new processes--it will have to ask the parent to do that for it.
Randell, would you mind taking a look at this starting at comment 9 (or ni? the right person)?
Flags: needinfo?(rjesup)
Georg, Chris - this is about GMP shutdown in e10s (and also will be about the interaction with sandboxing, I presume).

bsmedberg: should profile-change-teardown occur in e10s child processes?  I presume so...
Flags: needinfo?(rjesup)
Flags: needinfo?(georg.fritzsche)
Flags: needinfo?(cpearce)
Flags: needinfo?(benjamin)
No, content processes don't "have" a profile and so they don't receive any profile-shutdown notifications. But content processes shouldn't be launching GMP children anyway: they should be children of the parent process and subsequently connected to whatever content process(es) need them.
Flags: needinfo?(benjamin)
OK, we'll need to route the GMPService usage in child processes through the master process before we can enable e10s. The GMPService needs shutdown notifications.
Flags: needinfo?(cpearce)
I filed bug 1057908 to track making the GeckoMediaPluginService work in e10s.
Flags: needinfo?(georg.fritzsche)
Depends on: 1058903
Depends on: 1065139
Green try run, with the assertion work around patch in bug 1058903:
  https://tbpl.mozilla.org/?tree=Try&rev=929b76bb71f5
No longer depends on: 1065139
Ben, are you okay with me landing this patch once I get enough stuff fixed up so that it will be green?
Flags: needinfo?(bent.mozilla)
Blocks: 1015466
Yep!
Flags: needinfo?(bent.mozilla)
For the Android debug crash, the test it is reporting crashes ( dom/base/test/test_appname_override.html ) looks pretty innocuous and does not involve a child process as far as I can see.  However, looking at /dev/log/main below, it seems that there is a child process being started up by dom/audiochannel/tests/test_telephonyPolicy.html, which sets dom.ipc.browser_frames.oop_by_default to true.  I guess that's why it runs out of process.

Then there's a crash in the crash reporter in CrashReporter::OnChildProcessDumpRequested, as it hits an assert where it looks like it is trying and failing to get a minidump for the child process.  From asking in #developers, child content processes are not expected to work on Android, so I think the path forward is to disable this test on Android.  I'll file a bug for that.  I have no idea why this worked before.  (Then there's a crash in ElfLoader, but I'll just assume for now that is due to the crash reporter crashing.)
Depends on: 1066212
minidump_stackwalk fails on the B2G emulator crashes, so I have no idea what is going on there.  The next step there is probably to get a B2G emulator build going with some real stacks.  Presumably something else is failing in a horrible way at shutdown.  To get this working on desktop at least, we could add some kind of quick exit for B2G, though that would require figuring out precisely where to put it.
Depends on: 1066318
I have a patch to work around the mysterious B2G shutdown crash by calling exit(0) at the end of XPCOM shutdown on B2G.  That works for B2G, but the patch seems to have a near permaorange on XP M3 debug:
  https://tbpl.mozilla.org/?tree=Try&rev=27546760d4a3

(The Android M5 orange is something that looks like a frequent problem on trunk, so hopefully the patch for that will fix this too.)
INFO -  291 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | Unexpected callback for 'INTERNAL_ERROR' with message = 'Cannot start media channels cause = OK' at ["PCW_setLocalDescription@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1841:1","PCT_setLocalDescription@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:746:3","commandsPeerConnection<@http://mochi.test:8888/tests/dom/media/tests/mochitest/templates.js:305:1","_executeNext@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:103:9",""] - expected PASS

Randell, do you know what this means?
Flags: needinfo?(rjesup)
No longer depends on: 1066318
Depends on: 1073310
Whiteboard: idempotent
ugh, meant to hit the bugzilla search field. :-/
Whiteboard: idempotent
Depends on: 1073597
Flags: needinfo?(rjesup)
The small amount we leak when a content process starts up a GMP process seems to be increasing a little bit, so we need the extra headroom.  I mostly only care about much larger leaks (100kb+) that involve windows and documents so I think this is fine for right now.
Attachment #8502804 - Flags: review?(khuey)
Comment on attachment 8502830 [details] [diff] [review]
part 2 - Paper over late shutdown crashes in debug B2G content processes by exiting XPCOM early.

I'm not sure what happened with this name.

Anyways, this is what happens without this patch on B2G:
  https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=78e727657e9b

The context here is that right now we're not getting content process shutdown logs, because we crash in shutdown, so I'm trying to get things at least to the point where we shut down cleanly on Linux, where we currently have e10s tests enabled, so we can start detecting leak regressions.  We've had one or two in the last few weeks.
Attachment #8502830 - Attachment description: This patch adds an exit(0) in XPCOMShutdown() on B2G and Windows, in content processes. This is the latest point where we avoid the WebRTC intermittent failure. For B2G, it can be pushed all the way to the end of XPCOMShutdown(), but I combined the two fo → Paper over late shutdown crashes in debug B2G content processes by exiting XPCOM early.
Attachment #8502830 - Flags: review?(benjamin)
Comment on attachment 8502830 [details] [diff] [review]
part 2 - Paper over late shutdown crashes in debug B2G content processes by exiting XPCOM early.

How does this help? IIRC, if you exit(0) here you *still* won't get leak logs because the leak logs are produced from the final call to NS_LogTerm.
Flags: needinfo?(continuation)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #33)
> How does this help? IIRC, if you exit(0) here you *still* won't get leak
> logs because the leak logs are produced from the final call to NS_LogTerm.

Yeah, this is pretty weird.  Maybe I should break it down a bit.

Currently on trunk, in tab processes, we exit(0) during the xpcom-shutdown notification.  bent's patch in this bug fixes that, which is enough for us to (sometimes) cleanly shut down child processes on Linux and Linux64, giving us leak logs.

However, bent's patch can't land by itself, because with it we hit different failures on B2G and Windows from things that happen later in shutdown.  This patch hacks around that by adding an exit(0) later than we currently run it, but before we hit whatever shutdown badness.  This is gross, but in combination with bent's patch we'll be running strictly more shutdown code.

So, with the patches here, we still will not get leak logs on Windows or B2G, but we will get them (sometimes) on Linux and Linux64.

(Also, I meant to mention that the exit(0) for B2G can actually go at the very end of ShutdownXPCOM, as the failure we are hitting there seems to be later, when we're shutting down the Chromium IPC stuff.  I just put it in at the place Windows needs for simplicity. But I can split it out if you think that is better.)
Flags: needinfo?(continuation)
Comment on attachment 8502830 [details] [diff] [review]
part 2 - Paper over late shutdown crashes in debug B2G content processes by exiting XPCOM early.

In the comment here, please indicate that this exit is intended to be temporary. That is the thing I think was missing when I read it the first time.
Attachment #8502830 - Flags: review?(benjamin) → review+
For now, I'm just going to disable this test on e10s.  I filed bug 1035454 for investigating it.
Attachment #8504341 - Flags: review?(jst)
Attachment #8504341 - Flags: review?(jst) → review+
The patch in bug 1081415 fixes a bunch of e10s M1 shutdown leaks.
Depends on: 1081415
Originally when I was testing, I had the exit(0) on B2G at the end of ShutdownXPCOM, but I decided to just move it up to where it needed to be for Windows for simplicity.

However, for some reason this leads to test failures on some B2G tests:
  https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=9a84615c976d
Specifically, we seem to hit an assert after or during the call to exit(0):
18:07:09 INFO - 10-14 00:57:12.125 I/Gecko ( 779): [Child 779] WARNING: Exiting child process early!: file ../../../gecko/xpcom/build/XPCOMInit.cpp, line 961
18:07:09 INFO - 10-14 00:57:12.135 F/MOZ_Assert( 779): Assertion failure: isEmpty(), at ../../dist/include/mozilla/LinkedList.h:305 

There's no stack for that, so I don't know what is happening, but moving the exit(0) back to the very end of ShutdownXPCOM seems to fix it, so that's what this patch does:
  https://tbpl.mozilla.org/?tree=Try&rev=2af1506dd53e

I'll land this folded into the other ShutdownXPCOM patch.
Attachment #8505511 - Flags: review?(benjamin)
Attachment #8505511 - Flags: review?(benjamin) → review+
all platform all tests try run: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=9a84615c976d

M-e10s failures are leaks fixed by bug 1081415:
  https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=1d69baf3ddd3

B2G emulator opt M3 failures are due to a frequent existing intermittent orange.

B2G emulator debug failures are fixed by part 2b:
  https://tbpl.mozilla.org/?tree=Try&rev=2af1506dd53e

I think that should be everything.  I'm planning to land this when the tree reopens.
Attachment #8468616 - Attachment description: Fix, v1 → part 4 - Don't send 'Webapps:UnregisterForMessages' at xpcom-shutdown
Attachment #8502804 - Attachment description: Increase the leak threshold for tab processes. → part 1 - Increase the leak threshold for tab processes.
Attachment #8502830 - Attachment description: Paper over late shutdown crashes in debug B2G content processes by exiting XPCOM early. → part 2 - Paper over late shutdown crashes in debug B2G content processes by exiting XPCOM early.
Attachment #8504341 - Attachment description: Disable test_bug345339.html on e10s for leaking. → part 3 - Disable test_bug345339.html on e10s for leaking.
Summary: Child DOMApplicationRegistry never sends "Webapps:UnregisterForMessages" to parent, causes parent leak → Child DOMApplicationRegistry never sends "Webapps:UnregisterForMessages" to parent, causes parent leak and crashes debug content processes
No longer depends on: 1073310
If this has to get backed out for weird intermittent failures, probably just backing out out the last part should be enough.
Depends on: 1083664
I'm pretty sure bug 1083664 was caused by this, which is a debug OSX 10.6-only intermittent assertion failure at shutdown.  In the short term, I can probably work around this by adding another exit(0) somewhere, for OSX...
A few pushes after I landed this, leak checking in M3 stopped working intermittently.  Some number of pushes after that, something started leaking.  So, for now, I'm bumping up the leak threshold to 2MB so we can open the tree again...

https://hg.mozilla.org/integration/mozilla-inbound/rev/a0c0f98c8482
Blocks: 1242119
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.