Allow ContentChild to perform tasks on shutdown

RESOLVED FIXED in mozilla37



5 years ago
4 years ago


(Reporter: jchen, Assigned: jchen)


Dependency tree / graph

Firefox Tracking Flags

(firefox37 wontfix, firefox38 wontfix)



(7 attachments, 8 obsolete attachments)

3.08 KB, patch
: review+
Details | Diff | Splinter Review
6.46 KB, patch
: review+
Details | Diff | Splinter Review
5.22 KB, patch
: review+
Details | Diff | Splinter Review
16.16 KB, patch
: review+
Details | Diff | Splinter Review
2.75 KB, patch
: review+
: feedback+
Details | Diff | Splinter Review
2.85 KB, patch
: review+
Details | Diff | Splinter Review
1.24 KB, patch
: review+
Details | Diff | Splinter Review
We want ContentChild to send over telemetry data to ContentParent on shutdown, because ContentChild can't save the telemetry data itself.

Right now each TabChild is notified of destruction through the Destroy call, but I don't see a way for ContentChild as a whole to be notified of process shutdown -- ContentProcess shuts down by closing the IPC channel, so ContentChild only knows about it after the channel is closed. At that point, it's too late to send telemetry data.
Posted patch WIP Patch (v1) (obsolete) — Splinter Review
Comment on attachment 8526926 [details] [diff] [review]
WIP Patch (v1)

Is this a good approach? This patch adds a Shutdown call from ContentParent to ContentChild, and now ContentChild is responsible for closing the channel on shutdown instead of ContentParent. This way, ContentChild can send over telemetry data before it closes the channel.
Attachment #8526926 - Flags: feedback?(wmccloskey)
Attachment #8526926 - Flags: feedback?(bent.mozilla)
Comment on attachment 8526926 [details] [diff] [review]
WIP Patch (v1)

Review of attachment 8526926 [details] [diff] [review]:

This looks good to me. There is one part that worries me. I was curious why the parent waits for the child to die. The shutdown sequence is pretty synchronous, so it seems like the parent should just blaze ahead, not waiting for the child. The following code explains why it waits:
This code is called from here:
It runs a nested event loop that waits for the compositor thread to shut down. The compositor thread shuts down when the CrossProcessCompositorParent dies, which happens when the child closes the IPC channel and the parent cleans up all remaining actors.

Mainly I'm just concerned that we're relying on this graphics code in order for the telemetry data to be sent back properly. As long as we're running a nested event loop in XPCOM shutdown, it seems like we might as well as a separate nested event loop to wait for the channel to be closed. At least that way we have an explicit place in the code where we know we're waiting for the child process to die.

::: dom/ipc/ContentChild.cpp
@@ +2431,5 @@
> +bool
> +ContentChild::RecvShutdown()
> +{
> +    // Close the IPC channel and let both the parent and child shut down.
> +    Close();

When we talked about this in person, I said that Close() won't work here because any data sent before it will be dropped. I was wrong; the data will be sent and received correctly. So I think this part of the patch is fine.
Attachment #8526926 - Flags: feedback?(wmccloskey) → feedback+
Posted patch jchen.patch (obsolete) — Splinter Review
Hi Jim. I hope you don't mind but I went ahead and made the changes I asked for. I think we need something like this for bug 1103279 and I wanted to test it out. Andrew is going to push this to try with some of his own patches to see if it fixes some issues with leak checking. If that works, we can proceed with review (probably from bent or bsmedberg).
try run with the patch from comment 4, plus a few test harness changes that enabled e10s content process leak checking:
Sorry, I was just about to post a new patch.

Turns out letting the child close the IPC channel doesn't really work. The problem is that after we send the shutdown message, the child can close the IPC channel at any time, even while the parent is still trying to use the IPC channel for things like the message manager. That resulted in some test failures. We can't shutdown the message manager early either, because the child depends on it to send over telemetry data.

I think the best way is to let the child call the parent back. The parent still first calls Shutdown() in the child. After the child finishes its tasks, it then calls FinishShutdown() in the parent. The parent can then close the channel like before. While the parent has an outstanding Shutdown() call, it waits for the FinishShutdown() call to arrive in a nested event loop. Also, we already have a timeout mechanism for shutdown [1], so I think we can just rely on that. What do you think?

Attachment #8526926 - Attachment is obsolete: true
Attachment #8526926 - Flags: feedback?(bent.mozilla)
There's a use-after-free crash in nsFrameMessageManager::DispatchAsyncMessageInternal in B2G ICS Emulator opt M5, during test_NuwaProcessCreation.html
This patch moves the nested event loop to the xpcom-shutdown observer. It also delays closing the channel in ContentParent::RecvFinishShutdown to see if it fixes the B2G crash. I'll push to try once it reopens.
Attachment #8534649 - Attachment is obsolete: true
It looks like the use after free is still happening.  There are a bunch of IsNuwaProcess() sorts of checks in ContentParent.  Maybe one is needed here, too?  A Nuwa person might have some ideas about what is going wrong.
Turns out Nuwa was calling ContentParent::Close directly, which I didn't account for. This patch passes the failing test on try,
Attachment #8535251 - Attachment is obsolete: true
The Linux debug bc3 failure in my try push looks like it could also be related to your patch.

The M1 failures shouldn't show up without my leak harness changes.  It looks like there are a few cases when running browser element parent tests out of process where we're still attempting to send messages to the parent too late.

I did a separate try push with my patch in bug 1091766 that makes us not exit(0) immediately when a fail to send a message to the parent, but then I'm seeing the same failure I saw before where there's a constructor for actor failed abort in some networking test, when we have some half-loaded process running after the parent shuts down. I think billm said he'd look at that failure a bit.
Comment on attachment 8534585 [details] [diff] [review]

I think billm's modified version of the patch isn't needed now.
Attachment #8534585 - Attachment is obsolete: true
(In reply to Andrew McCreight [:mccr8] (away-ish Dec 17-26) from comment #13)
> The Linux debug bc3 failure in my try push looks like it could also be
> related to your patch.
> <snip>

Did you try using my new patch? I think my try run with the new patch didn't show those failures,
Oh, great.  I just wanted to point those out in case you'd only done the B2G run.
Hey Bill, would you be able to review this patch? I tested that it works locally, and try run looked okay.
Attachment #8537279 - Attachment is obsolete: true
Attachment #8537489 - Flags: review?(wmccloskey)
I'm wondering why you do CLOSE_CHANNEL instead of SEND_SHUTDOWN_MESSAGE in the xpcom-shutdown observer.  I think this is what was causing some problems for leak checking.  We create some processes late in shutdown, and they end up doing the old bad way of shutting down, which causes problems.  Though that in turn causes another assertion at shutdown.  Let me see if I can reproduce the problem with the newest version of your patch.
This is what I was doing, though it seems to just hang now when I'm testing it with the new version of the patch. :(  I'm not sure when I'll have time to look into it next, but I can file a followup bug as needed to address my shutdown needs.

The weird tab child change is needed because DelayedDeleteRunnable was getting posted, then ContentChild went through the new message back and forth which caused the subtree destroy to be called on tabchild, then the runnable was trying to do a send delete on the tabchild.
I see. I used CLOSE_CHANNEL to preserve current behavior for xpcom-shutdown. Normally, we've already started shutting down by the time we get to xpcom-shutdown, but I can see how that might not be the case when we're running tests.
Sorry again this took me so long. I ended up playing around with this patch, so I'll just post my review comments as an additional patch.
Attachment #8537489 - Flags: review?(wmccloskey) → review+
Posted patch review-changes (obsolete) — Splinter Review
Here's my "review". The only substantial change is to the xpcom-shutdown behavior. I think we should always try to send the Shutdown message and then wait for the child to die. I pushed that change to try and it seems to work. This will help Andrew's leak checking work and it also makes things more consistent I think.
Attachment #8537543 - Attachment is obsolete: true
Attachment #8543419 - Flags: review+
This fixes an existing bug I noticed while messing around with the patch here. Namely, we create a runnable that sends the __delete__ message for TabChild. However, this runnable can happen to run after TabChild::ActorDestroy, in which case bad things will happen. (It won't happen after ~TabChild since it keeps a ref to the TabChild.)
Attachment #8543421 - Flags: review?(bent.mozilla)
Posted patch timer-changesSplinter Review
I also noticed a pretty disturbing issue while testing this patch. I wanted to make sure that a hung child process won't cause shutdown to hang forever. So I stuck a sleep(30) call inside TabChild::RecvShutdown. In theory, NotifyTabDestroying sets a timer that should kill the child after 5 seconds.

That doesn't work though. We get to the nested event loop that this patch adds to wait for the child to shutdown. However, for some reason the timer doesn't trigger while we're in that nested event loop. If I switch it to an XPCOM timer instead of a chromium timer, then it runs. I think there may be a bug in MessagePump that's causing this behavior since afaik know the chromium timer should work. Please correct me if that's wrong, Ben.

Anyway, I'd like to work around this for now by switching to an XPCOM timer. I'm going to investigate the MessagePump issue now though. It seems pretty serious.
Attachment #8543426 - Flags: review?(bent.mozilla)
I looked at the MessagePump issue in comment 24 and it seems to be sort of a corner case. We're creating a timer that's supposed to expire after 5 seconds, but then we cancel it here:
The reason is that we're exiting the main Gecko event loop.

Besides this particular timeout, I'm not sure how likely it is that we have chromium delayed tasks that should still run when we're processing events during shutdown. It doesn't seem very likely to me, so I think the fix in comment 24 may be the right way to go.
The original version of this had a really stupid error, so flagging Jim to make sure he looks this over too :-).
Attachment #8543419 - Attachment is obsolete: true
Attachment #8543448 - Flags: review?(nchen)
Comment on attachment 8543448 [details] [diff] [review]
review-changes v2

Review of attachment 8543448 [details] [diff] [review]:


::: dom/ipc/ContentChild.cpp
@@ +2498,5 @@
>      }
> +
> +    // Ignore errors here. If this fails, the parent will kill us after a
> +    // timeout.
> +    unused << SendFinishShutdown();

Is there a reason you removed the PostTask call? I don't think we actually need it, but I added it originally just in case some shutdown task posts something to the event loop.
Attachment #8543448 - Flags: review?(nchen) → review+
Comment on attachment 8543426 [details] [diff] [review]

Review of attachment 8543426 [details] [diff] [review]:

::: dom/ipc/ContentParent.cpp
@@ -1908,4 @@
>      int32_t timeoutSecs =
>          Preferences::GetInt("dom.ipc.tabs.shutdownTimeoutSecs", 5);
>      if (timeoutSecs > 0) {
> -        MessageLoop::current()->PostDelayedTask(

Ugh, we should make this method crash somehow if called from non-chromium code... We've had so many bugs because of it :(

@@ +1909,5 @@
>          Preferences::GetInt("dom.ipc.tabs.shutdownTimeoutSecs", 5);
>      if (timeoutSecs > 0) {
> +        mForceKillTimer = do_CreateInstance(";1");
> +        if (!mForceKillTimer)
> +            return;

Returning seems not very future-proof (someone could add code below and not notice it?). Can't we assert that mForceKillTimer always succeeds? I don;t think this can be called after XPCOM has shut down, right?

@@ +1973,5 @@
>  {
>      mSubprocess = nullptr;
>      mChildID = gContentChildID++;
>      mGeolocationWatchID = -1;
> +    mForceKillTimer = nullptr;

This isn't needed, nsCOMPtr default-initializes to null.

@@ +3146,5 @@
> +/* static */ void
> +ContentParent::ForceKillTimerCallback(nsITimer* aTimer, void* aClosure)
> +{
> +    ContentParent* self = static_cast<ContentParent*>(aClosure);

Nit: Use |auto*| here :)

::: dom/ipc/ContentParent.h
@@ +20,5 @@
>  #include "nsFrameMessageManager.h"
>  #include "nsHashKeys.h"
>  #include "nsIObserver.h"
>  #include "nsIThreadInternal.h"
> +#include "nsITimer.h"

Nit: I'd just forward-declare nsITimer, then #include it in the cpp.
Attachment #8543426 - Flags: review?(bent.mozilla) → review+
Comment on attachment 8543421 [details] [diff] [review]

Review of attachment 8543421 [details] [diff] [review]:

::: dom/ipc/TabChild.h
@@ +498,5 @@
>      already_AddRefed<nsIWidget> CreatePluginWidget(nsIWidget* aParent);
>      nsIntPoint GetChromeDisplacement() { return mChromeDisp; };
> +    bool IPCOpen() { return mIPCOpen; }

Hm, ContentParent does this too, but calls everything IsAlive()... Kind of an unfortunate name, and I'm not sure I like it, but maybe copy it? Dunno, up to you.
Attachment #8543421 - Flags: review?(bent.mozilla) → review+
(In reply to Jim Chen [:jchen] from comment #27)
> Is there a reason you removed the PostTask call? I don't think we actually
> need it, but I added it originally just in case some shutdown task posts
> something to the event loop.

There wasn't really any reason. You can keep it if it makes sense.
> > -        MessageLoop::current()->PostDelayedTask(
> Ugh, we should make this method crash somehow if called from non-chromium code... We've had
> so many bugs because of it :(

What other problems does it cause besides the issue in comment 25?

> > +    bool IPCOpen() { return mIPCOpen; }
> Hm, ContentParent does this too, but calls everything IsAlive()... Kind of an unfortunate
> name, and I'm not sure I like it, but maybe copy it? Dunno, up to you.

I'd rather keep it IPCOpen since we use that name a lot in the networking and graphics code for the same purpose. IsAlive is used in a few places in DOM code for different things.
Rebased version of my patch. Bill, can you land this along with your additional patches once they're ready? Thanks!
Attachment #8537489 - Attachment is obsolete: true
Attachment #8544839 - Flags: review+
I did a try push for the patches in this bug:
For some reason my previous try push hit an xpcshell issue on Mac. That push had some other patches, though, so hopefully this one will be green.
Backed out in for the b2g xpcshell timeout in (no idea whether the total failure to deal with a timeout after the timeout is just the normal way that tests on b2g suck, or also fallout from this).
That's too bad. I'd assumed it was some sort of infra bustage because of the download failure.

Jim suggested a try push without the timer changes.
I found out that waiting for shutdown in xpcom-shutdown is too late for receiving telemetry data because telemetry data collection happens during profile-before-change, before xpcom-shutdown.

This patch adds a profile-before-change observer to ContentParent, in addition to xpcom-shutdown. For either observer, we initiate shutdown if needed and wait for any pending shutdown to complete. I'm not sure if xpcom-shutdown is really needed, but I guess it's a last resort just in case we don't have a profile for some reason.

To simplify things, I made ShutDownProcess so that we can call it multiple times and only have it send the shutdown message once.
Attachment #8545507 - Flags: review?(wmccloskey)
Comment on attachment 8545507 [details] [diff] [review]
Wait for ContentChild shutdown during profile-before-change (v1)

Review of attachment 8545507 [details] [diff] [review]:

::: dom/ipc/ContentParent.cpp
@@ +2728,5 @@
>                         const char* aTopic,
>                         const char16_t* aData)
>  {
> +    if (mSubprocess && (!strcmp(aTopic, "xpcom-shutdown") ||
> +                        !strcmp(aTopic, "profile-before-change"))) {

I think it would be much better if we just do this based on profile-before-change (and not xpcom-shutdown). I asked Fabrice whether profile-before-change fires on b2g in the parent and he said he thinks it does. I'm asking him for feedback because he wanted to check to make sure. If that's true, let's remove the xpcom-shutdown portion.
Attachment #8545507 - Flags: review?(wmccloskey)
Attachment #8545507 - Flags: review+
Attachment #8545507 - Flags: feedback?(fabrice)
We used to just close the channel in the shutdown observer, but now that we always send the shudown message and wait for shutdown, we should start the kill timer in the shutdown observer too if needed.

I will push to try when the tree reopens. I hope this will fix the b2g failures. If it does, that means there's something going on in those b2g tests that causes the child to hang; these hangs didn't show up before because the parent didn't wait on the child. Relying on the kill timer simply hides the underlying issue, but actually fixing the issue should be in a separate bug.
Attachment #8546134 - Flags: review?(wmccloskey)
Comment on attachment 8546134 [details] [diff] [review]
Use kill timer when shutting down ContentParent in shutdown observer (v1)

Review of attachment 8546134 [details] [diff] [review]:

::: dom/ipc/ContentParent.cpp
@@ +1504,5 @@
>      if (aMethod == SEND_SHUTDOWN_MESSAGE) {
>          if (mIPCOpen && !mShutdownPending && SendShutdown()) {
>              mShutdownPending = true;
> +            // Start the force-kill timer if we haven't already.
> +            StartForceKillTimer();

It seems like the timer ought to be running anyway since we've probably closed a tab very recently during shutdown. But it's worth a try.
Attachment #8546134 - Flags: review?(wmccloskey) → review+
Almost there! Only b2g M11 failure left. BTW I had to keep the xpcom-shutdown observer, otherwise there were failures in Linux debug tests; go figure.
Attachment #8545507 - Flags: feedback?(fabrice) → feedback+
This fixes the B2G M11 failure. So in summary, to fix the test failures, we need:

* observers for both xpcom-shutdown and profile-before-change
** xpcom-shutdown observer prevents linux test failures (I suspect some tests are run without a profile / with the profile unloaded early somehow)
** profile-before-change is needed for telemetry

* start kill timer during xpcom-shutdown/profile-before-change
** needed to fix b2g xpcshell failure

* don't send shutdown message for nuwa processes in xpcom-shutdown/profile-before-change
** needed to fix b2g m11 failure
Attachment #8547082 - Flags: review?(wmccloskey)
Comment on attachment 8547082 [details] [diff] [review]
Don't send shutdown message to Nuwa processes (v1)

OK. I'm still curious about the profile-before-change issue, but that shouldn't stop you from landing.
Attachment #8547082 - Flags: review?(wmccloskey) → review+
You need to log in before you can comment on or make changes to this bug.