Closed Bug 1066812 Opened 5 years ago Closed 5 years ago

e10s: child process continues running if parent process crashes

Categories

(Core :: IPC, defect)

x86_64
All
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
e10s m5+ ---
firefox39 --- fixed

People

(Reporter: luke, Assigned: mconley)

References

Details

Attachments

(4 files, 3 obsolete files)

I was experimenting with a content-caused hang (while(true){} in a script) when dom.max_script_run_time was accidentally set to 0.  This hangs the browser so I killed the browser.  However, after the parent process was dead, the child Web Content process continued at 100% cpu.  I'm not an expert on process control, but I would hope that we could set up the processes so that, if the parent process dies, all the children do as well.  I think maybe process groups allow you to do this?
tracking-e10s: --- → ?
So I expect that this is peculiar to the case of the content process ilooping:

In ContentChild::ActorDestroy we QuickExit when the parent goes away, but that method is only called on the main thread on an event. http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#1557

So if the main thread is ilooping or deadlocked, we won't get there. Currently the MessageListener API only has callbacks on the main thread, but there is MessageChannel::mAbortOnError and SetAbortOnError which the plugin code uses to hard-abort. The content code could probably use something similar.
Assignee: nobody → ally
Status: NEW → ASSIGNED
Assignee: ally → nobody
Status: ASSIGNED → NEW
Assignee: nobody → mconley
Attached patch WIP Patch v1 (obsolete) — Splinter Review
So, it might actually be this simple. At least, from my understanding, this fixes the issue.

My test case was to run a while(true){} in some web content, and then kill the parent process, and observe the child process. Without my patch, plugin-container just kept running. With my patch, it self-destructed after a few seconds.

This is a [WIP] still because:

1) I'm suspicious about it actually being this simple, so I'm asking for feedback
2) It looks like SetAbortOnError takes a boolean argument, but doesn't do anything with it. If it's all the same to everybody here, I wouldn't mind fixing that while I'm here (probably in a separate patch though).
Attachment #8560012 - Flags: feedback?(wmccloskey)
/r/3523 - Bug 1066812 - Make it possible for a remote process to stop aborting on error messages. r=?
/r/3525 - Bug 1066812 - Hard abort ContentChild if communications with parent process have broken down. r=?

Pull down these commits:

hg pull review -r f6acb026c490a4df9084e14136f6bb9a75f52960
Attachment #8560818 - Flags: review?(wmccloskey)
Attachment #8560012 - Attachment is obsolete: true
Attachment #8560012 - Flags: feedback?(wmccloskey)
Comment on attachment 8560818 [details]
MozReview Request: bz://1066812/mconley

https://reviewboard.mozilla.org/r/3521/#review2807

Ship It!
Attachment #8560818 - Flags: review?(wmccloskey) → review+
Mike, is there any way to make the comment I wrote show up in bugzilla? I don't like the idea that review comments end up in a different system.
(In reply to Bill McCloskey (:billm) from comment #5)
> Mike, is there any way to make the comment I wrote show up in bugzilla? I
> don't like the idea that review comments end up in a different system.

Which comment? I don't see a comment in RB or Bugzilla, just the "Ship It"...

and I already landed. :/

remote:   https://hg.mozilla.org/integration/fx-team/rev/8ecfe45191b6
remote:   https://hg.mozilla.org/integration/fx-team/rev/1613eb430cec

Did you have a review comment that you didn't publish?
https://reviewboard.mozilla.org/r/3523/#review2805

::: ipc/glue/MessageChannel.h
(Diff revision 1)
> -        mAbortOnError = true;
> +        mAbortOnError = abort;

Just adding a comment to test our review board.
I didn't actually have anything to comment. I didn't realize you need to individually publish each comment.
Depends on: 1131221
Backed out for causing the B2G crashes shown in bug 1066812. Also, please push core Gecko patches to mozilla-inbound in the future to avoid potential merge conflicts.
https://hg.mozilla.org/integration/fx-team/rev/0207e9b9ae79
No longer depends on: 1131221
So I suspect in most (if not all cases) here, our test harness doesn't like the fact that the child process is taking itself down when the connection to the parent dies, because we leave crash stacks behind.

Any suggestions on what the right behaviour is here? Like, we definitely want to kill the child process when the parent goes down, but should we put a timer on it? Like, if we don't go down cleanly within 10 seconds of a connection error, kill self.

Does that sound reasonable, billm? Are there other people I should be clearing this idea with?
Flags: needinfo?(wmccloskey)
There's some way to indicate that you are intentionally crashing, which keeps the test harness from freaking out.
Probably mozilla::NoteIntentionalCrash().
In theory, all of these problems are bugs. It would be nice to understand why they happen. Since these appear to be hard to reproduce, it will be difficult.

The parent is supposed to wait for all its children to die before it shuts down. If the children take too long, the parent is supposed to kill them. Either way, the child should never find itself in a state where the parent is dead.

I'll try to take a look at the code a bit tomorrow. If I don't find anything, then maybe we can think about workarounds.
So I did a try push with some logging of process shutdown, and here's what I suspect.

The ContentParent receives the command to shutdown, so it sends a message to the child process to shut down, and starts its internal timer to self destruct (and take down children).

The ContentChild receives the message to shut down, and sends the message to the parent saying that it has finished shutting down (I believe the child is still actually alive at this point).

The parent receives that message from the child, and re-enters ShutDownProcess, this time closing the channel to the child. This calls the ActorDestroy in the ContentParent, and sends a message down the child to call ActorDestroy in the child.

In the parent process, ActorDestroy eventually loops through the list of processes created by the ContentParent, and queues Runnables to destroy them.

This is the critical part - it looks like at this point in the game (at least for the Windows XP Debug case that I reproduced on try), the childIDArray.Length() is 0, meaning that the ContentParent has no notion of child processes being opened that it should queue for killing. So we skip over that bit.

The parent finishes shutting down.

The child, meanwhile, is still very much alive, and I suppose it attempts to communicate with the parent, and this triggers the abort / crash that my code added.

So I think I need to figure out why the parent doesn't have the ID of the child process in childIDArray.
Talked to Mike on IRC. We think maybe turning AbortOnError off in ContentChild::ActorDestroy might fix the problem. Only try will tell.
Flags: needinfo?(wmccloskey)
Sadly, at least for Windows XP Debug, turning AbortOnError off in ContentChild::ActorDestroy did not help.

Try build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ac9964c3be33
*sigh*, I swear I wrote a comment and needinfo'd in here. Well, second time's the charm.

I'm focusing entirely upon the failure for Windows XP Debug. What strikes me as odd is that this is not for an e10s-enabled mochitest run, but for a standard mochitest run. So it seems like there must be some tests running in non-e10s mochitest runs that open content processes.

The second thing I note is that despite printing with printf during the whole process take down procedure, I don't see any of that logging from the content process.

Here's the logging patch: https://hg.mozilla.org/try/rev/55ea3576e164

And here's the log: https://hg.mozilla.org/try/rev/55ea3576e164

Notably, the logging I put in before the ContentChild sends up the FinishShutdown message to the parent is not in the log, despite the fact that we clearly receive that FinishShutdown message and log that in the parent.

What's more, NS_RUNTIMEABORT is _supposed_ to log information when it brings down the process, and there's no such information in the log.

I wonder if perhaps our testing infrastructure is not prepared to capture the stdout from other processes when not doing an e10s mochitest run?
Hey Armen - remember that problem I pinged you about on IRC yesterday? I think it might still be a problem...

Is it possible that we're not capturing the stdout from content processes that get opened by the main browser process in non-e10s mochitest runs?
Flags: needinfo?(armenzg)
I'm working today only 1/2 day.
jlund could also help you look into this if it can't wait until tomorrow.
Are you talking about any of the jobs in here?
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ac9964c3be33

Do we run e10s for XP?
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=e10s

It seems that we run non-e10s and e10s jobs side by side for Win7 and Win8 but not WinXp:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=Windows

In any case, you could run mozharness locally to see if it would show up for you:
https://wiki.mozilla.org/ReleaseEngineering/Mozharness/How_to_run_tests_as_a_developer
I have added support for developers to run things locally so let me know if it does not work for you.
Try to fetch the try build and tests rather than trying to make it work with your local build.

I will be at the office tomorrow if you would like to look farther into together.
Flags: needinfo?(armenzg)
Depends on: 1134741
So I've figured out what's going wrong here - the content sandbox is not letting the content process print anything out via stdout. My printf's are just getting denied.

Disabling the sandbox via MOZ_DISABLE_CONTENT_SANDBOX=1 helps - I get my logging out there... but the logging is getting jumbled in with the output from the parent process, so it becomes a huge mess.

bobowen - jimm said that you're the go-to fella for sandbox stuff... what can I do to get reliable, orderly logging from the content process on Windows XP?
Flags: needinfo?(bobowen.code)
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #22)
> So I've figured out what's going wrong here - the content sandbox is not
> letting the content process print anything out via stdout. My printf's are
> just getting denied.
> 
> Disabling the sandbox via MOZ_DISABLE_CONTENT_SANDBOX=1 helps - I get my
> logging out there... but the logging is getting jumbled in with the output
> from the parent process, so it becomes a huge mess.
> 
> bobowen - jimm said that you're the go-to fella for sandbox stuff... what
> can I do to get reliable, orderly logging from the content process on
> Windows XP?

We don't inherit handles by default pre-Vista, because there is no way to limit which handles get inherited.
Vista and afterwards allow you to limit the handles inherited to a given list.

The e10s tests should normally get logging because we added an environment variable (MOZ_WIN_INHERIT_STD_HANDLES_PRE_VISTA=1) to allow handles to be inherited anyway pre-Vista for testing.

I think for standard e10s process launch the logging works for debug builds (as you saw) because the processes are marked as console apps and therefore inherit the parent's console by default.
However, the sandboxed process launch sets the DETACHED_PROCESS flag which stops this.

Over the order of logging, I'm not sure we can do much about that.

For the actual subject of this bug I might have better news (at least on Windows) ...
Once we can start using a separate job object for the sandboxed content process the first flag that gets set is JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, which should mean that when the last handle to the job is closed the job gets terminated.
So, assuming those handles are only held by the parent process we would be OK.
Not sure how it would handle cyclical handle holding among multiple child processes though (if that could happen).

I think that we will be able to do this (set a job level of JOB_UNPROTECTED) once bug 1057908 is fixed.
Flags: needinfo?(bobowen.code)
So I was finally able to get an IPC log for just before the child process aborts:

06:17:42 INFO - [time:1424960261985000][2456->260][PContentChild] Sending Msg_PNeckoConstructor([TODO])
06:17:42 INFO - [time:1424960262125000][2456->260][PContentChild] Sending Msg_GetXPCOMProcessAttributes([TODO])
06:17:42 INFO - [time:1424960262141000][2456<-260][PContentChild] Received reply Reply_GetXPCOMProcessAttributes([TODO])
06:17:42 INFO - [time:1424960262141000][2456->260][PContentChild] Sending Msg_PHalConstructor([TODO])
06:17:42 INFO - [time:1424960262141000][2456->260][PHalChild] Sending Msg_EnableSystemTimezoneChangeNotifications([TODO])
06:17:42 INFO - [time:1424960262141000][2456->260][PHalChild] Sending Msg_EnableWakeLockNotifications([TODO])
06:17:42 INFO - [time:1424960262141000][2456->676][PCompositorChild] Sending Msg_GetTileSize([TODO])
06:17:42 INFO - [time:1424960262141000][2456<-676][PCompositorChild] Received reply Reply_GetTileSize([TODO])
06:17:42 INFO - [time:1424960262156000][2456->260][PCrashReporterChild] Sending Msg_AnnotateCrashReport([TODO])
06:17:42 INFO - [time:1424960262156000][2456->260][PCrashReporterChild] Sending Msg_AnnotateCrashReport([TODO])
06:17:42 INFO - [time:1424960262156000][2456->260][PCrashReporterChild] Sending Msg_AnnotateCrashReport([TODO])
06:17:42 INFO - [time:1424960262156000][2456->260][PCrashReporterChild] Sending Msg_AnnotateCrashReport([TODO])
06:17:42 INFO - [time:1424960262156000][2456->260][PCrashReporterChild] Sending Msg_AppendAppNotes([TODO])
06:17:42 INFO - [time:1424960262156000][2456->260][PContentChild] Sending Msg_GetGraphicsFeatureStatus([TODO])
06:17:42 INFO - [time:1424960262172000][2456<-260][PContentChild] Received reply Reply_GetGraphicsFeatureStatus([TODO])
06:17:42 INFO - [time:1424960262172000][2456->260][PContentChild] Sending Msg_GetGraphicsFeatureStatus([TODO])
06:17:42 INFO - [time:1424960262203000][2456<-260][PContentChild] Received reply Reply_GetGraphicsFeatureStatus([TODO])
06:17:42 INFO - [time:1424960262203000][2456->260][PContentChild] Sending Msg_GetGraphicsFeatureStatus([TODO])
06:17:42 INFO - [time:1424960262219000][2456<-260][PContentChild] Received reply Reply_GetGraphicsFeatureStatus([TODO])
06:17:42 INFO - [time:1424960262219000][2456<-260][PContentChild] Received Msg_RegisterChrome([TODO])
06:17:42 INFO - [time:1424960262234000][2456<-260][PContentChild] Received Msg_AppInfo([TODO])
06:17:42 INFO - [time:1424960262234000][2456<-260][PContentChild] Received Msg_LoadAndRegisterSheet([TODO])
06:17:42 INFO - [time:1424960262234000][2456<-260][PContentChild] Received Msg_LoadAndRegisterSheet([TODO])
06:17:42 INFO - [time:1424960262234000][2456<-260][PContentChild] Received Msg_PBrowserConstructor([TODO])
06:17:42 INFO - [time:1424960262250000][2456->260][PContentChild] Sending Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262250000][2456->260][PContentChild] Sending Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262250000][2456->260][PContentChild] Sending Msg_SyncMessage([TODO])
06:17:42 INFO - [time:1424960262281000][2456<-260][PContentChild] Received reply Reply_SyncMessage([TODO])
06:17:42 INFO - [time:1424960262296000][2456<-260][PBrowserChild] Received Msg_PRenderFrameConstructor([TODO])
06:17:42 INFO - [time:1424960262296000][2456<-260][PBrowserChild] Received Msg_Show([TODO])
06:17:42 INFO - [time:1424960262296000][2456->260][PBrowserChild] Sending Msg_NotifyIMEFocus([TODO])
06:17:42 INFO - [time:1424960262296000][2456<-260][PBrowserChild] Received reply Reply_NotifyIMEFocus([TODO])
06:17:42 INFO - [time:1424960262296000][2456->676][PCompositorChild] Sending Msg_PLayerTransactionConstructor([TODO])
06:17:42 INFO - [time:1424960262296000][2456<-676][PCompositorChild] Received reply Reply_PLayerTransactionConstructor([TODO])
06:17:42 INFO - [time:1424960262312000][2456<-260][PBrowserChild] Received Msg_LoadRemoteScript([TODO])
06:17:42 INFO - [time:1424960262328000][2456<-260][PBrowserChild] Received Msg_LoadRemoteScript([TODO])
06:17:42 INFO - [time:1424960262328000][2456<-260][PBrowserChild] Received Msg_LoadRemoteScript([TODO])
06:17:42 INFO - [time:1424960262343000][2456<-260][PBrowserChild] Received Msg_LoadRemoteScript([TODO])
06:17:42 INFO - [time:1424960262359000][2456<-260][PBrowserChild] Received Msg_LoadRemoteScript([TODO])
06:17:42 INFO - [time:1424960262406000][2456<-260][PBrowserChild] Received Msg_LoadRemoteScript([TODO])
06:17:42 INFO - [time:1424960262406000][2456<-260][PBrowserChild] Received Msg_LoadURL([TODO])
06:17:42 INFO - [time:1424960262406000][2456->260][PContentChild] Sending Msg_ReadPermissions([TODO])
06:17:42 INFO - [time:1424960262483000][2456<-260][PContentChild] Received reply Reply_ReadPermissions([TODO])
06:17:42 INFO - [time:1424960262499000][2456->260][PContentChild] Sending Msg_PScreenManagerConstructor([TODO])
06:17:42 INFO - [time:1424960262561000][2456<-260][PContentChild] Received reply Reply_PScreenManagerConstructor([TODO])
06:17:42 INFO - [time:1424960262577000][2456->260][PContentChild] Sending Msg_PStorageConstructor([TODO])
06:17:42 INFO - [time:1424960262592000][2456->260][PBrowserChild] Sending Msg_SetStatus([TODO])
06:17:42 INFO - [time:1424960262592000][2456->260][PCrashReporterChild] Sending Msg_AnnotateCrashReport([TODO])
06:17:42 INFO - [time:1424960262592000][2456<-260][PBrowserChild] Received Msg_LoadRemoteScript([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PBrowserChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456->260][PNeckoChild] Sending Msg_PHttpChannelConstructor([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PBrowserChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PBrowserChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PBrowserChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PBrowserChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PBrowserChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PBrowserChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PBrowserChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PBrowserChild] Received Msg_Destroy([TODO])
06:17:42 INFO - [time:1424960262608000][2456->260][PHttpChannelChild] Sending Msg_Cancel([TODO])
06:17:42 INFO - [time:1424960262608000][2456->676][PLayerTransactionChild] Sending Msg_Shutdown([TODO])
06:17:42 INFO - [time:1424960262608000][2456->260][PRenderFrameChild] Sending Msg___delete__([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PContentChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PContentChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-260][PContentChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262608000][2456<-676][PLayerTransactionChild] Received Msg___delete__([TODO])
06:17:42 INFO - [time:1424960262608000][2456->260][PCrashReporterChild] Sending Msg_AnnotateCrashReport([TODO])
06:17:42 INFO - [time:1424960262608000][2456->260][PCrashReporterChild] Sending Msg_AppendAppNotes([TODO])
06:17:42 INFO - [time:1424960262608000][2456->260][PCrashReporterChild] Sending Msg_AppendAppNotes([TODO])
06:17:42 INFO - [time:1424960262608000][2456->260][PContentChild] Sending Msg_ScriptError([TODO])
06:17:42 INFO - [time:1424960262608000][2456->260][PContentChild] Sending Msg_ScriptError([TODO])
06:17:42 INFO - [time:1424960262608000][2456->260][PBrowserChild] Sending Msg___delete__([TODO])
06:17:42 INFO - [time:1424960262608000][2456->260][PContentChild] Sending Msg_FirstIdle([TODO])
06:17:42 INFO - [time:1424960262686000][2456<-260][PContentChild] Received Msg_AsyncMessage([TODO])
06:17:42 INFO - [time:1424960262686000][2456<-260][PStorageChild] Received Msg_ScopesHavingData([TODO])
06:17:42 INFO - [time:1424960262733000][2456<-260][PContentChild] Received Msg_Shutdown([TODO])
06:17:42 INFO - [time:1424960262733000][2456->260][PContentChild] Sending Msg_FinishShutdown([TODO])
06:17:42 INFO - [time:1424960262888000][2456<-260][PHttpChannelChild] Received Msg_OnStartRequest([TODO])
06:17:42 INFO - [Child 2456] ###!!! ABORT: Aborting on channel error.: file c:\builds\moz2_slave\try-w32-d-00000000000000000000\build\src\ipc\glue\MessageChannel.cpp, line 1585
06:17:42 INFO - essageChannel::SynchronouslyClose

Unfortunately, the logging doesn't seem to come out deterministically - the IPC logging seems to be emitted separately from the printf logging I put in, so I can't be 100% sure of how things are going down.

A few observations:

1) The IPC logging doesn't log things like the messages going across to destroy the ContentParent and ContentChild actors, so I can't really be sure where in this log of IPC messages that is occurring.

2) In successful runs of this test, the child receives OnStopRequest and then sends back a DocumentChannelCleanup message before shutting down:

06:17:42     INFO -  [time:1424960262686000][2096->260][PContentChild] Sending Msg_FinishShutdown([TODO])
06:17:42     INFO -  [time:1424960262701000][2096<-260][PHttpChannelChild] Received Msg_OnStartRequest([TODO])
06:17:42     INFO -  [time:1424960262701000][2096<-260][PHttpChannelChild] Received Msg_OnStopRequest([TODO])
06:17:42     INFO -  [time:1424960262701000][2096->260][PHttpChannelChild] Sending Msg_DocumentChannelCleanup([TODO])

These are missing in the failure case.

Hypothesis 1): parent is attempting to send down the OnStopRequest message, but at this point, the IPC channel is in a shutting-down state. This causes the IPC channel to flip out and send some kind of message to the child which tells it that a communications error occurred and that it should abort.

Hypothesis 2): parent sends OnStartRequest down to the child, and the child, knowing that it's being shut down already, attempts to send the DocumentChannelCleanup message. The channel to the parent, however, has closed, and so the child interprets this as a communications error and aborts. I think this is less likely since it appears as if DocumentChannelCleanup would only ever get sent if the child had received OnStopRequest. But I could be wrong about that.

Going to test hypothesis 1 and 2 with some instrumentation. billm, are there any other possibilities that might explain this orange that you can glean from the log?
Flags: needinfo?(wmccloskey)
Can you try putting SetAbortOnError(false) in ContentChild::RecvShutdown as well.

Also, how did you get this logging?
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #25)
> Can you try putting SetAbortOnError(false) in ContentChild::RecvShutdown as
> well.
> 

Sure - actually, that sounds like a fantastic idea. Going through my logs, it looks like every time, the content process receives the Shutdown message before it attempts to abort.

> Also, how did you get this logging?

In a most hack-y way. Attempting to log all IPC messages by setting MOZ_IPC_MESSAGE_LOG in the override mozconfig for the test run was causing the log files to get too massive, so my test runs were aborting. What I did instead was have the content process listen for pref changes from the parent - if it noticed a "mconley.ipc_logging" pref change, it'd immediately set the MOZ_IPC_MESSAGE_LOG=1 environment variable. The messaging thread would then pick up the environment variable.

I then had one of the mochitests where the failure occurs set that preference.

It's pretty rube-goldberg, but it got me what I wanted.
This will cause the content process to take itself down in the event that it
loses communication with the parent process. This case is particularly
important for the case where the parent process crashes while the content
process is blocked or busy on the main thread, as the content process will
no longer continue to exist as a zombie process, but will shut down after
a short delay.
Comment on attachment 8570756 [details] [diff] [review]
Set AbortOnError in the ContentProcess on initialization. r=?

That was a fantastic suggestion, billm. My try runs look good with that alteration:

(orange on ICS due to some bad casting on my part in my logging)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a0ea19e1f1b

(proper ICS try push)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=703e701d7e2b
Attachment #8570756 - Flags: review?(wmccloskey)
OS: Linux → All
Comment on attachment 8570756 [details] [diff] [review]
Set AbortOnError in the ContentProcess on initialization. r=?

Review of attachment 8570756 [details] [diff] [review]:
-----------------------------------------------------------------

Good to hear. Still don't understand quite what's going on. Presumably the child is trying to send a message after sending FinishShutdown (and after the parent actually closes the link). It's just weird that that message doesn't show up in the log.

::: dom/ipc/ContentChild.cpp
@@ +613,5 @@
>  
>      // Make sure there's an nsAutoScriptBlocker on the stack when dispatching
>      // urgent messages.
>      GetIPCChannel()->BlockScripts();
> +    // If communications with the parent have broken down, take the process

Line break before please.
Attachment #8570756 - Flags: review?(wmccloskey) → review+
Thanks! Added the line break, and landed:

remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/3e9cc0ba5dfb
I just remembered that we don't send the Shutdown message for Nuwa processes on b2g.
http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#1553
Maybe we should not set AbortOnError if IsNuwaProcess().
This will cause the content process to take itself down in the event that it
loses communication with the parent process. This case is particularly
important for the case where the parent process crashes while the content
process is blocked or busy on the main thread, as the content process will
no longer continue to exist as a zombie process, but will shut down after
a short delay.
Attachment #8570756 - Attachment is obsolete: true
Comment on attachment 8571566 [details] [diff] [review]
Set AbortOnError in the ContentProcess on initialization. r=?

Something like this?

Doing a full-spectrum try run on this, since it apparently affects everything.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=31262da84c84
Attachment #8571566 - Flags: review?(wmccloskey)
Comment on attachment 8571566 [details] [diff] [review]
Set AbortOnError in the ContentProcess on initialization. r=?

Review of attachment 8571566 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentChild.cpp
@@ +620,5 @@
> +    GetIPCChannel()->SetAbortOnError(true
> +#ifdef MOZ_NUWA_PROCESS
> +                                     && !IsNuwaProcess()
> +#endif
> +                                     );

How about:
bool abortOnError = true;
#ifdef MOZ_NUWA_PROCESS
abortOnError &= !IsNuwaProcess();
#endif
SetAbortOnError(abortOnError);

I filed bug 1138727 to make this better.
Attachment #8571566 - Flags: review?(wmccloskey) → review+
Try[1] looks good. RyanVM - were there any retriggers you wanted to do to smoke out any intermittents before I land this on inbound again?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=31262da84c84
Flags: needinfo?(ryanvm)
Try run looks good enough that I think you can go ahead and re-land. I mainly just wanted a full Try run for ease of retriggers should any issues arise post-landing. Thanks for checking!
Flags: needinfo?(ryanvm)
Third time lucky. Let's hope I don't roll snake eyes.

remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/09cc6c61c059
https://hg.mozilla.org/mozilla-central/rev/09cc6c61c059
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Attachment #8560818 - Attachment is obsolete: true
Attachment #8618355 - Flags: review+
Attachment #8618356 - Flags: review+
Duplicate of this bug: 1003258
You need to log in before you can comment on or make changes to this bug.