Closed
Bug 811636
Opened 12 years ago
Closed 11 years ago
Parent process does not handle the situation correctly if its child process is failed at launching
Categories
(Firefox OS Graveyard :: General, defect)
Tracking
(blocking-b2g:leo+, firefox23 wontfix, firefox24 wontfix, firefox25 fixed, b2g18+ fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix, b2g-v1.1hd fixed)
People
(Reporter: shelly, Assigned: shelly)
References
Details
(Whiteboard: [tech-bug][LeoVB+])
Attachments
(1 file, 7 obsolete files)
4.59 KB,
patch
|
shelly
:
review+
|
Details | Diff | Splinter Review |
Parent process is fail to get any notification if its child process does not launch or execute itself properly.
Comment 1•12 years ago
|
||
Launching a child process involves a fork/exec
There are at least 4 very distinct time periods that we need to concern ourselves with:
1 - Failures which occur in the parent before the fork
2 - Failures which occur in the child after the fork but before the exec
3 - Failures which occur in the child after the exec but before the IPC channel is setup
4 - Failures which occur in the child after the IPC channel is setup
Assignee | ||
Comment 2•12 years ago
|
||
This is happened when a child process is forked successfully, but failed after execv, and call _exit(127).
The current system misses the situation when a child process exit itself abnormally.
This draft of patch is trying to watch the SIGCHLD in an earlier state of creating a child process. Modify the fetched result from DidProcessCrash() (in process_util_posix.cc) in order to defer a child process is exit normally or not.
And after receives an abnormal exit of a child process, sends a "oop-frameloader-crashed' to window_manager.js. Or, maybe we should do something else?
Assignee | ||
Comment 3•12 years ago
|
||
Please reference comment 7 from bug 823474 for details of reproduce steps.
https://bugzilla.mozilla.org/show_bug.cgi?id=823474#c7
Assignee | ||
Updated•12 years ago
|
Attachment #684599 -
Flags: feedback?(justin.lebar+bug)
Comment 4•12 years ago
|
||
Comment on attachment 684599 [details] [diff] [review]
draft 1: Handle when a child process exit abnormally
I think cjones is probably a better person to look at this; I'm not familiar with any of this code.
Chris, if you're swamped, I can figure this stuff out.
Attachment #684599 -
Flags: feedback?(justin.lebar+bug) → feedback?(jones.chris.g)
Comment on attachment 684599 [details] [diff] [review]
draft 1: Handle when a child process exit abnormally
Hi Shelly, this is a very good UNIX-y solution to this problem. I
like it a lot :).
The problem is, this process mangement code is cross-platform, and we
need to keep all of them in mind when designing features like this.
We have an existing mechanism to notify us of "channel errors" (which
during startup, correspond to process crashes/exits). The first place
this code is used is when we launch the subprocess. The subprocess is
launched from ipc/glue/GeckoChildProcessHost.cpp. If an error occurs,
it will be notified here [1]. The code doesn't properly handle the
error notification at the moment; fixing that is bug 773925.
The second piece we need is to let ContentParent know when a channel
error has occurred during launch. We can set a particular state on
the GeckoChildProcessHost, for example.
The advantage of this alternate approach is that it will work across
all the platforms we support.
Let me know if this makes sense.
[1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#827
Attachment #684599 -
Flags: feedback?(jones.chris.g)
Assignee | ||
Comment 6•12 years ago
|
||
Attachment #684599 -
Attachment is obsolete: true
Assignee | ||
Comment 7•12 years ago
|
||
Comment on attachment 703172 [details] [diff] [review]
draft 2: Close the client fd earlier
Hi Chris,
Thanks for your feedback! I've worked on another solution to handle this kind of error, with less changes and closer to the outlook of our exist mechanism.
When a child process is created, its parent process waits for a hello message and close the client_pipe_. However, in this kind of situation, a system interrupt with signal SIGCHLD is sent, interrupts epoll_dispatch, and thus the hello message is never added to active event list, making the libevent fail to receive any notification. I figured maybe we can close the client_pipe_ eariler in time, in order to find out the connection has been reset already.
This change uses the exist mechanism to notify us channel error, but the rest is like you mentioned, where OnChannelError from GeckoChildProcessHost is not yet being called at this moment.
Attachment #703172 -
Flags: feedback?(jones.chris.g)
Assignee | ||
Comment 8•12 years ago
|
||
Btw, bug 823474 is no longer needed with this version of fix.
I'm very sorry for the long feedback lag here :(. I'll be clearing out my queue very soon.
We need this work for 100% rock-solid stability. When this kind of lmk death occurs, we permanently lose a substantial amount of device memory. That makes the possibility of this crash happening again higher, etc, creating a disastrous feedback loop.
blocking-b2g: --- → leo?
Updated•12 years ago
|
Blocks: b2g-v-next
Comment on attachment 703172 [details] [diff] [review]
draft 2: Close the client fd earlier
Hi Shelly, is it correct that we're not getting the error notification at [1] currently? If so, I /think/ I understand what this patch is doing, but there's still a race condition where the process can die without the notification.
We should try to discuss this on IRC this week. (I want to make sure I understand the details here ;).)
[1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#833
Attachment #703172 -
Flags: feedback?(jones.chris.g)
Comment 11•12 years ago
|
||
How often does this occur? How can this be triggered? Is it showing up in any of our or our partners' stability tests?
Needed for competitive stability.
Whiteboard: [tech-bug]
Assignee | ||
Comment 14•12 years ago
|
||
(In reply to Chris Jones [:cjones] [:warhammer] from comment #10)
> Comment on attachment 703172 [details] [diff] [review]
> draft 2: Close the client fd earlier
>
> Hi Shelly, is it correct that we're not getting the error notification at
> [1] currently? If so, I /think/ I understand what this patch is doing, but
> there's still a race condition where the process can die without the
> notification.
>
> We should try to discuss this on IRC this week. (I want to make sure I
> understand the details here ;).)
>
> [1]
> http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.
> cpp#833
Hi Chris,
Thanks for your feeback, and yes, we're not getting the error notification at [1], I'm not 100 percent sure if it is not called other where else, but at least it's not being called when the app is killed (shutdown abnormally).
Please feel free to ping me on IRC, my nickname is ShellyLin. :)
Updated•12 years ago
|
blocking-b2g: leo? → -
tracking-b2g18:
--- → +
Assignee | ||
Comment 15•11 years ago
|
||
Hi Justin,
Without this fix, we will end up having lots of zombie processes in the system. Since project Nuwa (bug 771765) might refactor most of the ipc implementation, especially the part of creating child processes, I'm not sure this is needed anymore, however, this fix is small and not risky, I still think it is worth trying.
I know you have re-directed this issue to cjones, or could you re-direct to someone else? Thanks.
Attachment #703172 -
Attachment is obsolete: true
Attachment #761912 -
Flags: review?
Attachment #761912 -
Flags: feedback?(justin.lebar+bug)
Assignee | ||
Updated•11 years ago
|
Attachment #761912 -
Flags: review?
Comment 16•11 years ago
|
||
We should definitely still fix this even though we're trying to do Nuwa.
Did you and cjones ever figure out the race condition he mentioned in comment 10?
Comment 17•11 years ago
|
||
We need to cover the child dying before it gets the IPC channel opened as well as after it gets the IPC channel opened.
Normally, the parent would launch a thread which does wait calls (a reaper thread). This reaps the zombies and allows the parent to be notified regardless when the child dies.
Comment 18•11 years ago
|
||
Even under Nuwa, we need the same thing, but it needs to be the template process that does the reaping (you can only reap children, not grand children).
Assignee | ||
Comment 19•11 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #16)
> We should definitely still fix this even though we're trying to do Nuwa.
>
> Did you and cjones ever figure out the race condition he mentioned in
> comment 10?
I didn't get a chance to talk with cjones, but I've discussed this issue with Thinker. Race condition won't happen here, since we are closing the client's fd and the call to |base::LaunchApp()| is synchronized, from the part of fork(), execv() till return.
Assignee | ||
Comment 20•11 years ago
|
||
(In reply to Dave Hylands [:dhylands] from comment #17)
> We need to cover the child dying before it gets the IPC channel opened as
> well as after it gets the IPC channel opened.
>
> Normally, the parent would launch a thread which does wait calls (a reaper
> thread). This reaps the zombies and allows the parent to be notified
> regardless when the child dies.
Yes, normally it does. In this abnormal case, child process dies before our ipc mechanism receives the "hello message", that is, the parent won't be notified because this child process does not event "exist".
Assignee | ||
Comment 21•11 years ago
|
||
The hacky steps to produce a zombie process:
1. Launch b2g, and kill the Preallocated-app process.
2. Go to system/b2g, rename |plugin-container| to something else, e.g. |plugin-container.2|.
3. Launch any web-app from Homescreen.
4. Observe a Z process with name "Gecko_IOThread".
Comment 22•11 years ago
|
||
(In reply to Shelly Lin [:shelly] from comment #20)
> (In reply to Dave Hylands [:dhylands] from comment #17)
> > We need to cover the child dying before it gets the IPC channel opened as
> > well as after it gets the IPC channel opened.
> >
> > Normally, the parent would launch a thread which does wait calls (a reaper
> > thread). This reaps the zombies and allows the parent to be notified
> > regardless when the child dies.
>
> Yes, normally it does. In this abnormal case, child process dies before our
> ipc mechanism receives the "hello message", that is, the parent won't be
> notified because this child process does not event "exist".
Well the child does, in fact, exist, it just didn't get around to using the ipc mechanism.
My point is that we shouldn't be using the IPC mechanism to track the child processes.
If the fork succeeds, then we should start tracking the child and be issuing a wait for the child. Otherwise, any child that fails for any reason between the fork and the open of the ipc channel becomes a zombie (because the parent needs to reap it so that the zombie goes away).
Comment 23•11 years ago
|
||
It sounds like dhylands understands this better than I do, and it also looks like the patch here doesn't fix the problem dhylands identified in comment 22.
Shelly, are you willing to wait a few weeks for Dave to get back from vacation and figure this out with him?
Comment 24•11 years ago
|
||
Comment on attachment 761912 [details] [diff] [review]
Handle the error of execv() when launching child processes
Kicking this over to Dave. If you need this taken care of more quickly, let me know and we can figure that out.
Attachment #761912 -
Flags: feedback?(justin.lebar+bug) → feedback?(dhylands)
Comment 25•11 years ago
|
||
This problem sounds similar to this issue that was fixed upstream:
https://codereview.chromium.org/7870008
"The problem was that when the child process died before connecting to the parent's IPC channel, the parent wouldn't notice the crash because the child end of the IPC pipe was kept open for too long."
Assignee | ||
Comment 26•11 years ago
|
||
(In reply to Dave Hylands [:dhylands] (PTO back July 2) from comment #22)
> (In reply to Shelly Lin [:shelly] from comment #20)
> > (In reply to Dave Hylands [:dhylands] from comment #17)
> > > We need to cover the child dying before it gets the IPC channel opened as
> > > well as after it gets the IPC channel opened.
> > >
> > > Normally, the parent would launch a thread which does wait calls (a reaper
> > > thread). This reaps the zombies and allows the parent to be notified
> > > regardless when the child dies.
> >
> > Yes, normally it does. In this abnormal case, child process dies before our
> > ipc mechanism receives the "hello message", that is, the parent won't be
> > notified because this child process does not event "exist".
>
> Well the child does, in fact, exist, it just didn't get around to using the
> ipc mechanism.
>
> My point is that we shouldn't be using the IPC mechanism to track the child
> processes.
>
> If the fork succeeds, then we should start tracking the child and be issuing
> a wait for the child. Otherwise, any child that fails for any reason between
> the fork and the open of the ipc channel becomes a zombie (because the
> parent needs to reap it so that the zombie goes away).
I see. "Issuing a wait" happens to be my very first solution, but Chris (in comment #5) thinks it's better to use our IPC mechanism to report errors. Attaching that patch back for further discussion.
Assignee | ||
Comment 27•11 years ago
|
||
Issuing a wait on newly created child process, in order to monitor its life state.
Assignee | ||
Comment 28•11 years ago
|
||
Justin, sounds good to me, thanks.
Andrew, it does look exactly the same like this issue!
(https://codereview.chromium.org/7870008)
Comment 29•11 years ago
|
||
Comment on attachment 761912 [details] [diff] [review]
Handle the error of execv() when launching child processes
Review of attachment 761912 [details] [diff] [review]:
-----------------------------------------------------------------
OK - I spent some time looking at the child/parent IPC socket and I things are a bit clearer.
What actually happens is something along these lines:
1 - parent creates a socketpair
2 - parent fork/execs the child, which inherits the open file descriptors
So I now believe that this patch is doing the right thing (and this should cover all of the failure points).
Attachment #761912 -
Flags: feedback?(dhylands) → feedback+
Assignee | ||
Comment 30•11 years ago
|
||
Thanks for your time very much!
I've added and updated some comments, could you give it a reivew?
Attachment #761912 -
Attachment is obsolete: true
Attachment #765168 -
Attachment is obsolete: true
Attachment #771146 -
Flags: review?(dhylands)
Comment 31•11 years ago
|
||
Comment on attachment 771146 [details] [diff] [review]
Close the FD of client side after parent fork/execs the child
Review of attachment 771146 [details] [diff] [review]:
-----------------------------------------------------------------
r=me with comments addressed.
::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ +781,5 @@
>
> +void Channel::ChannelImpl::CloseClientFileDescriptor() const {
> + if (client_pipe_ != -1) {
> + Singleton<PipeMap>()->Remove(pipe_name_);
> + HANDLE_EINTR(close(client_pipe_));
This function should be setting client_pipe_ to -1 after closing the file descriptor.
This then implies that this can't be a const method (so you'll need to remove the const or declare client_pipe_ mutable.
@@ +951,5 @@
> return channel_impl_->GetServerFileDescriptor();
> }
>
> +void Channel::CloseClientFileDescriptor() const {
> + return channel_impl_->CloseClientFileDescriptor();
nit: I'd remove the return since this is a void function.
::: ipc/glue/GeckoChildProcessHost.cpp
@@ +656,5 @@
> #endif
> false, &process, arch);
>
> + // Close the FD of client side as early as possible, letting our ipc mechanism
> + // to detect unexpected error at any stage.
nit: I'd reword this slightly.
// We're in the parent and the child was launched. Close the child FD
// in the parent as soon as possible, which will allow the parent
// to detect when the child closes its FD (either due to normal exit or
// due to crash).
Attachment #771146 -
Flags: review?(dhylands) → review+
Assignee | ||
Comment 32•11 years ago
|
||
Patch updated with nit fixes, and thanks for your review :)
Attachment #771146 -
Attachment is obsolete: true
Attachment #771910 -
Flags: review?(dhylands)
Assignee | ||
Comment 33•11 years ago
|
||
Fixed compiling error.
Attachment #771910 -
Attachment is obsolete: true
Attachment #771910 -
Flags: review?(dhylands)
Attachment #771949 -
Flags: review?(dhylands)
Comment 34•11 years ago
|
||
Comment on attachment 771949 [details] [diff] [review]
Close the FD of client side after parent fork/execs the child (w/ nit fix)
Review of attachment 771949 [details] [diff] [review]:
-----------------------------------------------------------------
::: ipc/glue/GeckoChildProcessHost.cpp
@@ +658,5 @@
>
> + // We're in the parent and the child was launched. Close the child FD in the
> + // parent as soon as possible, which will allow the parent to detect when the
> + // child closes its FD (either due to normal exit or due to crash).
> + const_cast<IPC::Channel&>(channel()).CloseClientFileDescriptor();
nit: You should be able to do it without the cast by doing:
GetChannel()->CloseClientFileDescriptor();
GetChannel is declared here:
https://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.h#83
Attachment #771949 -
Flags: review?(dhylands) → review+
Assignee | ||
Comment 35•11 years ago
|
||
Patch to checkin, with nit fixed, and carry r+ from :dhylands.
Attachment #772478 -
Flags: review+
Assignee | ||
Comment 36•11 years ago
|
||
Try server result:
https://tbpl.mozilla.org/?tree=Try&rev=4bd1b967abfd
Assignee | ||
Updated•11 years ago
|
Keywords: checkin-needed
Updated•11 years ago
|
Attachment #771949 -
Attachment is obsolete: true
Comment 37•11 years ago
|
||
Keywords: checkin-needed
Comment 38•11 years ago
|
||
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 40•11 years ago
|
||
Blocks a blocker, and already has testing done by Leo.
blocking-b2g: leo? → leo+
Comment 41•11 years ago
|
||
As I said in bug 893012, comment 54, the accumulated messages in output queue is solved by this patch.
But I observed just now, that a preallocated process is leaked after this patch.
I am checking how it happened, and if it is really related to this patch.
Comment 42•11 years ago
|
||
status-b2g18:
--- → fixed
status-b2g18-v1.0.0:
--- → wontfix
status-b2g18-v1.0.1:
--- → wontfix
status-b2g-v1.1hd:
--- → affected
status-firefox23:
--- → wontfix
status-firefox24:
--- → wontfix
status-firefox25:
--- → fixed
Target Milestone: --- → 1.1 QE5
Comment 43•11 years ago
|
||
Comment 44•11 years ago
|
||
Ok, the extra preallocated process in comment 41 is a known issue (bug 867739) and not related to this patch.
Assignee | ||
Comment 45•11 years ago
|
||
(In reply to andre.graziani from comment #44)
> Ok, the extra preallocated process in comment 41 is a known issue (bug
> 867739) and not related to this patch.
Thanks for your confirm. :)
Updated•11 years ago
|
Whiteboard: [tech-bug] → [tech-bug][LeoVB+]
You need to log in
before you can comment on or make changes to this bug.
Description
•