Closed Bug 811636 Opened 12 years ago Closed 11 years ago

Parent process does not handle the situation correctly if its child process is failed at launching

Tracking

(blocking-b2g:leo+, firefox23 wontfix, firefox24 wontfix, firefox25 fixed, b2g18+ fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix, b2g-v1.1hd fixed)

Status:

RESOLVED FIXED

Milestone:

1.1 QE5

Project Flags:

blocking-b2g

leo+

Tracking Flags:

Tracking

Status

firefox23

---

wontfix

firefox24

---

wontfix

firefox25

---

fixed

b2g18

fixed

b2g18-v1.0.0

---

wontfix

b2g18-v1.0.1

---

wontfix

b2g-v1.1hd

---

fixed

People

(Reporter: shelly, Assigned: shelly)

References

Details

(Whiteboard: [tech-bug][LeoVB+])

Attachments

(1 file, 7 obsolete files)

draft 1: Handle when a child process exit abnormally 12 years ago Shelly Lin [:shelly] 12.92 KB, patch		Details \| Diff \| Splinter Review
draft 2: Close the client fd earlier 12 years ago Shelly Lin [:shelly] 4.18 KB, patch		Details \| Diff \| Splinter Review
Handle the error of execv() when launching child processes 11 years ago Shelly Lin [:shelly] 4.33 KB, patch	dhylands : feedback+	Details \| Diff \| Splinter Review
Issuing a wait for child process 11 years ago Shelly Lin [:shelly] 12.86 KB, patch		Details \| Diff \| Splinter Review
Close the FD of client side after parent fork/execs the child 11 years ago Shelly Lin [:shelly] 4.47 KB, patch	dhylands : review+	Details \| Diff \| Splinter Review
Close the FD of client side after parent fork/execs the child (w/ nit fix) 11 years ago Shelly Lin [:shelly] 4.59 KB, patch		Details \| Diff \| Splinter Review
Close the FD of client side after parent fork/execs the child (w/ nit fix) 11 years ago Shelly Lin [:shelly] 4.62 KB, patch	dhylands : review+	Details \| Diff \| Splinter Review
[check-in] Fixed patch, w/ nit fixed, and carry r+ 11 years ago Shelly Lin [:shelly] 4.59 KB, patch	shelly : review+	Details \| Diff \| Splinter Review

Shelly Lin [:shelly]

Assignee

Description

•

12 years ago

Parent process is fail to get any notification if its child process does not launch or execute itself properly.

Dave Hylands [:dhylands]

Comment 1

•

12 years ago

Launching a child process involves a fork/exec There are at least 4 very distinct time periods that we need to concern ourselves with: 1 - Failures which occur in the parent before the fork 2 - Failures which occur in the child after the fork but before the exec 3 - Failures which occur in the child after the exec but before the IPC channel is setup 4 - Failures which occur in the child after the IPC channel is setup

Shelly Lin [:shelly]

Assignee

Comment 2

•

12 years ago

Attached patch draft 1: Handle when a child process exit abnormally (obsolete) — Details — Splinter Review

This is happened when a child process is forked successfully, but failed after execv, and call _exit(127). The current system misses the situation when a child process exit itself abnormally. This draft of patch is trying to watch the SIGCHLD in an earlier state of creating a child process. Modify the fetched result from DidProcessCrash() (in process_util_posix.cc) in order to defer a child process is exit normally or not. And after receives an abnormal exit of a child process, sends a "oop-frameloader-crashed' to window_manager.js. Or, maybe we should do something else?

Shelly Lin [:shelly]

Assignee

Updated

•

12 years ago

Depends on: 823474

Shelly Lin [:shelly]

Assignee

Comment 3

•

12 years ago

Please reference comment 7 from bug 823474 for details of reproduce steps. https://bugzilla.mozilla.org/show_bug.cgi?id=823474#c7

Shelly Lin [:shelly]

Assignee

Updated

•

12 years ago

Attachment #684599 - Flags: feedback?(justin.lebar+bug)

Justin Lebar (not reading bugmail)

Comment 4

•

12 years ago

Comment on attachment 684599 [details] [diff] [review] draft 1: Handle when a child process exit abnormally I think cjones is probably a better person to look at this; I'm not familiar with any of this code. Chris, if you're swamped, I can figure this stuff out.

Attachment #684599 - Flags: feedback?(justin.lebar+bug) → feedback?(jones.chris.g)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 5

•

12 years ago

Comment on attachment 684599 [details] [diff] [review] draft 1: Handle when a child process exit abnormally Hi Shelly, this is a very good UNIX-y solution to this problem. I like it a lot :). The problem is, this process mangement code is cross-platform, and we need to keep all of them in mind when designing features like this. We have an existing mechanism to notify us of "channel errors" (which during startup, correspond to process crashes/exits). The first place this code is used is when we launch the subprocess. The subprocess is launched from ipc/glue/GeckoChildProcessHost.cpp. If an error occurs, it will be notified here [1]. The code doesn't properly handle the error notification at the moment; fixing that is bug 773925. The second piece we need is to let ContentParent know when a channel error has occurred during launch. We can set a particular state on the GeckoChildProcessHost, for example. The advantage of this alternate approach is that it will work across all the platforms we support. Let me know if this makes sense. [1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#827

Attachment #684599 - Flags: feedback?(jones.chris.g)

Shelly Lin [:shelly]

Assignee

Comment 6

•

12 years ago

Attached patch draft 2: Close the client fd earlier (obsolete) — Details — Splinter Review

Attachment #684599 - Attachment is obsolete: true

Shelly Lin [:shelly]

Assignee

Comment 7

•

12 years ago

Comment on attachment 703172 [details] [diff] [review] draft 2: Close the client fd earlier Hi Chris, Thanks for your feedback! I've worked on another solution to handle this kind of error, with less changes and closer to the outlook of our exist mechanism. When a child process is created, its parent process waits for a hello message and close the client_pipe_. However, in this kind of situation, a system interrupt with signal SIGCHLD is sent, interrupts epoll_dispatch, and thus the hello message is never added to active event list, making the libevent fail to receive any notification. I figured maybe we can close the client_pipe_ eariler in time, in order to find out the connection has been reset already. This change uses the exist mechanism to notify us channel error, but the rest is like you mentioned, where OnChannelError from GeckoChildProcessHost is not yet being called at this moment.

Attachment #703172 - Flags: feedback?(jones.chris.g)

Shelly Lin [:shelly]

Assignee

Comment 8

•

12 years ago

Btw, bug 823474 is no longer needed with this version of fix.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 9

•

12 years ago

I'm very sorry for the long feedback lag here :(. I'll be clearing out my queue very soon. We need this work for 100% rock-solid stability. When this kind of lmk death occurs, we permanently lose a substantial amount of device memory. That makes the possibility of this crash happening again higher, etc, creating a disastrous feedback loop.

blocking-b2g: --- → leo?

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Updated

•

12 years ago

Blocks: b2g-v-next

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 10

•

12 years ago

Comment on attachment 703172 [details] [diff] [review] draft 2: Close the client fd earlier Hi Shelly, is it correct that we're not getting the error notification at [1] currently? If so, I /think/ I understand what this patch is doing, but there's still a race condition where the process can die without the notification. We should try to discuss this on IRC this week. (I want to make sure I understand the details here ;).) [1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.cpp#833

Attachment #703172 - Flags: feedback?(jones.chris.g)

Dietrich Ayala (:dietrich)

Comment 11

•

12 years ago

How often does this occur? How can this be triggered? Is it showing up in any of our or our partners' stability tests?

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 13

•

12 years ago

Needed for competitive stability.

Whiteboard: [tech-bug]

Shelly Lin [:shelly]

Assignee

Comment 14

•

12 years ago

(In reply to Chris Jones [:cjones] [:warhammer] from comment #10) > Comment on attachment 703172 [details] [diff] [review] > draft 2: Close the client fd earlier > > Hi Shelly, is it correct that we're not getting the error notification at > [1] currently? If so, I /think/ I understand what this patch is doing, but > there's still a race condition where the process can die without the > notification. > > We should try to discuss this on IRC this week. (I want to make sure I > understand the details here ;).) > > [1] > http://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost. > cpp#833 Hi Chris, Thanks for your feeback, and yes, we're not getting the error notification at [1], I'm not 100 percent sure if it is not called other where else, but at least it's not being called when the app is killed (shutdown abnormally). Please feel free to ping me on IRC, my nickname is ShellyLin. :)

[:fabrice] Fabrice Desré

Updated

•

12 years ago

blocking-b2g: leo? → -

tracking-b2g18: --- → +

Shelly Lin [:shelly]

Assignee

Comment 15

•

11 years ago

Attached patch Handle the error of execv() when launching child processes (obsolete) — Details — Splinter Review

Hi Justin, Without this fix, we will end up having lots of zombie processes in the system. Since project Nuwa (bug 771765) might refactor most of the ipc implementation, especially the part of creating child processes, I'm not sure this is needed anymore, however, this fix is small and not risky, I still think it is worth trying. I know you have re-directed this issue to cjones, or could you re-direct to someone else? Thanks.

Attachment #703172 - Attachment is obsolete: true

Attachment #761912 - Flags: review?

Attachment #761912 - Flags: feedback?(justin.lebar+bug)

Shelly Lin [:shelly]

Assignee

Updated

•

11 years ago

Attachment #761912 - Flags: review?

Justin Lebar (not reading bugmail)

Comment 16

•

11 years ago

We should definitely still fix this even though we're trying to do Nuwa. Did you and cjones ever figure out the race condition he mentioned in comment 10?

Dave Hylands [:dhylands]

Comment 17

•

11 years ago

We need to cover the child dying before it gets the IPC channel opened as well as after it gets the IPC channel opened. Normally, the parent would launch a thread which does wait calls (a reaper thread). This reaps the zombies and allows the parent to be notified regardless when the child dies.

Dave Hylands [:dhylands]

Comment 18

•

11 years ago

Even under Nuwa, we need the same thing, but it needs to be the template process that does the reaping (you can only reap children, not grand children).

Shelly Lin [:shelly]

Assignee

Comment 19

•

11 years ago

(In reply to Justin Lebar [:jlebar] from comment #16) > We should definitely still fix this even though we're trying to do Nuwa. > > Did you and cjones ever figure out the race condition he mentioned in > comment 10? I didn't get a chance to talk with cjones, but I've discussed this issue with Thinker. Race condition won't happen here, since we are closing the client's fd and the call to |base::LaunchApp()| is synchronized, from the part of fork(), execv() till return.

Shelly Lin [:shelly]

Assignee

Comment 20

•

11 years ago

(In reply to Dave Hylands [:dhylands] from comment #17) > We need to cover the child dying before it gets the IPC channel opened as > well as after it gets the IPC channel opened. > > Normally, the parent would launch a thread which does wait calls (a reaper > thread). This reaps the zombies and allows the parent to be notified > regardless when the child dies. Yes, normally it does. In this abnormal case, child process dies before our ipc mechanism receives the "hello message", that is, the parent won't be notified because this child process does not event "exist".

Shelly Lin [:shelly]

Assignee

Comment 21

•

11 years ago

The hacky steps to produce a zombie process: 1. Launch b2g, and kill the Preallocated-app process. 2. Go to system/b2g, rename |plugin-container| to something else, e.g. |plugin-container.2|. 3. Launch any web-app from Homescreen. 4. Observe a Z process with name "Gecko_IOThread".

Dave Hylands [:dhylands]

Comment 22

•

11 years ago

(In reply to Shelly Lin [:shelly] from comment #20) > (In reply to Dave Hylands [:dhylands] from comment #17) > > We need to cover the child dying before it gets the IPC channel opened as > > well as after it gets the IPC channel opened. > > > > Normally, the parent would launch a thread which does wait calls (a reaper > > thread). This reaps the zombies and allows the parent to be notified > > regardless when the child dies. > > Yes, normally it does. In this abnormal case, child process dies before our > ipc mechanism receives the "hello message", that is, the parent won't be > notified because this child process does not event "exist". Well the child does, in fact, exist, it just didn't get around to using the ipc mechanism. My point is that we shouldn't be using the IPC mechanism to track the child processes. If the fork succeeds, then we should start tracking the child and be issuing a wait for the child. Otherwise, any child that fails for any reason between the fork and the open of the ipc channel becomes a zombie (because the parent needs to reap it so that the zombie goes away).

Justin Lebar (not reading bugmail)

Comment 23

•

11 years ago

It sounds like dhylands understands this better than I do, and it also looks like the patch here doesn't fix the problem dhylands identified in comment 22. Shelly, are you willing to wait a few weeks for Dave to get back from vacation and figure this out with him?

Justin Lebar (not reading bugmail)

Comment 24

•

11 years ago

Comment on attachment 761912 [details] [diff] [review] Handle the error of execv() when launching child processes Kicking this over to Dave. If you need this taken care of more quickly, let me know and we can figure that out.

Attachment #761912 - Flags: feedback?(justin.lebar+bug) → feedback?(dhylands)

Andrew McCreight [:mccr8]

Comment 25

•

11 years ago

This problem sounds similar to this issue that was fixed upstream: https://codereview.chromium.org/7870008 "The problem was that when the child process died before connecting to the parent's IPC channel, the parent wouldn't notice the crash because the child end of the IPC pipe was kept open for too long."

Shelly Lin [:shelly]

Assignee

Comment 26

•

11 years ago

(In reply to Dave Hylands [:dhylands] (PTO back July 2) from comment #22) > (In reply to Shelly Lin [:shelly] from comment #20) > > (In reply to Dave Hylands [:dhylands] from comment #17) > > > We need to cover the child dying before it gets the IPC channel opened as > > > well as after it gets the IPC channel opened. > > > > > > Normally, the parent would launch a thread which does wait calls (a reaper > > > thread). This reaps the zombies and allows the parent to be notified > > > regardless when the child dies. > > > > Yes, normally it does. In this abnormal case, child process dies before our > > ipc mechanism receives the "hello message", that is, the parent won't be > > notified because this child process does not event "exist". > > Well the child does, in fact, exist, it just didn't get around to using the > ipc mechanism. > > My point is that we shouldn't be using the IPC mechanism to track the child > processes. > > If the fork succeeds, then we should start tracking the child and be issuing > a wait for the child. Otherwise, any child that fails for any reason between > the fork and the open of the ipc channel becomes a zombie (because the > parent needs to reap it so that the zombie goes away). I see. "Issuing a wait" happens to be my very first solution, but Chris (in comment #5) thinks it's better to use our IPC mechanism to report errors. Attaching that patch back for further discussion.

Shelly Lin [:shelly]

Assignee

Comment 27

•

11 years ago

Attached patch Issuing a wait for child process (obsolete) — Details — Splinter Review

Issuing a wait on newly created child process, in order to monitor its life state.

Shelly Lin [:shelly]

Assignee

Comment 28

•

11 years ago

Justin, sounds good to me, thanks. Andrew, it does look exactly the same like this issue! (https://codereview.chromium.org/7870008)

Dave Hylands [:dhylands]

Comment 29

•

11 years ago

Comment on attachment 761912 [details] [diff] [review] Handle the error of execv() when launching child processes Review of attachment 761912 [details] [diff] [review]: ----------------------------------------------------------------- OK - I spent some time looking at the child/parent IPC socket and I things are a bit clearer. What actually happens is something along these lines: 1 - parent creates a socketpair 2 - parent fork/execs the child, which inherits the open file descriptors So I now believe that this patch is doing the right thing (and this should cover all of the failure points).

Attachment #761912 - Flags: feedback?(dhylands) → feedback+

Shelly Lin [:shelly]

Assignee

Comment 30

•

11 years ago

Attached patch Close the FD of client side after parent fork/execs the child (obsolete) — Details — Splinter Review

Thanks for your time very much! I've added and updated some comments, could you give it a reivew?

Attachment #761912 - Attachment is obsolete: true

Attachment #765168 - Attachment is obsolete: true

Attachment #771146 - Flags: review?(dhylands)

Dave Hylands [:dhylands]

Comment 31

•

11 years ago

Comment on attachment 771146 [details] [diff] [review] Close the FD of client side after parent fork/execs the child Review of attachment 771146 [details] [diff] [review]: ----------------------------------------------------------------- r=me with comments addressed. ::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc @@ +781,5 @@ > > +void Channel::ChannelImpl::CloseClientFileDescriptor() const { > + if (client_pipe_ != -1) { > + Singleton<PipeMap>()->Remove(pipe_name_); > + HANDLE_EINTR(close(client_pipe_)); This function should be setting client_pipe_ to -1 after closing the file descriptor. This then implies that this can't be a const method (so you'll need to remove the const or declare client_pipe_ mutable. @@ +951,5 @@ > return channel_impl_->GetServerFileDescriptor(); > } > > +void Channel::CloseClientFileDescriptor() const { > + return channel_impl_->CloseClientFileDescriptor(); nit: I'd remove the return since this is a void function. ::: ipc/glue/GeckoChildProcessHost.cpp @@ +656,5 @@ > #endif > false, &process, arch); > > + // Close the FD of client side as early as possible, letting our ipc mechanism > + // to detect unexpected error at any stage. nit: I'd reword this slightly. // We're in the parent and the child was launched. Close the child FD // in the parent as soon as possible, which will allow the parent // to detect when the child closes its FD (either due to normal exit or // due to crash).

Attachment #771146 - Flags: review?(dhylands) → review+

Shelly Lin [:shelly]

Assignee

Comment 32

•

11 years ago

Attached patch Close the FD of client side after parent fork/execs the child (w/ nit fix) (obsolete) — Details — Splinter Review

Patch updated with nit fixes, and thanks for your review :)

Attachment #771146 - Attachment is obsolete: true

Attachment #771910 - Flags: review?(dhylands)

Shelly Lin [:shelly]

Assignee

Comment 33

•

11 years ago

Attached patch Close the FD of client side after parent fork/execs the child (w/ nit fix) (obsolete) — Details — Splinter Review

Fixed compiling error.

Attachment #771910 - Attachment is obsolete: true

Attachment #771910 - Flags: review?(dhylands)

Attachment #771949 - Flags: review?(dhylands)

Dave Hylands [:dhylands]

Comment 34

•

11 years ago

Comment on attachment 771949 [details] [diff] [review] Close the FD of client side after parent fork/execs the child (w/ nit fix) Review of attachment 771949 [details] [diff] [review]: ----------------------------------------------------------------- ::: ipc/glue/GeckoChildProcessHost.cpp @@ +658,5 @@ > > + // We're in the parent and the child was launched. Close the child FD in the > + // parent as soon as possible, which will allow the parent to detect when the > + // child closes its FD (either due to normal exit or due to crash). > + const_cast<IPC::Channel&>(channel()).CloseClientFileDescriptor(); nit: You should be able to do it without the cast by doing: GetChannel()->CloseClientFileDescriptor(); GetChannel is declared here: https://mxr.mozilla.org/mozilla-central/source/ipc/glue/GeckoChildProcessHost.h#83

Attachment #771949 - Flags: review?(dhylands) → review+

Shelly Lin [:shelly]

Assignee

Comment 35

•

11 years ago

Attached patch [check-in] Fixed patch, w/ nit fixed, and carry r+ — Details — Splinter Review

Patch to checkin, with nit fixed, and carry r+ from :dhylands.

Attachment #772478 - Flags: review+

Shelly Lin [:shelly]

Assignee

Comment 36

•

11 years ago

Try server result: https://tbpl.mozilla.org/?tree=Try&rev=4bd1b967abfd

Shelly Lin [:shelly]

Assignee

Updated

•

11 years ago

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Attachment #771949 - Attachment is obsolete: true

Ryan VanderMeulen [:RyanVM]

Comment 37

•

11 years ago

https://hg.mozilla.org/projects/birch/rev/68fa5bec3312

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 38

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/68fa5bec3312

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

leo.bugzilla.gecko

Updated

•

11 years ago

Blocks: 893012

Andrew McCreight [:mccr8]

Comment 39

•

11 years ago

Marking leo? per bug 893012#c54.

blocking-b2g: - → leo?

Justin Lebar (not reading bugmail)

Comment 40

•

11 years ago

Blocks a blocker, and already has testing done by Leo.

blocking-b2g: leo? → leo+

Andre Graziani (:graziani)

Comment 41

•

11 years ago

As I said in bug 893012, comment 54, the accumulated messages in output queue is solved by this patch. But I observed just now, that a preallocated process is leaked after this patch. I am checking how it happened, and if it is really related to this patch.

Ryan VanderMeulen [:RyanVM]

Comment 42

•

11 years ago

https://hg.mozilla.org/releases/mozilla-b2g18/rev/a9f1d82103c6

status-b2g18: --- → fixed

status-b2g18-v1.0.0: --- → wontfix

status-b2g18-v1.0.1: --- → wontfix

status-b2g-v1.1hd: --- → affected

status-firefox23: --- → wontfix

status-firefox24: --- → wontfix

status-firefox25: --- → fixed

Target Milestone: --- → 1.1 QE5

Ryan VanderMeulen [:RyanVM]

Comment 43

•

11 years ago

https://hg.mozilla.org/releases/mozilla-b2g18_v1_1_0_hd/rev/a9f1d82103c6

status-b2g-v1.1hd: affected → fixed

Andre Graziani (:graziani)

Comment 44

•

11 years ago

Ok, the extra preallocated process in comment 41 is a known issue (bug 867739) and not related to this patch.

Shelly Lin [:shelly]

Assignee

Comment 45

•

11 years ago

(In reply to andre.graziani from comment #44) > Ok, the extra preallocated process in comment 41 is a known issue (bug > 867739) and not related to this patch. Thanks for your confirm. :)

jeremy.kim.leo [:jeremykimleo]

Updated

•

11 years ago

Whiteboard: [tech-bug] → [tech-bug][LeoVB+]

You need to log in before you can comment on or make changes to this bug.