Closed Bug 943174 (bad-waitpid) Opened 7 years ago Closed 7 years ago

ContentParent can try to call waitpid on a child that has already been reaped ("waitpid failed … errno:10")

Categories

(Core :: IPC, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla31
Tracking Status
firefox29 --- wontfix
firefox30 --- fixed
firefox31 --- fixed
b2g-v1.3 --- wontfix
b2g-v1.3T --- wontfix
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: jld, Assigned: jld)

References

Details

Attachments

(2 files)

Copying bug 936169 comment 6, referring to some suspicious-looking log messages about failure of waitpid():


That's actually from DidProcessCrash in process_util_posix.cc, which we're almost certainly misusing in IsProcessDead in process_watcher_posix_sigchld.cc.  We call it after something has already waited on the child, and it indicates that the process hasn't died.  Then, 3000 ms later, the delayed call to ContentParent::ShutDownProcess happens, and calls IsProcessDead *again*, and sees that the process apparently isn't dead, and sends SIGKILL to the process that stopped existing 3 seconds ago.

The comment in DidProcessCrash suggests that, in the Nuwa case, we can wind up trying to waitpid on a grandchild process, which we shouldn't be trying to do in the first place, but in that case we get ECHILD on a process that *does* still exist (maybe).  So this is complicated.
Alias: bad-waitpid
Summary: DidProcessCrash is being called repeatedly on the same child and doesn't work like that → DidProcessCrash is being called repeatedly on the same child and doesn't work like that ("waitpid failed … errno:10")
Blocks: 933680
We can use waitid() with WNOWAIT to check a child process's status and/or wait for it to exit, without consuming the zombie.  But that still leaves the problem of finding the last place that cares about the child process to do a full wait and clean it up.

Something that might make this easier: if the parent process is exiting, it shouldn't matter, because the children will be reparented to init and cleaned up that way.  So that leaves the cases where the process exits normally or is killed (LMK or crashing).

Something that makes this harder: if the parent process ever tries to call waitpid or waitid on a content process managed by Nuwa, normally it will fail with ECHILD, but it could race with the child process exiting (and being auto-reaped) and the pid being reassigned to a non-IPC child.  So that needs to not happen, though it might be deferrable to a followup bug.  (If we want to remote the call in that case: Nuwa can send IPC messages (see AddNewProcess) but I don't know if it can receive them, so that adds even more complexity.)
In general I think that in our IPC code there is going to be a single place that is responsible for "officially" waiting for a process: I'd assert without being 100% sure that is probably GeckoChildProcessHost::~GeckoChildProcessHost.

Whatever that method is, nobody else should be calling waitpid without WNOWAIT.
Assignee: nobody → jld
Status: NEW → ASSIGNED
I think it actually is ~GeckoChildProcessHost, via ProcessWatcher::EnsureProcessTerminated

There's a DidProcessCrash in ChildProcessHost::OnWaitableEventSignaled, but it's explicitly Windows-only.

There's another ProcessWatcher::EnsureProcessTerminated in ~ChildProcessHost, which is also Windows-only but this is not particularly obvious and I'm not 100% sure I'm right about that.

There's also ChildReaper/ChildGrimReaper/ChildLaxReaper, which are launched if EnsureProcessTerminated finds the process still running[1], to ensure that it stops.  (Which ChildReaper subclass we use seems to ultimately depend on NS_BUILD_REFCNT_LOGGING for reasons that are not immediately obvious.)  There are some waitpids in there, but I think each of them will be the last (relative to the EnsureProcessTerminated call) if it succeeds.

Things that need to change:

1. ContentParent::KillHard calls EnsureProcessTerminated, and then (I think?) ~GeckoChildProcessHost calls it again.  Either KillHard should refrain from doing that, or it should set a flag so that ~GeckoChildProcessHost skips its own call.

2. ContentParent::SetPriorityAndCheckIsAlive uses DidProcessCrash, but it should either WNOWAIT or disarm the ~GeckoChildProcessHost if it finds the process dead.

3. GeckoChildProcessHost::Join → base::KillProcess can waitpid, and I assume a ~GeckoChildProcessHost is also in its future, so see above.  But also this seems to be used only when we're restarting B2G (PowerManagerService::Restart), currently.

4. Grandchildren should be marked somehow to treat them as always having exited, with one exception:

4a. ContentParent::SetPriorityAndCheckIsAlive should treat grandchildren as always alive (as it currently effectively does).

5. Longer-term, we should get the real status of grandchildren somehow.  Also, we should find out what is broken by not having it, so we know how urgent that is.  (Also, consider that we need to survive Nuwa being killed to reclaim memory, so no matter what we do this might not always work.)
(In reply to Jed Davis [:jld] from comment #3)
> There's another ProcessWatcher::EnsureProcessTerminated in
> ~ChildProcessHost, which is also Windows-only but this is not particularly
> obvious and I'm not 100% sure I'm right about that.

More precisely: it's conditional on handle() being truthy, and the only call to set_handle() is ifdef OS_WIN.

> 3. GeckoChildProcessHost::Join → base::KillProcess can waitpid, and I assume
> a ~GeckoChildProcessHost is also in its future, so see above.  But also this
> seems to be used only when we're restarting B2G
> (PowerManagerService::Restart), currently.

On further inspection, this is already fixed: Join() zeroes mChildProcessHandle.

So, all of the things that need to be fixed here are specific to ContentParent.

Which is wrong — and this means I'm missing something — because bug 933680 is about a plugin process, not a content process.
Clearly there are more problems, and they're not specific to plugins, because https://tbpl.mozilla.org/?tree=Try&rev=3ac2bb012d54 *should* have fixes to the known problems with ContentParent, but the "waitpid failed" case is still being reached (with errno == ECHILD).
Something I missed: a grandchild's ContentParent must skip the EnsureProcessTerminated in KillHard.  (This is why the linked try run's b2g reftests/crashtests are a sea of red.)
Summary: DidProcessCrash is being called repeatedly on the same child and doesn't work like that ("waitpid failed … errno:10") → ContentParent can try to call waitpid on a child that has already been reaped ("waitpid failed … errno:10")
The important part for bug 933680 is the change to DidProcessCrash, because that keeps ChildLaxReaper from retrying forever.  But that breaks b2g without the change to ContentParent::SetPriorityAndCheckIsAlive (note that that's waitid(), not waitpid()), because of Nuwa.

The other changes should be sufficient to fix ContentParent's waitpid bugs, and thus prevent DidProcessCrash's call to waitpid from failing with ECHILD as long as the NSPR process API is never used (see bug 227246; also bug 678369).

Trying: https://tbpl.mozilla.org/?tree=Try&rev=7f2cb8c7e669
Attachment #8392431 - Flags: review?(benjamin)
Comment on attachment 8392431 [details] [diff] [review]
bug943174-excessive-waitpid-hg0.diff

Do we need the "CheckIsAlive" in SetPriorityAndCheckIsAlive at all, or could we just cut the entire checkisalive bit out completely?
Flags: needinfo?(jld)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #8)
> Comment on attachment 8392431 [details] [diff] [review]
> bug943174-excessive-waitpid-hg0.diff
> 
> Do we need the "CheckIsAlive" in SetPriorityAndCheckIsAlive at all, or could
> we just cut the entire checkisalive bit out completely?

That's an excellent question, given that the child is always “alive“ even if it's actually dead when using Nuwa; I filed bug 983829 about that.  We certainly needed that check back when it was added — see bug 840277, bug 836638, and especially bug 836199.

So, to answer your question: I don't know, and I'm not entirely sure who would know at this point.
Flags: needinfo?(jld)
Comment on attachment 8392431 [details] [diff] [review]
bug943174-excessive-waitpid-hg0.diff

ok then. I'm a bit skeptical that CheckIsAlive was ever really what we wanted here, but I'll just go with it for now.
Attachment #8392431 - Flags: review?(benjamin) → review+
this appears to have fixed a crash in browser-chrome chunk 1 as we were moving over to running the linux debug tests to ec2.  I am happy to see this fix!  is there any chance this could be backported to aurora?  If not we can just disable 1 test and most likely be fine.
(In reply to Joel Maher (:jmaher) from comment #12)
> this appears to have fixed a crash in browser-chrome chunk 1 as we were
> moving over to running the linux debug tests to ec2.  I am happy to see this
> fix!  is there any chance this could be backported to aurora?

I'd be in favor of backporting it, once it's spent a little time on m-c so I can convince the uplift approvers that it's sufficiently low-risk.
https://hg.mozilla.org/mozilla-central/rev/569ab9f8ff19
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
can we get this backported to aurora?
Flags: needinfo?(jld)
Depends on: 989042
would love to get this on aurora this week, please let me know if there are any concerns with doing so?
(In reply to Joel Maher (:jmaher) from comment #16)
> would love to get this on aurora this week, please let me know if there are
> any concerns with doing so?

The patch breaks the build on NetBSD, OpenBSD, and older versions of FreeBSD; see bug 989042.  The fix for that currently awaits review.
Flags: needinfo?(jld)
would we be able to land this patch and the one from bug 989042 on Aurora?  If we don't think that can reasonably happen in the next week, I would like to just disable the tests which seem to be affected by this.

Jed, any thoughts on landing both patches on Aurora this week?
Flags: needinfo?(jld)
[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 933680
User impact if declined: No direct impact (probably), but releng needs this to move tests to EC2 without breaking them.
Testing completed (on m-c, etc.): It's been on m-c for two weeks now, and the only problem was a BSD build breakage (fixed, and the fix rolled into this patch).
Risk to taking this patch (and alternatives if risky): Medium-low, given that it's survived two weeks of testing on desktop and b2g.
String or IDL/UUID changes made by this patch: None.

(Carrying over r=bsmedberg.  I squashed the patch with bug 989042 so that there isn't a broken intermediate state; let me know if that's not the preferred approach.)
Attachment #8402063 - Flags: review+
Attachment #8402063 - Flags: approval-mozilla-aurora?
Flags: needinfo?(jld)
Attachment #8402063 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.