Closed Bug 943174 (bad-waitpid) Opened 12 years ago Closed 11 years ago

ContentParent can try to call waitpid on a child that has already been reaped ("waitpid failed … errno:10")

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla31

Tracking Flags:

Tracking

Status

firefox29

---

wontfix

firefox30

---

fixed

firefox31

---

fixed

b2g-v1.3

---

wontfix

b2g-v1.3T

---

wontfix

b2g-v1.4

---

fixed

b2g-v2.0

---

fixed

People

(Reporter: jld, Assigned: jld)

References

Details

Attachments

(2 files)

bug943174-excessive-waitpid-hg0.diff 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 6.48 KB, patch	benjamin : review+	Details \| Diff \| Splinter Review
bug943174-uplift-hg0.diff 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 6.59 KB, patch	jld : review+ lsblakk : approval-mozilla-aurora+	Details \| Diff \| Splinter Review

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Description

•

12 years ago

Copying bug 936169 comment 6, referring to some suspicious-looking log messages about failure of waitpid(): That's actually from DidProcessCrash in process_util_posix.cc, which we're almost certainly misusing in IsProcessDead in process_watcher_posix_sigchld.cc. We call it after something has already waited on the child, and it indicates that the process hasn't died. Then, 3000 ms later, the delayed call to ContentParent::ShutDownProcess happens, and calls IsProcessDead *again*, and sees that the process apparently isn't dead, and sends SIGKILL to the process that stopped existing 3 seconds ago. The comment in DidProcessCrash suggests that, in the Nuwa case, we can wind up trying to waitpid on a grandchild process, which we shouldn't be trying to do in the first place, but in that case we get ECHILD on a process that *does* still exist (maybe). So this is complicated.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

11 years ago

Alias: bad-waitpid

Summary: DidProcessCrash is being called repeatedly on the same child and doesn't work like that → DidProcessCrash is being called repeatedly on the same child and doesn't work like that ("waitpid failed … errno:10")

Armen [:armenzg]

Updated

•

11 years ago

Blocks: 933680

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 1

•

11 years ago

We can use waitid() with WNOWAIT to check a child process's status and/or wait for it to exit, without consuming the zombie. But that still leaves the problem of finding the last place that cares about the child process to do a full wait and clean it up. Something that might make this easier: if the parent process is exiting, it shouldn't matter, because the children will be reparented to init and cleaned up that way. So that leaves the cases where the process exits normally or is killed (LMK or crashing). Something that makes this harder: if the parent process ever tries to call waitpid or waitid on a content process managed by Nuwa, normally it will fail with ECHILD, but it could race with the child process exiting (and being auto-reaped) and the pid being reassigned to a non-IPC child. So that needs to not happen, though it might be deferrable to a followup bug. (If we want to remote the call in that case: Nuwa can send IPC messages (see AddNewProcess) but I don't know if it can receive them, so that adds even more complexity.)

Benjamin Smedberg

Comment 2

•

11 years ago

In general I think that in our IPC code there is going to be a single place that is responsible for "officially" waiting for a process: I'd assert without being 100% sure that is probably GeckoChildProcessHost::~GeckoChildProcessHost. Whatever that method is, nobody else should be calling waitpid without WNOWAIT.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

11 years ago

Assignee: nobody → jld

Status: NEW → ASSIGNED

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 3

•

11 years ago

I think it actually is ~GeckoChildProcessHost, via ProcessWatcher::EnsureProcessTerminated There's a DidProcessCrash in ChildProcessHost::OnWaitableEventSignaled, but it's explicitly Windows-only. There's another ProcessWatcher::EnsureProcessTerminated in ~ChildProcessHost, which is also Windows-only but this is not particularly obvious and I'm not 100% sure I'm right about that. There's also ChildReaper/ChildGrimReaper/ChildLaxReaper, which are launched if EnsureProcessTerminated finds the process still running[1], to ensure that it stops. (Which ChildReaper subclass we use seems to ultimately depend on NS_BUILD_REFCNT_LOGGING for reasons that are not immediately obvious.) There are some waitpids in there, but I think each of them will be the last (relative to the EnsureProcessTerminated call) if it succeeds. Things that need to change: 1. ContentParent::KillHard calls EnsureProcessTerminated, and then (I think?) ~GeckoChildProcessHost calls it again. Either KillHard should refrain from doing that, or it should set a flag so that ~GeckoChildProcessHost skips its own call. 2. ContentParent::SetPriorityAndCheckIsAlive uses DidProcessCrash, but it should either WNOWAIT or disarm the ~GeckoChildProcessHost if it finds the process dead. 3. GeckoChildProcessHost::Join → base::KillProcess can waitpid, and I assume a ~GeckoChildProcessHost is also in its future, so see above. But also this seems to be used only when we're restarting B2G (PowerManagerService::Restart), currently. 4. Grandchildren should be marked somehow to treat them as always having exited, with one exception: 4a. ContentParent::SetPriorityAndCheckIsAlive should treat grandchildren as always alive (as it currently effectively does). 5. Longer-term, we should get the real status of grandchildren somehow. Also, we should find out what is broken by not having it, so we know how urgent that is. (Also, consider that we need to survive Nuwa being killed to reclaim memory, so no matter what we do this might not always work.)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 4

•

11 years ago

(In reply to Jed Davis [:jld] from comment #3) > There's another ProcessWatcher::EnsureProcessTerminated in > ~ChildProcessHost, which is also Windows-only but this is not particularly > obvious and I'm not 100% sure I'm right about that. More precisely: it's conditional on handle() being truthy, and the only call to set_handle() is ifdef OS_WIN. > 3. GeckoChildProcessHost::Join → base::KillProcess can waitpid, and I assume > a ~GeckoChildProcessHost is also in its future, so see above. But also this > seems to be used only when we're restarting B2G > (PowerManagerService::Restart), currently. On further inspection, this is already fixed: Join() zeroes mChildProcessHandle. So, all of the things that need to be fixed here are specific to ContentParent. Which is wrong — and this means I'm missing something — because bug 933680 is about a plugin process, not a content process.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 5

•

11 years ago

Clearly there are more problems, and they're not specific to plugins, because https://tbpl.mozilla.org/?tree=Try&rev=3ac2bb012d54 *should* have fixes to the known problems with ContentParent, but the "waitpid failed" case is still being reached (with errno == ECHILD).

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 6

•

11 years ago

Something I missed: a grandchild's ContentParent must skip the EnsureProcessTerminated in KillHard. (This is why the linked try run's b2g reftests/crashtests are a sea of red.)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

11 years ago

Summary: DidProcessCrash is being called repeatedly on the same child and doesn't work like that ("waitpid failed … errno:10") → ContentParent can try to call waitpid on a child that has already been reaped ("waitpid failed … errno:10")

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 7

•

11 years ago

Attached patch bug943174-excessive-waitpid-hg0.diff — Details — Splinter Review

The important part for bug 933680 is the change to DidProcessCrash, because that keeps ChildLaxReaper from retrying forever. But that breaks b2g without the change to ContentParent::SetPriorityAndCheckIsAlive (note that that's waitid(), not waitpid()), because of Nuwa. The other changes should be sufficient to fix ContentParent's waitpid bugs, and thus prevent DidProcessCrash's call to waitpid from failing with ECHILD as long as the NSPR process API is never used (see bug 227246; also bug 678369). Trying: https://tbpl.mozilla.org/?tree=Try&rev=7f2cb8c7e669

Attachment #8392431 - Flags: review?(benjamin)

Benjamin Smedberg

Comment 8

•

11 years ago

Comment on attachment 8392431 [details] [diff] [review] bug943174-excessive-waitpid-hg0.diff Do we need the "CheckIsAlive" in SetPriorityAndCheckIsAlive at all, or could we just cut the entire checkisalive bit out completely?

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 9

•

11 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #8) > Comment on attachment 8392431 [details] [diff] [review] > bug943174-excessive-waitpid-hg0.diff > > Do we need the "CheckIsAlive" in SetPriorityAndCheckIsAlive at all, or could > we just cut the entire checkisalive bit out completely? That's an excellent question, given that the child is always “alive“ even if it's actually dead when using Nuwa; I filed bug 983829 about that. We certainly needed that check back when it was added — see bug 840277, bug 836638, and especially bug 836199. So, to answer your question: I don't know, and I'm not entirely sure who would know at this point.

Flags: needinfo?(jld)

Benjamin Smedberg

Comment 10

•

11 years ago

Comment on attachment 8392431 [details] [diff] [review] bug943174-excessive-waitpid-hg0.diff ok then. I'm a bit skeptical that CheckIsAlive was ever really what we wanted here, but I'll just go with it for now.

Attachment #8392431 - Flags: review?(benjamin) → review+

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

11 years ago

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 11

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/569ab9f8ff19

Keywords: checkin-needed

Joel Maher ( :jmaher ) (UTC -8)

Comment 12

•

11 years ago

this appears to have fixed a crash in browser-chrome chunk 1 as we were moving over to running the linux debug tests to ec2. I am happy to see this fix! is there any chance this could be backported to aurora? If not we can just disable 1 test and most likely be fine.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 13

•

11 years ago

(In reply to Joel Maher (:jmaher) from comment #12) > this appears to have fixed a crash in browser-chrome chunk 1 as we were > moving over to running the linux debug tests to ec2. I am happy to see this > fix! is there any chance this could be backported to aurora? I'd be in favor of backporting it, once it's spent a little time on m-c so I can convince the uplift approvers that it's sufficiently low-risk.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 14

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/569ab9f8ff19

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla31

Joel Maher ( :jmaher ) (UTC -8)

Comment 15

•

11 years ago

can we get this backported to aurora?

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

11 years ago

Flags: needinfo?(jld)

Jan Beich

Updated

•

11 years ago

Depends on: 989042

Joel Maher ( :jmaher ) (UTC -8)

Comment 16

•

11 years ago

would love to get this on aurora this week, please let me know if there are any concerns with doing so?

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 17

•

11 years ago

(In reply to Joel Maher (:jmaher) from comment #16) > would love to get this on aurora this week, please let me know if there are > any concerns with doing so? The patch breaks the build on NetBSD, OpenBSD, and older versions of FreeBSD; see bug 989042. The fix for that currently awaits review.

Flags: needinfo?(jld)

Joel Maher ( :jmaher ) (UTC -8)

Comment 18

•

11 years ago

would we be able to land this patch and the one from bug 989042 on Aurora? If we don't think that can reasonably happen in the next week, I would like to just disable the tests which seem to be affected by this. Jed, any thoughts on landing both patches on Aurora this week?

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 19

•

11 years ago

Attached patch bug943174-uplift-hg0.diff — Details — Splinter Review

[Approval Request Comment] Bug caused by (feature/regressing bug #): bug 933680 User impact if declined: No direct impact (probably), but releng needs this to move tests to EC2 without breaking them. Testing completed (on m-c, etc.): It's been on m-c for two weeks now, and the only problem was a BSD build breakage (fixed, and the fix rolled into this patch). Risk to taking this patch (and alternatives if risky): Medium-low, given that it's survived two weeks of testing on desktop and b2g. String or IDL/UUID changes made by this patch: None. (Carrying over r=bsmedberg. I squashed the patch with bug 989042 so that there isn't a broken intermediate state; let me know if that's not the preferred approach.)

Attachment #8402063 - Flags: review+

Attachment #8402063 - Flags: approval-mozilla-aurora?

Flags: needinfo?(jld)

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

11 years ago

Attachment #8402063 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 20

•

11 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/7906c1b48e0d

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

11 years ago

status-b2g-v1.3: --- → wontfix

status-b2g-v1.3T: --- → wontfix

status-b2g-v1.4: --- → fixed

status-b2g-v2.0: --- → fixed

status-firefox29: --- → wontfix

status-firefox30: --- → fixed

status-firefox31: --- → fixed

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Blocks: 847558

You need to log in before you can comment on or make changes to this bug.

bug943174-excessive-waitpid-hg0.diff 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 6.48 KB, patch	benjamin : review+	Details \| Diff \| Splinter Review
bug943174-uplift-hg0.diff 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 6.59 KB, patch	jld : review+ lsblakk : approval-mozilla-aurora+	Details \| Diff \| Splinter Review