Open Bug 227246 Opened 21 years ago Updated 1 year ago

PR_CreateProcess()/PR_WaitProcess() doesn't play nicely with fork() on Unix/Linux

Categories

(NSPR :: NSPR, enhancement)

Other
Other
enhancement

Tracking

(Not tracked)

People

(Reporter: chris.elving, Unassigned)

References

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)

On Unix/Linux with _PR_NATIVE_THREADS, PR_CreateProcess() increments the
numProcs counter and adds a pr_PidRecord entry to the pidTable. PR_WaitProcess()
inspects pidTable and, if the process has not yet terminated, waits on reapedCV.

The WaitPidDaemonThread() thread monitors numProcs and, when it's non-zero,
calls waitpid(-1, ...). When waitpid(-1, ...) returns, WaitPidDaemonThread()
decrements numProcs regardless of whether the given pid was in pidTable. If the
given pid was in pidTable, WaitPidDaemonThread() notifies reapedCV, potentially
waking a PR_WaitProcess().

As a result of the above logic, if any code in the process fork()s a process
without using PR_CreateProcess() then 1. PR_WaitProcess() can hang forever and
2. the non-NSPR-based code can fail to properly reap its own children. This is
especially problematic in an environment such as a web server that uses
PR_CreateProcess() and also supports 3rd party plugins.

With _PR_NATIVE_THREADS, I think PR_WaitProcess() should call
waidpid(process->md.pid, ...) directly. This would solve both of the above problems.


Reproducible: Always

Steps to Reproduce:
1.
2.
3.
Status: UNCONFIRMED → NEW
Ever confirmed: true
QA Contact: wtchang → nspr
Issue 2 exists even without PR_NATIVE_THREADS.

http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/nsprpub/pr/src/md/unix/uxproces.c&rev=3.24&mark=580,601,670#612
Summary: PR_WaitProcess() doesn't play nicely with fork() on Unix/Linux → PR_CreateProcess()/PR_WaitProcess() doesn't play nicely with fork() on Unix/Linux
Blocks: 678369
Bug 745212 has a more detailed analysis of the bug.

I expect that Linux has supported _PR_NATIVE_THREADS since at least the advent of NPTL, which was something like a decade ago now (so, it wouldn't have helped when this bug was filed, but fortunately now is not then).  Given a waitpid() that actually works, the NSPR "proces" API maps more or less directly onto Unix… except for PR_DetachProcess/PR_CreateProcessDetached.

One alternative would be for those functions to create a detached pthread for each detached child, to wait on it.  I don't know if we'd consider that to be excessive overhead.

Another alernative is for the wait-daemon thread, when SIGCHLD is posted, to loop over all running children doing waitpid with WNOHANG on each one.  (If non-detached children are using waitpid directly, this could be limited to each running detached child.)  I also don't know if we'd consider that to be excessive overhead.

Or someone could go through Gecko switching all the callers of PR_CreateProcess* to something else.  I'm not sure what that would be.
I just tripped over another failure mode from this.  Gecko's IPC code uses libevent to temporarily catch SIGCHLD when waiting for a process to exit; this installs a signal handler which apparently doesn't invoke the previous handler, although it does restore it afterwards.  So, if this races with an NSPR child process exiting, NSPR's SIGCHLD handler won't get called and PR_WaitProcess will block forever.  (I can reproduce it fairly reliably on a debug build by opening the browser and immediately navigating to about:preferences, which runs shell commands from the mailcap file while enumerating MIME type handlers.)
I think I know how to fix the problem in comment #4, and probably also bug 1299581.

Kai, does NSPR still need to support Linux 2.4?  More specifically: does NSPR still need to support LinuxThreads, or can it assume NPTL?
Flags: needinfo?(kaie)
Hi Jed: Please wait for Kai's reply. But I guess (or at least hope)
NSPR can drop support for LinuxThreads and assume NPTL.
On further inspection, it's not quite that simple.  The _PR_NATIVE_THREADS variant of uxproces.c calls waitpid() if and only if pr_wp.numProcs is nonzero.  But, if waitpid returns a non-NSPR child process, then numProcs can be decremented back to zero when there are still NSPR processes outstanding… if I'm reading this correctly; I haven't observed this bug in practice yet.

I'd been wondering about the possibility of introducing a PR_ImportProcess for Gecko IPC to use in place of waitpid, but there are also system libraries that spawn processes, so that wouldn't eliminate the problem.

For my earlier idea of mapping PR_WaitProcess directly to waitpid (which effectively requires _PR_NATIVE_THREADS) and not starting the daemon thread until/unless detached processes are used, there's some awkwardness around what happens if a thread is blocked in PR_WaitProcess and a child process is detached, but I think it would be enough enough for PR_WaitProcess to check the pid table if the targeted waitpid() fails with ECHILD.


And then there's still the other option: bypassing this bug by changing Gecko to use native APIs on generic Unix, like what it already does in most cases on Windows and OS X.  Of course, that wouldn't help anything else that's using NSPR.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #7)
>
> And then there's still the other option: bypassing this bug by changing
> Gecko to use native APIs on generic Unix, like what it already does in most
> cases on Windows and OS X.

This option is worth considering.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #5)
> Kai, does NSPR still need to support Linux 2.4?  More specifically: does
> NSPR still need to support LinuxThreads, or can it assume NPTL?

It seems acceptable to declare that future NSPR/NSS versions can no longer work on Linux 2.4 and classic LinuxThreads.
Flags: needinfo?(kaie)
I've been experiencing an intermittent test failure in a very specific scenario in bug 1393800 where PR_WaitProcess() gets stuck waiting for a process that I'm sure has terminated correctly. It gets stuck here to be precise:

https://dxr.mozilla.org/mozilla-central/rev/19b32a138d08f73961df878a29de6f0aad441683/nsprpub/pr/src/md/unix/uxproces.c#824

I believe I'm hitting this bug since the test involves crashing and re-spawning a content process in quick succession, and that uses fork() directly IIRC. At the same time the minidump-analyzer tool is being launched via nsIProcess to analyze the minidump left by the crash. Since the latter uses PR_CreateProcess()/PR_WaitProcess() I have the feeling I'm hitting this. It happens only in Linux 64-bit code-coverage builds and only for this test, roughly 2-3 times out of 10.
I've just managed to confirm that this bug is causing the intermittent failures I'm seeing in bug 1393800.
Blocks: 1393800
No longer blocks: 1393800
Severity: normal → S3

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: wtc → nobody
You need to log in before you can comment on or make changes to this bug.