Closed Bug 385465 Opened 16 years ago Closed 16 years ago

safe_pclose() fails to wait for child causing a deadlock w/PATCH


(NSS :: Libraries, defect)

Not set


(Not tracked)



(Reporter: kurt, Unassigned)





(1 file, 1 obsolete file)

User-Agent:       Mozilla/5.0 (X11; U; OpenBSD i386; en-US; rv: Gecko/20070609 Firefox/
Build Identifier: 3.11.5

As it currently is written safe_pclose() is nether portable nor robust. It has three problems 1) assumes kill() on zombie processes will not return ESRCH, 2) doesn't handle EINTR errors from waitpid() 3) gives up waiting for the child after an arbitrary number of attempts.

When one of the above problems occurs safe_pclose() returns leaving the zombie process around. Later when a process is created using nspr interfaces the WaitPidDaemonThread() wakes up, calls waitpid() which returns the nss zombie status and then goes to sleep forever (since pr_wp.numProcs is now 0). Finally when PR_WaitProcess() is called on the UI thread it deadlocks waiting for the status of the new child, but WaitPidDaemonThread() will not wake up again.

Reproducible: Sometimes

Steps to Reproduce:
One way to reproduce is to use thunderbird w/engimail on an operating system that returns ESRCH when kill() is called on zombie processes. Then you need to be lucky and win a race where the first waitpid() call returns 0, the child terminates and then kill() is called on the zombie process returning ESRCH, safe_pclose() bails from the loop leaving the zombie remaining. The remaining zombie sets up the conditions for the deadlock between nspr's WaitPidDaemonThread() and PR_WaitProcess().

While POSIX recommends that kill() on a zombie process not return ESRCH, it does acknowledge that many systems implement this differently. To write portable code you can not make an assumption about what kill() will return on a zombie process.
Ever confirmed: true
You guys are looking at OLD code.  
This bug was fixed 4+ months ago.  See

Martynas, you should not "confirm" bugs that are filed against old 
versions of the code.  Before confirming a bug, please check that it is 
still present in the current sources.  
Severity: critical → normal
Closed: 16 years ago
Resolution: --- → DUPLICATE
Version: unspecified → 3.0
Thanks for the pointer to the new code. It has the exact same race I described in this bug report. first waitpid() returns 0, child terminates, kill on zombie returns -1 with errno ESRCH causing the blocking waitpid not to be called.

The code should look like this to be portable (see comments above about what POSIX says about kill() on terminated processes):

60     /* if the child hasn't exited, kill it -- we're done with its output */
861     while ((rv = waitpid(pid, &status, WNOHANG)) == -1 && errno == EINTR)
862         ;
863     if (rv == 0) {
            kill(pid, SIGKILL);
864         while ((rv = waitpid(pid, &status, 0)) == -1 && errno == EINTR)
865             ;
866     }

Should this bug report be reopened or the other one?
The other one.
I can't find a way to reopen the other bug report. There isn't an option to reopen available on the other report, so I'm reopening this report. Please see comment 4 for a proposed patch that fixes the remaining portability issue with safe_pclose(). This problem leads to deadlock of thunderbird w/engimail. 
Severity: normal → critical
Resolution: DUPLICATE → ---
from comment 4
Attachment #269376 - Attachment is obsolete: true
Nelson, i don't think the other one should be reopened, because this is a different issue, a new bug, that is.
Abolutely a duplicate of bug 51429.  It says that the supposed fix for
bug 51429 was insufficient.  I will reopen 51429.
Closed: 16 years ago16 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.