Closed Bug 51429 Opened 24 years ago Closed 17 years ago

RNG_SystemInfoForRNG possible "netstat" zombie process

Categories

(NSS :: Libraries, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED
3.11.8

People

(Reporter: nelson, Assigned: wtc)

References

()

Details

Attachments

(9 files, 7 obsolete files)

1.09 KB, patch
nelson
: review+
julien.pierre
: review+
Details | Diff | Splinter Review
553 bytes, text/plain
Details
388 bytes, patch
nelson
: review+
Details | Diff | Splinter Review
29 bytes, text/plain
Details
3.38 KB, text/plain
Details
1003 bytes, patch
Details | Diff | Splinter Review
1.35 KB, patch
Details | Diff | Splinter Review
737 bytes, patch
Details | Diff | Splinter Review
1.62 KB, patch
wtc
: review+
Details | Diff | Splinter Review
This bug was formerly bugsplat 300544.
It applies to all UNIX systems, to the unix implementation of 
RNG_SystemInfoForRNG().

jgmysers@netscape.com wrote:
In ns/security/lib/util/unix_rand.c:safe_pclose(), if waitpid() 
fails with EINTR, the child process could be left as a zombie.
moving to 3.2
Depends on: 48258
Target Milestone: --- → 3.2
QA Contact: wtc → sonmi
Target Milestone: 3.2 → 3.3
Target Milestone: 3.3 → 3.4
Changed the QA contact to Bishakha.
QA Contact: sonja.mirtitsch → bishakhabanerjee
Set target milestone to NSS 3.5.
Target Milestone: 3.4 → 3.5
I'm seeing [a] netstat zombie[s] on Red Hat Linux 7.3 using the 2002052409 Linux
build of Mozilla 1.0RC3.

Let me know if I can provide any useful information about this somehow.
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2a) Gecko/2002091318

I see something like this all the time:
22483 ?        Z      0:00 [netstat <defunct>]
Is this what this bug is about?
It does not seem to be a problem.

pi
Assigned the bug to myself.  Target NSS 3.7.
Assignee: ian.mcgreer → wtc
Priority: P3 → P2
Target Milestone: 3.5 → 3.7
*** Bug 183778 has been marked as a duplicate of this bug. ***
Isn't there some much, much, MUCH more sane way to get entropy?

This bug should be considered a bit higher priority that it seems, since a
browser screwing with netstat will look to many people like spyway or a trojan.
Moved to target milestone 3.8 because the original
NSS 3.7 release has been renamed 3.8.
Target Milestone: 3.7 → 3.8
It's still there in Mozilla 1.2.1.
15728 S    19:18   0:24      \_ /home/alex/programs/mozilla/mozilla-bin
15740 Z    19:18   0:00          \_ [netstat <defunct>]

Yup, it's always there after a while. Mozilla 1.2.1
I'm having the same problem also with this. Mozilla 1.3a

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3a) Gecko/20021212
Linux schloss 2.4.18-10 #1 Wed Aug 7 10:26:48 EDT 2002 i686 unknown


winzig   25910  1476  0 19:12 ?        00:00:00 /bin/sh /usr/local/mozilla/run-m
winzig   25916 25910  2 19:12 ?        00:02:08 /usr/local/mozilla/mozilla-bin
winzig   25918 25916  0 19:12 ?        00:00:00 /usr/local/mozilla/mozilla-bin
winzig   25919 25918  0 19:12 ?        00:00:01 /usr/local/mozilla/mozilla-bin
winzig   25920 25918  0 19:12 ?        00:00:00 /usr/local/mozilla/mozilla-bin
winzig   25922 25918  0 19:12 ?        00:00:00 /usr/local/mozilla/mozilla-bin
winzig   26157 25916  0 20:15 ?        00:00:00 [netstat <defunct>]
winzig   26243 25918  0 20:36 ?        00:00:00 /usr/local/mozilla/mozilla-bin
Looks like netstat hangs when it's stuck resolving an ip address into a hostname.
Netstat hanging would be a different bug.
Remove target milestone of 3.8, since these bugs didn't get into that release.
Target Milestone: 3.8 → ---
it's still there. Isn't it possible to use /dev/urandom?
Or is this not available on some systems?
zombified netstat still there in 1.6 and 1.7b. Wouldnt it be feasable to
incorporate something like "'EGD' http://egd.sourceforge.net/" into mozilla and
have mozilla read from a socket instead of using a system process that could be
exploitable to gain intelligence about running systems and maybe possible other
info or even gain a shell on a system. PS: When i see netstat being zombified
after a browser is using it, makes me think about the ole days of Internet
explorer reporting home "Phoning Home" like E.T. And knowing that you guys are
pretty respectable ppl i trust you to use netstat but there is allways that
slight possiblity of someone sneaking around doing malicious stuff. Please look
into how EGD handles getting more data for entropy.
Bug still there, with 1.7rc1 according to
http://qa.mandrakesoft.com/show_bug.cgi?id=9427

Neil, please have a look at this.  Take it if you want it.
QA Contact: bishakhabanerjee → jason.m.reid
Attached patch patch to avoid zombie netstat (obsolete) — Splinter Review
This patch has been suggested by others. It avoids a theoretically possible
bug, one that I have been unable to force on Solaris 10 x86. Briefly, it is
unlikely to happen because the parent reads all of the output of the netstat
command before issuing the waitpid() call. Since the child's stdout is closed
by its exit routine  it will always be defunct immediately thereafter.
Furthermore, even when the parent is forced to truncate the input file and
while the child is explicitly held in the stopped state, waitpid() still
returns 0 rather than EINTR.

I suspect this last behavior may be different under different OSes.

Would some of the respondents who reported seeing this problem be kind enough
to try the patch?
Assignee: wtchang → neil.williams
Status: NEW → ASSIGNED
Attachment #185311 - Flags: superreview?(wtchang)
Attachment #185311 - Flags: review?(nelson)
If waitpid gets EINTR, do you want to kill pid?  or just waitpid again?
Comment on attachment 185311 [details] [diff] [review]
patch to avoid zombie netstat

Nelson, I agree that it would be more correct to call waitpid again. I'll
submit a revised patch.
Attachment #185311 - Attachment is obsolete: true
Attached patch patch to avoid zombie netstat (obsolete) — Splinter Review
Incorporating Nelson's suggestion.

I note that most of the bug citations mention Linux. Also note that the
Mandrake bug mentioned in comment 18 has been fixed, apparently by a Mandrake
release.
Attachment #185508 - Flags: superreview?(wtchang)
Attachment #185508 - Flags: review?(nelson)
Comment on attachment 185508 [details] [diff] [review]
patch to avoid zombie netstat

That counter increment/check needs to happen in every path that repeats this
loop.
Attachment #185508 - Flags: superreview?(wtchang)
Attachment #185508 - Flags: review?(nelson)
Attachment #185508 - Flags: review-
Comment on attachment 185311 [details] [diff] [review]
patch to avoid zombie netstat

Forgot to mark this patch with my previous review comment.
Attachment #185311 - Flags: superreview?(wtchang)
Attachment #185311 - Flags: review?(nelson)
Attachment #185311 - Flags: review-
To be honest, I just applied a Debian patch for Mandriva/Mandrake package which
commented the netstat call, which isn't really needed on Linux system, since
/dev/random randomness is good enough.
Attachment #185508 - Attachment is obsolete: true
Attached patch third trySplinter Review
Reviewer's last comments incorporated.
Attachment #185630 - Flags: superreview?(wtchang)
Attachment #185630 - Flags: review?(nelson)
Comment on attachment 185630 [details] [diff] [review]
third try

Looks good to me.  Since we haven't found a reproducible test case with which
to test this patch, please get a second review on this.  Thanks.
Attachment #185630 - Flags: review?(nelson) → review+
Target Milestone: --- → 3.10.1
(In reply to comment #26)
> To be honest, I just applied a Debian patch for Mandriva/Mandrake package which
> commented the netstat call, which isn't really needed on Linux system, since
> /dev/random randomness is good enough.

You mean /dev/urandom, right ? On Linux, /dev/random might block. Not on BSD though.

While I'm not a reviewer, the latest patch looks good. EINTR will cause waipid
to be repeated (max 1000 times). All other errors will elave the loop
I'm running Gentoo 1.4.16, and I'm getting a zombie netstat process for both
Firefox (1.0.6 - Mozilla/5.0 - 20050722) and Thunderbird (1.0.6 (20050722)).
Originally, before upgrading to the latest Firefox and Thunderbird build, this
problem only used to occur with instances of Firefox.

 /bin/bash /usr/libexec/mozilla-launcher
  \_ /usr/lib/mozilla-firefox/firefox-bin
      \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      \_ [netstat] <defunct>


 /bin/bash /usr/libexec/mozilla-launcher
  \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      \_ [netstat] <defunct>

Target Milestone: 3.10.1 → 3.11
Comment on attachment 185630 [details] [diff] [review]
third try

This bug needs a second review.  Julien, will you review it?
Attachment #185630 - Flags: review?(julien.pierre.bugs)
Comment on attachment 185630 [details] [diff] [review]
third try

r=wtc.	This patch may not fix this bug though.
I would rename the variable r as rv.
Attachment #185630 - Flags: superreview?(wtchang)
Attachment #185630 - Flags: superreview+
Attachment #185630 - Flags: review?(julien.pierre.bugs)
Comment on attachment 185630 [details] [diff] [review]
third try

Like Wan-Teh, I'm not sure if this bug will fix the problem, but I don't think
it will hurt.

Re: the kill() call, I think we should probably break out of the loop on any
error, and not just on ESRCH. In the Solaris man pages, the other possible
errors are EINVAL (invalid signal number) and EPERM (no permission to kill the
process), which should also be fatal, since subsequent calls to kill aren't
going to fix those problems, and the only point of continuing the waitpid loop
is to work gracefully if the process exits quickly within 1000 iterations. On
the other hand, these other errors should never happen, since we are passing a
hardcoded signal value, and we started the process, so we should have
permission - unless it changed user. I hope netstat doesn't do that ;).
Attachment #185630 - Flags: review+
Hmm. Netstat could be setuid root!  
I've appplied the patch, but unfortunately there are still sometimes hanging
zombie netstat processes:

&#9500;&#9472;firefox&#9472;&#9472;&#9472;run-mozilla.sh&#9472;&#9472;&#9472;firefox-bin&#9472;&#9472;&#9472;netstat
Hmm.  I wonder.  Perhaps the netstat processes are not zombies,
but are actually live hung processes that cannot be killed by
the parent process because they are running as root and the 
parent isn't.  A full long ps output might help diagnose that.
Please attach it rather than adding it inline as a comment.
(In reply to comment #36)
> Hmm.  I wonder.  Perhaps the netstat processes are not zombies,
> but are actually live hung processes that cannot be killed by
> the parent process because they are running as root and the 

No, the process is a real zombie and is not running under a different UID and
netstat is not SETUID.

> parent isn't.  A full long ps output might help diagnose that.
> Please attach it rather than adding it inline as a comment.

I'll attached a file with the relevant part of the ps output.
Christian: what is the Linux distribution you are
using?  What's the kernel version and do you know
the glibc version and whether you are using LinuxThreads
or NPTL?  This will help us find a similar system to
reproduce this zombie netstat process.

Do you know how to use strace on firefox-bin to find out if
waitpid was called at all?  If not, I can supply a patch
to add logging statements.  (Or you can try to do that.)
The problem is most likely that waitpid() is being called with WNOHANG.  The
function can loop the 1000 times before the child process is scheduled to
receive the signal.

I think we need a blocking waitpid() call, at least after the first attempt. 
Things to prevent a deadlock would include setting a timer to cause the blocking
waitpid() to return EINTR and making sure the pipe is drained to EOF and/or closed.
Attached patch Proposed fix (untested) (obsolete) — Splinter Review
Comment on attachment 197901 [details] [diff] [review]
Proposed fix (untested)

We should only need to make the kill(pid, SIGKILL) call
once.  If so, we can do that before calling waitpid.
It should be harmless to send SIGKILL to a zombie.

John, your patch never calls waitpid with WNOHANG. Is
this intentional?

In safe_popen, if fork succeeds but fdopen fails,
safe_popen will return NULL, which will cause safe_pclose
and therefore waitpid to not be called.  We should be
able to fix that by moving the fdopen call before the
fork call.  (fdopen is unlikely to fail though.)
I think the SIGKILL is unnecessary.  I put the patch together quickly, feel free
to tweak it.
(In reply to comment #41)
> Created an attachment (id=197901) [edit]
> Proposed fix (untested)
> 

(In reply to comment #42)
> (From update of attachment 197901 [details] [diff] [review] [edit])
> We should only need to make the kill(pid, SIGKILL) call
> once.  If so, we can do that before calling waitpid.
> It should be harmless to send SIGKILL to a zombie.
> 
> John, your patch never calls waitpid with WNOHANG. Is
> this intentional?
> 
> In safe_popen, if fork succeeds but fdopen fails,
> safe_popen will return NULL, which will cause safe_pclose
> and therefore waitpid to not be called.  We should be
> able to fix that by moving the fdopen call before the
> fork call.  (fdopen is unlikely to fail though.)
> 

I agree with Wan-Teh. Waitpid without WNOHANG could cause indefinite wait.
Should probably try WHNOHANG first, then if waitpid() returns 0, kill() and
waitpid(pid, &status, 0). I'll try to work up a patch.
Please add this patch as well, since the code is useless on Linux, where /dev/urandom is used.
The patch is from Wichert Akkerman (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=241200 )
Attachment #216373 - Flags: review?(nelson)
QA Contact: jason.m.reid → libraries
Neil, contact Wichert Akkerman and Eric Dorland (see the URL above for their
email addresses), and see if they can reproduce at will, and are willing to
test your patch.
Comment on attachment 216373 [details] [diff] [review]
workaround for Linux

Our first preference is to fix this bug "the right way" which uses netstat but doesn't leave it as a zombie.  If that can be achieved, then we want to use netstat even on Linux. 
But this needs resolution now.  So, if the "right fix" cannot be verified before 3.11.1-Beta, then I think we must deploy this workaround at that time.
Attachment #216373 - Attachment description: additional patch → workaround for Linux
Attachment #216373 - Flags: review?(nelson) → review+
Target Milestone: 3.11 → 3.11.1
Priority: P2 → P1
Target Milestone: 3.11.1 → 3.11.2
Per our meeting, retargetting to 3.12.
Target Milestone: 3.11.2 → 3.12
This seems to still be present in Firefox 2.0 derived from Debian.  The #ifdef LINUX patch seems to be applied (I'm not sure whether that's a Debian patch or if it was merged), but doesn't seem to be taking effect.  "linux" is what the preprocessor provides, not "LINUX", though perhaps that's supposed to be defined somewhere in the tree.
(In reply to comment #49)
> This seems to still be present in Firefox 2.0 derived from Debian.  The #ifdef
> LINUX patch seems to be applied (I'm not sure whether that's a Debian patch or
> if it was merged), but doesn't seem to be taking effect.  "linux" is what the
> preprocessor provides, not "LINUX", though perhaps that's supposed to be
> defined somewhere in the tree.

The #ifdef LINUX in Firefox 2.0 in Ubuntu is something from Ubuntu, since there have been changes to the unix_rand.c file between the latest firefox in debian experimental (2.0b2) and 2.0 final. Though this new Ubuntu patch should work, I can confirm that the one that was applied in the Debian version does what it is supposed to do.
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: neil.williams → wtchang
Status: REOPENED → NEW
I reproduced this zombie process on Red Hat Enterprise Linux 4
Update 4 using two small test programs, and understood why the
code fails to reap the child process.

Here is the first test program, mystat.c.  Compile it with
    gcc mystat.c -o mystat

It is to be used in place of netstat.  You can also use netstat,
but for some reason it's easier to reproduce the bug with mystat.
The mystat program does nothing.
Attached file Test program popen.c
Here is the second test program, popen.c.  Compile it with
    gcc popen.c -o popen

Then, run the test with
    ./popen

On RHEL 4 U4, it is not hard to see something like this:
    kill(23585, SIGKILL)=0
    [repeated 1000 times]
    count equals 1000

If you see this, run this command immediately (within 10
seconds) to see the zombie process:

    $ ps -eaf | grep mystat
    wtchang  23585 23584  0 17:05 pts/4    00:00:00 [mystat] <defunct>
    wtchang  23589 16153  0 17:05 pts/5    00:00:00 grep mystat

I found that waitpid(..., WNOHANG) repeatedly returns 0 because
the OS is only running the popen process and is not giving the
mystat process a chance to run and terminate.  The evidence is
that if the popen program calls sched_yield() between the
waitpid(..., WNOHANG) calls (there is one such call in popen.c
that you can uncomment out), then the bug cannot be reproduced,
and you will only see at most one line like this:
    kill(23608, SIGKILL)=0
Attached patch Proposed fix v2 (obsolete) — Splinter Review
John's description in comment 40 of what causes the problem
is right on.

This patch is based on John's patch in comment 41 and implements
what Neil outlined in comment 44.
Attachment #197901 - Attachment is obsolete: true
Attachment #250408 - Flags: superreview?(rrelyea)
Attachment #250408 - Flags: review?(neil.williams)
This patch fixes the corner case in safe_popen that I described
at the end of comment 42.
Attachment #250410 - Flags: review?(neil.williams)
Comment on attachment 250410 [details] [diff] [review]
Patch for a minor problem in safe_popen

This looks correct. Is it not correct to close(p[0]) when fork() returns -1? Why?
Attachment #250410 - Flags: review?(neil.williams) → review+
Comment on attachment 250408 [details] [diff] [review]
Proposed fix v2

Thanks for tracking this down, Wan-Teh. It finally makes good sense what's happening. I was going to use sleep(1) but this is better.
Attachment #250408 - Flags: review?(neil.williams) → review+
I noticed non-printable characters in unix_rand.c. Turns out they're form feed characters--at lines 283,324 and others. As a matter of policy should these be removed when working up a patch?
Neil, Feel free to remove FF characters in your next patch.
(In reply to comment #55)
> (From update of attachment 250410 [details] [diff] [review])
> This looks correct. Is it not correct to close(p[0]) when fork() returns -1?
> Why?

p[0] has been associated with fp, so fclose(fp) will call close(p[0])
under the cover, according to the fclose man page in the Single Unix
Specification:
http://www.opengroup.org/onlinepubs/009695399/functions/fclose.html

  The fclose() function shall perform the equivalent of a close()
  on the file descriptor that is associated with the stream pointed
  to by stream.

I should add a comment to clarify this.  How about something like this:

  fclose(fp);  /* this closes p[0], the fd associated with fp */
Note that the fdopen function associates a stream with a file descriptor.
Our code has
    fp = fdopen(p[0], "r");
Comment on attachment 250408 [details] [diff] [review]
Proposed fix v2

r+ = relyea

Did you want to add the yield or sleep(0) in this version of the patch as well?
Attachment #250408 - Flags: superreview?(rrelyea) → superreview+
I checked in this patch on the NSS trunk (NSS 3.12) with
a comment to explain that fclose(fp) closes p[0].

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.23; previous revision: 1.22
done
Attachment #250410 - Attachment is obsolete: true
I added a PR_Sleep(0) call instead, which calls sched_yield().

Calling sched_yield() directly would have porting issues because
not all Unix platforms have sched_yield() and it would require
linking libfreebl3.so with -lrt.  Calling PR_Sleep(0) allows us
to let NSPR handle these porting issues.

I considered calling sleep(0), which is available everywhere and
defined in libc, but I can't find anything in the sleep man page
in the Single Unix Specification that says sleep(0) yields the
processor:
    http://www.opengroup.org/onlinepubs/009695399/functions/sleep.html
The man page for the related function usleep says usleep(0) "has no
effect":
    http://www.opengroup.org/onlinepubs/009695399/functions/usleep.html
So I decided to not use sleep(0).

I checked in this patch on the NSS trunk (NSS 3.12).

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.24; previous revision: 1.23
done
Attachment #250408 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 18 years ago18 years ago
Resolution: --- → FIXED
Wan-Teh, Can the kill() call fail with EINTR?
Should the kill call be retried if it fails with EINTR,
as the patch now does for the waitpid() calls?
Nelson, kill() cannot fail with EINTR:
http://www.opengroup.org/onlinepubs/009695399/functions/kill.html

Actually waitpid(..., WNOHANG) most likely can't fail with EINTR
either, but I still handle it as defensive programming.  In general
only blocking system calls can fail with EINTR.
I checked in the "proposed patch (as checked in)" on the
NSS_3_11_BRANCH (NSS 3.11.6).

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.17.10.6; previous revision: 1.17.10.5
done
Target Milestone: 3.12 → 3.11.6
Comment on attachment 216373 [details] [diff] [review]
workaround for Linux

requesting sr
Attachment #216373 - Flags: superreview?(wtchang)
Comment on attachment 216373 [details] [diff] [review]
workaround for Linux

Let's come up with a better way to turn off the
safe_popen(netstat_ni_cmd) call.  We already use
two different methods for BSD/OS and Solaris.  The
sample code omits the irrelevant code.

Method 1 (used by BSD/OS): if we succeeded in getting
data from /dev/urandom, then we skip the
safe_popen(netstat_ni_cmd) call.

    bytes = RNG_FileUpdate("/dev/urandom", SYSTEM_RNG_SEED_COUNT);

#ifdef BSDI
    if (bytes)
        return;
#endif

    fp = safe_popen(netstat_ni_cmd);
    if (fp != NULL) {
        while ((bytes = fread(buf, 1, sizeof(buf), fp)) > 0)
            RNG_RandomUpdate(buf, bytes);
        safe_pclose(fp);
    }

Method 2 (used by Solaris): turn off the safe_popen(netstat_ni_cmd)
call by conditional compilation, equivalent to this patch.

#define DO_NETSTAT 1

#ifdef SOLARIS
#undef DO_NETSTAT
#endif

#ifdef DO_NETSTAT
    fp = safe_popen(netstat_ni_cmd);
    if (fp != NULL) {
        while ((bytes = fread(buf, 1, sizeof(buf), fp)) > 0)
            RNG_RandomUpdate(buf, bytes);
        safe_pclose(fp);
    }
#endif

So, this patch could be replaced by a patch that undefines DO_NETSTAT
for Linux.  But I think we should generalize the new patch to cover
all the platforms that have /dev/urandom.  It may be better to do this
in a new bug report.
Attachment #216373 - Flags: superreview?(wtchang) → superreview-
What dou think about this in nspr? 
static PRStatus OpenDevRandom( void )
 {
-    fdDevRandom = open( "/dev/random", O_RDONLY );
+    fdDevRandom = open( "/dev/urandom", O_RDONLY );
     return((-1 == fdDevRandom)? PR_FAILURE : PR_SUCCESS );
 } /* end OpenDevRandom() */

romaxa: good idea.  NSPR should also use /dev/urandom instead of
/dev/random.
(In reply to comment #69)
> So, this patch could be replaced by a patch that undefines DO_NETSTAT
> for Linux.

FWIW, we've been doing this in debian for a while to solve the netstat zombies problem.
I'm confused whether this bug is really fixed.

Wan-Teh, I do not understand the purpose of your comment 69 and why you discuss attachment 216373 [details] [diff] [review] proposes to undefine DO_NETSTAT for Linux.

What is the motivation for a new patch? Do you want to "optimize" the fix?

Is is still necessary to undefine DO_NETSTAT for Linux in order to fix this bug?

Or did Wan-Teh's checkin of attachment 250666 [details] [diff] [review], comment 66, to the 3.11.6 branch already fix the reported problems?

Did someone test/verify?

Thanks in advance for a summary of the current status.
Kai, I wouldn't try to fix this any more than it is already fixed unless 
someone complains that it is not actually fixed at present.  I see no such
complaint above.
Nelson, the motivation for my question:

I would like to understand the best approach to backport a fix to Firefox/SeaMonkey in RHEL 4.

Should we go with "#undef DO_NETSTAT"
or should we take the patch that was checked in.

Because Wan-Teh continued to discuss, I was concerned the checked in patch might not be complete.
FWIW, we do #undef DO_NETSTAT on Debian (and Ubuntu does too, AFAIK). Also see my comment #45. I still don't understand obstination as in comment #47. Why continue to use netstat for the sake of it ?
Kai: FYI "#undef DO_NETSTAT" has also been tried on RHEL4/firefox and appears to fix the netstat zombie issue there, see IssueTracker#119883.
Kai, for your backporting, take the patch that was checked in.

We should consider removing the netstat (safe_popen and safe_pclose)
code on the NSS trunk.  Let's do that in a new bug report.
Thanks for the clarifications and your recommendations.

I want to follow Wan-Teh's recommendation to use the patch as checked in.
That way we will use the same patch now and in future NSS updates.
Bug 385465 alleges that the fix for this bug was/is incomplete, and continues
to fail in certain reproducible cases.  A patch attached to that bug,
attachment 269706 [details] [diff] [review], contains a proposed fix for the trunk.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 3.11.6 → 3.11.8
Comment on attachment 250662 [details] [diff] [review]
Patch for a minor problem in safe_popen (as checked in)

This patch is now also on the branch:
unix_rand.c; new revision: 1.17.10.7; previous revision: 1.17.10.6

unix_rand.c on the branch now matches the trunk.
Is there a reason this bug is still opened? Looks like the other problems were fixed, but maybe I'm missing something.
Yes the problem described in Bug 385465 still remains. There is a proposed patch to correct the issue attachment 269706 [details] [diff] [review] that has not been committed. It would be great if this could be reviewed and committed.

In summary the issue is that the return value of kill() on a zombie process is not standardized by POSIX. It recommends that kill() on a zombie not return ESRCH but then acknowledges that many systems do. safe_pclose() on those systems has a race that leads to deadlock which I have hit quite a bit.

Regardless of how the operating system implements kill() on zombie processes, I don't see the logic in bailing out of safe_pclose() if kill() fails for any reason. waitpid() should still be called so that the conditions for the deadlock are avoided (i.e. don't exit safe_pclose() leaving the zombie around).
Kurt,
Attachment 269706 [details] [diff] is attached to a resolved bug, and has no review request,
so it's not getting any attention because it doesn't show up on any searches
for patches awaiting review on unresolved bugs.

I suggest attaching it to this bug, and requesting review from Wan-Teh.
This bug was reopened on 2007-07-01.  It's better to close this bug
and restore its 3.11.6 target milestone, and use bug 385465 for the
remaining issue.
I'm not sure that I agree that it's better.  
Bug 385465 proves that this bug was marked "fixed" prematurely.
A checkin was done, but it seems that this bug is still not actually fixed.  
I'd like to keep all the history of this bug in one bug report until it
is really fixed.
Patch to correct remaining issue (thanks Nelson for the pointer). Requesting review from Wan-Teh.

I'm not sure what if anything I should do with the flags option on this form.
Nelson or Wan-Teh perhaps you could fix them if I got it wrong.
Attachment #284642 - Flags: review?(wtc)
Comment on attachment 284642 [details] [diff] [review]
patch to correct issue described in comment #85

NSS 3.11.6 is already released and has a partial fix for
this bug.  The partial fix works on all the platforms that
the NSS team tested on, therefore it is still useful.
Since we didn't write release notes for NSS 3.11.6,
the target milestone of this bug is a good way to document
the fact that NSS 3.11.6 has a partial fix.  Reopening
this bug and changing the target milestone makes it
hard to discover that information.

Kurt, your patch is correct.  I considered changing the
code to call kill() only once, but decided not to make the
change because "it ain't broke".

Is OpenBSD your platform?  I don't have POSIX, so I use
Single Unix Specification to help me write portable code.
Its kill() man page doesn't mention the issue with zombies
and ESRCH:
http://opengroup.org/onlinepubs/007908799/xsh/kill.html

The best fix for this bug is for NSS to not exec
processes and mess with signal handlers and waitpid.
Attachment #284642 - Flags: review?(wtc) → review+
Thanks for the review. Yes OpenBSD is my platform. I guess I should have said SUSv3 instead of POSIX. Indeed the SUSv2/POSIX doesn't address how zombies are handled which is how I suppose different 1003.1-1988 compliant kill() implementations came about.

Here's the url to the SUSv3 version that I was referring to:

http://www.opengroup.org/onlinepubs/009695399/functions/kill.html

There's a paragraph in the Rational section beginning with "Existing implementations..."
(In reply to comment #90)
> The best fix for this bug is for NSS to not exec
> processes and mess with signal handlers and waitpid.

Especially on platforms that already have good random number generation.
Comment on attachment 284642 [details] [diff] [review]
patch to correct issue described in comment #85

Can we check in this patch?
FWIW, In Fedora's experimental sandbox (called Rawhide) we started to use NSS 3.12 alpha for testing purposes.

In our OS package, I removed all patches that worked around this bug.
Now, I got new complaints that we are still executing netstat, which is causing some unwanted side effects for SELinux. We'd like to avoid executing netstat.

I would like to do this on trunk:

#ifdef LINUX
#undef DO_NETSTAT
#endif
Attachment #290839 - Flags: superreview?(rrelyea)
Attachment #290839 - Flags: review?(wtc)
Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

Kai, I suggest you do the following:

1. Verify that we don't define DO_PS on any platform:
http://lxr.mozilla.org/security/search?string=DO_PS

Then remove the dead code ifdef'd with DO_PS.

2. Add LINUX to the code here and update the comment:

978 /*
979  * Bug 100447: On BSD/OS 4.2 and 4.3, we have problem calling safe_popen
980  * in a pthreads environment.  Therefore, we call safe_popen last and on
981  * BSD/OS we do not call safe_popen when we succeeded in getting data
982  * from /dev/urandom.
983  */
984 
985 #ifdef BSDI
986     if (bytes)
987         return;
988 #endif
I checked in Kurt Miller's patch on the NSS trunk for NSS 3.12.
I restored the use of tab characters to make the patch smaller.

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.26; previous revision: 1.25
done
Attachment #284642 - Attachment is obsolete: true
(In reply to comment #96)
> (From update of attachment 290839 [details] [diff] [review])

> 2. Add LINUX to the code here and update the comment:
> 
> 978 /*
> 979  * Bug 100447: On BSD/OS 4.2 and 4.3, we have problem calling safe_popen
> 980  * in a pthreads environment.  Therefore, we call safe_popen last and on
> 981  * BSD/OS we do not call safe_popen when we succeeded in getting data
> 982  * from /dev/urandom.
> 983  */
> 984 
> 985 #ifdef BSDI
> 986     if (bytes)
> 987         return;
> 988 #endif

Note that there's a patch attached to bug 174993 (attachment 272917 [details] [diff] [review]), which achieves that effect. It hasn't been checked in so far, but got the necessary reviews.
Kaspar, thanks for the pointer. I asked in bug 174993 to give people a chance to veto, but if we don't hear anything within the next day, we should check that in.
(In reply to comment #96)
> 1. Verify that we don't define DO_PS on any platform:
> http://lxr.mozilla.org/security/search?string=DO_PS
> 
> Then remove the dead code ifdef'd with DO_PS.

Wan-Teh, I remove you made this proposal for cleanup purposes, only.
Sure, I can do that. Will attach a patch.
Comment on attachment 291228 [details] [diff] [review]
remove unused DO_PS code [checked in]

r=wtc.
Attachment #291228 - Flags: review?(wtc) → review+
Attachment #291228 - Attachment description: remove unused DO_PS code → remove unused DO_PS code [checked in]
Comment on attachment 291228 [details] [diff] [review]
remove unused DO_PS code [checked in]

This patch checked in:

/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.27; previous revision: 1.26
Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

r+ holding my nose....

I totally agree with undefining DO_NETSTAT for Linux.

I don't like the #undef method of setting it, unfortunately that's because there is an explicit #define DO_NETSTAT in the file.

I would much rather that #define be ifdef out rather than undef'ing DO_NETSTAT explicitly.

I would happily enteratain a patch that does that.

bob
Attachment #290839 - Flags: superreview?(rrelyea) → superreview+
Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

This patch is obsolete.  This has already been fixed by another 
patch that has already been checked in:

1.28 <kaie@kuix.de> 2007-12-03 13:07
Bug 174993, don't fork netstat if data has been gathered from /dev/urandom
r=nelson

That fix is better than the one proposed here in one important 
respect.  If the attempt to read /dev/urandom should fail for some
reason, that other fix (already committed) does not leave the PRNG
unseeded.
Attachment #290839 - Flags: review-
Attachment #290839 - Flags: review?(wtc)
Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

Nelson said this patch is obsolete. Setting the flag.
Attachment #290839 - Attachment is obsolete: true
I believe this bug is now fixed by the checkin documentd in comment 105.
Status: REOPENED → RESOLVED
Closed: 18 years ago17 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: