RNG_SystemInfoForRNG possible "netstat" zombie process

RESOLVED FIXED in 3.11.8

Status

P1
normal
RESOLVED FIXED
19 years ago
11 years ago

People

(Reporter: nelson, Assigned: wtc)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(9 attachments, 7 obsolete attachments)

1.09 KB, patch
nelson
: review+
julien.pierre
: review+
Details | Diff | Splinter Review
553 bytes, text/plain
Details
388 bytes, patch
nelson
: review+
Details | Diff | Splinter Review
29 bytes, text/plain
Details
3.38 KB, text/plain
Details
1003 bytes, patch
Details | Diff | Splinter Review
1.35 KB, patch
Details | Diff | Splinter Review
737 bytes, patch
Details | Diff | Splinter Review
1.62 KB, patch
wtc
: review+
Details | Diff | Splinter Review
This bug was formerly bugsplat 300544.
It applies to all UNIX systems, to the unix implementation of 
RNG_SystemInfoForRNG().

jgmysers@netscape.com wrote:
In ns/security/lib/util/unix_rand.c:safe_pclose(), if waitpid() 
fails with EINTR, the child process could be left as a zombie.

Comment 1

19 years ago
moving to 3.2
Depends on: 48258
Target Milestone: --- → 3.2
(Assignee)

Updated

18 years ago
QA Contact: wtc → sonmi

Updated

18 years ago
Target Milestone: 3.2 → 3.3
(Assignee)

Updated

18 years ago
Target Milestone: 3.3 → 3.4
(Assignee)

Comment 2

17 years ago
Changed the QA contact to Bishakha.
QA Contact: sonja.mirtitsch → bishakhabanerjee
(Assignee)

Comment 3

17 years ago
Set target milestone to NSS 3.5.
Target Milestone: 3.4 → 3.5

Comment 4

17 years ago
I'm seeing [a] netstat zombie[s] on Red Hat Linux 7.3 using the 2002052409 Linux
build of Mozilla 1.0RC3.

Let me know if I can provide any useful information about this somehow.
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2a) Gecko/2002091318

I see something like this all the time:
22483 ?        Z      0:00 [netstat <defunct>]
Is this what this bug is about?
It does not seem to be a problem.

pi
(Assignee)

Comment 6

16 years ago
Assigned the bug to myself.  Target NSS 3.7.
Assignee: ian.mcgreer → wtc
Priority: P3 → P2
Target Milestone: 3.5 → 3.7
(Assignee)

Comment 7

16 years ago
*** Bug 183778 has been marked as a duplicate of this bug. ***

Comment 8

16 years ago
Isn't there some much, much, MUCH more sane way to get entropy?

This bug should be considered a bit higher priority that it seems, since a
browser screwing with netstat will look to many people like spyway or a trojan.
(Assignee)

Comment 9

16 years ago
Moved to target milestone 3.8 because the original
NSS 3.7 release has been renamed 3.8.
Target Milestone: 3.7 → 3.8

Comment 10

16 years ago
It's still there in Mozilla 1.2.1.

Comment 11

16 years ago
15728 S    19:18   0:24      \_ /home/alex/programs/mozilla/mozilla-bin
15740 Z    19:18   0:00          \_ [netstat <defunct>]

Yup, it's always there after a while. Mozilla 1.2.1

Comment 12

16 years ago
I'm having the same problem also with this. Mozilla 1.3a

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3a) Gecko/20021212
Linux schloss 2.4.18-10 #1 Wed Aug 7 10:26:48 EDT 2002 i686 unknown


winzig   25910  1476  0 19:12 ?        00:00:00 /bin/sh /usr/local/mozilla/run-m
winzig   25916 25910  2 19:12 ?        00:02:08 /usr/local/mozilla/mozilla-bin
winzig   25918 25916  0 19:12 ?        00:00:00 /usr/local/mozilla/mozilla-bin
winzig   25919 25918  0 19:12 ?        00:00:01 /usr/local/mozilla/mozilla-bin
winzig   25920 25918  0 19:12 ?        00:00:00 /usr/local/mozilla/mozilla-bin
winzig   25922 25918  0 19:12 ?        00:00:00 /usr/local/mozilla/mozilla-bin
winzig   26157 25916  0 20:15 ?        00:00:00 [netstat <defunct>]
winzig   26243 25918  0 20:36 ?        00:00:00 /usr/local/mozilla/mozilla-bin

Comment 13

16 years ago
Looks like netstat hangs when it's stuck resolving an ip address into a hostname.

Comment 14

16 years ago
Netstat hanging would be a different bug.
Remove target milestone of 3.8, since these bugs didn't get into that release.
Target Milestone: 3.8 → ---
it's still there. Isn't it possible to use /dev/urandom?
Or is this not available on some systems?

Comment 17

15 years ago
zombified netstat still there in 1.6 and 1.7b. Wouldnt it be feasable to
incorporate something like "'EGD' http://egd.sourceforge.net/" into mozilla and
have mozilla read from a socket instead of using a system process that could be
exploitable to gain intelligence about running systems and maybe possible other
info or even gain a shell on a system. PS: When i see netstat being zombified
after a browser is using it, makes me think about the ole days of Internet
explorer reporting home "Phoning Home" like E.T. And knowing that you guys are
pretty respectable ppl i trust you to use netstat but there is allways that
slight possiblity of someone sneaking around doing malicious stuff. Please look
into how EGD handles getting more data for entropy.

Comment 18

15 years ago
Bug still there, with 1.7rc1 according to
http://qa.mandrakesoft.com/show_bug.cgi?id=9427

Neil, please have a look at this.  Take it if you want it.
(Reporter)

Updated

14 years ago
QA Contact: bishakhabanerjee → jason.m.reid

Comment 20

14 years ago
Created attachment 185311 [details] [diff] [review]
patch to avoid zombie netstat

This patch has been suggested by others. It avoids a theoretically possible
bug, one that I have been unable to force on Solaris 10 x86. Briefly, it is
unlikely to happen because the parent reads all of the output of the netstat
command before issuing the waitpid() call. Since the child's stdout is closed
by its exit routine  it will always be defunct immediately thereafter.
Furthermore, even when the parent is forced to truncate the input file and
while the child is explicitly held in the stopped state, waitpid() still
returns 0 rather than EINTR.

I suspect this last behavior may be different under different OSes.

Would some of the respondents who reported seeing this problem be kind enough
to try the patch?
Assignee: wtchang → neil.williams
Status: NEW → ASSIGNED
Attachment #185311 - Flags: superreview?(wtchang)
Attachment #185311 - Flags: review?(nelson)
If waitpid gets EINTR, do you want to kill pid?  or just waitpid again?

Comment 22

14 years ago
Comment on attachment 185311 [details] [diff] [review]
patch to avoid zombie netstat

Nelson, I agree that it would be more correct to call waitpid again. I'll
submit a revised patch.
Attachment #185311 - Attachment is obsolete: true

Comment 23

14 years ago
Created attachment 185508 [details] [diff] [review]
patch to avoid zombie netstat

Incorporating Nelson's suggestion.

I note that most of the bug citations mention Linux. Also note that the
Mandrake bug mentioned in comment 18 has been fixed, apparently by a Mandrake
release.
Attachment #185508 - Flags: superreview?(wtchang)
Attachment #185508 - Flags: review?(nelson)
Comment on attachment 185508 [details] [diff] [review]
patch to avoid zombie netstat

That counter increment/check needs to happen in every path that repeats this
loop.
Attachment #185508 - Flags: superreview?(wtchang)
Attachment #185508 - Flags: review?(nelson)
Attachment #185508 - Flags: review-
Comment on attachment 185311 [details] [diff] [review]
patch to avoid zombie netstat

Forgot to mark this patch with my previous review comment.
Attachment #185311 - Flags: superreview?(wtchang)
Attachment #185311 - Flags: review?(nelson)
Attachment #185311 - Flags: review-

Comment 26

14 years ago
To be honest, I just applied a Debian patch for Mandriva/Mandrake package which
commented the netstat call, which isn't really needed on Linux system, since
/dev/random randomness is good enough.

Updated

14 years ago
Attachment #185508 - Attachment is obsolete: true

Comment 27

14 years ago
Created attachment 185630 [details] [diff] [review]
third try

Reviewer's last comments incorporated.
Attachment #185630 - Flags: superreview?(wtchang)
Attachment #185630 - Flags: review?(nelson)
Comment on attachment 185630 [details] [diff] [review]
third try

Looks good to me.  Since we haven't found a reproducible test case with which
to test this patch, please get a second review on this.  Thanks.
Attachment #185630 - Flags: review?(nelson) → review+

Updated

14 years ago
Target Milestone: --- → 3.10.1

Comment 29

14 years ago
(In reply to comment #26)
> To be honest, I just applied a Debian patch for Mandriva/Mandrake package which
> commented the netstat call, which isn't really needed on Linux system, since
> /dev/random randomness is good enough.

You mean /dev/urandom, right ? On Linux, /dev/random might block. Not on BSD though.

While I'm not a reviewer, the latest patch looks good. EINTR will cause waipid
to be repeated (max 1000 times). All other errors will elave the loop
I'm running Gentoo 1.4.16, and I'm getting a zombie netstat process for both
Firefox (1.0.6 - Mozilla/5.0 - 20050722) and Thunderbird (1.0.6 (20050722)).
Originally, before upgrading to the latest Firefox and Thunderbird build, this
problem only used to occur with instances of Firefox.

 /bin/bash /usr/libexec/mozilla-launcher
  \_ /usr/lib/mozilla-firefox/firefox-bin
      \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      \_ [netstat] <defunct>


 /bin/bash /usr/libexec/mozilla-launcher
  \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      \_ [netstat] <defunct>

(Assignee)

Updated

13 years ago
Target Milestone: 3.10.1 → 3.11
Comment on attachment 185630 [details] [diff] [review]
third try

This bug needs a second review.  Julien, will you review it?
Attachment #185630 - Flags: review?(julien.pierre.bugs)
(Assignee)

Comment 32

13 years ago
Comment on attachment 185630 [details] [diff] [review]
third try

r=wtc.	This patch may not fix this bug though.
I would rename the variable r as rv.
Attachment #185630 - Flags: superreview?(wtchang)
Attachment #185630 - Flags: superreview+
Attachment #185630 - Flags: review?(julien.pierre.bugs)

Comment 33

13 years ago
Comment on attachment 185630 [details] [diff] [review]
third try

Like Wan-Teh, I'm not sure if this bug will fix the problem, but I don't think
it will hurt.

Re: the kill() call, I think we should probably break out of the loop on any
error, and not just on ESRCH. In the Solaris man pages, the other possible
errors are EINVAL (invalid signal number) and EPERM (no permission to kill the
process), which should also be fatal, since subsequent calls to kill aren't
going to fix those problems, and the only point of continuing the waitpid loop
is to work gracefully if the process exits quickly within 1000 iterations. On
the other hand, these other errors should never happen, since we are passing a
hardcoded signal value, and we started the process, so we should have
permission - unless it changed user. I hope netstat doesn't do that ;).
Attachment #185630 - Flags: review+
Hmm. Netstat could be setuid root!  

Comment 35

13 years ago
I've appplied the patch, but unfortunately there are still sometimes hanging
zombie netstat processes:

&#9500;&#9472;firefox&#9472;&#9472;&#9472;run-mozilla.sh&#9472;&#9472;&#9472;firefox-bin&#9472;&#9472;&#9472;netstat
Hmm.  I wonder.  Perhaps the netstat processes are not zombies,
but are actually live hung processes that cannot be killed by
the parent process because they are running as root and the 
parent isn't.  A full long ps output might help diagnose that.
Please attach it rather than adding it inline as a comment.

Comment 37

13 years ago
(In reply to comment #36)
> Hmm.  I wonder.  Perhaps the netstat processes are not zombies,
> but are actually live hung processes that cannot be killed by
> the parent process because they are running as root and the 

No, the process is a real zombie and is not running under a different UID and
netstat is not SETUID.

> parent isn't.  A full long ps output might help diagnose that.
> Please attach it rather than adding it inline as a comment.

I'll attached a file with the relevant part of the ps output.

Comment 38

13 years ago
Created attachment 197818 [details]
ps output of firefox and netstat zombie
(Assignee)

Comment 39

13 years ago
Christian: what is the Linux distribution you are
using?  What's the kernel version and do you know
the glibc version and whether you are using LinuxThreads
or NPTL?  This will help us find a similar system to
reproduce this zombie netstat process.

Do you know how to use strace on firefox-bin to find out if
waitpid was called at all?  If not, I can supply a patch
to add logging statements.  (Or you can try to do that.)

Comment 40

13 years ago
The problem is most likely that waitpid() is being called with WNOHANG.  The
function can loop the 1000 times before the child process is scheduled to
receive the signal.

I think we need a blocking waitpid() call, at least after the first attempt. 
Things to prevent a deadlock would include setting a timer to cause the blocking
waitpid() to return EINTR and making sure the pipe is drained to EOF and/or closed.

Comment 41

13 years ago
Created attachment 197901 [details] [diff] [review]
Proposed fix (untested)
(Assignee)

Comment 42

13 years ago
Comment on attachment 197901 [details] [diff] [review]
Proposed fix (untested)

We should only need to make the kill(pid, SIGKILL) call
once.  If so, we can do that before calling waitpid.
It should be harmless to send SIGKILL to a zombie.

John, your patch never calls waitpid with WNOHANG. Is
this intentional?

In safe_popen, if fork succeeds but fdopen fails,
safe_popen will return NULL, which will cause safe_pclose
and therefore waitpid to not be called.  We should be
able to fix that by moving the fdopen call before the
fork call.  (fdopen is unlikely to fail though.)

Comment 43

13 years ago
I think the SIGKILL is unnecessary.  I put the patch together quickly, feel free
to tweak it.

Comment 44

13 years ago
(In reply to comment #41)
> Created an attachment (id=197901) [edit]
> Proposed fix (untested)
> 

(In reply to comment #42)
> (From update of attachment 197901 [details] [diff] [review] [edit])
> We should only need to make the kill(pid, SIGKILL) call
> once.  If so, we can do that before calling waitpid.
> It should be harmless to send SIGKILL to a zombie.
> 
> John, your patch never calls waitpid with WNOHANG. Is
> this intentional?
> 
> In safe_popen, if fork succeeds but fdopen fails,
> safe_popen will return NULL, which will cause safe_pclose
> and therefore waitpid to not be called.  We should be
> able to fix that by moving the fdopen call before the
> fork call.  (fdopen is unlikely to fail though.)
> 

I agree with Wan-Teh. Waitpid without WNOHANG could cause indefinite wait.
Should probably try WHNOHANG first, then if waitpid() returns 0, kill() and
waitpid(pid, &status, 0). I'll try to work up a patch.
Created attachment 216373 [details] [diff] [review]
workaround for Linux

Please add this patch as well, since the code is useless on Linux, where /dev/urandom is used.
The patch is from Wichert Akkerman (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=241200 )
Attachment #216373 - Flags: review?(nelson)
(Reporter)

Updated

13 years ago
QA Contact: jason.m.reid → libraries
Neil, contact Wichert Akkerman and Eric Dorland (see the URL above for their
email addresses), and see if they can reproduce at will, and are willing to
test your patch.
Comment on attachment 216373 [details] [diff] [review]
workaround for Linux

Our first preference is to fix this bug "the right way" which uses netstat but doesn't leave it as a zombie.  If that can be achieved, then we want to use netstat even on Linux. 
But this needs resolution now.  So, if the "right fix" cannot be verified before 3.11.1-Beta, then I think we must deploy this workaround at that time.
Attachment #216373 - Attachment description: additional patch → workaround for Linux
Attachment #216373 - Flags: review?(nelson) → review+
(Reporter)

Updated

13 years ago
Target Milestone: 3.11 → 3.11.1
(Reporter)

Updated

13 years ago
Priority: P2 → P1
Target Milestone: 3.11.1 → 3.11.2

Comment 48

13 years ago
Per our meeting, retargetting to 3.12.
Target Milestone: 3.11.2 → 3.12

Comment 49

12 years ago
This seems to still be present in Firefox 2.0 derived from Debian.  The #ifdef LINUX patch seems to be applied (I'm not sure whether that's a Debian patch or if it was merged), but doesn't seem to be taking effect.  "linux" is what the preprocessor provides, not "LINUX", though perhaps that's supposed to be defined somewhere in the tree.
(In reply to comment #49)
> This seems to still be present in Firefox 2.0 derived from Debian.  The #ifdef
> LINUX patch seems to be applied (I'm not sure whether that's a Debian patch or
> if it was merged), but doesn't seem to be taking effect.  "linux" is what the
> preprocessor provides, not "LINUX", though perhaps that's supposed to be
> defined somewhere in the tree.

The #ifdef LINUX in Firefox 2.0 in Ubuntu is something from Ubuntu, since there have been changes to the unix_rand.c file between the latest firefox in debian experimental (2.0b2) and 2.0 final. Though this new Ubuntu patch should work, I can confirm that the one that was applied in the Debian version does what it is supposed to do.
Status: ASSIGNED → RESOLVED
Last Resolved: 12 years ago
Resolution: --- → FIXED
(Reporter)

Updated

12 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Updated

12 years ago
Assignee: neil.williams → wtchang
Status: REOPENED → NEW
(Assignee)

Comment 51

12 years ago
Created attachment 250404 [details]
Test program mystat.c (used in place of netstat)

I reproduced this zombie process on Red Hat Enterprise Linux 4
Update 4 using two small test programs, and understood why the
code fails to reap the child process.

Here is the first test program, mystat.c.  Compile it with
    gcc mystat.c -o mystat

It is to be used in place of netstat.  You can also use netstat,
but for some reason it's easier to reproduce the bug with mystat.
The mystat program does nothing.
(Assignee)

Comment 52

12 years ago
Created attachment 250405 [details]
Test program popen.c

Here is the second test program, popen.c.  Compile it with
    gcc popen.c -o popen

Then, run the test with
    ./popen

On RHEL 4 U4, it is not hard to see something like this:
    kill(23585, SIGKILL)=0
    [repeated 1000 times]
    count equals 1000

If you see this, run this command immediately (within 10
seconds) to see the zombie process:

    $ ps -eaf | grep mystat
    wtchang  23585 23584  0 17:05 pts/4    00:00:00 [mystat] <defunct>
    wtchang  23589 16153  0 17:05 pts/5    00:00:00 grep mystat

I found that waitpid(..., WNOHANG) repeatedly returns 0 because
the OS is only running the popen process and is not giving the
mystat process a chance to run and terminate.  The evidence is
that if the popen program calls sched_yield() between the
waitpid(..., WNOHANG) calls (there is one such call in popen.c
that you can uncomment out), then the bug cannot be reproduced,
and you will only see at most one line like this:
    kill(23608, SIGKILL)=0
(Assignee)

Comment 53

12 years ago
Created attachment 250408 [details] [diff] [review]
Proposed fix v2

John's description in comment 40 of what causes the problem
is right on.

This patch is based on John's patch in comment 41 and implements
what Neil outlined in comment 44.
Attachment #197901 - Attachment is obsolete: true
Attachment #250408 - Flags: superreview?(rrelyea)
Attachment #250408 - Flags: review?(neil.williams)
(Assignee)

Comment 54

12 years ago
Created attachment 250410 [details] [diff] [review]
Patch for a minor problem in safe_popen

This patch fixes the corner case in safe_popen that I described
at the end of comment 42.
Attachment #250410 - Flags: review?(neil.williams)

Comment 55

12 years ago
Comment on attachment 250410 [details] [diff] [review]
Patch for a minor problem in safe_popen

This looks correct. Is it not correct to close(p[0]) when fork() returns -1? Why?
Attachment #250410 - Flags: review?(neil.williams) → review+

Comment 56

12 years ago
Comment on attachment 250408 [details] [diff] [review]
Proposed fix v2

Thanks for tracking this down, Wan-Teh. It finally makes good sense what's happening. I was going to use sleep(1) but this is better.
Attachment #250408 - Flags: review?(neil.williams) → review+

Comment 57

12 years ago
I noticed non-printable characters in unix_rand.c. Turns out they're form feed characters--at lines 283,324 and others. As a matter of policy should these be removed when working up a patch?
Neil, Feel free to remove FF characters in your next patch.
(Assignee)

Comment 59

12 years ago
(In reply to comment #55)
> (From update of attachment 250410 [details] [diff] [review])
> This looks correct. Is it not correct to close(p[0]) when fork() returns -1?
> Why?

p[0] has been associated with fp, so fclose(fp) will call close(p[0])
under the cover, according to the fclose man page in the Single Unix
Specification:
http://www.opengroup.org/onlinepubs/009695399/functions/fclose.html

  The fclose() function shall perform the equivalent of a close()
  on the file descriptor that is associated with the stream pointed
  to by stream.

I should add a comment to clarify this.  How about something like this:

  fclose(fp);  /* this closes p[0], the fd associated with fp */
(Assignee)

Comment 60

12 years ago
Note that the fdopen function associates a stream with a file descriptor.
Our code has
    fp = fdopen(p[0], "r");

Comment 61

12 years ago
Comment on attachment 250408 [details] [diff] [review]
Proposed fix v2

r+ = relyea

Did you want to add the yield or sleep(0) in this version of the patch as well?
Attachment #250408 - Flags: superreview?(rrelyea) → superreview+
(Assignee)

Comment 62

12 years ago
Created attachment 250662 [details] [diff] [review]
Patch for a minor problem in safe_popen (as checked in)

I checked in this patch on the NSS trunk (NSS 3.12) with
a comment to explain that fclose(fp) closes p[0].

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.23; previous revision: 1.22
done
Attachment #250410 - Attachment is obsolete: true
(Assignee)

Comment 63

12 years ago
Created attachment 250666 [details] [diff] [review]
Proposed fix (as checked in)

I added a PR_Sleep(0) call instead, which calls sched_yield().

Calling sched_yield() directly would have porting issues because
not all Unix platforms have sched_yield() and it would require
linking libfreebl3.so with -lrt.  Calling PR_Sleep(0) allows us
to let NSPR handle these porting issues.

I considered calling sleep(0), which is available everywhere and
defined in libc, but I can't find anything in the sleep man page
in the Single Unix Specification that says sleep(0) yields the
processor:
    http://www.opengroup.org/onlinepubs/009695399/functions/sleep.html
The man page for the related function usleep says usleep(0) "has no
effect":
    http://www.opengroup.org/onlinepubs/009695399/functions/usleep.html
So I decided to not use sleep(0).

I checked in this patch on the NSS trunk (NSS 3.12).

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.24; previous revision: 1.23
done
Attachment #250408 - Attachment is obsolete: true
(Assignee)

Updated

12 years ago
Status: NEW → RESOLVED
Last Resolved: 12 years ago12 years ago
Resolution: --- → FIXED
Wan-Teh, Can the kill() call fail with EINTR?
Should the kill call be retried if it fails with EINTR,
as the patch now does for the waitpid() calls?
(Assignee)

Comment 65

12 years ago
Nelson, kill() cannot fail with EINTR:
http://www.opengroup.org/onlinepubs/009695399/functions/kill.html

Actually waitpid(..., WNOHANG) most likely can't fail with EINTR
either, but I still handle it as defensive programming.  In general
only blocking system calls can fail with EINTR.
(Assignee)

Comment 66

12 years ago
I checked in the "proposed patch (as checked in)" on the
NSS_3_11_BRANCH (NSS 3.11.6).

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.17.10.6; previous revision: 1.17.10.5
done
Target Milestone: 3.12 → 3.11.6
does anybody supposed to commit https://bugzilla.mozilla.org/attachment.cgi?id=216373
?
Comment on attachment 216373 [details] [diff] [review]
workaround for Linux

requesting sr
Attachment #216373 - Flags: superreview?(wtchang)
(Assignee)

Comment 69

12 years ago
Comment on attachment 216373 [details] [diff] [review]
workaround for Linux

Let's come up with a better way to turn off the
safe_popen(netstat_ni_cmd) call.  We already use
two different methods for BSD/OS and Solaris.  The
sample code omits the irrelevant code.

Method 1 (used by BSD/OS): if we succeeded in getting
data from /dev/urandom, then we skip the
safe_popen(netstat_ni_cmd) call.

    bytes = RNG_FileUpdate("/dev/urandom", SYSTEM_RNG_SEED_COUNT);

#ifdef BSDI
    if (bytes)
        return;
#endif

    fp = safe_popen(netstat_ni_cmd);
    if (fp != NULL) {
        while ((bytes = fread(buf, 1, sizeof(buf), fp)) > 0)
            RNG_RandomUpdate(buf, bytes);
        safe_pclose(fp);
    }

Method 2 (used by Solaris): turn off the safe_popen(netstat_ni_cmd)
call by conditional compilation, equivalent to this patch.

#define DO_NETSTAT 1

#ifdef SOLARIS
#undef DO_NETSTAT
#endif

#ifdef DO_NETSTAT
    fp = safe_popen(netstat_ni_cmd);
    if (fp != NULL) {
        while ((bytes = fread(buf, 1, sizeof(buf), fp)) > 0)
            RNG_RandomUpdate(buf, bytes);
        safe_pclose(fp);
    }
#endif

So, this patch could be replaced by a patch that undefines DO_NETSTAT
for Linux.  But I think we should generalize the new patch to cover
all the platforms that have /dev/urandom.  It may be better to do this
in a new bug report.
Attachment #216373 - Flags: superreview?(wtchang) → superreview-
What dou think about this in nspr? 
static PRStatus OpenDevRandom( void )
 {
-    fdDevRandom = open( "/dev/random", O_RDONLY );
+    fdDevRandom = open( "/dev/urandom", O_RDONLY );
     return((-1 == fdDevRandom)? PR_FAILURE : PR_SUCCESS );
 } /* end OpenDevRandom() */

(Assignee)

Comment 71

12 years ago
romaxa: good idea.  NSPR should also use /dev/urandom instead of
/dev/random.
(In reply to comment #69)
> So, this patch could be replaced by a patch that undefines DO_NETSTAT
> for Linux.

FWIW, we've been doing this in debian for a while to solve the netstat zombies problem.
I'm confused whether this bug is really fixed.

Wan-Teh, I do not understand the purpose of your comment 69 and why you discuss attachment 216373 [details] [diff] [review] proposes to undefine DO_NETSTAT for Linux.

What is the motivation for a new patch? Do you want to "optimize" the fix?

Is is still necessary to undefine DO_NETSTAT for Linux in order to fix this bug?

Or did Wan-Teh's checkin of attachment 250666 [details] [diff] [review], comment 66, to the 3.11.6 branch already fix the reported problems?

Did someone test/verify?

Thanks in advance for a summary of the current status.
Kai, I wouldn't try to fix this any more than it is already fixed unless 
someone complains that it is not actually fixed at present.  I see no such
complaint above.
Nelson, the motivation for my question:

I would like to understand the best approach to backport a fix to Firefox/SeaMonkey in RHEL 4.

Should we go with "#undef DO_NETSTAT"
or should we take the patch that was checked in.

Because Wan-Teh continued to discuss, I was concerned the checked in patch might not be complete.
FWIW, we do #undef DO_NETSTAT on Debian (and Ubuntu does too, AFAIK). Also see my comment #45. I still don't understand obstination as in comment #47. Why continue to use netstat for the sake of it ?

Comment 77

12 years ago
Kai: FYI "#undef DO_NETSTAT" has also been tried on RHEL4/firefox and appears to fix the netstat zombie issue there, see IssueTracker#119883.
(Assignee)

Comment 78

12 years ago
Kai, for your backporting, take the patch that was checked in.

We should consider removing the netstat (safe_popen and safe_pclose)
code on the NSS trunk.  Let's do that in a new bug report.
Thanks for the clarifications and your recommendations.

I want to follow Wan-Teh's recommendation to use the patch as checked in.
That way we will use the same patch now and in future NSS updates.
(Reporter)

Updated

12 years ago
Duplicate of this bug: 385465
(Reporter)

Updated

12 years ago
Duplicate of this bug: 385465
Bug 385465 alleges that the fix for this bug was/is incomplete, and continues
to fail in certain reproducible cases.  A patch attached to that bug,
attachment 269706 [details] [diff] [review], contains a proposed fix for the trunk.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 3.11.6 → 3.11.8
Comment on attachment 250662 [details] [diff] [review]
Patch for a minor problem in safe_popen (as checked in)

This patch is now also on the branch:
unix_rand.c; new revision: 1.17.10.7; previous revision: 1.17.10.6

unix_rand.c on the branch now matches the trunk.
Is there a reason this bug is still opened? Looks like the other problems were fixed, but maybe I'm missing something.

Comment 85

11 years ago
Yes the problem described in Bug 385465 still remains. There is a proposed patch to correct the issue attachment 269706 [details] [diff] [review] that has not been committed. It would be great if this could be reviewed and committed.

In summary the issue is that the return value of kill() on a zombie process is not standardized by POSIX. It recommends that kill() on a zombie not return ESRCH but then acknowledges that many systems do. safe_pclose() on those systems has a race that leads to deadlock which I have hit quite a bit.

Regardless of how the operating system implements kill() on zombie processes, I don't see the logic in bailing out of safe_pclose() if kill() fails for any reason. waitpid() should still be called so that the conditions for the deadlock are avoided (i.e. don't exit safe_pclose() leaving the zombie around).
Kurt,
Attachment 269706 [details] [diff] is attached to a resolved bug, and has no review request,
so it's not getting any attention because it doesn't show up on any searches
for patches awaiting review on unresolved bugs.

I suggest attaching it to this bug, and requesting review from Wan-Teh.
(Assignee)

Comment 87

11 years ago
This bug was reopened on 2007-07-01.  It's better to close this bug
and restore its 3.11.6 target milestone, and use bug 385465 for the
remaining issue.
I'm not sure that I agree that it's better.  
Bug 385465 proves that this bug was marked "fixed" prematurely.
A checkin was done, but it seems that this bug is still not actually fixed.  
I'd like to keep all the history of this bug in one bug report until it
is really fixed.

Comment 89

11 years ago
Created attachment 284642 [details] [diff] [review]
patch to correct issue described in comment #85

Patch to correct remaining issue (thanks Nelson for the pointer). Requesting review from Wan-Teh.

I'm not sure what if anything I should do with the flags option on this form.
Nelson or Wan-Teh perhaps you could fix them if I got it wrong.
(Reporter)

Updated

11 years ago
Attachment #284642 - Flags: review?(wtc)
(Assignee)

Comment 90

11 years ago
Comment on attachment 284642 [details] [diff] [review]
patch to correct issue described in comment #85

NSS 3.11.6 is already released and has a partial fix for
this bug.  The partial fix works on all the platforms that
the NSS team tested on, therefore it is still useful.
Since we didn't write release notes for NSS 3.11.6,
the target milestone of this bug is a good way to document
the fact that NSS 3.11.6 has a partial fix.  Reopening
this bug and changing the target milestone makes it
hard to discover that information.

Kurt, your patch is correct.  I considered changing the
code to call kill() only once, but decided not to make the
change because "it ain't broke".

Is OpenBSD your platform?  I don't have POSIX, so I use
Single Unix Specification to help me write portable code.
Its kill() man page doesn't mention the issue with zombies
and ESRCH:
http://opengroup.org/onlinepubs/007908799/xsh/kill.html

The best fix for this bug is for NSS to not exec
processes and mess with signal handlers and waitpid.
Attachment #284642 - Flags: review?(wtc) → review+

Comment 91

11 years ago
Thanks for the review. Yes OpenBSD is my platform. I guess I should have said SUSv3 instead of POSIX. Indeed the SUSv2/POSIX doesn't address how zombies are handled which is how I suppose different 1003.1-1988 compliant kill() implementations came about.

Here's the url to the SUSv3 version that I was referring to:

http://www.opengroup.org/onlinepubs/009695399/functions/kill.html

There's a paragraph in the Rational section beginning with "Existing implementations..."
(In reply to comment #90)
> The best fix for this bug is for NSS to not exec
> processes and mess with signal handlers and waitpid.

Especially on platforms that already have good random number generation.
Comment on attachment 284642 [details] [diff] [review]
patch to correct issue described in comment #85

Can we check in this patch?
FWIW, In Fedora's experimental sandbox (called Rawhide) we started to use NSS 3.12 alpha for testing purposes.

In our OS package, I removed all patches that worked around this bug.
Now, I got new complaints that we are still executing netstat, which is causing some unwanted side effects for SELinux. We'd like to avoid executing netstat.

I would like to do this on trunk:

#ifdef LINUX
#undef DO_NETSTAT
#endif
Created attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux
Attachment #290839 - Flags: superreview?(rrelyea)
Attachment #290839 - Flags: review?(wtc)
(Assignee)

Comment 96

11 years ago
Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

Kai, I suggest you do the following:

1. Verify that we don't define DO_PS on any platform:
http://lxr.mozilla.org/security/search?string=DO_PS

Then remove the dead code ifdef'd with DO_PS.

2. Add LINUX to the code here and update the comment:

978 /*
979  * Bug 100447: On BSD/OS 4.2 and 4.3, we have problem calling safe_popen
980  * in a pthreads environment.  Therefore, we call safe_popen last and on
981  * BSD/OS we do not call safe_popen when we succeeded in getting data
982  * from /dev/urandom.
983  */
984 
985 #ifdef BSDI
986     if (bytes)
987         return;
988 #endif
(Assignee)

Comment 97

11 years ago
Created attachment 290909 [details] [diff] [review]
patch to correct issue described in comment #85 (as checked in)

I checked in Kurt Miller's patch on the NSS trunk for NSS 3.12.
I restored the use of tab characters to make the patch smaller.

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.26; previous revision: 1.25
done
Attachment #284642 - Attachment is obsolete: true

Comment 98

11 years ago
(In reply to comment #96)
> (From update of attachment 290839 [details] [diff] [review])

> 2. Add LINUX to the code here and update the comment:
> 
> 978 /*
> 979  * Bug 100447: On BSD/OS 4.2 and 4.3, we have problem calling safe_popen
> 980  * in a pthreads environment.  Therefore, we call safe_popen last and on
> 981  * BSD/OS we do not call safe_popen when we succeeded in getting data
> 982  * from /dev/urandom.
> 983  */
> 984 
> 985 #ifdef BSDI
> 986     if (bytes)
> 987         return;
> 988 #endif

Note that there's a patch attached to bug 174993 (attachment 272917 [details] [diff] [review]), which achieves that effect. It hasn't been checked in so far, but got the necessary reviews.
Kaspar, thanks for the pointer. I asked in bug 174993 to give people a chance to veto, but if we don't hear anything within the next day, we should check that in.
(In reply to comment #96)
> 1. Verify that we don't define DO_PS on any platform:
> http://lxr.mozilla.org/security/search?string=DO_PS
> 
> Then remove the dead code ifdef'd with DO_PS.

Wan-Teh, I remove you made this proposal for cleanup purposes, only.
Sure, I can do that. Will attach a patch.
Created attachment 291228 [details] [diff] [review]
remove unused DO_PS code [checked in]
Attachment #291228 - Flags: review?(wtc)
(Assignee)

Comment 102

11 years ago
Comment on attachment 291228 [details] [diff] [review]
remove unused DO_PS code [checked in]

r=wtc.
Attachment #291228 - Flags: review?(wtc) → review+

Updated

11 years ago
Attachment #291228 - Attachment description: remove unused DO_PS code → remove unused DO_PS code [checked in]
Comment on attachment 291228 [details] [diff] [review]
remove unused DO_PS code [checked in]

This patch checked in:

/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.27; previous revision: 1.26

Comment 104

11 years ago
Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

r+ holding my nose....

I totally agree with undefining DO_NETSTAT for Linux.

I don't like the #undef method of setting it, unfortunately that's because there is an explicit #define DO_NETSTAT in the file.

I would much rather that #define be ifdef out rather than undef'ing DO_NETSTAT explicitly.

I would happily enteratain a patch that does that.

bob
Attachment #290839 - Flags: superreview?(rrelyea) → superreview+
Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

This patch is obsolete.  This has already been fixed by another 
patch that has already been checked in:

1.28 <kaie@kuix.de> 2007-12-03 13:07
Bug 174993, don't fork netstat if data has been gathered from /dev/urandom
r=nelson

That fix is better than the one proposed here in one important 
respect.  If the attempt to read /dev/urandom should fail for some
reason, that other fix (already committed) does not leave the PRNG
unseeded.
Attachment #290839 - Flags: review-
(Assignee)

Updated

11 years ago
Attachment #290839 - Flags: review?(wtc)
Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

Nelson said this patch is obsolete. Setting the flag.
Attachment #290839 - Attachment is obsolete: true
I believe this bug is now fixed by the checkin documentd in comment 105.
Status: REOPENED → RESOLVED
Last Resolved: 12 years ago11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.