Closed Bug 51429 Opened 24 years ago Closed 17 years ago

RNG_SystemInfoForRNG possible "netstat" zombie process

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

3.11.8

People

(Reporter: nelson, Assigned: wtc)

References

(
URL
)

Details

Attachments

(9 files, 7 obsolete files)

patch to avoid zombie netstat 19 years ago Neil Williams 901 bytes, patch	nelson : review-	Details \| Diff \| Splinter Review
patch to avoid zombie netstat 19 years ago Neil Williams 946 bytes, patch	nelson : review-	Details \| Diff \| Splinter Review
third try 19 years ago Neil Williams 1.09 KB, patch	nelson : review+ julien.pierre : review+ wtc : superreview+	Details \| Diff \| Splinter Review
ps output of firefox and netstat zombie 19 years ago Christian Krause 553 bytes, text/plain		Details
Proposed fix (untested) 19 years ago John G. Myers 1.14 KB, patch		Details \| Diff \| Splinter Review
workaround for Linux 18 years ago Mike Hommey [:glandium] 388 bytes, patch	nelson : review+ wtc : superreview-	Details \| Diff \| Splinter Review
Test program mystat.c (used in place of netstat) 18 years ago Wan-Teh Chang 29 bytes, text/plain		Details
Test program popen.c 18 years ago Wan-Teh Chang 3.38 KB, text/plain		Details
Proposed fix v2 18 years ago Wan-Teh Chang 1.08 KB, patch	neil.williams : review+ rrelyea : superreview+	Details \| Diff \| Splinter Review
Patch for a minor problem in safe_popen 18 years ago Wan-Teh Chang 955 bytes, patch	neil.williams : review+	Details \| Diff \| Splinter Review
Patch for a minor problem in safe_popen (as checked in) 18 years ago Wan-Teh Chang 1003 bytes, patch		Details \| Diff \| Splinter Review
Proposed fix (as checked in) 18 years ago Wan-Teh Chang 1.35 KB, patch		Details \| Diff \| Splinter Review
patch to correct issue described in comment #85 17 years ago Kurt Miller 601 bytes, patch	wtc : review+	Details \| Diff \| Splinter Review
Patch to undef DO_NETSTAT on Linux 17 years ago Kai Engert (:KaiE:) 868 bytes, patch	nelson : review- rrelyea : superreview+	Details \| Diff \| Splinter Review
patch to correct issue described in comment #85 (as checked in) 17 years ago Wan-Teh Chang 737 bytes, patch		Details \| Diff \| Splinter Review
remove unused DO_PS code [checked in] 17 years ago Kai Engert (:KaiE:) 1.62 KB, patch	wtc : review+	Details \| Diff \| Splinter Review

Nelson Bolyard (seldom reads bugmail)

Reporter

Description

•

24 years ago

This bug was formerly bugsplat 300544.
It applies to all UNIX systems, to the unix implementation of 
RNG_SystemInfoForRNG().

jgmysers@netscape.com wrote:
In ns/security/lib/util/unix_rand.c:safe_pclose(), if waitpid() 
fails with EINTR, the child process could be left as a zombie.

Ian McGreer

Comment 1

•

24 years ago

moving to 3.2

Depends on: 48258

Target Milestone: --- → 3.2

Wan-Teh Chang

Assignee

Updated

•

24 years ago

QA Contact: wtc → sonmi

Ian McGreer

Updated

•

24 years ago

Target Milestone: 3.2 → 3.3

Wan-Teh Chang

Assignee

Updated

•

23 years ago

Target Milestone: 3.3 → 3.4

Wan-Teh Chang

Assignee

Comment 2

•

22 years ago

Changed the QA contact to Bishakha.

QA Contact: sonja.mirtitsch → bishakhabanerjee

Wan-Teh Chang

Assignee

Comment 3

•

22 years ago

Set target milestone to NSS 3.5.

Target Milestone: 3.4 → 3.5

Jonathan Abbey

Comment 4

•

22 years ago

I'm seeing [a] netstat zombie[s] on Red Hat Linux 7.3 using the 2002052409 Linux
build of Mozilla 1.0RC3.

Let me know if I can provide any useful information about this somehow.

Boris 'pi' Piwinger

Comment 5

•

22 years ago

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2a) Gecko/2002091318

I see something like this all the time:
22483 ?        Z      0:00 [netstat <defunct>]
Is this what this bug is about?
It does not seem to be a problem.

pi

Wan-Teh Chang

Assignee

Comment 6

•

22 years ago

Assigned the bug to myself.  Target NSS 3.7.

Assignee: ian.mcgreer → wtc

Priority: P3 → P2

Target Milestone: 3.5 → 3.7

Wan-Teh Chang

Assignee

Comment 7

•

22 years ago

*** Bug 183778 has been marked as a duplicate of this bug. ***

Jeremy M. Dolan

Comment 8

•

22 years ago

Isn't there some much, much, MUCH more sane way to get entropy?

This bug should be considered a bit higher priority that it seems, since a
browser screwing with netstat will look to many people like spyway or a trojan.

Wan-Teh Chang

Assignee

Comment 9

•

22 years ago

Moved to target milestone 3.8 because the original
NSS 3.7 release has been renamed 3.8.

Target Milestone: 3.7 → 3.8

Ken Yap

Comment 10

•

22 years ago

It's still there in Mozilla 1.2.1.

Alex Papadopoulos

Comment 11

•

22 years ago

15728 S    19:18   0:24      \_ /home/alex/programs/mozilla/mozilla-bin
15740 Z    19:18   0:00          \_ [netstat <defunct>]

Yup, it's always there after a while. Mozilla 1.2.1

Brian Little

Comment 12

•

22 years ago

I'm having the same problem also with this. Mozilla 1.3a

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3a) Gecko/20021212
Linux schloss 2.4.18-10 #1 Wed Aug 7 10:26:48 EDT 2002 i686 unknown


winzig   25910  1476  0 19:12 ?        00:00:00 /bin/sh /usr/local/mozilla/run-m
winzig   25916 25910  2 19:12 ?        00:02:08 /usr/local/mozilla/mozilla-bin
winzig   25918 25916  0 19:12 ?        00:00:00 /usr/local/mozilla/mozilla-bin
winzig   25919 25918  0 19:12 ?        00:00:01 /usr/local/mozilla/mozilla-bin
winzig   25920 25918  0 19:12 ?        00:00:00 /usr/local/mozilla/mozilla-bin
winzig   25922 25918  0 19:12 ?        00:00:00 /usr/local/mozilla/mozilla-bin
winzig   26157 25916  0 20:15 ?        00:00:00 [netstat <defunct>]
winzig   26243 25918  0 20:36 ?        00:00:00 /usr/local/mozilla/mozilla-bin

Mike Lieman

Comment 13

•

21 years ago

Looks like netstat hangs when it's stuck resolving an ip address into a hostname.

John G. Myers

Comment 14

•

21 years ago

Netstat hanging would be a different bug.

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 15

•

21 years ago

Remove target milestone of 3.8, since these bugs didn't get into that release.

Target Milestone: 3.8 → ---

Wolfgang Rosenauer [:wolfiR]

Comment 16

•

21 years ago

it's still there. Isn't it possible to use /dev/urandom?
Or is this not available on some systems?

c0ldbyte

Comment 17

•

20 years ago

zombified netstat still there in 1.6 and 1.7b. Wouldnt it be feasable to
incorporate something like "'EGD' http://egd.sourceforge.net/" into mozilla and
have mozilla read from a socket instead of using a system process that could be
exploitable to gain intelligence about running systems and maybe possible other
info or even gain a shell on a system. PS: When i see netstat being zombified
after a browser is using it, makes me think about the ole days of Internet
explorer reporting home "Phoning Home" like E.T. And knowing that you guys are
pretty respectable ppl i trust you to use netstat but there is allways that
slight possiblity of someone sneaking around doing malicious stuff. Please look
into how EGD handles getting more data for entropy.

Frederic Crozat

Comment 18

•

20 years ago

Bug still there, with 1.7rc1 according to
http://qa.mandrakesoft.com/show_bug.cgi?id=9427

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 19

•

19 years ago

Neil, please have a look at this.  Take it if you want it.

Nelson Bolyard (seldom reads bugmail)

Reporter

Updated

•

19 years ago

QA Contact: bishakhabanerjee → jason.m.reid

Neil Williams

Comment 20

•

19 years ago

Attached patch patch to avoid zombie netstat (obsolete) — Details — Splinter Review

This patch has been suggested by others. It avoids a theoretically possible
bug, one that I have been unable to force on Solaris 10 x86. Briefly, it is
unlikely to happen because the parent reads all of the output of the netstat
command before issuing the waitpid() call. Since the child's stdout is closed
by its exit routine  it will always be defunct immediately thereafter.
Furthermore, even when the parent is forced to truncate the input file and
while the child is explicitly held in the stopped state, waitpid() still
returns 0 rather than EINTR.

I suspect this last behavior may be different under different OSes.

Would some of the respondents who reported seeing this problem be kind enough
to try the patch?

Assignee: wtchang → neil.williams

Status: NEW → ASSIGNED

Attachment #185311 - Flags: superreview?(wtchang)

Attachment #185311 - Flags: review?(nelson)

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 21

•

19 years ago

If waitpid gets EINTR, do you want to kill pid?  or just waitpid again?

Neil Williams

Comment 22

•

19 years ago

Comment on attachment 185311 [details] [diff] [review]
patch to avoid zombie netstat

Nelson, I agree that it would be more correct to call waitpid again. I'll
submit a revised patch.

Attachment #185311 - Attachment is obsolete: true

Neil Williams

Comment 23

•

19 years ago

Attached patch patch to avoid zombie netstat (obsolete) — Details — Splinter Review

Incorporating Nelson's suggestion.

I note that most of the bug citations mention Linux. Also note that the
Mandrake bug mentioned in comment 18 has been fixed, apparently by a Mandrake
release.

Attachment #185508 - Flags: superreview?(wtchang)

Attachment #185508 - Flags: review?(nelson)

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 24

•

19 years ago

Comment on attachment 185508 [details] [diff] [review]
patch to avoid zombie netstat

That counter increment/check needs to happen in every path that repeats this
loop.

Attachment #185508 - Flags: superreview?(wtchang)

Attachment #185508 - Flags: review?(nelson)

Attachment #185508 - Flags: review-

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 25

•

19 years ago

Comment on attachment 185311 [details] [diff] [review]
patch to avoid zombie netstat

Forgot to mark this patch with my previous review comment.

Attachment #185311 - Flags: superreview?(wtchang)

Attachment #185311 - Flags: review?(nelson)

Attachment #185311 - Flags: review-

Frederic Crozat

Comment 26

•

19 years ago

To be honest, I just applied a Debian patch for Mandriva/Mandrake package which
commented the netstat call, which isn't really needed on Linux system, since
/dev/random randomness is good enough.

Neil Williams

Updated

•

19 years ago

Attachment #185508 - Attachment is obsolete: true

Neil Williams

Comment 27

•

19 years ago

Attached patch third try — Details — Splinter Review

Reviewer's last comments incorporated.

Attachment #185630 - Flags: superreview?(wtchang)

Attachment #185630 - Flags: review?(nelson)

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 28

•

19 years ago

Comment on attachment 185630 [details] [diff] [review]
third try

Looks good to me.  Since we haven't found a reproducible test case with which
to test this patch, please get a second review on this.  Thanks.

Attachment #185630 - Flags: review?(nelson) → review+

Neil Williams

Updated

•

19 years ago

Target Milestone: --- → 3.10.1

Jo Hermans

Comment 29

•

19 years ago

(In reply to comment #26)
> To be honest, I just applied a Debian patch for Mandriva/Mandrake package which
> commented the netstat call, which isn't really needed on Linux system, since
> /dev/random randomness is good enough.

You mean /dev/urandom, right ? On Linux, /dev/random might block. Not on BSD though.

While I'm not a reviewer, the latest patch looks good. EINTR will cause waipid
to be repeated (max 1000 times). All other errors will elave the loop

Anthony O'Hara (Spam Killer: Use anthony.ohara at experien.com.au)

Comment 30

•

19 years ago

I'm running Gentoo 1.4.16, and I'm getting a zombie netstat process for both
Firefox (1.0.6 - Mozilla/5.0 - 20050722) and Thunderbird (1.0.6 (20050722)).
Originally, before upgrading to the latest Firefox and Thunderbird build, this
problem only used to occur with instances of Firefox.

 /bin/bash /usr/libexec/mozilla-launcher
  \_ /usr/lib/mozilla-firefox/firefox-bin
      \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      |   \_ /usr/lib/mozilla-firefox/firefox-bin
      \_ [netstat] <defunct>


 /bin/bash /usr/libexec/mozilla-launcher
  \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      |   \_ /usr/lib/mozilla-thunderbird/thunderbird-bin -mail
      \_ [netstat] <defunct>

Wan-Teh Chang

Assignee

Updated

•

19 years ago

Target Milestone: 3.10.1 → 3.11

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 31

•

19 years ago

Comment on attachment 185630 [details] [diff] [review]
third try

This bug needs a second review.  Julien, will you review it?

Attachment #185630 - Flags: review?(julien.pierre.bugs)

Wan-Teh Chang

Assignee

Comment 32

•

19 years ago

Comment on attachment 185630 [details] [diff] [review]
third try

r=wtc.	This patch may not fix this bug though.
I would rename the variable r as rv.

Attachment #185630 - Flags: superreview?(wtchang)

Attachment #185630 - Flags: superreview+

Attachment #185630 - Flags: review?(julien.pierre.bugs)

Julien Pierre

Comment 33

•

19 years ago

Comment on attachment 185630 [details] [diff] [review]
third try

Like Wan-Teh, I'm not sure if this bug will fix the problem, but I don't think
it will hurt.

Re: the kill() call, I think we should probably break out of the loop on any
error, and not just on ESRCH. In the Solaris man pages, the other possible
errors are EINVAL (invalid signal number) and EPERM (no permission to kill the
process), which should also be fatal, since subsequent calls to kill aren't
going to fix those problems, and the only point of continuing the waitpid loop
is to work gracefully if the process exits quickly within 1000 iterations. On
the other hand, these other errors should never happen, since we are passing a
hardcoded signal value, and we started the process, so we should have
permission - unless it changed user. I hope netstat doesn't do that ;).

Attachment #185630 - Flags: review+

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 34

•

19 years ago

Hmm. Netstat could be setuid root!

Christian Krause

Comment 35

•

19 years ago

I've appplied the patch, but unfortunately there are still sometimes hanging
zombie netstat processes:

&#9500;&#9472;firefox&#9472;&#9472;&#9472;run-mozilla.sh&#9472;&#9472;&#9472;firefox-bin&#9472;&#9472;&#9472;netstat

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 36

•

19 years ago

Hmm.  I wonder.  Perhaps the netstat processes are not zombies,
but are actually live hung processes that cannot be killed by
the parent process because they are running as root and the 
parent isn't.  A full long ps output might help diagnose that.
Please attach it rather than adding it inline as a comment.

Christian Krause

Comment 37

•

19 years ago

(In reply to comment #36)
> Hmm.  I wonder.  Perhaps the netstat processes are not zombies,
> but are actually live hung processes that cannot be killed by
> the parent process because they are running as root and the 

No, the process is a real zombie and is not running under a different UID and
netstat is not SETUID.

> parent isn't.  A full long ps output might help diagnose that.
> Please attach it rather than adding it inline as a comment.

I'll attached a file with the relevant part of the ps output.

Christian Krause

Comment 38

•

19 years ago

Attached file ps output of firefox and netstat zombie — Details

Wan-Teh Chang

Assignee

Comment 39

•

19 years ago

Christian: what is the Linux distribution you are
using?  What's the kernel version and do you know
the glibc version and whether you are using LinuxThreads
or NPTL?  This will help us find a similar system to
reproduce this zombie netstat process.

Do you know how to use strace on firefox-bin to find out if
waitpid was called at all?  If not, I can supply a patch
to add logging statements.  (Or you can try to do that.)

John G. Myers

Comment 40

•

19 years ago

The problem is most likely that waitpid() is being called with WNOHANG.  The
function can loop the 1000 times before the child process is scheduled to
receive the signal.

I think we need a blocking waitpid() call, at least after the first attempt. 
Things to prevent a deadlock would include setting a timer to cause the blocking
waitpid() to return EINTR and making sure the pipe is drained to EOF and/or closed.

John G. Myers

Comment 41

•

19 years ago

Attached patch Proposed fix (untested) (obsolete) — Details — Splinter Review

Wan-Teh Chang

Assignee

Comment 42

•

19 years ago

Comment on attachment 197901 [details] [diff] [review]
Proposed fix (untested)

We should only need to make the kill(pid, SIGKILL) call
once.  If so, we can do that before calling waitpid.
It should be harmless to send SIGKILL to a zombie.

John, your patch never calls waitpid with WNOHANG. Is
this intentional?

In safe_popen, if fork succeeds but fdopen fails,
safe_popen will return NULL, which will cause safe_pclose
and therefore waitpid to not be called.  We should be
able to fix that by moving the fdopen call before the
fork call.  (fdopen is unlikely to fail though.)

John G. Myers

Comment 43

•

19 years ago

I think the SIGKILL is unnecessary.  I put the patch together quickly, feel free
to tweak it.

Neil Williams

Comment 44

•

19 years ago

(In reply to comment #41)
> Created an attachment (id=197901) [edit]
> Proposed fix (untested)
> 

(In reply to comment #42)
> (From update of attachment 197901 [details] [diff] [review] [edit])
> We should only need to make the kill(pid, SIGKILL) call
> once.  If so, we can do that before calling waitpid.
> It should be harmless to send SIGKILL to a zombie.
> 
> John, your patch never calls waitpid with WNOHANG. Is
> this intentional?
> 
> In safe_popen, if fork succeeds but fdopen fails,
> safe_popen will return NULL, which will cause safe_pclose
> and therefore waitpid to not be called.  We should be
> able to fix that by moving the fdopen call before the
> fork call.  (fdopen is unlikely to fail though.)
> 

I agree with Wan-Teh. Waitpid without WNOHANG could cause indefinite wait.
Should probably try WHNOHANG first, then if waitpid() returns 0, kill() and
waitpid(pid, &status, 0). I'll try to work up a patch.

Mike Hommey [:glandium]

Comment 45

•

18 years ago

Attached patch workaround for Linux — Details — Splinter Review

Please add this patch as well, since the code is useless on Linux, where /dev/urandom is used.
The patch is from Wichert Akkerman (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=241200 )

Mike Hommey [:glandium]

Updated

•

18 years ago

Attachment #216373 - Flags: review?(nelson)

Nelson Bolyard (seldom reads bugmail)

Reporter

Updated

•

18 years ago

QA Contact: jason.m.reid → libraries

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 46

•

18 years ago

Neil, contact Wichert Akkerman and Eric Dorland (see the URL above for their
email addresses), and see if they can reproduce at will, and are willing to
test your patch.

URL: http://bugs.debian.org/cgi-bin/bugrep...

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 47

•

18 years ago

Comment on attachment 216373 [details] [diff] [review]
workaround for Linux

Our first preference is to fix this bug "the right way" which uses netstat but doesn't leave it as a zombie.  If that can be achieved, then we want to use netstat even on Linux. 
But this needs resolution now.  So, if the "right fix" cannot be verified before 3.11.1-Beta, then I think we must deploy this workaround at that time.

Attachment #216373 - Attachment description: additional patch → workaround for Linux

Attachment #216373 - Flags: review?(nelson) → review+

Nelson Bolyard (seldom reads bugmail)

Reporter

Updated

•

18 years ago

Target Milestone: 3.11 → 3.11.1

Nelson Bolyard (seldom reads bugmail)

Reporter

Updated

•

18 years ago

Priority: P2 → P1

Target Milestone: 3.11.1 → 3.11.2

Julien Pierre

Comment 48

•

18 years ago

Per our meeting, retargetting to 3.12.

Target Milestone: 3.11.2 → 3.12

Matt Zimmerman

Comment 49

•

18 years ago

This seems to still be present in Firefox 2.0 derived from Debian.  The #ifdef LINUX patch seems to be applied (I'm not sure whether that's a Debian patch or if it was merged), but doesn't seem to be taking effect.  "linux" is what the preprocessor provides, not "LINUX", though perhaps that's supposed to be defined somewhere in the tree.

Mike Hommey [:glandium]

Comment 50

•

18 years ago

(In reply to comment #49)
> This seems to still be present in Firefox 2.0 derived from Debian.  The #ifdef
> LINUX patch seems to be applied (I'm not sure whether that's a Debian patch or
> if it was merged), but doesn't seem to be taking effect.  "linux" is what the
> preprocessor provides, not "LINUX", though perhaps that's supposed to be
> defined somewhere in the tree.

The #ifdef LINUX in Firefox 2.0 in Ubuntu is something from Ubuntu, since there have been changes to the unix_rand.c file between the latest firefox in debian experimental (2.0b2) and 2.0 final. Though this new Ubuntu patch should work, I can confirm that the one that was applied in the Debian version does what it is supposed to do.

Status: ASSIGNED → RESOLVED

Closed: 18 years ago

Resolution: --- → FIXED

Nelson Bolyard (seldom reads bugmail)

Reporter

Updated

•

18 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Wan-Teh Chang

Assignee

Updated

•

18 years ago

Assignee: neil.williams → wtchang

Status: REOPENED → NEW

Wan-Teh Chang

Assignee

Comment 51

•

18 years ago

Attached file Test program mystat.c (used in place of netstat) — Details

I reproduced this zombie process on Red Hat Enterprise Linux 4
Update 4 using two small test programs, and understood why the
code fails to reap the child process.

Here is the first test program, mystat.c.  Compile it with
    gcc mystat.c -o mystat

It is to be used in place of netstat.  You can also use netstat,
but for some reason it's easier to reproduce the bug with mystat.
The mystat program does nothing.

Wan-Teh Chang

Assignee

Comment 52

•

18 years ago

Attached file Test program popen.c — Details

Here is the second test program, popen.c.  Compile it with
    gcc popen.c -o popen

Then, run the test with
    ./popen

On RHEL 4 U4, it is not hard to see something like this:
    kill(23585, SIGKILL)=0
    [repeated 1000 times]
    count equals 1000

If you see this, run this command immediately (within 10
seconds) to see the zombie process:

    $ ps -eaf | grep mystat
    wtchang  23585 23584  0 17:05 pts/4    00:00:00 [mystat] <defunct>
    wtchang  23589 16153  0 17:05 pts/5    00:00:00 grep mystat

I found that waitpid(..., WNOHANG) repeatedly returns 0 because
the OS is only running the popen process and is not giving the
mystat process a chance to run and terminate.  The evidence is
that if the popen program calls sched_yield() between the
waitpid(..., WNOHANG) calls (there is one such call in popen.c
that you can uncomment out), then the bug cannot be reproduced,
and you will only see at most one line like this:
    kill(23608, SIGKILL)=0

Wan-Teh Chang

Assignee

Comment 53

•

18 years ago

Attached patch Proposed fix v2 (obsolete) — Details — Splinter Review

John's description in comment 40 of what causes the problem
is right on.

This patch is based on John's patch in comment 41 and implements
what Neil outlined in comment 44.

Attachment #197901 - Attachment is obsolete: true

Attachment #250408 - Flags: superreview?(rrelyea)

Attachment #250408 - Flags: review?(neil.williams)

Wan-Teh Chang

Assignee

Comment 54

•

18 years ago

Attached patch Patch for a minor problem in safe_popen (obsolete) — Details — Splinter Review

This patch fixes the corner case in safe_popen that I described
at the end of comment 42.

Attachment #250410 - Flags: review?(neil.williams)

Neil Williams

Comment 55

•

18 years ago

Comment on attachment 250410 [details] [diff] [review]
Patch for a minor problem in safe_popen

This looks correct. Is it not correct to close(p[0]) when fork() returns -1? Why?

Attachment #250410 - Flags: review?(neil.williams) → review+

Neil Williams

Comment 56

•

18 years ago

Comment on attachment 250408 [details] [diff] [review]
Proposed fix v2

Thanks for tracking this down, Wan-Teh. It finally makes good sense what's happening. I was going to use sleep(1) but this is better.

Attachment #250408 - Flags: review?(neil.williams) → review+

Neil Williams

Comment 57

•

18 years ago

I noticed non-printable characters in unix_rand.c. Turns out they're form feed characters--at lines 283,324 and others. As a matter of policy should these be removed when working up a patch?

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 58

•

18 years ago

Neil, Feel free to remove FF characters in your next patch.

Wan-Teh Chang

Assignee

Comment 59

•

18 years ago

(In reply to comment #55)
> (From update of attachment 250410 [details] [diff] [review])
> This looks correct. Is it not correct to close(p[0]) when fork() returns -1?
> Why?

p[0] has been associated with fp, so fclose(fp) will call close(p[0])
under the cover, according to the fclose man page in the Single Unix
Specification:
http://www.opengroup.org/onlinepubs/009695399/functions/fclose.html

  The fclose() function shall perform the equivalent of a close()
  on the file descriptor that is associated with the stream pointed
  to by stream.

I should add a comment to clarify this.  How about something like this:

  fclose(fp);  /* this closes p[0], the fd associated with fp */

Wan-Teh Chang

Assignee

Comment 60

•

18 years ago

Note that the fdopen function associates a stream with a file descriptor.
Our code has
    fp = fdopen(p[0], "r");

Robert Relyea

Comment 61

•

18 years ago

Comment on attachment 250408 [details] [diff] [review]
Proposed fix v2

r+ = relyea

Did you want to add the yield or sleep(0) in this version of the patch as well?

Attachment #250408 - Flags: superreview?(rrelyea) → superreview+

Wan-Teh Chang

Assignee

Comment 62

•

18 years ago

Attached patch Patch for a minor problem in safe_popen (as checked in) — Details — Splinter Review

I checked in this patch on the NSS trunk (NSS 3.12) with
a comment to explain that fclose(fp) closes p[0].

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.23; previous revision: 1.22
done

Attachment #250410 - Attachment is obsolete: true

Wan-Teh Chang

Assignee

Comment 63

•

18 years ago

Attached patch Proposed fix (as checked in) — Details — Splinter Review

I added a PR_Sleep(0) call instead, which calls sched_yield().

Calling sched_yield() directly would have porting issues because
not all Unix platforms have sched_yield() and it would require
linking libfreebl3.so with -lrt.  Calling PR_Sleep(0) allows us
to let NSPR handle these porting issues.

I considered calling sleep(0), which is available everywhere and
defined in libc, but I can't find anything in the sleep man page
in the Single Unix Specification that says sleep(0) yields the
processor:
    http://www.opengroup.org/onlinepubs/009695399/functions/sleep.html
The man page for the related function usleep says usleep(0) "has no
effect":
    http://www.opengroup.org/onlinepubs/009695399/functions/usleep.html
So I decided to not use sleep(0).

I checked in this patch on the NSS trunk (NSS 3.12).

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.24; previous revision: 1.23
done

Attachment #250408 - Attachment is obsolete: true

Wan-Teh Chang

Assignee

Updated

•

18 years ago

Status: NEW → RESOLVED

Closed: 18 years ago → 18 years ago

Resolution: --- → FIXED

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 64

•

18 years ago

Wan-Teh, Can the kill() call fail with EINTR?
Should the kill call be retried if it fails with EINTR,
as the patch now does for the waitpid() calls?

Wan-Teh Chang

Assignee

Comment 65

•

18 years ago

Nelson, kill() cannot fail with EINTR:
http://www.opengroup.org/onlinepubs/009695399/functions/kill.html

Actually waitpid(..., WNOHANG) most likely can't fail with EINTR
either, but I still handle it as defensive programming.  In general
only blocking system calls can fail with EINTR.

Wan-Teh Chang

Assignee

Comment 66

•

18 years ago

I checked in the "proposed patch (as checked in)" on the
NSS_3_11_BRANCH (NSS 3.11.6).

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.17.10.6; previous revision: 1.17.10.5
done

Target Milestone: 3.12 → 3.11.6

Oleg Romashin (:romaxa)

Comment 67

•

18 years ago

does anybody supposed to commit https://bugzilla.mozilla.org/attachment.cgi?id=216373
?

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 68

•

18 years ago

Comment on attachment 216373 [details] [diff] [review]
workaround for Linux

requesting sr

Attachment #216373 - Flags: superreview?(wtchang)

Wan-Teh Chang

Assignee

Comment 69

•

18 years ago

Comment on attachment 216373 [details] [diff] [review]
workaround for Linux

Let's come up with a better way to turn off the
safe_popen(netstat_ni_cmd) call.  We already use
two different methods for BSD/OS and Solaris.  The
sample code omits the irrelevant code.

Method 1 (used by BSD/OS): if we succeeded in getting
data from /dev/urandom, then we skip the
safe_popen(netstat_ni_cmd) call.

    bytes = RNG_FileUpdate("/dev/urandom", SYSTEM_RNG_SEED_COUNT);

#ifdef BSDI
    if (bytes)
        return;
#endif

    fp = safe_popen(netstat_ni_cmd);
    if (fp != NULL) {
        while ((bytes = fread(buf, 1, sizeof(buf), fp)) > 0)
            RNG_RandomUpdate(buf, bytes);
        safe_pclose(fp);
    }

Method 2 (used by Solaris): turn off the safe_popen(netstat_ni_cmd)
call by conditional compilation, equivalent to this patch.

#define DO_NETSTAT 1

#ifdef SOLARIS
#undef DO_NETSTAT
#endif

#ifdef DO_NETSTAT
    fp = safe_popen(netstat_ni_cmd);
    if (fp != NULL) {
        while ((bytes = fread(buf, 1, sizeof(buf), fp)) > 0)
            RNG_RandomUpdate(buf, bytes);
        safe_pclose(fp);
    }
#endif

So, this patch could be replaced by a patch that undefines DO_NETSTAT
for Linux.  But I think we should generalize the new patch to cover
all the platforms that have /dev/urandom.  It may be better to do this
in a new bug report.

Attachment #216373 - Flags: superreview?(wtchang) → superreview-

Oleg Romashin (:romaxa)

Comment 70

•

18 years ago

What dou think about this in nspr? 
static PRStatus OpenDevRandom( void )
 {
-    fdDevRandom = open( "/dev/random", O_RDONLY );
+    fdDevRandom = open( "/dev/urandom", O_RDONLY );
     return((-1 == fdDevRandom)? PR_FAILURE : PR_SUCCESS );
 } /* end OpenDevRandom() */

Wan-Teh Chang

Assignee

Comment 71

•

18 years ago

romaxa: good idea.  NSPR should also use /dev/urandom instead of
/dev/random.

Mike Hommey [:glandium]

Comment 72

•

18 years ago

(In reply to comment #69)
> So, this patch could be replaced by a patch that undefines DO_NETSTAT
> for Linux.

FWIW, we've been doing this in debian for a while to solve the netstat zombies problem.

Kai Engert (:KaiE:)

Comment 73

•

17 years ago

I'm confused whether this bug is really fixed.

Wan-Teh, I do not understand the purpose of your comment 69 and why you discuss attachment 216373 [details] [diff] [review] proposes to undefine DO_NETSTAT for Linux.

What is the motivation for a new patch? Do you want to "optimize" the fix?

Is is still necessary to undefine DO_NETSTAT for Linux in order to fix this bug?

Or did Wan-Teh's checkin of attachment 250666 [details] [diff] [review], comment 66, to the 3.11.6 branch already fix the reported problems?

Did someone test/verify?

Thanks in advance for a summary of the current status.

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 74

•

17 years ago

Kai, I wouldn't try to fix this any more than it is already fixed unless 
someone complains that it is not actually fixed at present.  I see no such
complaint above.

Kai Engert (:KaiE:)

Comment 75

•

17 years ago

Nelson, the motivation for my question:

I would like to understand the best approach to backport a fix to Firefox/SeaMonkey in RHEL 4.

Should we go with "#undef DO_NETSTAT"
or should we take the patch that was checked in.

Because Wan-Teh continued to discuss, I was concerned the checked in patch might not be complete.

Mike Hommey [:glandium]

Comment 76

•

17 years ago

FWIW, we do #undef DO_NETSTAT on Debian (and Ubuntu does too, AFAIK). Also see my comment #45. I still don't understand obstination as in comment #47. Why continue to use netstat for the sake of it ?

Jan Iven

Comment 77

•

17 years ago

Kai: FYI "#undef DO_NETSTAT" has also been tried on RHEL4/firefox and appears to fix the netstat zombie issue there, see IssueTracker#119883.

Wan-Teh Chang

Assignee

Comment 78

•

17 years ago

Kai, for your backporting, take the patch that was checked in.

We should consider removing the netstat (safe_popen and safe_pclose)
code on the NSS trunk.  Let's do that in a new bug report.

Kai Engert (:KaiE:)

Comment 79

•

17 years ago

Thanks for the clarifications and your recommendations.

I want to follow Wan-Teh's recommendation to use the patch as checked in.
That way we will use the same patch now and in future NSS updates.

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 82

•

17 years ago

Bug 385465 alleges that the fix for this bug was/is incomplete, and continues
to fail in certain reproducible cases.  A patch attached to that bug,
attachment 269706 [details] [diff] [review], contains a proposed fix for the trunk.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Target Milestone: 3.11.6 → 3.11.8

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 83

•

17 years ago

Comment on attachment 250662 [details] [diff] [review]
Patch for a minor problem in safe_popen (as checked in)

This patch is now also on the branch:
unix_rand.c; new revision: 1.17.10.7; previous revision: 1.17.10.6

unix_rand.c on the branch now matches the trunk.

Reed Loden [:reed]

Comment 84

•

17 years ago

Is there a reason this bug is still opened? Looks like the other problems were fixed, but maybe I'm missing something.

Kurt Miller

Comment 85

•

17 years ago

Yes the problem described in Bug 385465 still remains. There is a proposed patch to correct the issue attachment 269706 [details] [diff] [review] that has not been committed. It would be great if this could be reviewed and committed.

In summary the issue is that the return value of kill() on a zombie process is not standardized by POSIX. It recommends that kill() on a zombie not return ESRCH but then acknowledges that many systems do. safe_pclose() on those systems has a race that leads to deadlock which I have hit quite a bit.

Regardless of how the operating system implements kill() on zombie processes, I don't see the logic in bailing out of safe_pclose() if kill() fails for any reason. waitpid() should still be called so that the conditions for the deadlock are avoided (i.e. don't exit safe_pclose() leaving the zombie around).

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 86

•

17 years ago

Kurt,
Attachment 269706 [details] [diff] is attached to a resolved bug, and has no review request,
so it's not getting any attention because it doesn't show up on any searches
for patches awaiting review on unresolved bugs.

I suggest attaching it to this bug, and requesting review from Wan-Teh.

Wan-Teh Chang

Assignee

Comment 87

•

17 years ago

This bug was reopened on 2007-07-01.  It's better to close this bug
and restore its 3.11.6 target milestone, and use bug 385465 for the
remaining issue.

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 88

•

17 years ago

I'm not sure that I agree that it's better.  
Bug 385465 proves that this bug was marked "fixed" prematurely.
A checkin was done, but it seems that this bug is still not actually fixed.  
I'd like to keep all the history of this bug in one bug report until it
is really fixed.

Kurt Miller

Comment 89

•

17 years ago

Attached patch patch to correct issue described in comment #85 (obsolete) — Details — Splinter Review

Patch to correct remaining issue (thanks Nelson for the pointer). Requesting review from Wan-Teh.

I'm not sure what if anything I should do with the flags option on this form.
Nelson or Wan-Teh perhaps you could fix them if I got it wrong.

Nelson Bolyard (seldom reads bugmail)

Reporter

Updated

•

17 years ago

Attachment #284642 - Flags: review?(wtc)

Wan-Teh Chang

Assignee

Comment 90

•

17 years ago

Comment on attachment 284642 [details] [diff] [review]
patch to correct issue described in comment #85

NSS 3.11.6 is already released and has a partial fix for
this bug.  The partial fix works on all the platforms that
the NSS team tested on, therefore it is still useful.
Since we didn't write release notes for NSS 3.11.6,
the target milestone of this bug is a good way to document
the fact that NSS 3.11.6 has a partial fix.  Reopening
this bug and changing the target milestone makes it
hard to discover that information.

Kurt, your patch is correct.  I considered changing the
code to call kill() only once, but decided not to make the
change because "it ain't broke".

Is OpenBSD your platform?  I don't have POSIX, so I use
Single Unix Specification to help me write portable code.
Its kill() man page doesn't mention the issue with zombies
and ESRCH:
http://opengroup.org/onlinepubs/007908799/xsh/kill.html

The best fix for this bug is for NSS to not exec
processes and mess with signal handlers and waitpid.

Attachment #284642 - Flags: review?(wtc) → review+

Kurt Miller

Comment 91

•

17 years ago

Thanks for the review. Yes OpenBSD is my platform. I guess I should have said SUSv3 instead of POSIX. Indeed the SUSv2/POSIX doesn't address how zombies are handled which is how I suppose different 1003.1-1988 compliant kill() implementations came about.

Here's the url to the SUSv3 version that I was referring to:

http://www.opengroup.org/onlinepubs/009695399/functions/kill.html

There's a paragraph in the Rational section beginning with "Existing implementations..."

Mike Hommey [:glandium]

Comment 92

•

17 years ago

(In reply to comment #90)
> The best fix for this bug is for NSS to not exec
> processes and mess with signal handlers and waitpid.

Especially on platforms that already have good random number generation.

Kai Engert (:KaiE:)

Comment 93

•

17 years ago

Comment on attachment 284642 [details] [diff] [review]
patch to correct issue described in comment #85

Can we check in this patch?

Kai Engert (:KaiE:)

Comment 94

•

17 years ago

FWIW, In Fedora's experimental sandbox (called Rawhide) we started to use NSS 3.12 alpha for testing purposes.

In our OS package, I removed all patches that worked around this bug.
Now, I got new complaints that we are still executing netstat, which is causing some unwanted side effects for SELinux. We'd like to avoid executing netstat.

I would like to do this on trunk:

#ifdef LINUX
#undef DO_NETSTAT
#endif

Kai Engert (:KaiE:)

Comment 95

•

17 years ago

Attached patch Patch to undef DO_NETSTAT on Linux (obsolete) — Details — Splinter Review

Attachment #290839 - Flags: superreview?(rrelyea)

Attachment #290839 - Flags: review?(wtc)

Wan-Teh Chang

Assignee

Comment 96

•

17 years ago

Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

Kai, I suggest you do the following:

1. Verify that we don't define DO_PS on any platform:
http://lxr.mozilla.org/security/search?string=DO_PS

Then remove the dead code ifdef'd with DO_PS.

2. Add LINUX to the code here and update the comment:

978 /*
979  * Bug 100447: On BSD/OS 4.2 and 4.3, we have problem calling safe_popen
980  * in a pthreads environment.  Therefore, we call safe_popen last and on
981  * BSD/OS we do not call safe_popen when we succeeded in getting data
982  * from /dev/urandom.
983  */
984 
985 #ifdef BSDI
986     if (bytes)
987         return;
988 #endif

Wan-Teh Chang

Assignee

Comment 97

•

17 years ago

Attached patch patch to correct issue described in comment #85 (as checked in) — Details — Splinter Review

I checked in Kurt Miller's patch on the NSS trunk for NSS 3.12.
I restored the use of tab characters to make the patch smaller.

Checking in unix_rand.c;
/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.26; previous revision: 1.25
done

Attachment #284642 - Attachment is obsolete: true

Kaspar

Comment 98

•

17 years ago

(In reply to comment #96)
> (From update of attachment 290839 [details] [diff] [review])

> 2. Add LINUX to the code here and update the comment:
> 
> 978 /*
> 979  * Bug 100447: On BSD/OS 4.2 and 4.3, we have problem calling safe_popen
> 980  * in a pthreads environment.  Therefore, we call safe_popen last and on
> 981  * BSD/OS we do not call safe_popen when we succeeded in getting data
> 982  * from /dev/urandom.
> 983  */
> 984 
> 985 #ifdef BSDI
> 986     if (bytes)
> 987         return;
> 988 #endif

Note that there's a patch attached to bug 174993 (attachment 272917 [details] [diff] [review]), which achieves that effect. It hasn't been checked in so far, but got the necessary reviews.

Kai Engert (:KaiE:)

Comment 99

•

17 years ago

Kaspar, thanks for the pointer. I asked in bug 174993 to give people a chance to veto, but if we don't hear anything within the next day, we should check that in.

Kai Engert (:KaiE:)

Comment 100

•

17 years ago

(In reply to comment #96)
> 1. Verify that we don't define DO_PS on any platform:
> http://lxr.mozilla.org/security/search?string=DO_PS
> 
> Then remove the dead code ifdef'd with DO_PS.

Wan-Teh, I remove you made this proposal for cleanup purposes, only.
Sure, I can do that. Will attach a patch.

Kai Engert (:KaiE:)

Comment 101

•

17 years ago

Attached patch remove unused DO_PS code [checked in] — Details — Splinter Review

Attachment #291228 - Flags: review?(wtc)

Wan-Teh Chang

Assignee

Comment 102

•

17 years ago

Comment on attachment 291228 [details] [diff] [review]
remove unused DO_PS code [checked in]

r=wtc.

Attachment #291228 - Flags: review?(wtc) → review+

Kai Engert (:KaiE:)

Updated

•

17 years ago

Attachment #291228 - Attachment description: remove unused DO_PS code → remove unused DO_PS code [checked in]

Kai Engert (:KaiE:)

Comment 103

•

17 years ago

Comment on attachment 291228 [details] [diff] [review]
remove unused DO_PS code [checked in]

This patch checked in:

/cvsroot/mozilla/security/nss/lib/freebl/unix_rand.c,v  <--  unix_rand.c
new revision: 1.27; previous revision: 1.26

Robert Relyea

Comment 104

•

17 years ago

Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

r+ holding my nose....

I totally agree with undefining DO_NETSTAT for Linux.

I don't like the #undef method of setting it, unfortunately that's because there is an explicit #define DO_NETSTAT in the file.

I would much rather that #define be ifdef out rather than undef'ing DO_NETSTAT explicitly.

I would happily enteratain a patch that does that.

bob

Attachment #290839 - Flags: superreview?(rrelyea) → superreview+

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 105

•

17 years ago

Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

This patch is obsolete.  This has already been fixed by another 
patch that has already been checked in:

1.28 <kaie@kuix.de> 2007-12-03 13:07
Bug 174993, don't fork netstat if data has been gathered from /dev/urandom
r=nelson

That fix is better than the one proposed here in one important 
respect.  If the attempt to read /dev/urandom should fail for some
reason, that other fix (already committed) does not leave the PRNG
unseeded.

Attachment #290839 - Flags: review-

Wan-Teh Chang

Assignee

Updated

•

17 years ago

Attachment #290839 - Flags: review?(wtc)

Kai Engert (:KaiE:)

Comment 106

•

17 years ago

Comment on attachment 290839 [details] [diff] [review]
Patch to undef DO_NETSTAT on Linux

Nelson said this patch is obsolete. Setting the flag.

Attachment #290839 - Attachment is obsolete: true

Nelson Bolyard (seldom reads bugmail)

Reporter

Comment 107

•

17 years ago

I believe this bug is now fixed by the checkin documentd in comment 105.

Status: REOPENED → RESOLVED

Closed: 18 years ago → 17 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.