Closed Bug 100447 Opened 23 years ago Closed 23 years ago

safe_popen fails under BSD/OS 4.2, BSD/OS 4.3

Categories

(NSS :: Libraries, defect, P1)

x86
BSDI
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lidl, Assigned: wtc)

Details

Attachments

(3 files, 3 obsolete files)

It was pointed out to me that the builds of mozilla that I have made for
BSD/OS4.2 core dump on exit.  This was rather curious, I thought, but after
some investigation, I was able to trace back the failure to the safe_popen()
function in mozilla/security/nss/lib/util/unix_rand.c

After looking at this problem a little closer, it finally became clear to me
that this was also the problem that prevented mozilla from running on BSD/OS 4.3
if I compiled with --enable-crypto!

The safe_popen() function takes a program to run, makes a pipe, forks and
then closes down the majority of the fds for the child process.  It execs
the command and sends back the results to the parent -- results which are
used to try to garner some randomness from the machine.

The command that is attemtped to run is "netstat -in", which produces a
more or less static output for a given machine under BSD/OS.  If you
wanted something that counts packets, so it changes often, it would
have to be "netstat -sin".  Here's what a "netstat -in" looks like on
BSD/OS 4.2:

lidl@sloar-40: netstat -in
Name Index   MTU Speed Mtrc Address                   Network
exp0     1  1500   100M   0 00:90:27:c1:bb:3c
exp0     1                0 192.111.45.25             192.111.45.0/26
exp0     1                0 fe80::290:27ff:fec1:bb3c%exp0 fe80::/64 
lo0      2  4352          0  
lo0      2                0 fe80::1%lo0               fe80::/64 
lo0      2                0 ::1                       ::1/128 
lo0      2                0 127.0.0.1                 127
ppp0*    3  1500          0  
tun0*    4  1500          0  
vlan0*   5     0          0 00:00:00:00:00:00
vlan1*   6     0          0 00:00:00:00:00:00
gif0*    7  1280          0  
gif1*    8  1280          0  
gif2*    9  1280          0  
gif3*   10  1280          0  

Not very random.

Under BSD/OS, the safe_popen() routine causes the child process to coredump.
Under 4.2, it coredumps after running the exec'd program.  This shows itself
as a completely "regular" run of mozilla, but after it exits normally,
a core file is written from the child process.

Under 4.3, the child exits before any data is written to the parent, so
the parent process hangs forever in the fread() of the expected data from
the (no longer running) child process.

Removing the attempt to call safe_popen() (heck, I removed the entire
function from my local copy) makes the program:

1) Not dump core under BSD/OS 4.2.
2) Initialize successfully under BSD/OS 4.3.

To compenstate for this lack of randomness, I added "/dev/urandom" to the
list of files that gets read, more or less applying the patch that
is attached to this bug: 

http://bugzilla.mozilla.org/show_bug.cgi?id=96626

I marked this bug as "critical" in that it core-dumps on 4.2 (although
only when exiting the application) and prevents it from running on 4.3
entirely.

I believe you should be able to replicate this problem by examing the
directory where the tests on host 'otaku' (from the tinderbox-ports page)
are run -- look for the mozilla-bin.core file.

I'll attach my diffs for this problem, which disable the safe_popen()
code entirely for BSD/OS.
Nelson, could you please take a look at this bug?  Thanks.
Assignee: wtc → nelsonb
Marking NEW adding keywords.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: patch, review
Instead of removing the call to safe_popen on BSD/OS,
we should find out why safe_popen causes the child
process to coredump.

Could you post the stack trace in the core file?
Priority: -- → P1
Target Milestone: --- → 3.4
Adding a traceback will be difficult for me -- I've upgraded most my
machines to BSD/OS 4.3, which as I explained in the original bug
report fails differently than under 4.2.  Because the browser won't
even start under 4.3 without this patch, I've already had to modify
all my local copies of the CVS tree to have this patch included.

However, since there is a BSD/OS 4.2 tinderbox running (otaku), and
the bug manifests itself as a core dump after the main browser exits,
I expect that you will find a core dump on otaku, in whatever the
working directory of the browser is when the browser tests are run.
If not, cd to a writable directory, run the browser, exit it (you don't
even have to visit any sites) and examine the core file.

If you don't have the core file on the tinderbox machine, I can, if I
must, reload one of my workstations with 4.2 and get that trace.  However,
as I pointed out, the netstat that is called doesn't give a particularly
random number under BSD/OS.  You would literally be better off calling
random() and using that the netstat that is run.
On most (some? many?) Unix platforms, netstat -in gives interface statistics
for all interfaces.  On BSD, it gives only static configuration info, and 
therefore is useless as a source of entropy.  

I suspect that there is some alternate set of flags that produces interface
statistics for all interfaces on BSD.  Can you tell us what those flags are 
and show some sample output of such a command?
Well, if you run 'netstat -in', you get the useless (for randomness purposes)
list of interfaces on the machine, which is pretty static across reboots.

To get the statistics, you would need to run 'netstat -sin', which looks
like the historical 'netstat -in' output:

lidl@nohost-77: uname -a
BSD/OS nohost.pix.net 4.3 BSDI BSD/OS 4.3 Kernel #2: Tue Oct 23 13:59:52 EDT
2001     root@nohost.pix.net:/usr/src/sys/compile/MEKETREX  i386

(mac address sanitized to protect the innocent)

lidl@nohost-78: netstat -sin
Name Index Address                       Ipkts Ierrs     Opkts Oerrs  Coll Drop
exp0     1 xx:xx:xx:xx:xx:xx            364945     0    340750     0     0    0
exp0     1 192.168.168.1                365571          336982            
exp0     1 fe80::2xx:xxff:fexx:xxxx%exp0         0               1            
exp1     2 xx:xx:xx:xx:xx:xx           6630526     0   5851870     1     0    0
exp1     2 192.168.169.1               6509562         5731715            
exp1     2 fe80::2xx:xxff:fexx:xxxx%exp1         0               1            
exp1     2 192.168.169.2                112487          141258            
wx0*     3 xx:xx:xx:xx:xx:xx                 0     0         0     0     0    0
lo0      4                              293072     0    293072     0     0    0
lo0      4 ::1                               0               0            
lo0      4 fe80::1%lo0                       0               0            
lo0      4 127.0.0.1                    267669          267669            
ppp0*    5                                   0     0         0     0     0    0
tun0*    6                                   0     0         0     0     0    0
vlan0*   7 00:00:00:00:00:00                 0     0         0     0     0    0
gif0*    8                                   0     0         0     0     0    0

The patch give above disables the use of the netstat command on ALL
unix platforms, and adds the use of /dev/urandom only on BSDI.  

I don't object to replacing netstat with /dev/urandom on platforms that
support /dev/urandom, but I do object to eliminating the use of netstat
without replacing it with something that provides at least as much 
entropy as is being lost.
In http://lists.insecure.org/linux-kernel/2000/Mar/1739.html the author 
explains that some systems may run out of swap space on one of the first 
Copy-On-Write (COW) events after a fork, causing a SIGSEGV to the child.  

This seems like a very likely cause for bug 100447, since mozilla is 
renowned for using lots of memory.  

So, submittor, please try substituting vfork in place of fork in 
safe_popen on your BSD box and see if that solves the problem.   
While you're at it, please make the command use the -s flag on BSD.
If this works, please attach the patch to this bug report.

While it is likely that mozilla will begin to use /dev/urandom on some
platforms in the future, that is not the solution to this problem.
Please follow bug 96626 on the introduction of /dev/urandom to mozilla.
Whiteboard: awaiting trial of vfork by submittor
I'll address two things in this email, then I'll go build the test
cases that I was asked to test.  (Even though I'm 99% sure that
it isn't going to help.)

1) I didn't mean to disable the netstat code on all machines, just
   BSD/OS.  There is a missing chunk of code in the attachment that
   addresses this problem.

#ifndef BSDI
#define DO_NETSTAT
#endif


2) I'm quite certain that I'm not running out of swap.  The 4.2 machine
   that I was testing on had in excess of 600MB of swap.  Mozilla may
   be a memory pig, but even it is not that big.

   The two different 4.3 machines that I've been testing on have,
   respectively, 512MB RAM + 2GB swap, and 1.25GB RAM + 2.25GB of swap.

I'm going to checkout a fresh tree and try the changes suggested.
I'll report back later on the results.

Well, as I suspected, changing the call from fork() to vfork()
didn't make it work.  So that's not the problem.  And, as a point
of reference, even when running the debugging build under
gdb, I still had in excess of 1.5GB of available swap space,
so the running out swap problem that was referenced isn't
the problem either.

I've cleaned up my earlier patch a little so that it doesn't
disable the netstat code on most machines, and more or less
predicates whether or not safe_popen() gets used on whether or
not /dev/urandom is in play.

While this doesn't address all of the commentary in bug 96626,
I don't feel it needs to either.  Some of that commentary is
quite flawed, as it assumes deep knowledge about the internals of the
/dev/urandom implementation.  Those assumptions may hold true
for Linux, but they don't for many flavors of the various BSD
systems available.  All one should assume about /dev/urandom
is that you can open it and read a random or pretty pseudo-random
stream of bytes for it.  One should NOT assume that is does
SHA-1 internally and therefore anything more than 160 bits pointless.

I'll attach my current patch next.
Thank you for testing vfork.

This bug reports a problem with safe_popen on BSDI.  
Another bug, bug 108708, reports a similar problem on Solaris,
(although it's not clear that these problems have the same cause).

Rather than fixing the root cause of the problem, the patches in this bug 
report would simply avoid using safe_popen and substitute the use of 
/dev/urandom instead.  

This report also points out that /dev/urandom is of varying quality 
on various platforms, which is precisely why I think the use of 
/dev/urandom should be in addition to, rather than instead of, 
other sources of entropy.  

There are two separate issues: 
1) fix safe_popen on BSDI and Solaris, (bugs 100447 and 108708)
2) add /dev/urandom as another source. (bug  96626)

I have a patch for 96626 that attempts to use /dev/urandom on ANY
unix/Linux that supports it (not a compile-time decision).  It addresses
the other concerns expressed in 96626 by trying /dev/urandom first,
and only trying 1KB of it.  I'll add that patch to 96626 when it's ready 
(probably today).  

In the meantime, let's get to the bottom of what's wrong with 
safe_popen on BSDI (and Solaris).  I have no BSDI system on which to 
investigate, so further progress on this issue depends on BSDI developers.
Keywords: patch, review
Whiteboard: awaiting trial of vfork by submittor
Getting to the bottom of what's wrong with the code in safe_popen() is
going to be a pretty messy problem, for a variety of reasons:

1) That code is written without any thought to the reality of running
   on systems that use pthreads as the threading model.  Signal handling
   under pthreads doesn't guarantee that any particular thread will
   get a particular signal delivered to it.  Remember that signals
   are still process-scope, not thread scope.  So, diddling the
   signals with sigaction() before and after forking is at best a
   potentially haphazard way of doing things.  At worst, doomed to
   failure.

2) The use of strtok() is a no-no in a treaded program, at least without
   a mutex to protect the state of it.  Given that safe_popen() makes
   no attempt to protect mutexes that might be held across the fork()
   (one of the few things that are carried over from parent->child
   when a pthread'd program forks), I wouldn't care to guess what odd
   failures might exhibit themselves due to some residual mutex being
   held in libc or another library that the child process encounters
   already locked by the parent.

I would guess that the safe_popen() routines should be using
pthread_atfork() to handle a reasonable cleanup of the child
space process context.  Without that rewrite, I think this code
is doomed to be buggy.

All that being said, I'm building a debug version of the suspect code
under 4.2 (where the child actually drops a core file on application
exit), and I'll post a backtrace when it finishes.
safe_popen is not a public function.  
It is intended for use only with the software crypto token.  
Prevention of simultaneous use of safe_popen is handled at higher
levels within the software crypto token (or are supposed to be).
Additional redundant locks at this level should not be added.

Having said that, contributions for the improvement of safe_popen are 
welcome.
Your comments about safe_open() being a private function -- sure, but
that doesn't address any of the points I just raised.

Given that this function runs in the context of a threaded application,
and on many of unix-like flavors, it uses pthreads, the points that
I raised are entirely valid and haven't been addressed.

Strtok is unsafe in a multi-threaded application.  Irregardless of whether
or not the safe_popen() function is only running once, any other
thread that is operating could smash the static strtok buffer, corrupting
the buffer before safe_popen is done with it.  Since you cannot force
other threads to not operate while this is running, you have an
insurmountable problem with the code as it exists currently.

The signal handling code is likewise bound to give problems without
additional protection.  While there may be protection with regard to
only one instance of safe_open() operating due to the higher levels
of the code, there is no protection (nor can there be) as the effects
of other threads that may be interleaved or running simultaenously
on another processor in the system.

As for contributions to the code, I've contributed twice.  In both
cases the problems are avoided by effectively removing this code
from the program.  Spending a huge amount of time in order to
rewrite this code to make it proof against pthreads effects is
not something that I am willing to do.

The root issue here is how to get randomness into the program, not on
how to get fork(), pipe() and a host of string functions
to operate perfectly in a pthreads environment.  I believe that
my patch (removal of safe_popen and all the dangerous things it
does) and the reading of a system source of randomness is an
effective solution to the problem.

As a final concern, since part of the goals of the mozilla project is
to make it run faster (and given that it is already a very large application),
using fork() is no way to make it run faster.  Much less running fork()
twice.
I noticed the sigaction(SIGCHLD) calls in safe_popen() and
safe_pclose() and I agree with Mr. Lidl that they have many
potential problems in a pthreads program.  I just opened
bug 109971 for this issue.  (I don't have a solution though.)

However, I don't think those two sigaction() calls caused
this bug.  Their purpose is to make waitpid() work.  To
verify that they are not the cause of this bug, we can do
this experiment:
- safe_popen: comment out the sigaction stuff.
- safe_pclose: comment out the sigaction call and the waitpid
  while loop.

On the issue of strtok(), yes, it's not thread-safe.  Looking
at the code, I think it can be rewritten to avoid the strtok()
call.  But there are other problems with our child process.
In the child process, we can only call async-signal-safe functions
between fork() and execvp().  We definitely violated this restriction.
I don't know if this is the cause of the problem though.  I will
attach a patch to address this issue.
NSS uses NSPR.  NSPR uses pthreads on all **ix platforms.  Ergo, 
I believe NSS threads are always pthreads.  When a pthread thread
calls fork, the only thread that is replicated in the newly forked
process is the thread that called fork.  That is, the new process
initially has only one thread. strtok is used in the only thread 
in the new process.  It is safe in that context.  One could replace 
strtok with something like strtok_r, but IMO that's pointless in 
that single-threaded process environment.  If BSDI's pthread
library is reproducing more than one thread in the newly forked 
process, I believe that is a flaw in that pthread library.

A greater concern is that processes that use NSS MUST be able to 
succesfully fork other processes.  If there is a flaw in NSS/NSPR
that makes NSS-based processes from being able to fork children,
such a flaw MUST be fixed.  To simply eliminate the calls to fork
rather than fixing the real flaw is to stick one's head in the sand.

I do not object, in principle, to replacing the popen with data
from /dev/urandom on those platforms where /dev/urandom is known
to be more than a trivial LFSR PRNG.  ONCE the flaw(s) that affect fork 
on BSDI and Solaris are resolved, then I have no problem with eliminating
the use of safe_popen on BSDI.
lidl@pix.net wrote:
> Under 4.3, the child exits before any data is written to the parent, so
> the parent process hangs forever in the fread() of the expected data from
> the (no longer running) child process.

Could you attach a debugger to the child process and post its backtrace?

I am wondering if its backtrace would look similar to the one in bug
108708, seen on Solaris.
Why does the parent hang, in this case, instead of getting zero from fread,
due to reading EOF?  

Is it possibly the case that if the FD is already at EOF before the 
parent calls fdopen, then fread doesn't detect EOF?
Here's the stack backtrace for the failure on BSD/OS 4.2.

lidl@sloar-39: gdb ./mozilla-bin mozilla-bin.core 
GNU gdb 
Copyright 1998 Free Software Foundation, Inc.
[...]
Reading symbols from
/data/obj/mozilla-096/dist/bin/components/libimpText.so...done.
Reading symbols from
/data/obj/mozilla-096/dist/bin/components/libabsyncsvc.so...done.
#0  0x486bcb6d in kill () from /shlib/libc.so.2
(gdb) where
#0  0x486bcb6d in kill () from /shlib/libc.so.2
#1  0x487436c3 in abort () from /shlib/libc.so.2
#2  0x48753624 in _thread_aio_entry_destroy () from /shlib/libc.so.2
#3  0x48753e8c in _thread_sys_close () from /shlib/libc.so.2
#4  0x4a833d8c in safe_popen (cmd=0x4a885a28 "netstat -nis") at unix_rand.c:666
#5  0x4a834038 in RNG_SystemInfoForRNG () at unix_rand.c:790
#6  0x4a7896c5 in nss_Init (configdir=0x8450f40
"/homes/elves/lidl/.mozilla/lidl/vipfc2ii.slt", 
    certPrefix=0x4a86b47b "", keyPrefix=0x4a86b47b "", secmodName=0x4a86b49d
"secmod.db", 
    readOnly=0, noCertDB=0, noModDB=0, forceOpen=0) at nssinit.c:250
#7  0x4a789867 in NSS_InitReadWrite (
    configdir=0x8450f40 "/homes/elves/lidl/.mozilla/lidl/vipfc2ii.slt") at
nssinit.c:307
#8  0x4a7155fd in nsNSSComponent::InitializeNSS (this=0x824e700)
    at
/scratch/proj/src/mozilla-096/mozilla/security/manager/ssl/src/nsNSSComponent.cpp:547
#9  0x4a715a35 in nsNSSComponent::Init (this=0x824e700)
    at
/scratch/proj/src/mozilla-096/mozilla/security/manager/ssl/src/nsNSSComponent.cpp:609
#10 0x4a71d859 in nsNSSComponentConstructor (aOuter=0x0, aIID=@0x8060c08,
aResult=0x8047510)
    at
/scratch/proj/src/mozilla-096/mozilla/security/manager/ssl/src/nsNSSModule.cpp:51
#11 0x48235064 in nsGenericFactory::CreateInstance (this=0x8255ce0, aOuter=0x0,
aIID=@0x8060c08, 
    aResult=0x8047510)
    at
/scratch/proj/src/mozilla-096/mozilla/xpcom/components/nsGenericFactory.cpp:74
#12 0x4822f686 in nsComponentManagerImpl::CreateInstanceByContractID
(this=0x806a200, 
    aContractID=0x8061072 "@mozilla.org/security/entropy;1", aDelegate=0x0,
aIID=@0x8060c08, 
    aResult=0x8047510)
    at
/scratch/proj/src/mozilla-096/mozilla/xpcom/components/nsComponentManager.cpp:1673
#13 0x48230620 in nsComponentManagerImpl::GetServiceByContractID
(this=0x806a200, 
    aContractID=0x8061072 "@mozilla.org/security/entropy;1", aIID=@0x8060c08,
result=0x80475b4)
    at
/scratch/proj/src/mozilla-096/mozilla/xpcom/components/nsComponentManager.cpp:2119
#14 0x4822ba07 in nsGetServiceByContractID::operator() (this=0x8047710,
aIID=@0x8060c08, 
    aInstancePtr=0x80475b4)
    at
/scratch/proj/src/mozilla-096/mozilla/xpcom/components/nsComponentManager.cpp:236
#15 0x8058ea3 in nsCOMPtr<nsIEntropyCollector>::assign_from_helper
(this=0x8047720, 
    helper=@0x8047710, aIID=@0x8060c08) at
../../dist/include/xpcom/nsCOMPtr.h:972
#16 0x805be1b in nsCOMPtr<nsIEntropyCollector>::nsCOMPtr (this=0x8047720,
helper=@0x8047710)
    at ../../dist/include/xpcom/nsCOMPtr.h:553
#17 0x8052f08 in VerifyPsmAbsentOrSane (argc=1, argv=0x804793c)
    at /scratch/proj/src/mozilla-096/mozilla/xpfe/bootstrap/nsAppRunner.cpp:1029
#18 0x8053ae7 in main1 (argc=1, argv=0x804793c, nativeApp=0x0)
    at /scratch/proj/src/mozilla-096/mozilla/xpfe/bootstrap/nsAppRunner.cpp:1254
#19 0x8054de3 in main (argc=1, argv=0x804793c)
    at /scratch/proj/src/mozilla-096/mozilla/xpfe/bootstrap/nsAppRunner.cpp:1630
#20 0x804f28e in __start ()
(gdb) up 3
#3  0x48753e8c in _thread_sys_close () from /shlib/libc.so.2
(gdb) up
#4  0x4a833d8c in safe_popen (cmd=0x4a885a28 "netstat -nis") at unix_rand.c:666
666             for (fd = getdtablesize(); --fd > 2; close(fd))
Current language:  auto; currently c
(gdb) list
661           case 0:
662             /* dup write-side of pipe to stderr and stdout */
663             if (p[1] != 1) dup2(p[1], 1);
664             if (p[1] != 2) dup2(p[1], 2);
665             close(0);
666             for (fd = getdtablesize(); --fd > 2; close(fd))
667                 ;
668     
669             /* clean up environment in the child process */
670             putenv("PATH=/bin:/usr/bin:/sbin:/usr/sbin:/etc:/usr/etc");
(gdb) p fd
$1 = 7

-Kurt

Mr. Lidl, thank you for the stack backtrace for the failure
on BSD/OS 4.2.

So the child process crashed while closing fd 7.  Some
observations.

1. It would be nice to find out what fd 7 was.  In particular,
   was it p[0] or p[1] (the pipe)?

2. All the calls made by the child process up to that point
   (dup2, close, and sysconf) were async-signal-safe.  So
   this crash in close() could be a BSD/OS bug.

3. One possible workaround is to comment out the for loop that
   closes all the unnecessary fd's in the child process:
       for (fd = getdtablesize(); --fd > 2; close(fd))
           ;
   This might introduce security vulnerability though.  On the
   other hand, the commands NSS executes, 'ps' and 'netstat',
   are not arbitrary user code, so perhaps this security
   precaution of preventing access to unnecessary fd's is not
   really necessary.
Well, I've spent quite a bit of time dinking around with various test setups
trying to see what works and what doesn't work.  The long and short of it
is that I cannot make the exec() stuff work under BSD/OS, either due to an
error in how fork() is implemented, or some other interaction that I don't
understand.

The traceback that I appended showed the child process, "hung" when attempting
to close fd 7.  "hung" is a bad description -- it was busy flipping between
the threads of the process, because it thought that the close call would block.
But the observant reader will note that this was in the child process, after
the fork.  Why are there multiple threads in the child process after a fork?
There shouldn't be, according the rules of pthreading.  But clearly there
are (see my previous debugger output).

1) I wasn't able to figure out what fd 7 was, I probably could do this
with a little more effort.  But not tonight.

2) As above, it's not a *crash* in close, but rather where the thread
goes into a loop trying to switch to the other threads. They don't have
any work to do, so they switch back.  Repeat, lather, rinse.

3) Commenting out the loop that closes the fds of the child process
causes the program to exit with a GDK error.  What is really happening
is the process is getting a SIGPIPE writing to the X11 socket.  The
application is acting as if FD_CLOEXEC is set on all the fds when
the command is run.  Weird.

Modifying the code so that it only closes down to fd 8 gives the same
error.  (SIGPIPE)

Finally, a small modification of changing the exec in the child
process to a call to _exit(0) shows that child process exits cleanly
and the parent process goes happily on its way.  (Except, of course,
that it doesn't get the random data it was looking for.)

So, I'm drawing a blank as to what to test next -- there is definately
something squirrely with the pthreads, fork() and attaching to the
child process under BSD/OS.  I don't know if it is a real problem, or just
something that looks like a problem.  I just don't know.

Anyhow, I'm going to attach the latest version of my patch, which
has my debugging stuff conditionalized, rolls together my previous
patch, integrates wtc's patch to avoid certain unsafe operations and
disables the safe_popen code for just BSD/OS.

This latest patch is apparently applied to an older version of the sources.
The code on the trunk and the code on the tip of the NSS_3_3_BRANCH
(which which PSM currently takes the sources) already use /dev/urandom.

Perhaps the PSM team has not yet moved the static NSS_CLIENT_TAG forward
to include these recent changes to unix_rand.c, so I'll add some members
of the PSM team to this cc list

I'll attach a patch that builds on the existing code that uses /dev/urandom
and that doesn't call fork on BSDI systems when data is obtained from 
/dev/urandom.
Assignee: nelsonb → wtc
Attached patch patch for NSS_3_3_BRANCH (obsolete) — Splinter Review
This patch applies to the current tip of the NSS_3_3_BRANCH.
It doesn't attempt to fork either PS or netstat on BSDI systems
when it succeeded in getting data from /dev/urandom.
Comment on attachment 58802 [details] [diff] [review]
patch for NSS_3_3_BRANCH

I think

+#if defined(BSDI)
+    if (bytes)
+	return;
+#endif

should simply be

+#if defined(BSDI)
+    return;
+#endif

because we can't call safe_popen() no matter what.  Calling safe_popen() will
not get
us more entropy; instead the child process will crash.
In the absence of both netstat and /dev/urandom input, this function does not 
collect enough entropy to securely seed the PRNG, IMO. So, perhaps NSS
initialization should fail in this case.
Mr. Lidl:

If you comment out the loop that closes the fds of the child process,
is it the parent process or the child process that gets a SIGPIPE
writing to the X11 socket?  It is worrisome that the child process
can't even call exec after we removed all the unsafe functions.  This
seems to mean that a process using pthreads and X11 can't spawn another
process.  Can you have someone at BSDI look at this problem?

Maybe the pthreads library on BSD/OS is not ready to be used by the
Mozilla client?
Status: NEW → ASSIGNED
This is Nelson's patch (attachment 58802 [details] [diff] [review]) plus a comment
describing this bug and the use of "netstat -nis" on
BSD/OS (when the bug that prevents us from calling
safe_popen in a pthreads environment is fixed).
Attachment #49865 - Attachment is obsolete: true
Attachment #57608 - Attachment is obsolete: true
Attachment #58802 - Attachment is obsolete: true
Fix checked into the tip of NSS.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Mr. Lidl,

Yesterday a customer sent me a simple test program that
can cause safe_popen to fail on Linux in a way similar
to the stack trace you provided in comment #22.

I looked into it and found that it was a shared library
linking order problem.  I invoked the 'ldd' command on
the executable and found that libc.so was before libpthread.so.
When I built the test program with the -pthread option of
g++/gcc so that libpthread.so is before libc.so in the
linking order, the test program completed successfully.

If BSD/OS has 'ldd' or a similar command that can report
the shared library linking order of an executable program,
could you invoke it on the mozilla executable and see the
relative order of libc.so and libpthread.so?  If the C++
compiler has a -pthread or -pthreads option, are we using
that option when linking the mozilla executable on BSD/OS?
If the C++ compiler doesn't have such an option, we will
need to link with -lpthread explicitly to ensure the correct
linking order of libc.so and libpthread.so.
There is not a seperate pthreads library on BSD/OS.
The pthreads functions have been integrated into libc, directly.

However, as you requested, here is the ldd output:

lidl-58: export LD_LIBRARY_PATH=/usr/local/mozilla
lidl-59: cd /usr/local/mozilla
lidl-60: ldd mozilla-bin
        libgkgfx.so => /usr/local/mozilla/libgkgfx.so (0x4808e000)
        libjsj.so => /usr/local/mozilla/libjsj.so (0x480b8000)
        libmozjs.so => /usr/local/mozilla/libmozjs.so (0x480d1000)
        libxpcom.so => /usr/local/mozilla/libxpcom.so (0x48143000)
        libplds4.so => /usr/local/mozilla/libplds4.so (0x4823f000)
        libplc4.so => /usr/local/mozilla/libplc4.so (0x48242000)
        libnspr4.so => /usr/local/mozilla/libnspr4.so (0x48247000)
        libdl.so => /shlib/libdl.so (0x48274000)
        libgtk-1.2.so.0 => /usr/contrib/lib/libgtk-1.2.so.0 (0x48277000)
        libgdk-1.2.so.0 => /usr/contrib/lib/libgdk-1.2.so.0 (0x48395000)
        libgmodule-1.2.so.0 => /usr/contrib/lib/libgmodule-1.2.so.0 (0x483c5000)
        libglib-1.2.so.0 => /usr/contrib/lib/libglib-1.2.so.0 (0x483c8000)
        libXext.so.6 => /usr/X11R6/lib/libXext.so.6 (0x483ea000)
        libX11.so.6 => /usr/X11R6/lib/libX11.so.6 (0x483f8000)
        libm.so => /shlib/libm.so.0.0 (0x484d6000)
        libstdc++.so.1 => /usr/lib/libstdc++.so.1 (0x484e7000)
        libgcc.so.1 => /shlib/libgcc.so.1 (0x48526000)
        libc.so.2 => /shlib/libc.so.2 (0x48532000)


I will add one other comment about a possibly related problem that I
have heard of that causes problems on BSD/OS when linking pthreads
problems.  Basically, the problem is that giving libc as an explicit
library on the link line (via a -lc) can cause problems that will
not occur if the program is linked such that the compiler will
implicitly add the library to the link line.

e.g.:   gcc -o foo foo.o -lc  (might cause pthreads problems)
        gcc -o foo foo.o (will not cause pthreads problems)

I don't think that is the issue here, as the final link line on my BSD/OS
4.3 host goes like this:

g++ -o mozilla-bin -I/usr/X11R6/include -fno-rtti -fno-exceptions -Wall -Wconver
sion -Wpointer-arith -Wbad-function-cast -Wcast-align -Woverloaded-virtual -Wsyn
th -Wno-ctor-dtor-privacy -pthread -pipe  -DNDEBUG -DTRIMMED -DWIDGET_DLL=\"libw
idget_gtk.so\" -DGFXWIN_DLL=\"libgfx_gtk.so\" -I/usr/contrib/lib/glib/include -I
/usr/contrib/include -I/usr/X11R6/include  nsAppRunner.o nsWindowCreator.o showO
SAlert.o nsSigHandlers.o nsNativeAppSupportGtk.o nsNativeAppSupportBase.o   -L..
/../dist/bin -L../../dist/lib  -lgkgfx -lmpfilelocprovider_s -ljsj  -L../../dist
/bin -lmozjs -L../../dist/bin -lxpcom -L/usr/obj/data/mozilla/dist/lib -lplds4 -
lplc4 -lnspr4 -ldl   -L/usr/contrib/lib -L/usr/X11R6/lib -lgtk -lgdk -lgmodule -
lglib -ldl -lXext -lX11 -lm -ldl -lm

I found that NSS (mozilla/security/nss) is not compiled with
the -pthread option.  That might be a problem.

To fix this, edit mozilla/security/coreconf/BSD_OS.mk and
define CC as 'gcc -pthread'.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: