Closed Bug 461502 Opened 11 years ago Closed 11 years ago

ForkAndExec is crashing on Solaris 8/9 due to environ being NULL

Categories

(NSPR :: NSPR, defect, P2)

4.7.2
Sun
Solaris
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: slavomir.katuscak+mozilla, Assigned: julien.pierre)

Details

Attachments

(5 files, 3 obsolete files)

These tests failed first in nightly builds since build 20081022.1 in all 4 combinations of 32/64 and DBG/OPT builds on 2 machines (Solaris 8 and Solaris 9).
I checked also results from today and they failed again.

BEGIN TEST: pipeping2 (2008-09-22 21:34:11)


END TEST: pipeping2 (2008-09-22 21:34:15)
TEST STATUS: pipeping2 = FAILED (errno 11)


BEGIN TEST: sockping (2008-09-22 21:42:46)


END TEST: sockping (2008-09-22 21:42:47)
TEST STATUS: sockping = FAILED (errno 11)

I compared code of pipeping2.c with pipeping.c (which passed), only significant difference is in calling PR_ProcessAttrSetInheritableFD function (pipeping.c calls PR_ProcessAttrSetStdioRedirect at the same place of code).

Errno 11 means Resource temporarily unavailable. There is not clear which part of code returns this error and which resource is unavailable.
NSPR 4.7.2 was released earlier this week. Setting the target milestone to 4.7.3.
Target Milestone: 4.7.1 → 4.7.3
This bug is seen on the latest NSPR 4.7.2.
Version: 4.7 → 4.7.2
Is it an NSPR bug? 
Or did the system run out of some resource?
Does the system perhaps need to have the amount of some resource increased?
The failure happened when we switched from NSPR 4.7.1 to 4.7.2. Unless there are some known system resource limit changes between these 2 micros releases, I don't think that is is the problem.
If the problem is due to increased system resource requirement as a side effect of a change in NSPR, this change is a problem by itself and needs to be tracked.
Before running the tests, we set the maximum fd number to 1024 (ulimit -n 1024).
I ran a truss with pipeping2 "truss -f ./pipeping2". The relevant part at the end is:

24936:	sigaction(SIGPIPE, 0xFFBEF188, 0x00000000)	= 0
24936:	pipe()						= 3 [4]
24936:	fcntl(3, F_GETFD, 0x00000000)			= 0
24936:	fcntl(3, F_GETFL, 0x00000000)			= 2
24936:	fstat64(3, 0xFFBEF270)				= 0
24936:	fstat64(3, 0xFFBEF270)				= 0
24936:	fcntl(3, F_SETFL, 0x00000082)			= 0
24936:	fcntl(4, F_GETFD, 0x00000000)			= 0
24936:	fcntl(4, F_GETFL, 0x00000000)			= 2
24936:	fstat64(4, 0xFFBEF270)				= 0
24936:	fstat64(4, 0xFFBEF270)				= 0
24936:	fcntl(4, F_SETFL, 0x00000082)			= 0
24936:	pipe()						= 5 [6]
24936:	fcntl(5, F_GETFD, 0x00000000)			= 0
24936:	fcntl(5, F_GETFL, 0x00000000)			= 2
24936:	fstat64(5, 0xFFBEF270)				= 0
24936:	fstat64(5, 0xFFBEF270)				= 0
24936:	fcntl(5, F_SETFL, 0x00000082)			= 0
24936:	fcntl(6, F_GETFD, 0x00000000)			= 0
24936:	fcntl(6, F_GETFL, 0x00000000)			= 2
24936:	fstat64(6, 0xFFBEF270)				= 0
24936:	fstat64(6, 0xFFBEF270)				= 0
24936:	fcntl(6, F_SETFL, 0x00000082)			= 0
24936:	fcntl(3, F_SETFD, 0x00000001)			= 0
24936:	fcntl(6, F_SETFD, 0x00000001)			= 0
24936:	lwp_cond_signal(0xFF1D34E8)			= 0
24936:	lwp_cond_wait(0xFF1D34E8, 0xFF1D34F8, 0xFF1CCD80) = 0
24936:	lwp_self()					= 3
24936:	    Incurred fault #6, FLTBOUNDS  %pc = 0xFF352A48
24936:	      siginfo: SIGSEGV SEGV_MAPERR addr=0x00000000
24936:	    Received signal #11, SIGSEGV [default]
24936:	      siginfo: SIGSEGV SEGV_MAPERR addr=0x00000000
24936:		*** process killed ***
I ran a trus with sockping (same conditions, same machine):

24939:	sigaction(SIGPIPE, 0xFFBEF190, 0x00000000)	= 0
24939:	so_socket(1, 2, 0, "", 1)			= 3
24939:	so_socket(1, 2, 0, "", 1)			= 4
24939:	so_socketpair(0xFFBEF4BC)			= 0
24939:	close(3)					= 0
24939:	fcntl(5, F_GETFD, 0x00000000)			= 0
24939:	fcntl(5, F_GETFL, 0x00000000)			= 2
24939:	fstat64(5, 0xFFBEF278)				= 0
24939:	getsockopt(5, 65535, 8192, 0xFFBEF378, 0xFFBEF370, 1) = 0
24939:	fstat64(5, 0xFFBEF278)				= 0
24939:	getsockopt(5, 65535, 8192, 0xFFBEF378, 0xFFBEF374, 1) = 0
24939:	setsockopt(5, 65535, 8192, 0xFFBEF378, 4, 1)	= 0
24939:	fcntl(5, F_SETFL, 0x00000082)			= 0
24939:	fcntl(4, F_GETFD, 0x00000000)			= 0
24939:	fcntl(4, F_GETFL, 0x00000000)			= 2
24939:	fstat64(4, 0xFFBEF278)				= 0
24939:	getsockopt(4, 65535, 8192, 0xFFBEF378, 0xFFBEF370, 0) = 0
24939:	fstat64(4, 0xFFBEF278)				= 0
24939:	getsockopt(4, 65535, 8192, 0xFFBEF378, 0xFFBEF374, 0) = 0
24939:	setsockopt(4, 65535, 8192, 0xFFBEF378, 4, 0)	= 0
24939:	fcntl(4, F_SETFL, 0x00000082)			= 0
24939:	fcntl(5, F_SETFD, 0x00000001)			= 0
24939:	lwp_cond_signal(0xFF1D34E8)			= 0
24939:	lwp_cond_wait(0xFF1D34E8, 0xFF1D34F8, 0xFF1CCD80) = 0
24939:	lwp_self()					= 3
24939:	    Incurred fault #6, FLTBOUNDS  %pc = 0xFF352A48
24939:	      siginfo: SIGSEGV SEGV_MAPERR addr=0x00000000
24939:	    Received signal #11, SIGSEGV [default]
24939:	      siginfo: SIGSEGV SEGV_MAPERR addr=0x00000000
24939:		*** process killed ***
Christophe: all the changes in NSPR 4.7.2 are listed in the release notes:
http://www.mozilla.org/projects/nspr/release-notes/nspr472.html

There are two incomplete porting changes that are not listed in the release
notes: the Symbian OS port and 64-bit Mac OS X port.
Note: the crash is only on Solaris 8 and 9 SPARC, not on Solaris 10 SPARC, not
on Solaris 9 and 10 x86.
System resources:
>	limit
cputime 	unlimited
filesize 	unlimited
datasize 	unlimited
stacksize 	32768 kbytes
coredumpsize 	unlimited
vmemoryuse 	unlimited
descriptors 	1024
Stack for pipeping2:

core 'core' of 24992:	./pipeping2
-----------------  lwp# 1 / thread# 1  --------------------
 ff352a48 ForkAndExec (119d0, 21c78, 0, 236d8, 0, 0) + c0
 ff3530e8 _MD_CreateUnixProcess (119d0, 21c78, 0, 236d8, 5, 4) + 80
 ff32d470 PR_CreateProcess (119d0, 21c78, 0, 236d8, 31960, 0) + 30
 00011394 main     (1, ffbef9b4, ffbef9bc, 21c00, 0, 0) + 364
 00010c00 _start   (0, 0, 0, 0, 0, 0) + 108
-----------------  lwp# 2 / thread# 2  --------------------
 ff29edc4 _signotifywait (ff1cc000, 0, 0, 0, 0, 0) + 8
 ff1b1c2c thr_yield (0, 0, 0, 0, 0, 0) + 8c
-----------------  lwp# 3  --------------------------------
 ff29c968 _door_return (3, ff1cd658, ff1cd670, 3, ff1cc000, 1) + 10
 ff1aa358 _lwp_start (ff145d98, 0, 6000, ffbef2e4, 0, 0) + 18
 ff1b1c2c thr_yield (0, 0, 0, 0, 0, 0) + 8c
--------------------------  thread# 3  --------------------
 ff1ad9b8 _reap_wait (ff1d0980, 1e924, 0, ff1cc000, 0, 0) + 38
 ff1ad710 _reaper  (ff1cce00, ff1d2708, ff1d0980, ff1ccdd8, 1, fe400000) + 38
 ff1bb01c _thread_start (0, 0, 0, 0, 0, 0) + 40
--------------------------  thread# 4  --------------------
 ff1bafd4 _restorefsr (249c0, 0, 0, 0, 0, 0) + 8
Stack for sockping:

core 'core' of 24998:	./sockping
-----------------  lwp# 1 / thread# 1  --------------------
 fdb52a48 ForkAndExec (11834, 21ad8, 0, 23538, 0, 0) + c0
 fdb530e8 _MD_CreateUnixProcess (11834, 21ad8, 0, 23538, 2, 4) + 80
 fdb2d470 PR_CreateProcess (11834, 21ad8, 0, 23538, 31960, 0) + 30
 0001126c main     (1, ffbef9b4, ffbef9bc, 21800, 0, 0) + 234
 00010c08 _start   (0, 0, 0, 0, 0, 0) + 108
-----------------  lwp# 2 / thread# 2  --------------------
 fda9edc4 _signotifywait (fd9cc000, 0, 0, 0, 0, 0) + 8
 fd9b1c2c thr_yield (0, 0, 0, 0, 0, 0) + 8c
-----------------  lwp# 3  --------------------------------
 fda9c968 _door_return (3, fd9cd658, fd9cd670, 3, fd9cc000, 1) + 10
 fd9aa358 _lwp_start (fd945d98, 0, 6000, ffbef2e4, 0, 0) + 18
 fd9b1c2c thr_yield (0, 0, 0, 0, 0, 0) + 8c
--------------------------  thread# 3  --------------------
 fd9ad9b8 _reap_wait (fd9d0980, 1e924, 0, fd9cc000, 0, 0) + 38
 fd9ad710 _reaper  (fd9cce00, fd9d2708, fd9d0980, fd9ccdd8, 1, fe400000) + 38
 fd9bb01c _thread_start (0, 0, 0, 0, 0, 0) + 40
--------------------------  thread# 4  --------------------
 fd9bafd4 _restorefsr (24820, 0, 0, 0, 0, 0) + 8
The bug appeared between NSPR_4_7_2_BETA2 AND NSPR_4_7_2_BETA3
Hmm. Out of resources, trying to fork.
I suspect the system ran out of process table entries.
That leaves us with the commits for the following bugs:
432430: [PATCH] NSPR port to Symbian OS, unit tests tested
313282: In strcstr.c there is an 'obvious improvement' waiting to be performed
451476: NSPR shared libraries should use direct bindings on Solaris
I omitted the changes that cannot affect Solaris.
I made a build with NSPR_4_7_2_RTM but without the changes for bug 451476 (removed -Bdirect).

Both tests pipeping2 and sockping don't crash with this build.
I made another build with NSPR_4_7_2_RTM but changes "-Bdirect" with "-z direct".
Both tests pipeping2 and sockping are still crashing.
Julien, this bug is about the test failures caused by the -Bdirect patch
of bug 451476.
Assignee: wtc → julien.pierre.boogz
Slavo,

I checked in the change to use -Bdirect on the NSPR trunk on august 29.
Why are we only seeing this regression 2 months later ?
Because we were still using the tag NSPR_4_7_1_RTM in preparation of NSS 3.11.10. I changed the tag to NSPR_4_7_2_RTM last week when Wan-Teh created it.

On the trunk build, we were also using NSPR_4_7_1_RTM since we built NSS 3.12.0 RTM and I forgot to remove it.

I don't think that we are running the NSPR tests in the Tinderbox.
Christophe, that's very unfortunate.

It means that 5 months worth of fixes made in NSPR since 4.7.1 have never had any QA. It's an egregious failure of QA. The process should be fixed so that this does not happen accidentally again.

Given this, I would vote in favor of not shipping NSPR 4.7.2rtm at Sun until we have had some more testing on the fixes on all platforms. 

None of the fixes in 4.7.2 appear to be critical to Sun except bug 453375 , and it's a trivial one to backport. We have the option of either doing an NSPR 4.7.1.1 with just that one fix, or rushing a 4.7.2.1 to fix the one regression we have found so far. Given how little testing the code changes from 4.7.2 have actually had, I would vote for doing 4.7.1.1 .
(In reply to comment #0)
>
> END TEST: pipeping2 (2008-09-22 21:34:15)
> TEST STATUS: pipeping2 = FAILED (errno 11)

"errno 11" is a typo of runtests.pl:
http://mxr.mozilla.org/nspr/source/nsprpub/pr/tests/runtests.pl#135

"errno" should be changed to "exit code" or "exit status".

Also, the Perl code that computes the exit status of a test
program doesn't handle abnormal termination due to a signal:
http://mxr.mozilla.org/nspr/source/nsprpub/pr/tests/runtests.pl#161

Here, the computed value of 11 is actually the value of SIGSEGV,
as the truss output in comment 5 and comment 6 showed.

In C code, we're supposed to use the WIFSIGNALED and WTERMSIG
macros to tell if the program was killed by a signal ands get the
signal number:
http://www.opengroup.org/onlinepubs/009695399/functions/waitpid.html

I don't know how to do the equivalent things in Perl.

In any case, the test programs did not fail with errno 11.  They
were killed by signal 11 (SIGSEGV).
I'll be happy to create NSPR 4.7.3 for you.  (I want to avoid using
a four-component version number for NSPR.)  Let me know if you want
to try to debug this crash, or I should just back out the -Ddirect
patch.

One thing common to pipeping2 and sockping is that they create a
child process (pipepong2 and sockpong).  The difference between
pipeping.c and pipeping2.c that Christophe pointed out in comment 0
is another clue.
Can one of you get a stack trace of the crash?  You can run
the pipeping2 or sockping test in dbx or invoke dbx on the
core file.
I provided the stack (using pstack) in comments 10 and 11. Is it not what you need ?
A process's exit value is called the "exit status" on Unix
and "exit code" on Windows.  I use "exit status" in this
patch.  If you prefer "exit code", please let me know.
Attachment #345113 - Flags: review?(christophe.ravel.bugs)
Could you run a debug build and use dbx to get a stack trace
with file names and line numbers?  Thanks.
Attachment #345113 - Flags: review?(christophe.ravel.bugs) → review+
Comment on attachment 345113 [details] [diff] [review]
Fix typos (errno => exit status) in runtests.pl

I have created Bug 462178 to fix the various issues I found with runtests.pl
Comment on attachment 345113 [details] [diff] [review]
Fix typos (errno => exit status) in runtests.pl

I moved this patch to bug 462178 as attachment 345348 [details] [diff] [review].
Attachment #345113 - Attachment is obsolete: true
If nobody posts the call stack of the crash (with file names and
line numbers) before Friday, I will back out the -Bdirect patch
on Friday morning.  I don't have access to Solaris 8 and 9 SPARC.

If you think I should back out the -Bdirect patch now, please let
me know.
I thought I had posted the dbx stack. Here it is:

(dbx) run
Running: pipeping2 
(process id 28288)
Reading librt.so.1
Reading libaio.so.1
Reading libsocket.so.1
t@1 (l@1) signal SEGV (no mapping at the fault address) in ForkAndExec at line 200 in file "uxproces.c"
  200           for (nEnv = 0; childEnvp[nEnv]; nEnv++) {
(dbx) where
current thread: t@1
=>[1] ForkAndExec(path = 0x11948 "pipepong2", argv = 0x21bf0, envp = (nil), attr = 0x23650), line 200 in "uxproces.c"
  [2] _MD_CreateUnixProcess(path = 0x11948 "pipepong2", argv = 0x21bf0, envp = (nil), attr = 0x23650), line 536 in "uxproces.c"
  [3] PR_CreateProcess(path = 0x11948 "pipepong2", argv = 0x21bf0, envp = (nil), attr = 0x23650), line 732 in "prinit.c"
  [4] main(), line 125 in "pipeping2.c"


-------

(dbx) run
Running: sockping 
(process id 28293)
Reading librt.so.1
Reading libaio.so.1
Reading libsocket.so.1
t@1 (l@1) signal SEGV (no mapping at the fault address) in ForkAndExec at line 200 in file "uxproces.c"
  200           for (nEnv = 0; childEnvp[nEnv]; nEnv++) {
(dbx) where
current thread: t@1
=>[1] ForkAndExec(path = 0x117a4 "sockpong", argv = 0x21a48, envp = (nil), attr = 0x234a8), line 200 in "uxproces.c"
  [2] _MD_CreateUnixProcess(path = 0x117a4 "sockpong", argv = 0x21a48, envp = (nil), attr = 0x234a8), line 536 in "uxproces.c"
  [3] PR_CreateProcess(path = 0x117a4 "sockpong", argv = 0x21a48, envp = (nil), attr = 0x234a8), line 732 in "prinit.c"
  [4] main(), line 106 in "sockping.c"

-----
Note: I posted earlier these stacks (using pstack) in comments #10 and #11.
Christophe: thanks.  The pstack output isn't as useful because it
doesn't have file names and line numbers.

So the problem has to do with the global variable 'environ'.
We pass envp=NULL to ForkAndExec.  So we set childEnvp to
'environ' on line 197 and then deference childEnvp on line 200:

189     childEnvp = envp;  <== Set to NULL.
190     if (attr && attr->fdInheritBuffer) {
191         PRBool found = PR_FALSE;
192 
193         if (NULL == childEnvp) {
194 #ifdef DARWIN
195             childEnvp = *(_NSGetEnviron());
196 #else
197             childEnvp = environ;  <== Set to environ.
198 #endif
199         }
200         for (nEnv = 0; childEnvp[nEnv]; nEnv++) {  <== CRASH
201         }

I don't know if there is a better way to get the array of
environment variables on Solaris.  Single Unix Specification
Version 3 has 'environ':
http://www.opengroup.org/onlinepubs/009695399/functions/environ.html

Could you try commenting out line 53 of
mozilla/nsprpub/pr/src/md/unix/uxproces.c:

50 #if defined(DARWIN)
51 #include <crt_externs.h>
52 #else
53 PR_IMPORT_DATA(char **) environ;  <== COMMENT THIS OUT
54 #endif

and see if that fixes the crash?  In the debugger, could you
print the values of 'environ' and 'childEnvp'?
(dbx) run
Running: pipeping2 
(process id 18161)
Reading librt.so.1
Reading libaio.so.1
Reading libsocket.so.1
t@1 (l@1) signal SEGV (no mapping at the fault address) in ForkAndExec at line 200 in file "uxproces.c"
  200           for (nEnv = 0; childEnvp[nEnv]; nEnv++) {
(dbx) print environ
environ = -4261084
(dbx) print childEnvp
childEnvp = (nil)
I traced the program from line 193:

(dbx) stop at uxproces.c:193
(2) stop at "uxproces.c":193
(dbx) run
Running: pipeping2 
(process id 18171)
Reading librt.so.1
Reading libaio.so.1
Reading libsocket.so.1
t@1 (l@1) stopped in ForkAndExec at line 193 in file "uxproces.c"
  193           if (NULL == childEnvp) {
(dbx) print childEnvp
childEnvp = (nil)
(dbx) step
t@1 (l@1) stopped in ForkAndExec at line 197 in file "uxproces.c"
  197               childEnvp = environ;
(dbx) print environ
environ = -4261084
(dbx) print childEnvp
childEnvp = (nil)
(dbx) step
t@1 (l@1) stopped in ForkAndExec at line 200 in file "uxproces.c"
  200           for (nEnv = 0; childEnvp[nEnv]; nEnv++) {
(dbx) print environ
environ = -4261084
(dbx) print childEnvp
childEnvp = (nil)
Wan-Teh,

Please don't back out the change to use -Bdirect until we understand the root cause and know if it should be taken out or not.

We have determined that we can ship with NSPR 4.7.1 at the moment at Sun, so there is no urgent need for a new NSPR release with this fix. But it should certainly remain a P1 issue for 4.7.3.
NSPR 4.7.2 was released last Friday, so this bug is a regression
in an announced patch release.  As long as we're making progress,
it's okay, but at some point I need to release a fix for this crash
to demonstrate that we take a regression in a patch release seriously.

Christophe told me that 'environ' is not declared in any header
under /usr/include (he searched under that directory recursively).
So when he commented out the declaration of 'environ' at line 53
of uxproces.c as I requested in comment 32, there was a compilation
error.
It would be nice to find out whether we have the same problems
with other global variables in libc, such as:
- stdin, stdout, and stderr
- errno, h_errno (I believe these are functions rather than
  global variables in multithreaded programs)
- the lookup tables used to implement the isdigit, isspace
  macros declared in <ctype.h>.

We manually declare 'environ' like this:
  extern char **environ;

Not sure if that's wrong.
Wan-Teh,

In response to comment 37, I built a very simple test case.
The problem seems to be only with environ, and only when it's in a shared lib built with -Bdirect.
I verified that stdin, stdout, stderr, and errno, were OK regardless.

I see the problem on Solaris 8 and 9, regardless of CPU type. There is no issue on solaris 10.

Looks like this is another issue I'm going to talk to the linker guys about .
Here is part of the answer that I got from the Solaris linker team :

quote.

environ is a symbol that is not supposed to be directly bound.
It needs to come from the copy that is part of the executable.
The way that this works is that libc offers the symbol (so that you
can link your object), but libc also marks it as being a symbol
that should not be directly bound. On my Nevada system:

    % elfdump -y libenvironDir.so | grep environ
           [8]  D            [0] libc.so.1                environ
    % elfdump -y /lib/libc.so | grep environ
         [207]  N                                         environ 

/end quote.

Unfortunately, this flag is only correctly set in the libc for Solaris 10 and Nevada. On Solaris 8 and 9 libc, the flag is not set on the environ symbol. Those operating systems do support the flag, but there are no tools that run on those platforms that are able to set the flag. Specifically, ld on s8/s9 does not support it. And the "elfedit" tool that can do this only runs on Nevada.

I will file a bug on libc on s8/s9 today about this, but it doesn't appear that our chances of getting the problem fixed in libc for s8/s9 are very high. So, we shouldn't make NSPR depend on the availability of that patch.

The linker team suggested we use the following hack to get things to work properly on s8/s9 :

  void* hdl;
  char*** e_addr;
  char** e;

  void* hdl = dlopen(0, RTLD_NOW|RTLD_GLOBAL);
  e_addr = dlsym(hdl, "environ"); 
  e = *e_addr;
Priority: -- → P1
Summary: Pipeping2 and sockping tests are failing on Solaris/Sparc. → Pipeping2 and sockping tests are crashing on Solaris 8/9
If we don't use this hack we could link the NSPR shared lib on s10. It will then be marked with environ as not being linked directly to libc. I verified that the resulting shared lib then runs fine on s8 . Of course that's not officially supported - bits made on s10 aren't officially supported on s8. And we still have a requirement to supported s8/s9/s10 with one set of shared libs.

Also, it would be a fairly nasty build process - since we build on s8, we would have to rlogin to an s10 box just to do the link ...
I would also add that IMO, when a module returns an error in C_Finalize, this is an indication to the application (here, NSS) that it's not able to free its resources. Why would it be able to free its resources at unload time in an handler and not in C_Finalize ?

NSS cannot really predict whether that resource leak is going to cause issues after unload/reload/reinitialization of the module. So far, the evidence with several modules says that it probably will cause at least some issues. On that basis I think it is desireable for NSS_Shutdown to at least have a mode of detecting these errors, whether there was a fork or not.
Oops! that last comment wasn't supposed to go in this bug. :)
Please ignore.
Attached patch Proposed patchSplinter Review
I implemented the workaround suggested by the Solaris linker team.
Please test this patch on Solaris.
Attachment #346151 - Flags: review?(julien.pierre.boogz)
Attachment #346151 - Flags: review?(julien.pierre.boogz) → review+
Comment on attachment 346151 [details] [diff] [review]
Proposed patch

Thanks for the patch, Wan-Teh. I tested this patch and found that it resolves the failure on s9.

I would like to make some changes to it to add a little bit more error checking and assertions in the new hack code.

I think we can also check if environ is non-NULL and skip the hack altogether for s10/nevada. There is no need to check for OS version fortunately.

I'm still looking into the possibility that there could be other symbols with the same problem. Please don't check the patch in just yet. I will provide an updated patch when I am done with that research.
It turns out there are many other symbols that should be NODIRECT, like malloc, free, etc, in libc alone. When linking on s10 and above, even with -Bdirect the linker picks up the NODIRECT flag from libc, and only links specific symbols directly that should be direct-linked, but excludes the NODIRECT. 

When linking on s8, the linker doesn't know how to do that, and the system libraries don't have the flags. It turns out there is a linker patch for s8 that allows the use of NODIRECT in shared lib map files. If we install that, we can resolve the problem. But the map files will be painful, since we will have to list all the system libs NODIRECT symbols that aren't in the s8 system shared libs.
The NODIRECT flag on malloc is just double protection.
Libraries that override malloc need to be built with the
-z interpose linker flag.

We can consider using -Bdirect on Solaris 10 and later
only.

The implications of -Bdirect seem very hard to fully
understand.  I'm not sure if it's worth the trouble.
There is a risk in using a feature that only a few
experts understand.  Well, this is your decision as
the Solaris platform owner.

In the interest of time (we need a new NSPR release for
the Firefox 3.0.x branch in bug 461082), I backed out
the Solaris -Bdirect linker flag on the NSPR trunk (NSPR 4.7.3).

Checking in configure.in;
/cvsroot/mozilla/nsprpub/configure.in,v  <--  configure.in
new revision: 1.243; previous revision: 1.242
done
Checking in configure;
/cvsroot/mozilla/nsprpub/configure,v  <--  configure
new revision: 1.239; previous revision: 1.238
done
Lowered the priority to P2 because I backed out the -Bdirect linker flag.
Priority: P1 → P2
We could conditionally put back the -Bdirect flag if we are linking on solaris 10. Ten years from now, we might stop linking on s8 or s9.
Priority: P2 → P1
I didn't mean to change the bug priority. Back to P2.
Priority: P1 → P2
It would be nice to verify in dbx that 'environ' is correct
on Solaris 10 with the -Bdirect linker flag.

You can run the pipeping2 test in dbx and set a breakpoint
in ForkAndExec.  After you step past line 197:

  197               childEnvp = environ;

print childEnvp[0], childEnvp[1], etc. in dbx to verify
they contain sane values.
Wan-Teh,

FYI, this is the test case I have been using to check the behavior of using -Bdirect vs not using it that I passed on to the Solaris guys. I thought it could be of interest.

I just added a few lines of code to verify that environ[0] and environ[1] were sane values when linking the shared lib with -Bdirect on s10.
Summary: Pipeping2 and sockping tests are crashing on Solaris 8/9 → ForkAndExec is crashing on Solaris 8/9 due to environ being NULL
Comment on attachment 346151 [details] [diff] [review]
Proposed patch

Withholding review on this patch, since additional problems were found with other symbols.
Attachment #346151 - Flags: review+ → review-
Attachment #346280 - Flags: review+
I tested this on s9 and s10 and it had the desired effect.
Attachment #346817 - Flags: review?(wtc)
Comment on attachment 346817 [details] [diff] [review]
Additional patch to enable -Bdirect on Solaris 10 and above

This patch is correct, but please don't check it in like this.

You need to run autoconf-2.13 to generate 'configure' before
you check it in.  If you don't have autoconf-2.13, I can check
this in for you.

Please look at the other case statements in 'configure.in' to
see how a case statement should be formatted.

You can combine the 5.8 and 5.9 cases with '|':
    5.8|5.9)

>+        *)         DSO_LDOPTS="$DSO_LDOPTS,-Bdirect";

Should we use two semicolons ;; at the end of this case?

Nit: we use
    case "${target_os}"
more often than
    case "${OS_RELEASE}"
to check the OS version.  $target_os contains the OS name, so
it's more than we need, but it'd be good to be consistent.
There are two such case statements for Solaris at line 2055
and line 2079.

Ideally we should not repeat the case statement in the "if"
and "else" blocks, but in this case it's probably not worth
the trouble.
Attachment #346817 - Flags: review?(wtc) → review-
Wan-Teh,

I don't have autoconf-2.13 . I always make changes to configure.* manually.

I looked at the formatting for other case statements, but did not find them to be very consistent. The formatting seems to differ quite a bit depending on the length of the statement for each case.

Re: the | character, sure we can use it.

Re: the two semicolons, we can put them in there for consistency, though it doesn't appear to make a difference.

I think the case "${target_os}" is OK. No need to check more than we have to.

As far as avoiding the two case statements in the GCC and Studio cases, those 2 compilers have very slightly different syntax (space and comma before -Bdirect), so that's why I put in two case statements.

I will attach an updated patch.
Attached patch Updated patch (obsolete) — Splinter Review
Attachment #346817 - Attachment is obsolete: true
Attachment #349090 - Flags: review?(wtc)
I edited Julien's patch and checked it in on the NSPR
trunk (NSPR 4.7.4).

Checking in configure;
/cvsroot/mozilla/nsprpub/configure,v  <--  configure
new revision: 1.241; previous revision: 1.240
done
Checking in configure.in;
/cvsroot/mozilla/nsprpub/configure.in,v  <--  configure.in
new revision: 1.245; previous revision: 1.244
done
Attachment #349090 - Attachment is obsolete: true
Attachment #349090 - Flags: review?(wtc)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: 4.7.3 → 4.7.4
You need to log in before you can comment on or make changes to this bug.