Closed
Bug 348198
Opened 19 years ago
Closed 17 years ago
Selfserv PR_Bind failure - local network address in use.
Categories
(NSS :: Test, defect, P2)
NSS
Test
Tracking
(Not tracked)
RESOLVED
FIXED
3.12.1
People
(Reporter: slavomir.katuscak+mozilla, Assigned: julien.pierre)
Details
Attachments
(9 files, 2 obsolete files)
5.35 KB,
patch
|
julien.pierre
:
review+
alvolkov.bgs
:
superreview+
|
Details | Diff | Splinter Review |
1.48 KB,
patch
|
julien.pierre
:
review-
|
Details | Diff | Splinter Review |
3.88 KB,
patch
|
julien.pierre
:
review-
|
Details | Diff | Splinter Review |
925 bytes,
patch
|
Details | Diff | Splinter Review | |
5.19 KB,
patch
|
Details | Diff | Splinter Review | |
2.81 KB,
patch
|
Details | Diff | Splinter Review | |
778 bytes,
patch
|
nelson
:
review+
|
Details | Diff | Splinter Review |
2.47 KB,
patch
|
Details | Diff | Splinter Review | |
5.71 KB,
patch
|
julien.pierre
:
review+
nelson
:
superreview+
|
Details | Diff | Splinter Review |
ssl.sh: SSL3 Request don't require client auth on 2nd hs (EC) (client auth) ----
selfserv -D -p 8444 -d ../server -n mandela.red.iplanet.com \
-e mandela.red.iplanet.com-ec -w nss -r -r -r -i ../tests_pid.27746 &
selfserv started at Wed Aug 9 02:32:14 PDT 2006
tstclnt -p 8444 -h mandela.red.iplanet.com -q \
-d ../client < /share/builds/mccrel3/security/securitytip/builds/20060809.1/wozzeck_Solaris8/mozilla/security/nss/tests/ssl/sslreq.dat
selfserv -D -p 8444 -d ../server -n mandela.red.iplanet.com \
-e mandela.red.iplanet.com-ec -w nss -r -r -r -r -i ../tests_pid.27746 &
selfserv started at Wed Aug 9 02:32:14 PDT 2006
tstclnt -p 8444 -h mandela.red.iplanet.com -q \
-d ../client < /share/builds/mccrel3/security/securitytip/builds/20060809.1/wozzeck_Solaris8/mozilla/security/nss/tests/ssl/sslreq.dat
selfserv: normal termination
selfserv: PR_Bind returned error -5982:
Local Network address is in use.
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
RETRY: tstclnt -p 8444 -h mandela.red.iplanet.com -q \
-d ../client < /share/builds/mccrel3/security/securitytip/builds/20060809.1/wozzeck_Solaris8/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
ssl.sh: Exit: 10 Fatal - selfserv process not detectable
all.sh: kill: no such process
Happened in nightly testing 20060809 on mandela SunOS5.10_i86pc_DBG.OBJ in securitytip branch.
Assignee | ||
Comment 1•19 years ago
|
||
The PR_Bind failure may mean that the previous selfserv did not shut down properly.
Indeed, from the output listed, it seems that the message "Selfserv: normal termination" from the first selfserv appears after the "selfserv started at Wed Aug 9 02:32:14 PDT 2006" string of the second selfserv.
But I'm not sure we can rely on this order because the messages may be on different streams (stdout, stderr) and the shell buffers them.
I think some better synchronization may be needed.
The ssl.sh script currently sends a USR1 message to selfserv to notify it to shutdown. But that is followed by a "wait ${PID}"
See kill_selfserv below :########################### kill_selfserv ##############################
# local shell function to kill the selfserver after the tests are done
########################################################################
kill_selfserv()
{
if [ "${OS_ARCH}" = "WINNT" -a "$OS_NAME" = "CYGWIN_NT" ]; then
PID=${SHELL_SERVERPID}
else
PID=`cat ${SERVERPID}`
fi
if [ "${OS_ARCH}" = "WINNT" -o "${OS_ARCH}" = "WIN95" -o "${OS_ARCH}" = "OS2" ]; then
${KILL} ${PID}
else
${KILL} -USR1 ${PID}
fi
wait ${PID}
if [ ${fileout} -eq 1 ]; then
cat ${SERVEROUTFILE}
fi
# On Linux selfserv needs up to 30 seconds to fully die and free
# the port. Wait until the port is free. (Bug 129701)
if [ "${OS_ARCH}" = "Linux" ]; then
until selfserv -b -p ${PORT} 2>/dev/null; do
sleep 1
done
fi
rm ${SERVERPID}
}
The additional 30 second wait to free the port should not be necessary on Solaris. It would be useful if this problem could be reproduced at will in order to fix it.
Reporter | ||
Comment 2•19 years ago
|
||
Does selfserv use SO_REUSEADDR option ? This can prevent problems with opening blocked sockets just after they were closed.
Comment 3•19 years ago
|
||
(In reply to comment #2)
I think it is set that way.
A part of the selfserv.c is given below.
***********************************************
getBoundListenSocket(unsigned short port)
{
PRFileDesc * listen_sock;
int listenQueueDepth = 5 + (2 * maxThreads);
PRStatus prStatus;
PRNetAddr addr;
PRSocketOptionData opt;
addr.inet.family = PR_AF_INET;
addr.inet.ip = PR_INADDR_ANY;
addr.inet.port = PR_htons(port);
listen_sock = PR_NewTCPSocket();
if (listen_sock == NULL) {
errExit("PR_NewTCPSocket");
}
opt.option = PR_SockOpt_Nonblocking;
opt.value.non_blocking = PR_FALSE;
prStatus = PR_SetSocketOption(listen_sock, &opt);
if (prStatus < 0) {
errExit("PR_SetSocketOption(PR_SockOpt_Nonblocking)");
}
opt.option=PR_SockOpt_Reuseaddr;
opt.value.reuse_addr = PR_TRUE;
prStatus = PR_SetSocketOption(listen_sock, &opt);
....
*************************************************************************
Comment 4•19 years ago
|
||
(In reply to comment #0)
Slavomir,
Cud u describe a bit more of the situation in which u found this PR_Bind() error?
It occurred during running ssl.sh, right? Now, in ssl.sh,it runs several functions,like ssl_run(which runs ssl_cov,ssl_auth,...),ssl_crl etc. Did it already run some of those? Is it reproducible?
Thanks
Reporter | ||
Comment 5•19 years ago
|
||
Hi Biswatosh,
I have seen it only in nightly testing (I didn't try to reproduce manualy). Today it happened again, you can look here:
http://cindercone.red.iplanet.com/share/builds/mccrel3/security/securitytip/builds/20060818.1/wozzeck_Solaris8/mozilla/tests_results/security/biarritz.1/output.log
Assignee | ||
Comment 6•18 years ago
|
||
Slavo,
Did this happen only on Solaris x86, or on other platforms also ?
Please list the specific OS version. Thanks.
Updated•18 years ago
|
Assignee: nobody → slavomir.katuscak
Summary: Selfserver PR_Bind failure - local network address in use. → Selfserv PR_Bind failure - local network address in use.
Reporter | ||
Comment 7•18 years ago
|
||
Last 2 occurences were on Solaris x86 but it happened also before, but I didn't submit it as special bug then, only as comment in nightly testing details.
It happened also on AIX, HPUX, Linux 2.6 x86.
OS: Solaris → All
Hardware: PC → All
Assignee | ||
Comment 8•18 years ago
|
||
The SO_REUSEADDR should prevent this problem from occurring. I could believe that one platform might have an implementation issue with it, but not 4 platforms.
We really need to figure out the synchronization. I suggest we add echo statements in the script at the time of the kill, wait and startup of selfserv, and include the PID and a timestamp in each of those. This should help determine what's going on. Slavo, can you please make these changes in ssl.sh and attach a patch ? Thanks.
Reporter | ||
Comment 9•18 years ago
|
||
Attachment #235071 -
Flags: review?(julien.pierre.bugs)
Assignee | ||
Comment 10•18 years ago
|
||
Comment on attachment 235071 [details] [diff] [review]
ssl.sh verbosity patch
Slavo,
Everything looks good except the following change.
We don't want to use selfserv -b on platforms that don't need it . This was a hack to work around a Linux OS bug. It shouldn't be needed on all platforms. Let's leave it within the if block at this time. If you just put the echo statement after the if, then r+.
- #if [ "${OS_ARCH}" = "Linux" ]; then
- kill -0 $PID >/dev/null 2>/dev/null || Exit 10 "Fatal - selfserv process not detectable"
- #else
- #$PS -e | grep $PID >/dev/null || \
- #Exit 10 "Fatal - selfserv process not detectable"
- #fi
+
+ echo "kill -0 ${PID} >/dev/null 2>/dev/null"
+ kill -0 ${PID} >/dev/null 2>/dev/null || Exit 10 "Fatal - selfserv process not detectable"
+
+ echo "selfserv with PID ${PID} found at `date`"
}
Attachment #235071 -
Flags: review?(julien.pierre.bugs) → review-
Assignee | ||
Comment 11•18 years ago
|
||
Sorry, I pasted the wrong part of the patch. The part I object to is actually :
- # On Linux selfserv needs up to 30 seconds to fully die and free
- # the port. Wait until the port is free. (Bug 129701)
- if [ "${OS_ARCH}" = "Linux" ]; then
- until selfserv -b -p ${PORT} 2>/dev/null; do
- sleep 1
- done
- fi
+
+ echo "selfserv -b -p ${PORT} 2>/dev/null;"
+ until selfserv -b -p ${PORT} 2>/dev/null; do
+ echo "RETRY: selfserv -b -p ${PORT} 2>/dev/null;"
+ sleep 1
+ done
+
+ echo "selfserv with PID ${PID} killed at `date`"
+
Reporter | ||
Comment 12•18 years ago
|
||
Julien, I thought that it will not mind also on other system, when there will be runned selfserv with -b parameter. By -h description it should try to bind to selected port and then exit (some sanity test). This shouldn't cause any problem. But I have it removed in this version.
Attachment #235071 -
Attachment is obsolete: true
Attachment #235115 -
Flags: review?(julien.pierre.bugs)
Assignee | ||
Updated•18 years ago
|
Attachment #235115 -
Flags: review?(julien.pierre.bugs) → review+
Reporter | ||
Updated•18 years ago
|
Priority: -- → P2
Reporter | ||
Comment 13•18 years ago
|
||
Happened 4 times today (20060906.1) on LEIA, WINNT5.0, both DBg and OBJ, branch securityjes5.
ssl.sh: SSL3 Request don't require client auth on 2nd hs (client does not provide auth) ----
selfserv -D -p 8447 -d ../server -n LEIA.red.iplanet.com \
-e LEIA.red.iplanet.com-ec -w nss -r -r -r -i ../tests_pid.2912 &
selfserv started at Wed Sep 6 21:14:52 PDT 2006
tstclnt -p 8447 -h LEIA.red.iplanet.com -q \
-d ../client < U:/security/securityjes5/builds/20060906.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt -p 8447 -h LEIA.red.iplanet.com -f -d ../client \
-T -w nss -n none < U:/security/securityjes5/builds/20060906.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: unable to connect (poll): Local Network address is in use.
ssl.sh: SSL3 Request don't require client auth on 2nd hs (client does not provide auth) (cert TestUser41 - not revoked) produced a returncode of 1, expected is 0 FAILED
ssl.sh: TLS Require client auth (EC) (bad password) ----
selfserv -D -p 8447 -d ../ext_server -n LEIA.red.iplanet.com -B -s \
-e LEIA.red.iplanet.com-ec -w nss -r -r -i ../tests_pid.2912 &
selfserv started at Wed Sep 6 22:16:50 PDT 2006
tstclnt -p 8447 -h LEIA.red.iplanet.com -q \
-d ../ext_client < U:/security/securityjes5/builds/20060906.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt -p 8447 -h LEIA.red.iplanet.com -f -d ../ext_client \
-w bogus -n ExtendedSSLUser-ec < U:/security/securityjes5/builds/20060906.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: unable to connect (poll): Local Network address is in use.
ssl.sh: TLS Require client auth (EC) (bad password) produced a returncode of 1, expected is 254 FAILED
ssl.sh: SSL3 Request don't require client auth on 2nd hs (EC) (bad password) ----
selfserv -D -p 8447 -d ../server -n LEIA.red.iplanet.com \
-e LEIA.red.iplanet.com-ec -w nss -r -r -r -i ../tests_pid.2548 &
selfserv started at Wed Sep 6 22:57:59 PDT 2006
tstclnt -p 8447 -h LEIA.red.iplanet.com -q \
-d ../client < U:/security/securityjes5/builds/20060906.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt -p 8447 -h LEIA.red.iplanet.com -f -d ../client \
-T -n TestUser42-ec -w bogus < U:/security/securityjes5/builds/20060906.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: unable to connect (poll): Local Network address is in use.
ssl.sh: SSL3 Request don't require client auth on 2nd hs (EC) (bad password) (cert TestUser42 - revoked) produced a returncode of 1, expected is 0 FAILED
ssl.sh: SSL3 Require client auth (EC) (client auth) ----
selfserv -D -p 8447 -d ../server -n LEIA.red.iplanet.com \
-e LEIA.red.iplanet.com-ec -w nss -r -r -i ../tests_pid.2548 &
selfserv started at Wed Sep 6 23:26:52 PDT 2006
tstclnt -p 8447 -h LEIA.red.iplanet.com -B -s -q \
-d ../client < U:/security/securityjes5/builds/20060906.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt -p 8447 -h LEIA.red.iplanet.com -f -d ../client -B -s \
-T -n TestUser-ec -w nss < U:/security/securityjes5/builds/20060906.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: unable to connect (poll): Local Network address is in use.
ssl.sh: SSL3 Require client auth (EC) (client auth) produced a returncode of 1, expected is 0 FAILED
Assignee | ||
Comment 14•18 years ago
|
||
Slavo,
Do you have CVS access now ? The verbosity patch needs to be checked in.
Assignee | ||
Updated•18 years ago
|
Attachment #235115 -
Flags: superreview?(alexei.volkov.bugs)
Updated•18 years ago
|
Attachment #235115 -
Flags: superreview?(alexei.volkov.bugs) → superreview+
Assignee | ||
Comment 15•18 years ago
|
||
Thanks for the 2nd review Alexei. Unfortunately, there was a merge error for the branch, again because of the 332222 changes, but it was a very simple one to fix, so I did it manually.
tip :
Checking in ssl.sh;
/cvsroot/mozilla/security/nss/tests/ssl/ssl.sh,v <-- ssl.sh
new revision: 1.70; previous revision: 1.69
done
branch :
Checking in ssl.sh;
/cvsroot/mozilla/security/nss/tests/ssl/ssl.sh,v <-- ssl.sh
new revision: 1.61.2.7; previous revision: 1.61.2.6
done
Slavo, you need to get CVS access ASAP.
Reporter | ||
Comment 16•18 years ago
|
||
Hi Julien,
Thanks for check in of this bug.
I'm just looking to the document how to get CVS write access, there is needed to file it as a bug and then send printed and filled document by fax. I will do this at monday when I will be in the office.
There is also written that I will need a voucher, so who should I ask for super-review ?
Assignee | ||
Comment 17•18 years ago
|
||
Slavo,
You can ask me and Nelson for the voucher.
Assignee | ||
Comment 18•18 years ago
|
||
I have since this error too occasionally. Here is a log excerpt. Note that I modified selfserv to print the pid along with the "normal termination" message.
In the log below, we see
kill -USR1 28637
selfserv with PID 28637 killed at Fri Sep 22 17:47:08 PDT 2006
In between these 2 lines, there should be a message saying :
selfserv: normal termination . pid=28637
As is the case for the earlier server with PID 27775 .
This causes the third server's (with pid 28655) to fail binding .
Yet, somehow PID 28637 did not stay around, and there is no core file. So, I think it received the USR1 signal. Process 28637 went down, but apparently just not quickly enough for that to occur before the start of the third server.
My conclusion is that the "wait ${PID}" in ssl.sh returned before PID 28637 actually completed. This would explain why the port was not released in time, and maybe also why we don't see the server's termination message .
tstclnt: read from socket failed: SSL peer rejected your certificate as revoked.
kill -0 27775 >/dev/null 2>/dev/null
selfserv with PID 27775 found at Fri Sep 22 17:47:08 PDT 2006
ssl.sh: SSL3 Request don't require client auth on 2nd hs (EC) (client auth)(cert TestUser52 - revoked) produced a returncode of 1, expected is 1 PASSED
trying to kill selfserv with PID 27775 at Fri Sep 22 17:47:08 PDT 2006
kill -USR1 27775
selfserv: normal termination . pid=27775 .
selfserv with PID 27775 killed at Fri Sep 22 17:47:08 PDT 2006
ssl.sh: SSL3 Request don't require client auth on 2nd hs (EC) (client auth) ----
selfserv starting at Fri Sep 22 17:47:08 PDT 2006
selfserv -D -p 8443 -d ../server -n monstre.red.iplanet.com \
-e monstre.red.iplanet.com-ec -w nss -r -r -r -i ../tests_pid.10767 &
waiting for selfserv at Fri Sep 22 17:47:08 PDT 2006
tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /export/home/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
kill -0 28637 >/dev/null 2>/dev/null
selfserv with PID 28637 found at Fri Sep 22 17:47:08 PDT 2006
selfserv with PID 28637 started at Fri Sep 22 17:47:08 PDT 2006
trying to kill selfserv with PID 28637 at Fri Sep 22 17:47:08 PDT 2006
kill -USR1 28637
selfserv with PID 28637 killed at Fri Sep 22 17:47:08 PDT 2006
selfserv starting at Fri Sep 22 17:47:08 PDT 2006
selfserv -D -p 8443 -d ../server -n monstre.red.iplanet.com \
-e monstre.red.iplanet.com-ec -w nss -r -r -r -r -i ../tests_pid.10767 &
waiting for selfserv at Fri Sep 22 17:47:08 PDT 2006
tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /export/home/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
selfserv: PR_Bind returned error -5982:
Local Network address is in use.
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
RETRY: tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /export/home/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
kill -0 28655 >/dev/null 2>/dev/null
ssl.sh: Exit: 10 Fatal - selfserv process not detectable
./all.sh: kill: no such process
Assignee | ||
Comment 19•18 years ago
|
||
Unfortunately, as you can see in the previous log, all the timestamps are : Fri Sep 22 17:47:08 PDT 2006 . I guess my machine is really fast. It would be nice to get some higher-resolution timestamps. But I think for now we should focus on why wait doesn't wait long enough . My problem was seen on Solaris x86. It is possible that the failure on other platforms has a different cause.
Reporter | ||
Comment 20•18 years ago
|
||
Julien, can you try to use workaround for bug 129071 on your machine ?
In comment #11 you wrote that it was only problem in Linux, but I'm not really sure about this, because "wait ${PID}" doens't look reliable to be sure that process is ended and port is free.
I think this workaround can be solution for this bug on all platforms.
Assignee | ||
Comment 21•18 years ago
|
||
Slavo,
That workaround will significantly slow down the QA, and shouldn't be needed. To resolve this bug, we first need to figure out why wait doesn't work actually wait long enough on many platforms . We can't come up with a fix or workaround before we know the root cause of the problem. Is it possible that we are using wait incorrectly ? Or is this a bug common to all the various shells and platforms we use ?
Reporter | ||
Comment 22•18 years ago
|
||
Julien:
I was looking to selfserv code and it seems that problem is there. You don't see message 'selfserv: normal termination' in your log, because selfserv probably didn't finished with return value 0.
if (NSS_Shutdown() != SECSuccess) {
SECU_PrintError(progName, "NSS_Shutdown");
PR_Cleanup();
exit(1);
}
PR_Cleanup();
printf("selfserv: normal termination\n");
return 0;
SECU_PrintError prints to STDERR, so we don't see this message in output log, but there is also possible that exit() is called from some other called function while cleanup. We should add there check for return value of selfserv.
Usage of wait in ssl.sh seems to be correct, if there is really bug in Linux (I don't think that there is, because then SO_REUSEADDR will not make sense to use), then it's possible that it will be also in other systems.
Assignee | ||
Comment 23•18 years ago
|
||
Slavo,
In many circumstances, if NSS_Shutdown() fails, it is because of leaked slot references. In debug builds, if the NSS_STRICT_SHUTDOWN variable is set, which it should be in the QA, selfserv will crash, and on Unix platforms, a core file is produced. We haven't seen those core files so far. It's possible that NSS_Shutdown() failed for another reason.
Even if selfserv failed with a non-zero error code, the shell should still wait for the process termination, and it should be able to bind again right away.
Regarding the possible missing output messages from selfserv, is it possible to make the QA script print the output of both stdout and stderr from selfserv, rather than just stdout ? If not, we should change selfserv to print everything to stdout.
Reporter | ||
Comment 24•18 years ago
|
||
Julien,
Printing output also from stderr is easy thing, you just need to add 2>&1 to selfserv command in start_selfserv() function in ssl.sh. Also return value checking is easy.
I did some tests and analysis of code and I found another problem:
In selfserv.c there is handler for USR1 signal set relatively late, it's set from do_accepts() function, after many initializations and after functions like PR_Bind(). In my tests which were running functions start_selfserv and kill_server from ssl.sh in cycle, I was able to hit with kill command to time before handler is set, so signal was not handled correctly, there wasn't done any cleanup and I got return value 138, also some strange error on output:
selfserv starting at Fri Oct 13 19:04:37 CEST 2006
selfserv -D -p 8443 -d ../server -n phantasm.czech.sun.com \
-e phantasm.czech.sun.com-ec -w nss -i ../tests_pid.3987 &
waiting for selfserv at Fri Oct 13 19:04:37 CEST 2006
tstclnt -p 8443 -h phantasm.czech.sun.com -q \
-d ../client < /home/sk159068/nss/mozilla/security/nss/tests/ssl/sslreq.dat
kill -0 4170 >/dev/null 2>/dev/null
selfserv with PID 4170 found at Fri Oct 13 19:04:37 CEST 2006
selfserv with PID 4170 started at Fri Oct 13 19:04:37 CEST 2006
trying to kill selfserv with PID 4170 at Fri Oct 13 19:04:37 CEST 2006
kill -USR1 4170
./ssl.sh: line 183: 4170 User defined signal 1 selfserv -D -p ${PORT} -d ${P_R_SERVERDIR} -n ${HOSTADDR} ${SERVER_OPTIONS} ${ECC_OPTIONS} -w nss ${sparam} -i ${R_SERVERPID} $verbose
selfserv with PID 4170 killed at Fri Oct 13 19:04:37 CEST 2006
On line 183 of ssl.sh I have: wait ${PID}.
This is probably not a main reason of our problem, because when you look to log in description of this bug, there was selfserv closed correctly with 'Local Network address is in use.' message.
Reporter | ||
Comment 25•18 years ago
|
||
There is one more simple answer what was the reason of this problem. There are running tests on all branches simultaneously. It's possible that between seflserv stop and start there was started selfserv from another branch.
Assignee | ||
Comment 26•18 years ago
|
||
Slavo,
That could certainly explain it, but I would expect more frequent conflicts if this was the case. These multiple QA runs from different branches are supposed to be running on different ports. You need to set the PORT environment variable for each one to customize the port. I believe I told Jason and Christophe about it at least a year ago. Are we running the QA for different branches on different ports ? If not, we should do that.
Unfortunately, I believe this is not the problem, or at least not the only one, as I have - very rarely - run into this problem on my own system, when I was not running any other server on the selfserv port .
Assignee | ||
Comment 27•18 years ago
|
||
I confirmed that securityjes5, securitytp, and securitypkix are running selfserv on different ports.
Reporter | ||
Comment 28•18 years ago
|
||
This I found today on frozen Tinderbox:
selfserv with PID 20828 found at Thu Feb 22 08:44:08 PST 2007
tstclnt -p 8443 -h situp.red.iplanet.com -c :C006 -B -s \
-f -d ../ext_client < /export/tinderbox/SunOS_5.10/mozilla/security/nss/tests/ssl/sslreq.dat
NFS server redbuild.red.iplanet.com not responding still trying
NFS server redbuild.red.iplanet.com ok
selfserv: PR_Bind returned error -5982:
Local Network address is in use.
Messages anout NFS problems are not logged, it's possible that these failures are caused by network problems.
In Tinderbox testing we should have all tools stored on local machines and there should be not needed any network communication except cvs checkout and e-mails to Tinderbox web server.
Reporter | ||
Comment 29•18 years ago
|
||
There is also problem with selfserv PID in ssl.sh. It's checked in file ${TMP}/tests_pid.$$, but this file is not created, so kill_selfserv() doesn't work correctly. I'm going to fix this.
Reporter | ||
Comment 30•18 years ago
|
||
Some analysis of log from comment 28 - selfserv already exists for longer time and already processed some other client data, but it failed after tstclnt message. It seems that failing PR_Bind() is not the one called when creating server, there must be some other PR_Bind() call while processing tstclnt data, probably called indirectly from some other called function.
Comment 31•18 years ago
|
||
Slavo, I suspect that this problem will go away when you fix the selfserv PID
file issue you described in comment 29.
The next time you find a hung selfserv, make a core file of it by killing it
with signal SEGV.
Assignee | ||
Comment 32•17 years ago
|
||
I continue to run into this bind problem intermittently on some platforms. I was testing a patch on many platforms, and ran into it on Solaris and AIX. This makes it very difficult to test patches for regressions.
Slavo, how is the issue in comment 29 preventing the server from binding ? Is it causing the next server to be started before the previous one has fully shutdown in some cases ? If so, it needs to be fixed ASAP.
Assignee | ||
Comment 33•17 years ago
|
||
Oh, and it also happened on HP-UX and Linux. Out of a total of 26 builds I ran the QA against overnight, I got this error 7 times. That's a 26.9% occurrence rate, which is horrible. I know nobody else was running anything on the same port on the machine - I chose custom ports.
I don't understand how the nightly QA can be green this often given the results I got. 2 of the failures were on a virgin source tree, 5 on a patched source tree, but I'm confident my patch isn't changing the bind behavior.
Reporter | ||
Comment 34•17 years ago
|
||
(In reply to comment #32)
> Slavo, how is the issue in comment 29 preventing the server from binding ? Is
> it causing the next server to be started before the previous one has fully
> shutdown in some cases ? If so, it needs to be fixed ASAP.
I think that's it. But it's really surprise for me that it happen in 26.9% in your tests. I see it only sometimes in nightly tests. Can you try to make a core as Nelson suggests in comment 31 ?
Assignee | ||
Comment 35•17 years ago
|
||
Slavo,
I can't make a core because I never had a hung selfserv in my runs. The selfserv process always goes away, but apparently not early enough before we start the next selfserv, so the port is held in use.
Assignee | ||
Comment 36•17 years ago
|
||
Re: comment 30, the only PR_Bind in the NSS source tree are in :
a) cmd/sslsample (dead code)
b) cmd/libpkix/pkix_pl/module/test_socket.c . But the error doesn't happen while running libpkix tests
c) cmd/selfserv/selfserv.c
d) cmd/ssltap/ssltap.c . And we don't run ssltap in our QA.
e) lib/libpkix/pkix_pl_nss/module/pkix_pl_socket.c, which is used only by b)
Also, the error string is always :
selfserv: PR_Bind returned error -5982
So, the only possibility is that the PR_Bind error is from selfserv.
Reporter | ||
Comment 37•17 years ago
|
||
This problem is there for a longer time, and I see it relatively often in nightly tests.
Some analysis what happens there:
kill_selfserv():
1. Send kill -USR1 to selfserv.
2. Wait for selfserv shutdown (using wait command from OS) - after this server should be in TIME_WAIT or CLOSED state.
start_selfserv():
3. Call selfserv command and try to bind to the same port - using SO_REUSEADDR flag, this should open socket in both cases - TIME_WAIT and CLOSED state - but sometimes it doesn't.
In fact, I don't think there is a scripting error in ssl.sh causing this bug.
There are 3 possible explanations:
1. Bug in OS - after successful wait command, there is still port in state other than TIME_WAIT or CLOSED.
2. Bug related to SO_REUSEADDR (can be in OS, NSPR implementation or selfserv itself) - bind fails also when connection is in TIME_WAIT state.
3. There is another selfserv open on the same port (not called from current all.sh run).
I think options 2 and 3 are more probable then 1, because this issue occured on more OS.
In comment 11 there is mentioned a Linux OS bug + workaround for it at the end of kill_seflserv(). This workaround tries to bind to selected port until succeeds (doesn't report failures). I looked to related bugs (129701, 119340) and I didn't find there any Linux bug related to connection closing, there was a bug in ps command on Linux where running selfserv was not detected by ps command. We don't use ps command to detect selfserv in our scripts now, so this Linux workaround doesn't make much sense as it is now - when it runs only on Linux. But it can be usefull for us in different way.
My suggestion - modify kill_selfserv() function and change workaround as this:
1. Kill + wait (as it is now).
2. Try to bind to selected port (selfserv -b) - report an error if fails.
3. In case of failure repeat step 2 more (for example 10) times with 1 second delay - and report every failure.
After those steps, we can be sure that port is free and reusable (or all 10 tries fails).
Frequency of failures in bind tests can help us find the reason of the problem:
1. 10 failures in row - another selfserv running at the same time (probably previous run not closed).
2. Only 1 failure - most probably OS_REUSEADDR problem or wait problem.
3. Random number of failures more times in one all.sh run - probably another all.sh instance running at the same time.
Reporter | ||
Comment 38•17 years ago
|
||
Workaround suggested in previous comment.
Attachment #296126 -
Flags: review?(julien.pierre.boogz)
Assignee | ||
Comment 39•17 years ago
|
||
Slavo,
Sorry, but I'm not ready to approve the patch. It is a workaround to the problem, but not an actual solution. Let's try to narrow down the problem some more.
In response to your comment 37, your analysis of what kill_selfserv does is correct, at least for the Unix platforms.
If the problem is not a scripting error, I think it may be a shell bug.
Let's examine your theories :
1. it's a possibility, but it seems unlikely that every implementation on every OS is broken, however.
2. same comment as for 1 regarding OS bug in SO_REUSEADDR.
There is no bug in selfserv WRT setting PR_SockOpt_Reuseaddr . It is always set unconditionally before PR_Bind / PR_Listen.
I don't think the NSPR implementations are broken either. They just translate to the OS option. I have traced the implementation on Solaris in ptio.c and it translated to the correct setsockopt call, with the right SOL_SOCKET level (0xffff) and SO_REUSEADDR option value (0x4)
3. I think this is the most likely explanation. There is probably another selfserv process with the port still open.
Has this problem occurred on any machines that only run a single instance of NSS QA ?
Is it possible there are some machines that run multiple instances on the same port (ie. the PORT environment variable is not set) ?
It definitely wasn't the case on the machines I ran into the problem with. So, I think it still leaves open the possibility that ssl.sh caused 2 selfserv processes to be started at the same time on the same port. This may be either a script bug or a shell bug. But either way, we need to prevent it.
I would suggest another patch to continue investigation of the bind failure : when it occurs, try to run ps -ef | grep selfserv . The command may be different on some platforms since sometimes the file paths are too long. You need to test this command to make sure it detects the process correctly (eg. don't send the -USR1 signal to test it). You might also want to print out the full output of ptree. Also, the netstat output would be useful to see the state of the socket on that port . But I would let this continue to fail intermittently until we know the root cause.
Assignee | ||
Comment 40•17 years ago
|
||
Comment on attachment 296126 [details] [diff] [review]
Workaround.
See comments in the bug.
Attachment #296126 -
Flags: review?(julien.pierre.boogz) → review-
Reporter | ||
Comment 41•17 years ago
|
||
(In reply to comment #39)
> I would suggest another patch to continue investigation of the bind failure :
> when it occurs, try to run ps -ef | grep selfserv . The command may be
> different on some platforms since sometimes the file paths are too long. You
> need to test this command to make sure it detects the process correctly (eg.
> don't send the -USR1 signal to test it). You might also want to print out the
> full output of ptree. Also, the netstat output would be useful to see the state
> of the socket on that port . But I would let this continue to fail
> intermittently until we know the root cause.
>
I think ps -ef and netstat -af inet should be enough for our purposes. We don't need to get all details, we only need to know if there is some other process using the same port.
Reporter | ||
Comment 42•17 years ago
|
||
Patch adding ps -ef and netstat -af inet. Works only on Solaris (one platform is enough for this, it's not regular test and supporting others is only overhead).
Attachment #305754 -
Flags: review?(julien.pierre.boogz)
Assignee | ||
Updated•17 years ago
|
Attachment #305754 -
Flags: review?(julien.pierre.boogz) → review+
Reporter | ||
Comment 43•17 years ago
|
||
Checking in ssl.sh;
/cvsroot/mozilla/security/nss/tests/ssl/ssl.sh,v <-- ssl.sh
new revision: 1.90; previous revision: 1.89
done
Comment 44•17 years ago
|
||
Windows' netstat now has options to tell you about the process that is holding each socket shown. The -b option shows the PID and executable (.exe) name.
Does Solaris have anything similar? Or Linux?
Comment 45•17 years ago
|
||
netstat -p I believe does the same thing on Linux
Reporter | ||
Comment 46•17 years ago
|
||
Previous patch contained construction:
selfserv &
RET=$?
This didn't work correctly, detected value was always 0.
Solution is to put return value check (+ actions related to it) into background together with selfserv. After doing this I have seen that sometimes checking for selfserv started before it was started, so I added also sleep 1 line.
Attachment #305754 -
Attachment is obsolete: true
Attachment #310449 -
Flags: review?(julien.pierre.boogz)
Assignee | ||
Comment 47•17 years ago
|
||
Comment on attachment 310449 [details] [diff] [review]
Fix for previous patch.
There is no guarantee that 1 second is enough. It may be too much also given how many tests we have. But this isn't going to be reliable or provide useful input on existing failures of selfserv.
Attachment #310449 -
Flags: review?(julien.pierre.boogz) → review-
Assignee | ||
Comment 48•17 years ago
|
||
Here is some additional feedback :
1) with this patch, there is a risk of escalating concurrent runs of selfserv / netstat . We need to make sure to limit it to 1 run of selfserv when we run in the background. I don't know how it's possible to do this sort of synchronization in a shell script. If you do, then it would need to be added if we want to use this approach of starting the whole function in the background.
2) a message should be added after the netstat so that we know that it has completed . Otherwise we won't know if we have incomplete data.
3) I am concerned that netstat will take a very long time . It really varies per machine and depending on load. Since we are dealing with a race here, it's very possible that this will take too long and not provide us with the data we need.
I think we need to try another approach. I'm going on the theory that we have multiple selfserv processes running concurrently and actually trying to bind. We should be able to definitely test that theory in the C code using an approach similar to the following :
1) just before binding, create a global named semaphore . If this fails, we will know that another selfserv is running . Selfserv should print out an error message to tell us that. Then we will know it is not a TCP/IP stack problem for sure.
2) after closing of the listen socket in selfserv, delete this global named semaphore
There are some additional implications for the multi-process case of selfserv, but we don't need to make this work right now since we are not testing it yet.
The semaphore name wouldn't be entirely global. It should be something hardcoded + the port. Eg. "selfserv_xxxx". We may or may not want to trigger this code by default. If not, we can add a command-line option.
Go and read http://www.mozilla.org/projects/nspr/reference/html/pripcsem.html . The functions PR_OpenSemaphore (with PR_SEM_EXCL) and PR_DeleteSemaphore should do the job.
Reporter | ||
Comment 49•17 years ago
|
||
Julien,
Running only one selfserv at one time seems to be a big restriction, we usually run more of them at one time but on different ports (branch and trunk). We can include port to semaphore name, but I would like to try something else now.
My approach now is to get return values from selfserv (current version doesn't work and we need this), this can help us to check if there is a connection between this bug and failing selfserv shutdown. If we see bind failure everytime after selfserv shutdown fails, I don't expect that the reason is other selfserv instance.
One second delay is there more for logging reason, I added it there after I received messages from client side and server side in opposite order (first successfull client connection, then message that selfserv started).
Assignee | ||
Comment 50•17 years ago
|
||
Slavo,
Re: comment 49,
Apparently we see the message "selfserv: normal termination" in the log. So it seems selfserv terminates correctly.
The question is *when* the first selfserv really shuts down, not if. The log file is not synchronous since selfserv is started in the background. We can't rely on the order of what we see in the log when it comes to selfserv.
I think the semaphore with port name in it will not introduce any regression, and should help us figure out if we are trying to start the 2nd selfserv before the 1st one has exited.
Reporter | ||
Comment 51•17 years ago
|
||
(In reply to comment #50)
> Apparently we see the message "selfserv: normal termination" in the log. So it
> seems selfserv terminates correctly.
Not really, I analyzed some logs with this failure and without. Since I'm checking also logs from previous tests when this bug occured, in all cases when I see bind failure there is missing message "selfserv: normal termination" in previous selfserv shutdown (all cases in last 1-2 months). When selfserv shuts down normally, this message is always there. I expect some failure inside selfserv which causes that it doesn't shut down normally and exit value wouldn't be 0 in this case.
Assignee | ||
Comment 52•17 years ago
|
||
This expedient patch should help in tracing this problem. If we don't have a normal termination, at least we should still see an "exiting" message from selfserv, with its PID, so we know that it has terminated.
This is still short of what we need. I think we need a separate log file for the server, like commercial servers have. Maybe both an errors log file (which is written synchronously) and a separate access log file (written asynchronously). Every log entry should be timestamped with the maximum available resolution from NSPR. We have been running the server in the background of the shell for long enough. It is clearly not adequate. We don't really see when which server process terminates.
The client programs also need more information. Timestampped entries would be good there too. They don't have to have separate log files since they run properly within the shell.
Attachment #314258 -
Flags: review?(nelson)
Assignee | ||
Comment 53•17 years ago
|
||
I have tested this patch both on Solaris and Windows. It works and prevents 2 selfserv from running on the same port, before the bind occurs.
On Windows, normally the OS lets any number of processes bind to the same port when SO_REUSEADDR is in use as we do. If my theory is correct that our script tries to run 2 selfserv concurrently on the same port, it might explain why we haven't seen this problem on Windows - because it can't happen there.
I just ran all.sh on Solaris with my patch. It started running all.sh but pretty soon I ran into an error. Here is the end of the log :
kill -0 25453 >/dev/null 2>/dev/null
selfserv with PID 25453 found at Mon Apr 7 22:48:48 PDT 2008
ssl.sh: #824: SSL3 Require client auth (EC) (client auth)(cert TestUser52 - revoked) produced a returncode of 254, expected is 254 - PASSED
trying to kill selfserv with PID 25453 at Mon Apr 7 22:48:48 PDT 2008
kill -USR1 25453
selfserv: closed global semaphore. PID=25453
selfserv: 0 cache hits; 41 cache misses, 0 cache not reusable
0 stateless resumes, 0 ticket parse failures
selfserv: normal termination . PID=25453
selfserv: exiting. PID=25453
selfserv with PID 25453 killed at Mon Apr 7 22:48:48 PDT 2008
ssl.sh: SSL3 Require client auth (EC) (client auth) ----
selfserv starting at Mon Apr 7 22:48:48 PDT 2008
selfserv -D -p 8443 -d ../server -n monstre.red.iplanet.com \
-e monstre.red.iplanet.com-ec -w nss -r -r -i ../tests_pid.14363 &
trying to connect to selfserv at Mon Apr 7 22:48:48 PDT 2008
tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /net/monstre/export/home/julien/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
selfserv: created global semaphore. PID=26400
kill -0 26400 >/dev/null 2>/dev/null
selfserv with PID 26400 found at Mon Apr 7 22:48:48 PDT 2008
selfserv with PID 26400 started at Mon Apr 7 22:48:48 PDT 2008
trying to kill selfserv with PID 26400 at Mon Apr 7 22:48:48 PDT 2008
kill -USR1 26400
selfserv with PID 26400 killed at Mon Apr 7 22:48:48 PDT 2008
selfserv starting at Mon Apr 7 22:48:48 PDT 2008
selfserv -D -p 8443 -d ../server -n monstre.red.iplanet.com \
-e monstre.red.iplanet.com-ec -w nss -r -r -r -i ../tests_pid.14363 &
trying to connect to selfserv at Mon Apr 7 22:48:48 PDT 2008
tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /net/monstre/export/home/julien/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
selfserv: unable to create global semaphore. PID=26421
selfserv: Unable to create global semaphore. returned error -5943:
Cannot create or rename a filename that already exists.
selfserv: exiting. PID=26421
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
retrying to connect to selfserv at Mon Apr 7 22:49:59 PDT 2008
tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /net/monstre/export/home/julien/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
ssl.sh: #825: Waiting for Server - FAILED
kill -0 26421 >/dev/null 2>/dev/null
ssl.sh: Exit: 10 Fatal - selfserv process not detectable - FAILED
ssl.sh: #826: 10 Fatal - selfserv process not detectable - FAILED
./ssl.sh: kill: no such process
107.05u 16.58s 5:12.06 39.6%
There are several troubling things about that log :
1) There is no "selfserv: normal termination . PID=26400" message. This means the process terminated abnormally.
2) Nowhere does the line "selfserv: exiting. PID=26400" appear.
This should have been displayed because it's in the atexit function.
Yet, this process is not running on my system anymore. And it wasn't running anymore either while the tstclnt was timing out.
3) since the atexit function was not invoked, the semaphore was not cleaned up, and is still present in /tmp/.SEMDtmp_selfserv8443 . It has to be deleted manually unfortunately, since Unix does not automatically delete semaphores without any users like windows does. This explains why the next server proces, with pid 26421, was not able to create the semaphore. It doesn't mean it actually ran concurrently with process 26400, since process 26400 terminated abnormally and didn't get a chance to free the sem.
I think a key to solving the problem is to figure out why process 26400 exited abnormally without calling the atexit callback function. A similar problem may also cause an issue with the binding of the port in the TCP/IP stack.
We tried to get this process to shut down by sending it a signal. Maybe something went wrong there. I didn't see any core dump, though.
I am going to turn on the signal handler logging and print the PID there too to see if the server got the signal. Hopefully I can reproduce the issue on my system. It seems it's easier to run into this problem with my semaphore patch than just with the TCP/IP stack. At least I got it to happen twice in 2 runs of ssl.sh so far.
I will also try to run ssl.sh with this patch on Windows to see if it can pass there.
Attachment #314275 -
Flags: review?(nelson)
Assignee | ||
Comment 54•17 years ago
|
||
I'm 3 for 3 in reproducing the failure with all.sh now thanks to this latest patch. Here is what I got :
trying to kill selfserv with PID 12090 at Mon Apr 7 23:08:51 PDT 2008
kill -USR1 12090
selfserv: sigusr1_handler: stop server. PID=12090
selfserv: closed global semaphore. PID=12090
selfserv: 0 cache hits; 41 cache misses, 0 cache not reusable
0 stateless resumes, 0 ticket parse failures
selfserv: normal termination . PID=12090
selfserv: exiting. PID=12090
selfserv with PID 12090 killed at Mon Apr 7 23:08:51 PDT 2008
ssl.sh: SSL3 Require client auth (EC) (client auth) ----
selfserv starting at Mon Apr 7 23:08:51 PDT 2008
selfserv -D -p 8443 -d ../server -n monstre.red.iplanet.com \
-e monstre.red.iplanet.com-ec -w nss -r -r -i ../tests_pid.973 &
trying to connect to selfserv at Mon Apr 7 23:08:51 PDT 2008
tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /net/monstre/export/home/julien/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
selfserv: created global semaphore. PID=13037
kill -0 13037 >/dev/null 2>/dev/null
selfserv with PID 13037 found at Mon Apr 7 23:08:51 PDT 2008
selfserv with PID 13037 started at Mon Apr 7 23:08:51 PDT 2008
trying to kill selfserv with PID 13037 at Mon Apr 7 23:08:51 PDT 2008
kill -USR1 13037
selfserv with PID 13037 killed at Mon Apr 7 23:08:51 PDT 2008
selfserv starting at Mon Apr 7 23:08:51 PDT 2008
selfserv -D -p 8443 -d ../server -n monstre.red.iplanet.com \
-e monstre.red.iplanet.com-ec -w nss -r -r -r -i ../tests_pid.973 &
trying to connect to selfserv at Mon Apr 7 23:08:51 PDT 2008
tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /net/monstre/export/home/julien/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
selfserv: unable to create global semaphore. PID=13058
selfserv: Unable to create global semaphore. returned error -5943:
Cannot create or rename a filename that already exists.
selfserv: exiting. PID=13058
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
retrying to connect to selfserv at Mon Apr 7 23:10:02 PDT 2008
tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /net/monstre/export/home/julien/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
ssl.sh: #825: Waiting for Server - FAILED
kill -0 13058 >/dev/null 2>/dev/null
ssl.sh: Exit: 10 Fatal - selfserv process not detectable - FAILED
ssl.sh: #826: 10 Fatal - selfserv process not detectable - FAILED
./ssl.sh: kill: no such process
103.20u 16.65s 5:09.39 38.7%
Conspicuously absent is the message :
selfserv: sigusr1_handler: stop server. PID=13037
So, apparently, the kill -USR1 13037 caused the process to go down, but without invoking the signal handler. Otherwise, we would have seen this message. There is also no core dump anywhere.
I am at a loss as to why selfserv wouldn't catch the signal properly. We check that the signal handler gets installed, and we exit if it did not. And the signal handler is installed early on, before the first PR_Accept call. And we know there was an accept because tstclnt worked against that server, and the shell then went on to try to stop it. So i can assume that the signal handler was installed, though I will add extra logging in the success case to make doubly certain. The next things to look into are perhaps a problem with the arguments to the sigaction call in selfserv, or a problem with the shell sending the signal.
Comment 55•17 years ago
|
||
> So, apparently, the kill -USR1 13037 caused the process to go down, but
> without invoking the signal handler.
That precisely describes the "default action" (SIG_DFL) for the SIGUSR1 signal,
as documented at http://opengroup.org/onlinepubs/007908775/xsh/signal.h.html
I wonder if the sigaction call was actually done, or if some other function
has subsequently reset the action for that signal.
There's a short list of Unix library and system calls that are allowed to be
called from a signal handler. See
http://opengroup.org/onlinepubs/007908775/xsh/sigaction.html#tag_000_008_581_003
I wonder if function stop_server calls any not in that list.
According to
http://opengroup.org/onlinepubs/007908775/xsh/sigaction.html#tag_000_008_585
signal handlers should not use heap manipulation functions (malloc, free),
nor any stdio functions, "both of which traditionally use data structures in
a non-reentrant manner."
It seems to me that it would be premature to review the patches above while
this work remains in an exploratory/investigative phase.
Updated•17 years ago
|
Attachment #314258 -
Flags: review?(nelson)
Comment 56•17 years ago
|
||
Comment on attachment 314275 [details] [diff] [review]
Also use global semaphore as described in comment 48
Based on Julien's comments that these patches didn't work as intended, I'm canceling the review request.
Attachment #314275 -
Flags: review?(nelson)
Assignee | ||
Comment 57•17 years ago
|
||
Nelson,
Actually, the intent of these patches was to try to find the root cause of the bug. So I think they worked. Once the signal handler issue is resolved, we might want to put them in.
FYI, I was able to run successfully both on Linux and Windows with the semaphore patch. Only Solaris consistently gives a problem right now. So perhaps the issue on Linux is a different one than signal handling. Or it is less frequent there.
Assignee | ||
Comment 58•17 years ago
|
||
Nelson,
Re: comment 55,
I put a printf at the very beginning of sigusr1_handler to log the fact that it was invoked and the PID . Hopefully printf doesn't invoke any functions that are illegal for a signal handler. I did not see the output. Even if stop_server does something that's not allowed, I would still expect to see the output from the printf.
So, most likely the default action for SIGUSR1 was invoked indeed. We install it fairly late, after we bind but before the first accept. That's because stop_server() isn't safe to call until after the server has been properly started.
This should be OK, because the script should never try to kill selfserv before it has done a connection on it first with one of the clients.
Maybe there is a bug in the script where it starts selfserv and tries to kill it right away without doing any connection ? I haven't seen any evidence of that yet, but it's possible since the client and server outputs are intermixed and not well timestampped. If that's the case, the script should be fixed.
I will experiment with trying to install the signal handler earlier.
Comment 59•17 years ago
|
||
printf is a stdio function. printf(fmt, vars...) is essentially equivalent
to fprintf(stdout, fmt, vars...)
Among the possibilities to consider are:
a) the sigaction wasn't done
b) it was done but was subsequently undone (action reset to SIG_DFL) by
another sigaction or signal call,
c) it was done, but the handler erred, causing a failure.
Assignee | ||
Comment 60•17 years ago
|
||
I installed a dummy SIGUSR1 handler at the very beginning of main that just does printf and PORT_Assert(0) .
It got caught in the middle of my execution of ssl.sh on Solaris. I got the following stack from it :
[1] __lwp_kill(0x1, 0x6), at 0xfea84a27
[2] _thr_kill(0x1, 0x6), at 0xfea821d4
[3] raise(0x6), at 0xfea311d3
[4] abort(0x807f198, 0xfebb2d09, 0x80454b8, 0x8058d61, 0x8068cf8, 0x8068cfc), at 0xfea115d9
=>[5] PR_Assert(s = 0x8068cf8 "0", file = 0x8068cfc "selfserv.c", ln = 1330), line 577 in "prlog.c"
[6] sigusr1_dummy_handler(sig = 16), line 1330 in "selfserv.c"
[7] __sighndlr(0x10, 0x0, 0x8045564, 0x8058d10), at 0xfea83def
[8] call_user_handler(0x10, 0x0, 0x8045564), at 0xfea7a292
[9] sigacthandler(0x10, 0x0, 0x8045564, 0xf, 0x0, 0x0), at 0xfea7a412
---- called from signal handler with signal 16 (SIGUSR1) ------
[10] s_mp_rshd(mp = 0x804586c, p = 4U), line 3051 in "mpi.c"
[11] s_mp_div_2d(mp = 0x804586c, d = 256U), line 3151 in "mpi.c"
[12] s_mp_redc(T = 0x804586c, mmm = 0x80458d8), line 95 in "mpmontg.c"
[13] mp_exptmod_safe_i(montBase = 0x8045900, exponent = 0x80459bc, modulus = 0x80459bc, result = 0x804595c, mmm = 0x80458d8, nLen = 8, bits_in_exponent = 256U, window_bits = 4U, num_powers = 16U), line 1049 in "mpmontg.c"
[14] mp_exptmod(inBase = 0x804596c, exponent = 0x80459bc, modulus = 0x80459bc, result = 0x804595c), line 1195 in "mpmontg.c"
[15] mpp_fermat(a = 0x80459bc, w = 2U), line 249 in "mpprime.c"
[16] mpp_make_prime(start = 0x8045a64, nBits = 256U, strong = 0, nTries = 0x8045a10), line 500 in "mpprime.c"
[17] generate_prime(prime = 0x8045a64, primeLen = 32), line 199 in "rsa.c"
[18] RSA_NewKey(keySizeInBits = 512, publicExponent = 0x8045bb0), line 273 in "rsa.c"
[19] RSA_NewKey(keySizeInBits = 512, publicExponent = 0x8045bb0), line 207 in "loader.c"
[20] NSC_GenerateKeyPair(hSession = 1U, pMechanism = 0x8045ccc, pPublicKeyTemplate = 0x8045e38, ulPublicKeyAttributeCount = 8U, pPrivateKeyTemplate = 0x8045ea4, ulPrivateKeyAttributeCount = 7U, phPublicKey = 0x8045ca4, phPrivateKey = 0x8045ca8), line 3642 in "pkcs11c.c"
[21] PK11_GenerateKeyPairWithOpFlags(slot = 0x8168728, type = 0, param = 0x8045fbc, pubKey = 0x8045fec, attrFlags = 70U, opFlags = 0, opFlagsMask = 0, wincx = (nil)), line 1095 in "pk11akey.c"
[22] PK11_GenerateKeyPairWithFlags(slot = 0x8168728, type = 0, param = 0x8045fbc, pubKey = 0x8045fec, attrFlags = 70U, wincx = (nil)), line 1178 in "pk11akey.c"
[23] PK11_GenerateKeyPair(slot = 0x8168728, type = 0, param = 0x8045fbc, pubKey = 0x8045fec, token = 0, sensitive = 1, wincx = (nil)), line 1202 in "pk11akey.c"
[24] SECKEY_CreateRSAPrivateKey(keySizeInBits = 512, pubk = 0x8045fec, cx = (nil)), line 214 in "seckey.c"
[25] ssl3_CreateRSAStepDownKeys(ss = 0x817fd10), line 8496 in "ssl3con.c"
[26] SSL_ConfigSecureServer(fd = 0x8085588, cert = 0x81730b8, key = 0x817a720, kea = ssl_kea_rsa), line 786 in "sslsecur.c"
[27] server_main(listen_sock = 0x8085588, requestCert = 2, privKey = 0x804613c, cert = 0x8046150), line 1701 in "selfserv.c"
[28] main(argc = 16, argv = 0x80461d0), line 2372 in "selfserv.c"
This is evidence that the script is sending a SIGUSR1 to the server before it has done a successful connection with it. The server is still in the middle of its initialization.
Comment 61•17 years ago
|
||
(In reply to comment #60)
> This is evidence that the script is sending a SIGUSR1 to the server before it
> has done a successful connection with it. The server is still in the middle of
> its initialization.
What does that tell us about this bug (PR_Bind failure)?
Assignee | ||
Comment 62•17 years ago
|
||
Nelson,
I'm theorizing that the PR_Bind failure and the failure I have with the semaphores have the same cause. So if we fix one it should fix the other.
I think the main problem is that ssl.sh starts the server in the background. It has no idea when the server is actually ready to accept connections. It just tries to run a client against it right away. The server takes a completely non-deterministic amount of time to be ready.
Some mechanism needs to be added to selfserv / ssl.sh to know when it is OK to start running the client.
In this case, I would have expected the client (tstclnt) to fail and report an error. I don't know why it didn't.
Assignee | ||
Comment 63•17 years ago
|
||
The script does try to connect to selfserv by using tstclnt -q.
I added some more logging in selfserv. I made it log the PID every time it does an accept.
Here is what I got :
selfserv: do_accept: Got connection. PID=29459
selfserv: sigusr1_handler: stop server. PID=29459
selfserv: closed global semaphore. PID=29459
selfserv: 0 cache hits; 41 cache misses, 0 cache not reusable
0 stateless resumes, 0 ticket parse failures
selfserv: normal termination . PID=29459
selfserv: exiting. PID=29459
selfserv with PID 29459 killed at Tue Apr 8 19:48:39 PDT 2008
ssl.sh: SSL3 Require client auth (EC) (client auth) ----
selfserv starting at Tue Apr 8 19:48:39 PDT 2008
selfserv -D -p 8443 -d ../server -n monstre.red.iplanet.com \
-e monstre.red.iplanet.com-ec -w nss -r -r -i ../tests_pid.18364 &
trying to connect to selfserv at Tue Apr 8 19:48:40 PDT 2008
tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /net/monstre/export/home/julien/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: successfully connected to server.
kill -0 438 >/dev/null 2>/dev/null
selfserv with PID 438 found at Tue Apr 8 19:48:40 PDT 2008
selfserv with PID 438 started at Tue Apr 8 19:48:40 PDT 2008
trying to kill selfserv with PID 438 at Tue Apr 8 19:48:40 PDT 2008
kill -USR1 438
selfserv: dummy SIGUSR1 signal handler successfully installed. PID=438
selfserv startup - PID=438
selfserv: created global semaphore. PID=438
selfserv: dummy sigusr1 handler invoked. PID=438
selfserv with PID 438 killed at Tue Apr 8 19:48:40 PDT 2008
ssl.sh: #825: kill_selfserv core detection step - Core file is detected.
selfserv starting at Tue Apr 8 19:48:40 PDT 2008
selfserv -D -p 8443 -d ../server -n monstre.red.iplanet.com \
-e monstre.red.iplanet.com-ec -w nss -r -r -r -i ../tests_pid.18364 &
trying to connect to selfserv at Tue Apr 8 19:48:40 PDT 2008
tstclnt -p 8443 -h monstre.red.iplanet.com -q \
-d ../client < /net/monstre/export/home/julien/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
selfserv: dummy SIGUSR1 signal handler successfully installed. PID=468
selfserv startup - PID=468
selfserv: unable to create global semaphore. PID=468
selfserv: exiting. PID=468
tstclnt: successfully connected to server.
kill -0 468 >/dev/null 2>/dev/null
ssl.sh: Exit: 10 Fatal - selfserv process not detectable - FAILED
ssl.sh: #826: 10 Fatal - selfserv process not detectable - Core file is detected.
I am confused by the above result.
a) selfserv PID 29459 is the last one to have accepted a connection. But it apparently exited at 19:48:39 . At least, that's what the script says. It printed the message that it was killed . That message is only printed after wait ${PID} returned.
b) selfserv PID 438 was started at 19:48:39. It was alive at 19:48:40 (kill -0) . It had the dummy SIGUSR1 handler invoked. This means that this server hadn't bound the port and wasn't listening yet, since the dummy handler gets replaced by the real handler just before PR_Bind
c) tstclnt successfully connected after being started at 19:48:40
What I don't understand is, what server did that tstclnt actually connect to ?
selfserv PID 29459 apparently had already gone down before the connection was attempted, and selfserv PID 438 wasn't listening yet.
The only theory that makes sense right now is that PID 29459 did not in fact go down earlier. Perhaps wait ${PID} returned early with some error. Then, the script tried to start selfserv again. It made a connection to check that the server was up. But the connection was actually processed by the previous server process (29459). Meanwhile, the new server (438) was initializing. After tstclnt returned, the script tried to kill the new server, which was still being initialized.
Assignee | ||
Comment 64•17 years ago
|
||
I have been changing my semaphore logic a little bit. In several of my runs, I saw some evidence that the "wait ${pid}" in the script was not working. My latest log contained this :
selfserv with PID 20525 found at Wed Apr 9 18:35:21 PDT 2008
selfserv with PID 20525 started at Wed Apr 9 18:35:21 PDT 2008
trying to kill selfserv with PID 20525 at Wed Apr 9 18:35:21 PDT 2008
kill -USR1 20525
selfserv with PID 20525 killed at Wed Apr 9 18:35:21 PDT 2008
Yet after this point, process 20525 was actually still alive, and it had become an orphan ! It was still holding the port, preventing a bind.
This is as I had concluded 2 years ago in comment 18 .
"My conclusion is that the "wait ${PID}" in ssl.sh returned before PID 28637
actually completed. This would explain why the port was not released in time,
and maybe also why we don't see the server's termination message ."
Is there any way to check the status from this wait to see if it failed ?
And why would it ever fail ?
Assignee | ||
Comment 65•17 years ago
|
||
This patch checks that the selfserv process id really down. It does that by sending a kill -0 after the wait ${PID} .
With just this patch, and without any changes to the C code that's on the trunk, I was able to reproduce the problem. Here is a log excerpt :
selfserv: HDX PR_Read returned error -8180:
Peer's Certificate has been revoked.
tstclnt: write to SSL socket failed: SSL peer rejected your certificate as revoked.
kill -0 8345 >/dev/null 2>/dev/null
selfserv with PID 8345 found at Wed Apr 9 19:19:18 PDT 2008
ssl.sh: #701: TLS Require client auth (client auth)(cert TestUser52 - revoked) produced a returncode of 254, expected is 254 - PASSED
trying to kill selfserv with PID 8345 at Wed Apr 9 19:19:18 PDT 2008
kill -USR1 8345
selfserv: 0 cache hits; 41 cache misses, 0 cache not reusable
0 stateless resumes, 0 ticket parse failures
selfserv: normal termination
status from waitpid : 0
Now let's make sure it's really dead.
kill -0 8345 >/dev/null 2>/dev/null
selfserv with PID 8345 found at Wed Apr 9 19:19:18 PDT 2008 . It should be dead !
ssl.sh: Exit: 12 - FAILED
ssl.sh: #702: 12 - FAILED
cat: cannot open /net/monstre/export/home/julien/nss/tip/mozilla/tests_results/security/monstre.33/tests_pid.29924
trying to kill selfserv with PID at Wed Apr 9 19:19:19 PDT 2008
kill -USR1
./ssl.sh: usage: kill [ [ -sig ] id ... | -l ]
status from waitpid : 0
Now let's make sure it's really dead.
ssl.sh: Error - selfserv PID file /net/monstre/export/home/julien/nss/tip/mozilla/tests_results/security/monstre.33/tests_pid.29924 doesn't exist
ssl.sh: Exit: 9 Fatal - selfserv pid file /net/monstre/export/home/julien/nss/tip/mozilla/tests_results/security/monstre.33/tests_pid.29924 does not exist - FAILED
ssl.sh: #703: 9 Fatal - selfserv pid file /net/monstre/export/home/julien/nss/tip/mozilla/tests_results/security/monstre.33/tests_pid.29924 does not exist - FAILED
So, the root cause is a race where a second selfserv is started before the first one is really over. This is what causes the PR_Bind errors, at least on Solaris.
When I added the semaphores in the earlier tests, the window for the race was extended, and it was easier to reproduce.
I believe this race happens fairly often. But most of the time it doesn't result in bind errors, because the first selfserv process ends up closing the listen socket before the second one tries to bind.
The bad news is that this seems to be a shell bug. It may not be a cross-platform bug. It's possible that the root cause on other platforms is different.
I'm not really sure why the check happens twice and is broken the second time (I would expect my "Exit 12" call in is_selfserv_dead to exit the first time), but nevertheless, the first instance of the problem is consistent with my theory.
The good news is that there should be a simple work-around. On platforms that have the problem, we can delete the wait altogether and replace it with a kill -0 / kill -SIGUSR1 loop . If after sometime the process still doesn't go down, then we should do a kill -9. The kill -9 had been removed from the script at the time that the signal handling logic was put in. This is something that we needed to put back in. But it can't work unless we are notified that the process is still up.
I don't believe there is a bug in the script here other than the fact that wait in the shell fails.
Oh, and the return from waitpid that I loggd is always 0 ...
Assignee | ||
Comment 66•17 years ago
|
||
I tried the workaround from comment 65. It didn't work. I ended up with a zombie selfserv process. I ran ptree and got this :
7817 tcsh
13302 /bin/sh ./ssl.sh
21676 <defunct>
21683 /bin/sh ./ssl.sh
22748 sleep 1
This shows that we have two shells running ssl.sh apparently . One parent and one child. The parent is the one that started selfserv. But it appears the child is the one waiting for it. I think that's the cause of the problem. If we can bring it down to one shell process, I think the wait ${PID} should work.
Comment 67•17 years ago
|
||
Even kill -9 won't clean up TCPCBs (the in-kernel part of the sockets) that
are in "TIME WAIT" state. (See RFC 793, Figure 6). And those TCPCBs may
result in busy/in-use errors for subsequent bind attempts, IINM (although,
SO_REUSEADDR should mitigate that).
Assignee | ||
Comment 68•17 years ago
|
||
Nelson,
There isn't any socket left in TIME_WAIT state at any point that I have seen here. There isn't any connection failure prior to the time that we get a server bind error, or when more generally the lifespan of 2 selfserv processes overlaps as demonstrated in attachment 314756 [details] [diff] [review]. So, I don't think that's TIME_WAIT sockets are the problem.
IMO, we need to get the wait call to work properly in the shell - probably without a kill -9. Unfortunately, as should be evident from this bug, I am not a shell expert. I have no idea why it would fork itself here. I think I am going to try to get dtrace to get me some more info on what's really going on in this mess. I hope it will shed some light.
Assignee | ||
Comment 69•17 years ago
|
||
Using truss, I determined that the shell uses the system call "waitid" to wait for the child. When the problem occurs, the wait fails with the error ECHILD. This error is apparently not reported to the the shell program.
I also added the following at the beginning of selfserv :
printf("ppid : %d.\n", getppid());
I noted that every single process that was successfully waited for had the same parent process id, which makes sense.
But at one point, a selfserv gets a new parent :
ppid : 23038.
ppid : 23038.
ppid : 23038.
ppid : 14236.
That last selfserv with a parent of 14236 is the one that is silently not waited for properly.
pid 14236 is the very first shell that started ran ssl.sh .
pid 23038 is another shell forked by 14236 at 80s into my test, after the completion of cert.sh . Apparently, the CRL tests run in shell 23038 .
Then, when pid 14236 is done with it, it waits for it. And continues running the tests itself, without the help of that child.
The point where new selfserv changes parent, and where the problem occurs, is always exactly at the 122nd invocation selfserv. I am still not sure what causes the shell to switch over parent process, but hopefully this information should help track it down to a particular function in ssl.sh . This is the end of the log before the error :
0 cache hits; 1 cache misses, 0 cache not reusable
0 stateless resumes
HTTP/1.0 200 OK
Server: Generic Web Server
Date: Tue, 26 Aug 1997 22:10:05 GMT
Content-type: text/plain
GET / HTTP/1.0
EOF
kill -0 1069 >/dev/null 2>/dev/null
selfserv with PID 1069 found at Wed Apr 9 22:31:09 PDT 2008
ssl.sh: #700: TLS Require client auth (client auth)(cert TestUser51 - not revoked) produced a returncode of 0, expected is 0 - PASSED
Server Args: -r_-r
tstclnt -p 8443 -h monstre.red.iplanet.com -f -d ../client \
-w nss -n TestUser52 < /net/monstre/export/home/julien/nss/tip/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: write to SSL socket failed: SSL peer rejected your certificate as revoked.
kill -0 1069 >/dev/null 2>/dev/null
selfserv with PID 1069 found at Wed Apr 9 22:31:10 PDT 2008
ssl.sh: #701: TLS Require client auth (client auth)(cert TestUser52 - revoked) produced a returncode of 254, expected is 254 - PASSED
trying to kill selfserv with PID 1069 at Wed Apr 9 22:31:10 PDT 2008
kill -USR1 1069
status from waitpid : 0
ppid : 14236.
selfserv: 0 cache hits; 41 cache misses, 0 cache not reusable
0 stateless resumes, 0 ticket parse failures
selfserv: normal termination . PID = 1069
Now let's make sure it's really dead.
kill -0 1069 >/dev/null 2>/dev/null
selfserv with PID 1069 found at Wed Apr 9 22:31:10 PDT 2008 . It should be dead !
ssl.sh: Exit: 12 - FAILED
ssl.sh: #702: 12 - FAILED
trying to kill selfserv with PID at Wed Apr 9 22:31:10 PDT 2008
kill -USR1
status from waitpid : 0
Now let's make sure it's really dead.
ssl.sh: Error - selfserv PID file /net/monstre/export/home/julien/nss/tip/mozilla/tests_results/security/monstre.7/tests_pid.14236 doesn't exist
ssl.sh: Exit: 9 Fatal - selfserv pid file /net/monstre/export/home/julien/nss/tip/mozilla/tests_results/security/monstre.7/tests_pid.14236 does not exist - FAILED
ssl.sh: #703: 9 Fatal - selfserv pid file /n
Reporter | ||
Comment 70•17 years ago
|
||
(In reply to comment #58)
> This should be OK, because the script should never try to kill selfserv before
> it has done a connection on it first with one of the clients.
Sometimes there is a situation when we kill selfserv without any connection to it (usually when we need to start it with different attributes or when we skip some tests) so you can rely on this. I can do improvement to ssl.sh to prevent it but it's not very clean solution. From my point of view, if selfserv doesn't shut down correctly when it's killed, it's a bug.
Assignee | ||
Comment 71•17 years ago
|
||
Slavo,
From my point of view, it's a bug in the script that we start selfserv and just try to kill it right away without making any connection. That needs to be fixed.
Based on my comment 69, can you tell me what's happening at the 122nd start of selfserv ? Are we getting into another function in ssl.sh ? I'm trying to determine why the parent shell forks a child early on to start selfserv, and then at the 122nd run switches over to starting selfserv itself instead. I suspect the shell is running a child to execute some subfunctions. I wonder if there is any way we can predict when that switch happens, so that we can make sure there is no selfserv process that survives when there is a switch over to another shell process.
Assignee | ||
Comment 72•17 years ago
|
||
Since there is no way for me to call getpid() in the shell, I created a program that just calls getppid() and displays the output. I invoked it from various places in ssl.sh - before every call to start_selfserv and kill_selfserv.
On solaris 10, this showed that the shell forks itself many times. I am still trying to isolate the specific conditions under which it does that. This is not always a problem, as long it's the same shell process that's calling start_selfserv and stop_selfserv. Unfortunately, this is not always the case.
If you look at :
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/tests/ssl/ssl.sh&rev=1.91
The specific instance of the first time the problem happens is at lines 657 and 736. The first one calls start_selfserv, the second calls stop_selfserv. But it turns out they are not done in the same shell process. The first time, selfserv is started in the outer while loop. Then, it is stopped in the inner while loop. Meanwhile, the shell has forked. So when we try to stop, the wait fails because it's trying to wait on a cousin process, rather than a child process.
I haven't looked at every condition that causes the Solaris shell to fork. This isn't the only one. But this is the first one that's causing a problem which is detected by attachment 314756 [details] [diff] [review].
I tried the same test script on Linux RHEL4, and the output of my program never changed throughout a successful execution of ssl.sh . Slavo, have we ever seen the PR_Bind problem on Linux RHEL4 ? I see that you said Linux 2.6 x86 earlier. Are you sure it was 2.6 and not 2.4 ?
I will rerun my test on AIX and HPUX to determine if the shell suffers the same forking problem on those platforms, as well as on all the versions of Linux we still have (RHEL2, RHEL3, RHEL4)
There are several ways to go about fixing this :
My preferred solution for this issue would be that we switch to using shells that don't fork themselves in the middle of the script on all Unix platforms. I don't know if it is realistic or possible that we find a non-forking /bin/sh clone for all platforms.
Alternately, is there any other shell or scripting language that includes in its specification the fact that a given shell script must be running within the same shell process ? I would think any scripting language that supports threads would have that requirement. But I'm not sure which shells support threads ;)
If we can't find all the proper compatible shells, another way to go about it would be to write some sort of process manager. We would have to write such a program, and it would be used in place of the built-in background start ("&") and "wait" functions of the shell. I can think of some ways that could be done, but it will need a client and a server working with IPC. ssl.sh would start that server at the very top level and terminate it before it exits. Then the rest would be clients to "start" and "wait". Hopefully, NSPR has sufficient APIs to do this for all platforms. It could work on Windows too even though the problem doesn't exist there, so that we don't have 2 different logics in the shell for startup/waiting.
Comment 73•17 years ago
|
||
Julien, The behavior of POSIX shells with respect to when they fork should
be well defined and absolutely the same in all POSIX shells on all platforms.
The behavior should be identical on all our supported platforms. However, we
know that Solaris's shell does not support all POSIX shell features, and may
not really be a POSIX shell. I'd suggest (as I have done before) that you try
running bash on Solaris. There is a good chance that bash will run the same
on Solaris as it does on other platforms.
As for "no way for me to call getpid() in the shell", IIRC, the value of
the shell expression "$$" is the pid.
echo "My PID is $$"
Reporter | ||
Comment 74•17 years ago
|
||
(In reply to comment #71)
> Based on my comment 69, can you tell me what's happening at the 122nd start of
> selfserv ? Are we getting into another function in ssl.sh ?
From log it looks like it's some of Cache CRL SSL tests. Maybe if I have access to full log I can tell you more, but I don't expect there anything special (maybe selfserv start/stop without any connection).
Reporter | ||
Comment 75•17 years ago
|
||
(In reply to comment #72)
> I tried the same test script on Linux RHEL4, and the output of my program never
> changed throughout a successful execution of ssl.sh . Slavo, have we ever seen
> the PR_Bind problem on Linux RHEL4 ? I see that you said Linux 2.6 x86 earlier.
> Are you sure it was 2.6 and not 2.4 ?
Hmm.. probably it really occured on RHEL4, but I remember that bind error occurred some times also from another reasons (broken reaper script, network problems, there was really some hanged selfserv on machine, ..). I checked results from last months and I found it at 20080127.1 build on nssamdrhel4/Linux2.6 with comment 'looks like there was another selfserv running on the same port' what means it was different case then normally.
All other cases in last 3 months were from Solaris.
Assignee | ||
Comment 76•17 years ago
|
||
Assignee: slavomir.katuscak → julien.pierre.boogz
Status: NEW → ASSIGNED
Attachment #317848 -
Flags: review?
Comment 77•17 years ago
|
||
(In reply to the first paragraph of comment #72)
Please refer to
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/nss/tests/ssl/ssl.sh&rev=1.91
The inner while loop (lines 658-741) has its standard input redirected:
657 start_selfserv
658 while read ectype value sparam cparam testname
...
738 kill_selfserv
...
741 done < ${SSLAUTH_TMP}
Here is a quote from Bruce Blinn, Portable Shell Programming, p. 45:
in most shells, using file redirection with a shell statement
will cause that statement to be executed in a subshell.
In fact, on p. 48 it talks about exactly our problem and gives a
workaround. The simple form of the workaround is:
exec <${SSLAUTH_TMP}
while read ectype value sparam cparam testname
do
...
done
It changes the standard input of the current shell to the file
${SSLAUTH_TMP}. If we want to restore the standard input after
the while loop (so we can receive input from the user), we need
to save fd 0 in fd 3 temporarily, and change it back:
exec 3<&0 <${SSLAUTH_TMP}
while read ectype value sparam cparam testname
do
...
done
exec 0<&3 3<&-
Since our shell scripts don't need to receive input from the
user, I think we can just use the simple form.
Comment 78•17 years ago
|
||
Comment on attachment 317848 [details] [diff] [review]
run ssl.sh and all.sh under bash
This forces the use of one shell on all platforms, including those that do not have bash. I think we want a solution that allows the desired shell to be set individually on each platform.
Some alternatives:
a) set $SHELL to the desired shell and then change the first line of the shell scripts to begin with ":" or "# ", rather than with "#!/bin/sh".
b) change the lines in scripts that invoke other scripts to invoke them with $SHELL explicitly, e.g. instead of a line that reads
all.sh
have a line that reads something like
$SHELL all.sh
or
$SHELL -c all.sh
Assignee | ||
Comment 79•17 years ago
|
||
Nelson,
As far as I can tell, all platforms have bash except for Windows, which ignores the first line of the scripts anyway.
Comment 80•17 years ago
|
||
Why don't you give the exec <${SSLAUTH_TMP} workaround a try?
I found it in two books on shell programming.
Comment 81•17 years ago
|
||
I wonder if the suggested workaround will work on MSYS and MKS.
I would hope so, but ...
Comment 82•17 years ago
|
||
Since Windows has the SetStdHandle function for redirecting
standard input/output/error:
http://msdn.microsoft.com/en-us/library/ms686244(VS.85).aspx
exec < ${file} can be easily implemented.
My experiments showed that exec < ${file} works on Cygwin,
MSYS, and MKS. This is the script I tested:
exec < foo.txt
while read a b
do
echo $a $b
done
Comment 83•17 years ago
|
||
Wan-Teh, what about putting stdin back to the original place after the loop.
Does that work on the windows shells?
Comment 84•17 years ago
|
||
That also works on Cygwin, MKS, and MSYS. This is the
script I tested:
exec 3<&0 <foo.txt
while read a b
do
echo $a $b
done
exec 0<&3 3<&-
read a b
echo $a $b
Are you sure you want to use that? That requires us to
pick a hardcoded fd (fd 3 in my example) to save the
original stdin. The simpler workaround is good enough
for us because we don't read console input from the user.
Comment 85•17 years ago
|
||
Comment on attachment 317848 [details] [diff] [review]
run ssl.sh and all.sh under bash
I was initially skeptical about this patch, but I have been persuaded that it will not be a problem. r=nelson
Attachment #317848 -
Flags: review? → review+
Comment 86•17 years ago
|
||
Comment on attachment 317848 [details] [diff] [review]
run ssl.sh and all.sh under bash
The other related shell scripts under nss/tests,
such as nss/tests/common/init.sh and nss/tests/cert/cert.sh,
should also be changed to say #!/bin/bash for consistency.
Otherwise, future NSS maintainers will be wondering why we
are running only ssl.sh and all.sh under bash.
It is a shame that we give up our portable shell scripts
for a problem that has a well-known solution. (It's also
used in Mozilla's configure script:
http://lxr.mozilla.org/seamonkey/source/configure.in#1362)
I understand why you think using bash is a better solution
though.
Assignee | ||
Comment 87•17 years ago
|
||
I checked in attachment 317848 [details] [diff] [review] to use bash in both of those scripts. Hopefully the tinderbox and nightlies will be OK.
Checking in all.sh;
/cvsroot/mozilla/security/nss/tests/all.sh,v <-- all.sh
new revision: 1.48; previous revision: 1.47
done
Checking in ssl/ssl.sh;
/cvsroot/mozilla/security/nss/tests/ssl/ssl.sh,v <-- ssl.sh
new revision: 1.92; previous revision: 1.91
done
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.12.1
Comment 88•17 years ago
|
||
The Win64 tinderbox has been broken ever since this checkin.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 89•17 years ago
|
||
Relevant excerpt from the log file:
######## Running NSS test ########
######## Running 64 bit NSS test in DBG mode ########
NSS_ENABLE_ECC=1; export NSS_ENABLE_ECC;
NSS_ECC_MORE_THAN_SUITE_B=1; export NSS_ECC_MORE_THAN_SUITE_B;
IOPR_HOSTADDR_LIST=dochinups.red.iplanet.com; export IOPR_HOSTADDR_LIST;
NSPR_LOG_MODULES="pkix:1"; export NSPR_LOG_MODULES;
NSS_TEST_DISABLE_STANDARD=1; export NSS_TEST_DISABLE_STANDARD;
NSS_TEST_DISABLE_UPGRADE_DB=1; export NSS_TEST_DISABLE_UPGRADE_DB;
NSS_TEST_DISABLE_SHARED_DB=1; export NSS_TEST_DISABLE_SHARED_DB;
USE_64=1; export USE_64;
cd /export/tinderbox/MINGW32_NT-5.2_1.0.10/mozilla/security/nss/tests;
./all.sh
sh: ./all.sh: No such file or directory
######## Debug 32 bit NSS test BUSTED ########
Could not run all.sh in 32 bit DBG mode
########
Julien is out today, so I will back out this patch in an effort to restore
the trunk tinderbox to green.
Assignee | ||
Comment 90•17 years ago
|
||
This is an environment issue on the Win64 tinderbox machine.
My copy of MSYS contains a bash.exe which is identical to sh.exe . In other words, MSYS should already have been using bash, and there should be no problem with that platform. If the tinderbox doesn't have this file, then it's probably incomplete.
I don't use file associations on my Win64 system. Since CMD.EXE is my shell of choice, I have MSYS in my PATH, and I start all.sh by typing "bash all.sh". Previously I typed "sh all.sh". all.sh and ssl.sh run just fine with attachment 317848 [details] [diff] [review] .
If I start bash.exe manually, and then type ./all.sh it also starts correctly even though I have no associations setup.
I think the problem may be related to the tinderbox script and/or the shell the tinderbox script is running under.
Comment 91•17 years ago
|
||
Our nightly build machine doesn't have bash installed with MSYS either. Your machine must have a special configuration. You need to clearly document this extra requirement somewhere.
Comment 92•17 years ago
|
||
I suggest we explore Wan-Teh's alternative approach.
Assignee | ||
Comment 93•17 years ago
|
||
Christophe,
I don't recall where I obtained my MSYS distribution originally, I will try to find that out. But it definitely had a bash.exe . I would say the distribution you have is incomplete. This can very easily be worked around by copying sh.exe to bash.exe . Please do that on the tinderbox machine and see if it works.
Nelson,
I think there were other reasons why we wanted a POSIX shell. It would be better to get this resolved, especially as there is only one platform that is a problem and bash is available for it.
Comment 94•17 years ago
|
||
This method prevents the creation of a subshell for the
while loop.
Assignee | ||
Comment 95•17 years ago
|
||
Comment on attachment 320405 [details] [diff] [review]
Use exec to redirect standard input before entering a while loop (minimal patch)
I am not sure about this patch. it with the Solaris shell. I added some logging in selfserv to log the parent pid.
My output log showed that the parent pid was still not always the same for all selfserv processes.
There was a total of 14 distinct selfserv parent pids for my run of ssl.sh, out of 514 total selfserv processes.
I ran ssl.sh again without the patch. There were 23 different parents in that case.
So it seems that the patch eliminates one case of forking with the Solaris shell, but not all. The patch might be OK if it's always the same shell process that starts the server and sends it a signal / wait for it. That would require more analysis. I would much prefer if we eliminated the forking altogether. And we already have one way to do that in attachment 317848 [details] [diff] [review] .
Assignee | ||
Comment 96•17 years ago
|
||
Christophe, Slavo,
FYI, this is the location where I obtained my MSYS distribution http://developer.mozilla.org/en/docs/Windows_Build_Prerequisites#MozillaBuild . Previously, I had downloaded mozillabuildsetup-1.1.exe which installs bash.exe under MSYS. I verified that the latest mozillabuildsetup-1.2.exe also has bash.exe . So if our systems don't have bash.exe, they are simply running incomplete MSYS. The machines should be fixed. This problem should not prevent us from checking in attachment 317848 [details] [diff] [review] .
Looking at our internal wiki, it looks like Christophe obtained MSYS separately and not from mozilla-build. I would suggest switching to mozilla-build, since that one package also includes zip, unzip and nsinstall, which are also needed by the NSS build.
Comment 97•17 years ago
|
||
Comment on attachment 320405 [details] [diff] [review]
Use exec to redirect standard input before entering a while loop (minimal patch)
Julien, sorry that I wasn't clear. This patch is the absolute miminum changes
required to fix this bug. It leaves unchanged the while loops that don't need
changing. The unchanged while loops look like this:
while read x y z
do
...
start_selfserv
...
kill_selfserv
done < ${XXX}
These while loops start and wait for selfserv processes from the same process.
Whether that process is the original shell or a subshell doesn't matter.
I now realize that it is a lot of work to verify that this patch is correct, so
I'm going to attach another patch that changes all the while loops.
Attachment #320405 -
Attachment description: Use exec to redirect standard input before entering a while loop → Use exec to redirect standard input before entering a while loop (minimal patch)
Comment 98•17 years ago
|
||
I attach these patches for reference only. You can ignore them.
Assignee | ||
Comment 99•17 years ago
|
||
Comment on attachment 320818 [details] [diff] [review]
Use exec to redirect standard input before entering a while loop (change all while loops) (checked in)
Thanks, Wan-Teh. I verified that this patch causes all selfserv processes to have the same parent.
I don't think this patch is exclusive with switching to bash, which we have wanted to do to get a POSIX shell for other reasons.
Attachment #320818 -
Flags: review+
Comment 100•17 years ago
|
||
Julien, In comment 95 you wrote:
> I am not sure about this patch. it with the Solaris shell.
What's the missing word here -------^ ?
Assignee | ||
Comment 101•17 years ago
|
||
There were 2 missing words. "I tried"
Assignee | ||
Comment 102•17 years ago
|
||
Christophe informed me that our Win64 machines now all have bash.exe . I checked in both attachment 317848 [details] [diff] [review] and attachment 320818 [details] [diff] [review] to the trunk .
Checking in all.sh;
/cvsroot/mozilla/security/nss/tests/all.sh,v <-- all.sh
new revision: 1.50; previous revision: 1.49
done
Checking in ssl/ssl.sh;
/cvsroot/mozilla/security/nss/tests/ssl/ssl.sh,v <-- ssl.sh
new revision: 1.94; previous revision: 1.93
done
Status: REOPENED → RESOLVED
Closed: 17 years ago → 17 years ago
Resolution: --- → FIXED
Comment 103•17 years ago
|
||
Julien, thank you for testing and checking in my patch.
I agree that my patch and your patch are not mutually
exclusive.
I second your recommendation of MozillaBuild. It is
one-stop shopping for setting up the Windows build
environment after installing the compiler.
Updated•16 years ago
|
Attachment #320818 -
Attachment description: Use exec to redirect standard input before entering a while loop (change all while loops) → Use exec to redirect standard input before entering a while loop (change all while loops) (checked in)
Attachment #320818 -
Flags: superreview+
Comment 104•16 years ago
|
||
Comment on attachment 320818 [details] [diff] [review]
Use exec to redirect standard input before entering a while loop (change all while loops) (checked in)
On branch:
ssl.sh; new revision: 1.61.2.16; previous revision: 1.61.2.15
You need to log in
before you can comment on or make changes to this bug.
Description
•