Closed Bug 119340 Opened 23 years ago Closed 23 years ago

increased "selfserv process not detectable" errors on linux

Categories

(NSS :: Tools, defect, P2)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sonja.mirtitsch, Assigned: wtc)

Details

Attachments

(5 files, 1 obsolete file)

huey in daily QA and louie on the tinderbox showed several times that the selfserv did not come up in time. 11/27 I reduced the time to wait after a selfserv had been killed to 3 seconds, this could cause that the socket is not free and the next selfserv has problems starting up. I increased the time again to 5 seconds, but we were running troublefree for a month with 3 seconds before. I will watch this for a few days, it might not be a bug in test (or networkspeed), but in tools.
Since it also shows up as the first selfserv is being started I assume it has nothing to do with the 3 seconds being to little of a delay. What the QA script does: start selfserv start tstclnt -q #exits with returncode 0 meaning has made a conection reads the server process id file greps through a ps to find out if the server is running gives errormessage "ssl.sh: Exit: 10 Fatal - selfserv process not detectable" during Exit function shell message: "/share/builds/mccrel/nss/nsstip/builds/20020110.1/booboo_Solaris8/mozilla/security/nss/tests/all.sh: line -57: 27172 Terminated selfserv -D -p ${PORT} -d ${R_SERVERDIR} -n ${HOSTADDR} -w nss ${sparam} -i ${R_SERVERPID} $verbose (wd: /share/builds/mccrel/nss/nsstip/builds/20020110.1/booboo_Solaris8/mozilla/tests_results/security/louie.7/client)" ssl.sh: SSL tests =============================== ssl.sh: SSL Cipher Coverage =============================== selfserv -D -p 8443 -d ../server -n louie.red.iplanet.com \ -w nss -c ABCDEFabcdefghijklmnvy -i ../tests_pid.25206 & selfserv started at Thu Jan 10 15:28:22 PST 2002 tstclnt -p 8443 -h louie -q -d . < /share/builds/mccrel/nss/nsstip/builds/20020110.1/booboo_Solaris8/mozilla/security/nss/tests/ssl/sslreq.txt ssl.sh: Exit: 10 Fatal - selfserv process not detectable /share/builds/mccrel/nss/nsstip/builds/20020110.1/booboo_Solaris8/mozilla/security/nss/tests/all.sh: line -57: 27172 Terminated selfserv -D -p ${PORT} -d ${R_SERVERDIR} -n ${HOSTADDR} -w nss ${sparam} -i ${R_SERVERPID} $verbose (wd: /share/builds/mccrel/nss/nsstip/builds/20020110.1/booboo_Solaris8/mozilla/tests_results/security/louie.7/client)
Assignee: sonja.mirtitsch → wtc
Component: Test → Tools
Priority: -- → P2
Target Milestone: --- → 3.4
Blocks: 119403
rerunning QA on all available linux machines
on the louie tinderbox this problem looks slightly different selfserv -D -p 8444 -d ../ext_server -n louie.red.iplanet.com \ -w nss -r -i ../tests_pid.13877 & selfserv started at Fri Jan 11 01:45:01 PST 2002 tstclnt -p 8444 -h louie -q -d . < /export/nss_tbx_linux2.4_7.2/builds/tinderbox/Linux-2.4_7.2/mozilla/security/nss/tests/ssl/sslreq.txt ssl.sh: Exit: 10 Fatal - selfserv process not detectable ssl.sh: SSL Stress Test Extended test =============================== ssl.sh: skipping Stress SSL2 RC4 128 with MD5 for Extended test ssl.sh: Stress SSL3 RC4 128 with MD5 ---- selfserv -D -p 8444 -d ../ext_server -n louie.red.iplanet.com \ -w nss -i ../tests_pid.13877 & selfserv started at Fri Jan 11 01:45:02 PST 2002 tstclnt -p 8444 -h louie -q -d . < /export/nss_tbx_linux2.4_7.2/builds/tinderbox/Linux-2.4_7.2/mozilla/security/nss/tests/ssl/sslreq.txt selfserv: PR_Bind returned error -5982: Local Network address is in use. ssl.sh: Exit: 10 Fatal - selfserv process not detectable /export/nss_tbx_linux2.4_7.2/builds/tinderbox/Linux-2.4_7.2/mozilla/security/nss/tests/all.sh: kill: (19528) - No such pid All of the machines that show the failure at the momentrun tinderboxes and regular QA. I will halt the tinderboxes, and keep running regular QA and see if that stops the problem
I ran the modified tinderboxes (no QA) which increased the number of failures again. before 13:20: 12 failures on huey and dewey in 14 runs each, no failure on louie on nightly QA, 4 failures on tinderbox QA on huey, dewey and louie 13:20 turned off tinderboxes 14:05 QA failure 15:35: turned on modified tinderboxes (no QA) 15:35-17:45 4 failures in 6 runs on dewey, 1 failure in 4 runs on huey no failures in 4 runs on louie
forgot one more thing: this did not show up a single time in the backward compatibility tests I will upload the results once more to ftp://ftp.mozilla.org/pub/security/nss/daily_qa/20020111.1/result.html
I have actually seen this once now on my local machine (RH 7.1 dual processor). I've probably run the SSL QA ~20-30 times in the last few days, but I did see the failure once. I looked at changes to selfserv since 3.3.2 and nothing stuck out to me. That the backward compatibility tests never show the problem would indicate that the problem is in selfserv. Are we using a newer version of NSPR?
I did see it today on box in the backward compatibility tests as well. I found the easiest way to reproduce it is, compiling at the same time.
Aha! Quite true. The one time I saw it, I was also compiling (QA was running debug, I was building optimized).
No longer blocks: 119403
This comment is from Bishakha: I was able to reproduce this error at the Netscape end by running two tests in very rapid succession. I have a script that iterates through the tests 30 times, and the only times I see the error are when two tests are running near simultaneously, with a second or so of time difference. I did this on "turgur", which is the fastest of all the linux machines I ran the tests on. If you view the times on these tests, you will see that sometimes the two QA runs coincide at some point (i.e. a particular test suite does not coincide in time exactly with the same suite of the next QA run - there should be a second or more of difference) , and it seems to me that coming up with this error is obvious, as one test waits on the other preceding it to free up the port. The tests were run on the 20020114.1 build. -Bishakha
I looked at this problem last night. As I noted in another bug, the particular selfserv failure below, which prevents it from starting up, has nothing to do with crypto or SSL: selfserv: PR_Bind returned error -5982: Local Network address is in use. We should do two things. 1. Find out if all the "selfserv process not detectable" errors are preceded by the above PR_Bind failure. 2. I will see if there is a more reliable way to kill a multithreaded process on Linux. There are two longer-term solutions. 1. Instead of killing selfserv, we should attempt to send it a "stop" command to gracefully shut it down. I tried to do this last night but could not make it to work. 2. Have selfserv bind to any unused TCP port (as opposed to a fixed TCP port) and write the port number it binds to to a file. This solution has the additional benefit that we will be able to run several QA sessions on the same machine at the same time. In the meantime, we should increase the sleep interval after killing selfserv to work around this problem on Linux. If we suspsect performance degradation issues, we should rely on Kirk's performance tests.
Status: NEW → ASSIGNED
Wan-Teh, FYI, this has been a constant problem on Linux for web server in the automated tests. The tests would not keep running in tinderbox because some server "threads" (really some left-over processes) would eventually stay even though the other threads were gone. I always attributed it to Linux bugs. FYI, here is what the web server stop script that ships with web server 6 does. See the nice comments about Linux I put in there. #!/bin/sh SERVER_ROOT=/export/home/jpierre/60opt PRODUCT_NAME=https INSTANCE_NAME=https-admserv PID_FILE=/export/home/jpierre/60opt/https-admserv/logs/pid LD_LIBRARY_PATH=${SERVER_ROOT}/bin/${PRODUCT_NAME}/lib:${LD_LIBRARY_PATH} export LD_LIBRARY_PATH LIBPATH=${LIBPATH}:${LD_LIBRARY_PATH}:/usr/lib/threads:/usr/ibmcxx/lib:/usr/lib:/lib; export LIBPATH SHLIB_PATH=${SHLIB_PATH}:${LD_LIBRARY_PATH}; export SHLIB_PATH NS_SERVER_HOME=${SERVER_ROOT}; export NS_SERVER_HOME NSES_SERVER_HOME=${SERVER_ROOT}; export NSES_SERVER_HOME NS_HTTPS_HOME=${NS_SERVER_HOME}/bin/https; export NS_HTTPS_HOME PATH=${SERVER_ROOT}/bin/${PRODUCT_NAME}/bin:${PATH}; export PATH if test -x "$SERVER_ROOT/bin/$PRODUCT_NAME/httpadmin/bin/shutdown" ; then # Send watchdog a message instructing it to terminate "$SERVER_ROOT/bin/$PRODUCT_NAME/httpadmin/bin/shutdown" $INSTANCE_NAME else # Kill the watchdog with SIGTERM (unreliable with some JVMs) if test -f $PID_FILE ; then if [ `uname` != "Linux" ] then kill -TERM `cat $PID_FILE` if test $? -ne 0 ; then exit 1 fi else # on Linux we send to the group to work around a glibc bug kill -TERM -`cat $PID_FILE` if test $? -ne 0 ; then exit 1 fi fi else echo server not running exit 1 fi fi loop_counter=1 max_count=30 while test $loop_counter -le $max_count; do loop_counter=`expr $loop_counter + 1` if test -f $PID_FILE ; then if [ `uname` != "Linux" ] then kill -TERM `cat $PID_FILE` if test $? -ne 0 ; then exit 1 fi else # on Linux we send to the group to work around a glibc bug kill -TERM -`cat $PID_FILE` if test $? -ne 0 ; then exit 1 fi fi else echo server not running exit 1 fi fi loop_counter=1 max_count=30 while test $loop_counter -le $max_count; do loop_counter=`expr $loop_counter + 1` if test -f $PID_FILE ; then sleep 2 else exit 0 fi done echo server not responding to exit command echo killing process group kill -9 -`cat $PID_FILE` rm $PID_FILE exit 1
I do not think that is related to killing a multithreaded process on Linux, because on 1/10 on louie this problem was seen at the first selfserv start ftp://ftp.mozilla.org/pub/security/nss/daily_qa/20020110.1/louie.7/results.html
On Linux, each thread is actually a process, with its own pid. This is evident in how 'selfserv' is displayed by the 'ps -eaf' command. wtc 1765 1757 1 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d ../server wtc 1770 1765 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d ../server wtc 1771 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d ../server wtc 1772 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d ../server wtc 1773 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d ../server wtc 1774 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d ../server wtc 1775 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d ../server wtc 1776 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d ../server wtc 1777 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d ../server wtc 1778 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d ../server Pid 1757 is the test script (ssl.sh). Pid 1765 is the primary thread in selfserv. Pid 1770 is the thread manager, created by the pthread library. Pid's 1771-1778 are the threads created by selfserv. If we kill the primary thread (pid 1765) and wait for its termination, my theory is that the other threads (pids 1770-1778) may still be running when the primary thread has terminated. This is because the thread manager needs to first notice that the primary thread is gone and then terminate all the other threads. All of this can only happen after the primary thread is dead. Knowing how threads are implemented on Linux, I came up with an inelegant but more reliable way to kill selfserv and wait for the termination of all its threads. I kill an arbitrary thread created by selfserv, say the first one, with pid 1771. My theory is that the following actions will be taken by the pthread library. 1. The thread manager, the parent of pid 1771, receives a SIGCHLD signal, indicating the death of pid 1771. 2. The thread manager detects that something is wrong and proceeds to kill all the other threads (pids 1772-1778). 3. After reaping all of its dead children, the thread manager itself terminates. 4. The primary thread, the parent of the thread manager, receives a SIGCHLD, indicating the death of the thread manager. 5. The primary thread detects something is wrong and terminates too. My theory is that with this method, the primary thread is the last one to go. Therefore, if we want for its termination, we are sure that all the other threads have already terminated. My patch implements this method for Linux. On non-Linux platforms, there is no change. On Linux, we kill the first thread created by selfserv and wait for the termination of the primary thread. I modified selfserv.c so that the first thread created by selfserv writes its pid to a file. With this method, we can do without the 'sleep 5' command, which we used on Linux to wait for the full termination of selfserv. I've tested my patch a few times on Linux. A code review and more testing are needed.
To avoid confusion in the discussion of this bug, I'd like to define two terms for the two types of errors that have been described in this bug report. Type I: a "selfserv process not detectable" error that is preceded by a PR_Bind failure: selfserv: PR_Bind returned error -5982: Local Network address is in use. Type II: a "selfserv process not detectable" error that is NOT preceded by a PR_Bind failure. All the "selfserv process not detectable" errors that Bishakha and I have reproduced are Type I. (In fact, one can easily cause Type I errors to occur by deleting the SLEEP command in mozilla/security/nss/tests/ssl/ssl.sh.) In comment #13, I presented a plausible theory that explains Type I errors and proposed a solution (attachment 65104 [details] [diff] [review]). In comment #12, Sonja gave an example of Type II errors. I examined the output.log file and found that there was no error message from the selfserv process. Bishakha and I haven't been able to reproduce Type II errors. Since I can't reproduce Type II errors and don't have any error message from selfserv, I don't know how to debug them.
I examined selfserv.c and tstclnt.c. As far as I can tell, selfserv and tstclnt always print an error message if they exit because of an error. Since there was no error message from selfserv or tstclnt in the Type II error described in comment #1, I only have one explanation: selfserv did a normal termination prematurely. To verify this theory, I added a printf statement at the end of the main() function in selfserv.c: NSS_Shutdown(); PR_Cleanup(); + printf("selfserv: normal termination\n"); return 0; }
Alternatively, could the process exit without error because it was killed? Perhaps some other user, or another script running simultaneously kills it?
Julien, I tried your solution (killing the process group as opposed to the process) on selfserv. It doesn't work. The command I issued, after variable and command substitution, was: kill -TERM -31459 The error message was: ./ssl.sh: kill: (-31459) - No such pid I take it that it is the web server, not the Linux pthread library, that creates a process group?
Wan-Teh, That's correct, the web server - or the watchdog, in case the server is started by the watchdog - creates a process group, and writes its PID to the file. Then we can kill it with -group and it's supposed to kill the process, all the threads, as well as all the children that it spawned - unless they defined their own process groups . This is supposed to kill for example looping CGIs that are part of the web server test suite.
Thanks, Julien. Anyone care to review or test my patch? :-)
I reviewed the script changes and they look fine to me.
Comment on attachment 65104 [details] [diff] [review] An inelegant but more reliable way to wait for the termination of selfserv on Linux I checked in this patch. This patch should eliminate what I call the Type I errors and speed up our QA on Linux. Note that this patch does not address the Type II errors, which I still don't have a handle for.
Comment on attachment 65104 [details] [diff] [review] An inelegant but more reliable way to wait for the termination of selfserv on Linux I backed out this patch. Ever since I checked in this patch, we are seeing the "selfserv process not detectable" errors on Linux in our daily QA reports. A common failure sequence is reproduced here (running Linux2.2_x86_glibc_PTH_OPT.OBJ on Red Hat Linux 6.2): =================================== ssl.sh: SSL3 Request don't require client auth on 2nd hs (client auth) produced a returncode of 0, expected is 0 PASSED ssl.sh: SSL3 Require client auth on 2nd hs (client does not provide auth) ---- selfserv -D -p 8443 -d ../ext_server -n dewey.red.iplanet.com \ -w nss -r -r -r -r -i ../tests_pid.8947 & selfserv started at Wed Jan 30 04:05:58 PST 2002 tstclnt -p 8443 -h dewey -q -d . < /share/builds/mccrel/nss/nsstip/builds/20020130.1/booboo_Solaris8/mozilla/secur ity/nss/tests/ssl/sslreq.txt ssl.sh: Exit: 10 Fatal - selfserv process not detectable ssl.sh: SSL Stress Test Extended test =============================== ssl.sh: skipping Stress SSL2 RC4 128 with MD5 for Extended test ssl.sh: Stress SSL3 RC4 128 with MD5 ---- selfserv -D -p 8443 -d ../ext_server -n dewey.red.iplanet.com \ -w nss -i ../tests_pid.8947 & selfserv started at Wed Jan 30 04:05:59 PST 2002 tstclnt -p 8443 -h dewey -q -d . < /share/builds/mccrel/nss/nsstip/builds/20020130.1/booboo_Solaris8/mozilla/secur ity/nss/tests/ssl/sslreq.txt selfserv: PR_Bind returned error -5982: Local Network address is in use. ssl.sh: Exit: 10 Fatal - selfserv process not detectable kill: (22329) - No such pid =================================== We would first have a Type II error, followed by one or two Type I errors. Those Type I errors suggest that there were still some selfserv threads (which are really processes) hanging around from the initial Type II error. So it is safe to assume that the subsequent Type I errors were caused by the initial Type II errors. I still have no clue why we had the initial Type II errors in the first place, and this is difficult for me to debug. selfserv just disappears without leaving a trace -- no error or normal termination messages to stderr or stdout, and no core files.
At this point, with the amount of time we've put into this, wouldn't it just be easier to start the selfserv's on different ports and let the OS kill the processes when the shell exits? Instead of writing the PID out to a file and then trying to kill the process, selfserv could find an available port, write the port number out to a file, and then tstclnt would pick up the port on which to connect. This way, we wouldn't have to worry about trying to kill the process within the script. Additionally, we wouldn't have any trouble with running the QA concurrently on the same machine. Yes/No? Am I missing something?
Ian, Good point. Sometimes a bystander can see things more clearly. :-) I first proposed this in bug 58176 and then again in comment #10. I totally forgot about this solution in the heat of debugging. So I suggest that we do that. That will make the Type I errors (with a PR_Bind error message) impossible. Then we'll see if we are still seeing the mysterious Type II errors.
At this point in time I use 3 different ports in standard QA, one for regular QA, 1 for 32 bit tinderbox and 1 for 63 bit tinderbox. It is not impossible to change, but still requires a little bit of work, since the scripts will run into conflicts if we just keep incrementing
Here is a case where multiple testclients connected to the same selfserv during the cipher coverage test. After 6 successfull tstclnt connections to the ***same*** selfserv, without any errormessage the selfserv process goes away. I would ask for a signal handler in selfserv to see if some old script of mine accidentially gets activated and kills the wrong process ssl.sh: SSL tests =============================== ssl.sh: SSL Cipher Coverage =============================== selfserv -D -p 8444 -d ../server -n dewey.red.iplanet.com \ -w nss -c ABCDEFabcdefghijklmnvy -i ../tests_pid.27150 & selfserv started at Thu Jan 31 04:32:24 PST 2002 tstclnt -p 8444 -h dewey -q -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt ssl.sh: running SSL2 RC4 128 WITH MD5 ---------------------------- tstclnt -p 8444 -h dewey -c A -T \ -f -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt subject DN: CN=dewey.red.iplanet.com, O=BOGUS Netscape, L=Mountain View, ST=California, C=US issuer DN: CN=NSS Test CA, O=BOGUS NSS, L=Mountain View, ST=California, C=US 0 cache hits; 0 cache misses, 0 cache not reusable HTTP/1.0 200 OK Server: Generic Web Server Date: Tue, 26 Aug 1997 22:10:05 GMT Content-type: text/plain Discarded 1 characters. GET / HTTP/1.0 EOF ssl.sh: running SSL2 RC4 128 EXPORT40 WITH MD5 ---------------------------- tstclnt -p 8444 -h dewey -c B \ -f -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt subject DN: CN=dewey.red.iplanet.com, O=BOGUS Netscape, L=Mountain View, ST=California, C=US issuer DN: CN=NSS Test CA, O=BOGUS NSS, L=Mountain View, ST=California, C=US 0 cache hits; 0 cache misses, 0 cache not reusable HTTP/1.0 200 OK Server: Generic Web Server Date: Tue, 26 Aug 1997 22:10:05 GMT Content-type: text/plain Discarded 1 characters. GET / HTTP/1.0 EOF ssl.sh: running SSL2 RC2 128 CBC WITH MD5 ---------------------------- tstclnt -p 8444 -h dewey -c C \ -f -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt subject DN: CN=dewey.red.iplanet.com, O=BOGUS Netscape, L=Mountain View, ST=California, C=US issuer DN: CN=NSS Test CA, O=BOGUS NSS, L=Mountain View, ST=California, C=US 0 cache hits; 0 cache misses, 0 cache not reusable HTTP/1.0 200 OK Server: Generic Web Server Date: Tue, 26 Aug 1997 22:10:05 GMT Content-type: text/plain Discarded 1 characters. GET / HTTP/1.0 EOF ssl.sh: running SSL2 RC2 128 CBC EXPORT40 WITH MD5 ---------------------------- tstclnt -p 8444 -h dewey -c D -T \ -f -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt subject DN: CN=dewey.red.iplanet.com, O=BOGUS Netscape, L=Mountain View, ST=California, C=US issuer DN: CN=NSS Test CA, O=BOGUS NSS, L=Mountain View, ST=California, C=US 0 cache hits; 0 cache misses, 0 cache not reusable HTTP/1.0 200 OK Server: Generic Web Server Date: Tue, 26 Aug 1997 22:10:05 GMT Content-type: text/plain Discarded 1 characters. GET / HTTP/1.0 EOF ssl.sh: running SSL2 DES 64 CBC WITH MD5 ---------------------------- tstclnt -p 8444 -h dewey -c E \ -f -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt subject DN: CN=dewey.red.iplanet.com, O=BOGUS Netscape, L=Mountain View, ST=California, C=US issuer DN: CN=NSS Test CA, O=BOGUS NSS, L=Mountain View, ST=California, C=US 0 cache hits; 0 cache misses, 0 cache not reusable HTTP/1.0 200 OK Server: Generic Web Server Date: Tue, 26 Aug 1997 22:10:05 GMT Content-type: text/plain Discarded 1 characters. GET / HTTP/1.0 EOF ssl.sh: running SSL2 DES 192 EDE3 CBC WITH MD5 ---------------------------- ssl.sh: Exit: 10 Fatal - selfserv process not detectable cat: /share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020131-04.02/dewey.2/tests_pid.27150: No such file or directory cat: /share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020131-04.02/dewey.2/tests_pid.27150: No such file or directory rm: cannot remove `/share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020131-04.02/dewey.2/tests_pid.27150': No such file or directory ssl.sh: SSL Client Authentication ===============================
Attached file full output.log
Sonja, Does your script always use SIGTERM (the default) to kill selfserv? I assume I just need to add a signal handler for SIGTERM?
It first tries a SIGTERM, waits a second and then does a SIGKILL (-9) I tried to disable it in standard and tinderbox QA a long time ago, and I'll read my script again to see if I have forgotten something. The scripts that do it are nssqa (function kill_by_name in the header) and mainly qaclean, which is started qaclean [hostname] and rsh to the host and kills selfserv there
I checked in this patch.
I changed the summary, since that bug has been seen on any type of linus by now. I noticed that it has decreased over the last couple of days, this is the first time that I notice it since the signal handler has been added: tinderbox on dewey: dewey.1 10 Fatal - selfserv process not detectable Failed dewey.1 10 Fatal - selfserv process not detectable Failed dewey.2 10 Fatal - selfserv process not detectable Failed dewey.1/output.log:tstclnt -p 8444 -h dewey -q -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt dewey.1/output.log:ssl.sh: Exit: 10 Fatal - selfserv process not detectable dewey.1/output.log:selfserv: received SIGTERM dewey.1/output.log:ssl.sh: SSL Stress Test Extended test =============================== dewey.1/output.log:ssl.sh: skipping Stress SSL2 RC4 128 with MD5 for Extended test dewey.1/output.log:ssl.sh: Stress SSL3 RC4 128 with MD5 ---- dewey.1/output.log:selfserv -D -p 8444 -d ../ext_server -n dewey.red.iplanet.com \ dewey.1/output.log: -w nss -i ../tests_pid.26331 & dewey.1/output.log:selfserv started at Tue Feb 5 11:26:33 PST 2002 dewey.1/output.log:tstclnt -p 8444 -h dewey -q -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt dewey.1/output.log:selfserv: PR_Bind returned error -5982: dewey.1/output.log:Local Network address is in use. dewey.1/output.log:ssl.sh: Exit: 10 Fatal - selfserv process not detectable dewey.1/output.log:kill: (669) - No such pid dewey.1/output.log-issuer DN: CN=NSS Chain2 Server Test CA, O=BOGUS NSS, L=Santa Clara, ST=California, C=US dewey.1/output.log-0 cache hits; 1 cache misses, 0 cache not reusable dewey.1/output.log-tstclnt: read from socket failed: TCP connection reset by peer. dewey.1/output.log-ssl.sh: SSL3 Require client auth on 2nd hs (bad password) produced a returncode of 1, expected is 1 PASSED dewey.1/output.log-selfserv: received SIGTERM dewey.1/output.log-ssl.sh: SSL3 Require client auth on 2nd hs (client auth) ---- dewey.1/output.log-selfserv -D -p 8444 -d ../ext_server -n dewey.red.iplanet.com \ dewey.1/output.log- -w nss -r -r -r -r -i ../tests_pid.26331 & dewey.1/output.log-selfserv started at Tue Feb 5 11:26:32 PST 2002 dewey.1/output.log-tstclnt -p 8444 -h dewey -q -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt dewey.1/output.log:ssl.sh: Exit: 10 Fatal - selfserv process not detectable dewey.1/output.log-selfserv: received SIGTERM dewey.1/output.log-ssl.sh: SSL Stress Test Extended test =============================== dewey.1/output.log-ssl.sh: skipping Stress SSL2 RC4 128 with MD5 for Extended test dewey.1/output.log-ssl.sh: Stress SSL3 RC4 128 with MD5 ---- dewey.1/output.log-selfserv -D -p 8444 -d ../ext_server -n dewey.red.iplanet.com \ dewey.1/output.log- -w nss -i ../tests_pid.26331 & dewey.1/output.log-selfserv started at Tue Feb 5 11:26:33 PST 2002 dewey.1/output.log-tstclnt -p 8444 -h dewey -q -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt dewey.1/output.log-selfserv: PR_Bind returned error -5982: dewey.1/output.log-Local Network address is in use. dewey.1/output.log:ssl.sh: Exit: 10 Fatal - selfserv process not detectable dewey.1/output.log-kill: (669) - No such pid dewey.1/output.log-sdr.sh: SDR Tests =============================== dewey.1/output.log-sdr.sh: Creating an SDR key/Encrypt -- dewey.2/output.log-Discarded 1 characters. dewey.2/output.log-GET / HTTP/1.0 dewey.2/output.log- dewey.2/output.log-EOF dewey.2/output.log- dewey.2/output.log- dewey.2/output.log- dewey.2/output.log- dewey.2/output.log- dewey.2/output.log-ssl.sh: running SSL3 RSA EXPORT WITH RC2 CBC 40 MD5 ---------------------------- dewey.2/output.log:ssl.sh: Exit: 10 Fatal - selfserv process not detectable dewey.2/output.log-selfserv: received SIGTERM dewey.2/output.log-cat: /share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020205-10.59/dewey.2/tests_pid.1325: No such file or directory dewey.2/output.log-cat: /share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020205-10.59/dewey.2/tests_pid.1325: No such file or directory
Summary: increased "selfserv process not detectable" errors on linux 2.4 → increased "selfserv process not detectable" errors on linux
Attached file one more output.log
Comment on attachment 67963 [details] output.log including signalhandler messages There are two things strange about this output.log. 1. There is no "selfserv: SIGTERM received" message at all. 2. In the first "selfserv process not detectable" error, the pid file did not exist. The 'cat' and 'rm' commands on the pid file both failed because the pid file did not exist.
that is because I attached the wrong file, sorry about that... let me see if I find the right one
Attachment #67963 - Attachment is obsolete: true
Comment on attachment 67964 [details] one more output.log This output.log looks more normal than the previous one. There are many "selfserv: received SIGTERM" messages, which are expected. There is only one "selfserv process not detectable" error, which is reproduced below: ssl.sh: running SSL3 RSA EXPORT WITH RC2 CBC 40 MD5 ---------------------------- ssl.sh: Exit: 10 Fatal - selfserv process not detectable selfserv: received SIGTERM cat: /share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020205 -10.59/dewey.2/tests_pid.1325: No such file or directory cat: /share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020205 -10.59/dewey.2/tests_pid.1325: No such file or directory rm: cannot remove `/share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-2002020 5-10.59/dewey.2/tests_pid.1325': No such file or directory It appears that the selfserv process was killed by the SIGTERM signal. It is not clear whether that occurred before or after the "selfserv process not detectable" error because the "selfserv: received SIGTERM" message was written to standard error (fd 2) whereas the "selfserv process not detectable" message was written to standard output (fd 1). I just modified selfserv.c to write the "selfserv: received SIGTERM" message to standard output (fd 1) as well so that we have a better chance of seeing these messages in the correct order. One strange thing about this output.log is that the pid file did not exist either after the error.
Comment on attachment 68055 [details] replaces attachment 67963 [details] This output.log has two "selfserv process not detectable" errors. Only the first one is interesting. The second one is a Type I error, caused by the PR_Bind failure. The first error is reproduced here: ssl.sh: SSL3 Require client auth on 2nd hs (client auth) ---- selfserv -D -p 8444 -d ../ext_server -n dewey.red.iplanet.com \ -w nss -r -r -r -r -i ../tests_pid.26331 & selfserv started at Tue Feb 5 11:26:32 PST 2002 tstclnt -p 8444 -h dewey -q -d . < /export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ ssl/sslreq.txt ssl.sh: Exit: 10 Fatal - selfserv process not detectable selfserv: received SIGTERM This again seems to suggest that the selfserv process was killed. Since the two messages were written to different streams (standard out and standard error), I am not sure whether these two events actually occurred in the same order as shown in the output.log. Sonja, do you know if ssl.sh was supposed to kill selfserv when it could not detect selfserv? Is it possible that the 'ps' command on Linux is not a reliable way to detect whether a process is still running? You can use "kill -0 $PID" to detect whether the process is still alive.
> Sonja, do you know if ssl.sh was supposed to kill selfserv > when it could not detect selfserv? it tries to kill it anyway. The Exit() function in ini.sh checks the file $SERVERPID, which is not removed when is_selfserv_alive() can not find the PID in the output of ps. I can change this behavior easily if that is desired. I was not aware of a kill -0 $PID - but so far the ps has given correct output. I'll read up on the kill -0
could not find any info about kill -0 in linus or solaris manpages I am reluctant to use a feature like this - please review the patch Index: ssl.sh =================================================================== RCS file: /cvsroot/mozilla/security/nss/tests/ssl/ssl.sh,v retrieving revision 1.44 diff -u -r1.44 ssl.sh --- ssl.sh 2 Nov 2001 23:47:47 -0000 1.44 +++ ssl.sh 6 Feb 2002 02:50:16 -0000 @@ -111,8 +111,12 @@ fi fi PID=`cat ${SERVERPID}` - $PS -e | grep $PID >/dev/null || \ - Exit 10 "Fatal - selfserv process not detectable" + if [ "${OS_ARCH}" = "Linux" ]; then + kill -0 $PID >/dev/null 2>/dev/null || Exit 10 "Fatal - selfserv process not detectable" + else + $PS -e | grep $PID >/dev/null || \ + Exit 10 "Fatal - selfserv process not detectable" + fi } ########################### wait_for_selfserv ##########################
kill -0 $PID is an idiom to check whether the process is alive. See http://www.opengroup.org/onlinepubs/007908799/xcu/kill.html: -signal_number Specify a non-negative decimal integer, signal_number, representing the signal to be used instead of SIGTERM, as the sig argument in the effective call to kill(). and http://www.opengroup.org/onlinepubs/007908799/xsh/kill.html: If sig is 0 (the null signal), error checking is performed but no signal is actually sent. The null signal can be used to check the validity of pid.
OK - I guess that means I should check it in?
I suggest we use the kill -0 idiom on all platforms, not just Linux. We just need to verify that kill -0 works correctly under MKS Korn shell.
ps works fine on all platforms except maybe for linux, where it stopped working very recently. I do not have the time right now to test it on all platforms - can I just check it in as it is?
checked it in as Wan-The wanted it kill -0 on all pleatforms
Sonja, Have you seen this error again after we switched to "kill -0"?
no I have not seen it since. Thanks Wan-Teh for finding that problem in the QA scripts, I think we can close the bug.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
If kill -0 works but ps doesn't, this shows the ps command on Linux has a bug. We should submit a bug report to Red Hat. Sonja, on which Linux machines have we ever seen the "Type II" errors (as defined in comment #14)? What kind of machines are they (Red Hat Linux version and number of CPUs)?
huey RH 7.1 P III (Coppermine) 996 MHz louie RH 7.2 P III (Katmai) 596 MHZ (showed less failures than faster machines) dewey RH 6.2 P III (Coppermine) 996 MHz were some of the machines I saw it on. I also saw it on other machines, whose CPU types I do not know box RH 7.2 2 processor washer RH 6.2 4 processor turgur.mcom showed the problem as well as Ian's machine from the data I have today it is not evident anymore if they were type II errors or type one errors, because only the QA reports and not the individual results and output logs are stored
I saw an error in today's backward compatibility tests on dewey which run old scripts that still work with ps that looks exactly the same again, just as an explaination for this QA failure.
Filed Red Hat Linux 6.2 bug 60921 http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=60921 for the 'ps -e' problem. I filed the bug against Red Hat Linux 6.2 because we have seen Type II errors on dewey, which is: dewey RH 6.2 P III (Coppermine) 996 MHz If we see Type II errors on other Red Hat Linux versions, I'll file new bugs against them.
Comment on attachment 67387 [details] [diff] [review] Patch to install a SIGTERM handler on Linux for debugging I removed the SIGTERM signal handler from the tip.
In http://bugzilla.mozilla.org/show_bug.cgi?id=129701#c16 we saw the 'ps -e' problem on 'huey' running Linux 2.4. Sonja, what kind of machine is 'huey' and what is the Red Hat Linux version it's running?
> In http://bugzilla.mozilla.org/show_bug.cgi?id=129701#c16 > we saw the 'ps -e' problem on 'huey' running Linux 2.4. > Sonja, what kind of machine is 'huey' and what is the 1 CPU Dell sever, 1000MhZ P III Coppermine > Red Hat Linux version it's running? Redhat 7.1 out of the box, kernel Linux 2.4.2-2
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: