Closed
Bug 119340
Opened 23 years ago
Closed 23 years ago
increased "selfserv process not detectable" errors on linux
Categories
(NSS :: Tools, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
3.4
People
(Reporter: sonja.mirtitsch, Assigned: wtc)
Details
Attachments
(5 files, 1 obsolete file)
|
4.24 KB,
patch
|
Details | Diff | Splinter Review | |
|
108.07 KB,
text/plain
|
Details | |
|
1.21 KB,
patch
|
Details | Diff | Splinter Review | |
|
123.18 KB,
text/plain
|
Details | |
|
134.91 KB,
text/plain
|
Details |
huey in daily QA and louie on the tinderbox showed several times that the
selfserv did not come up in time. 11/27 I reduced the time to wait after a
selfserv had been killed to 3 seconds, this could cause that the socket is not
free and the next selfserv has problems starting up.
I increased the time again to 5 seconds, but we were running troublefree for a
month with 3 seconds before. I will watch this for a few days, it might not be a
bug in test (or networkspeed), but in tools.
| Reporter | ||
Comment 1•23 years ago
|
||
Since it also shows up as the first selfserv is being started I assume it has
nothing to do with the 3 seconds being to little of a delay.
What the QA script does:
start selfserv
start tstclnt -q #exits with returncode 0 meaning has made a conection
reads the server process id file
greps through a ps to find out if the server is running
gives errormessage "ssl.sh: Exit: 10 Fatal - selfserv process not detectable"
during Exit function shell message:
"/share/builds/mccrel/nss/nsstip/builds/20020110.1/booboo_Solaris8/mozilla/security/nss/tests/all.sh:
line -57: 27172 Terminated selfserv -D -p ${PORT} -d ${R_SERVERDIR}
-n ${HOSTADDR} -w nss ${sparam} -i ${R_SERVERPID} $verbose (wd:
/share/builds/mccrel/nss/nsstip/builds/20020110.1/booboo_Solaris8/mozilla/tests_results/security/louie.7/client)"
ssl.sh: SSL tests ===============================
ssl.sh: SSL Cipher Coverage ===============================
selfserv -D -p 8443 -d ../server -n louie.red.iplanet.com \
-w nss -c ABCDEFabcdefghijklmnvy -i ../tests_pid.25206 &
selfserv started at Thu Jan 10 15:28:22 PST 2002
tstclnt -p 8443 -h louie -q -d . <
/share/builds/mccrel/nss/nsstip/builds/20020110.1/booboo_Solaris8/mozilla/security/nss/tests/ssl/sslreq.txt
ssl.sh: Exit: 10 Fatal - selfserv process not detectable
/share/builds/mccrel/nss/nsstip/builds/20020110.1/booboo_Solaris8/mozilla/security/nss/tests/all.sh:
line -57: 27172 Terminated selfserv -D -p ${PORT} -d ${R_SERVERDIR}
-n ${HOSTADDR} -w nss ${sparam} -i ${R_SERVERPID} $verbose (wd:
/share/builds/mccrel/nss/nsstip/builds/20020110.1/booboo_Solaris8/mozilla/tests_results/security/louie.7/client)
Assignee: sonja.mirtitsch → wtc
Component: Test → Tools
Priority: -- → P2
Target Milestone: --- → 3.4
| Reporter | ||
Comment 2•23 years ago
|
||
rerunning QA on all available linux machines
| Reporter | ||
Comment 3•23 years ago
|
||
on the louie tinderbox this problem looks slightly different
selfserv -D -p 8444 -d ../ext_server -n louie.red.iplanet.com \
-w nss -r -i ../tests_pid.13877 &
selfserv started at Fri Jan 11 01:45:01 PST 2002
tstclnt -p 8444 -h louie -q -d . <
/export/nss_tbx_linux2.4_7.2/builds/tinderbox/Linux-2.4_7.2/mozilla/security/nss/tests/ssl/sslreq.txt
ssl.sh: Exit: 10 Fatal - selfserv process not detectable
ssl.sh: SSL Stress Test Extended test ===============================
ssl.sh: skipping Stress SSL2 RC4 128 with MD5 for Extended test
ssl.sh: Stress SSL3 RC4 128 with MD5 ----
selfserv -D -p 8444 -d ../ext_server -n louie.red.iplanet.com \
-w nss -i ../tests_pid.13877 &
selfserv started at Fri Jan 11 01:45:02 PST 2002
tstclnt -p 8444 -h louie -q -d . <
/export/nss_tbx_linux2.4_7.2/builds/tinderbox/Linux-2.4_7.2/mozilla/security/nss/tests/ssl/sslreq.txt
selfserv: PR_Bind returned error -5982:
Local Network address is in use.
ssl.sh: Exit: 10 Fatal - selfserv process not detectable
/export/nss_tbx_linux2.4_7.2/builds/tinderbox/Linux-2.4_7.2/mozilla/security/nss/tests/all.sh:
kill: (19528) - No such pid
All of the machines that show the failure at the momentrun tinderboxes and
regular QA. I will halt the tinderboxes, and keep running regular QA and see if
that stops the problem
| Reporter | ||
Comment 4•23 years ago
|
||
I ran the modified tinderboxes (no QA) which increased the number of failures
again.
before 13:20: 12 failures on huey and dewey in 14 runs each, no failure on louie
on nightly QA, 4 failures on tinderbox QA on huey, dewey and louie
13:20 turned off tinderboxes
14:05 QA failure
15:35: turned on modified tinderboxes (no QA)
15:35-17:45
4 failures in 6 runs on dewey, 1 failure in 4 runs on huey no failures in 4 runs
on louie
| Reporter | ||
Comment 5•23 years ago
|
||
forgot one more thing: this did not show up a single time in the backward
compatibility tests
I will upload the results once more to
ftp://ftp.mozilla.org/pub/security/nss/daily_qa/20020111.1/result.html
Comment 6•23 years ago
|
||
I have actually seen this once now on my local machine (RH 7.1 dual processor).
I've probably run the SSL QA ~20-30 times in the last few days, but I did see
the failure once.
I looked at changes to selfserv since 3.3.2 and nothing stuck out to me. That
the backward compatibility tests never show the problem would indicate that the
problem is in selfserv.
Are we using a newer version of NSPR?
| Reporter | ||
Comment 7•23 years ago
|
||
I did see it today on box in the backward compatibility tests as well. I found
the easiest way to reproduce it is, compiling at the same time.
Comment 8•23 years ago
|
||
Aha! Quite true. The one time I saw it, I was also compiling (QA was running
debug, I was building optimized).
| Assignee | ||
Comment 9•23 years ago
|
||
This comment is from Bishakha:
I was able to reproduce this error at the Netscape end by running two tests in
very rapid succession. I have a script that iterates through the tests 30 times,
and the only times I see the error are when two tests are running near
simultaneously, with a second or so of time difference. I did this on
"turgur", which is the fastest of all the linux machines I ran the tests on.
If you view the times on these tests, you will see that sometimes the two QA
runs coincide at some point (i.e. a particular test suite does not coincide in
time exactly with the same suite of the next QA run - there should be a second
or more of difference) , and it seems to me that coming up with this error is
obvious, as one test waits on the other preceding it to free up the port.
The tests were run on the 20020114.1 build.
-Bishakha
| Assignee | ||
Comment 10•23 years ago
|
||
I looked at this problem last night. As I noted in another bug,
the particular selfserv failure below, which prevents it from
starting up, has nothing to do with crypto or SSL:
selfserv: PR_Bind returned error -5982:
Local Network address is in use.
We should do two things.
1. Find out if all the "selfserv process not detectable" errors
are preceded by the above PR_Bind failure.
2. I will see if there is a more reliable way to kill a multithreaded
process on Linux.
There are two longer-term solutions.
1. Instead of killing selfserv, we should attempt to send it a
"stop" command to gracefully shut it down. I tried to do this
last night but could not make it to work.
2. Have selfserv bind to any unused TCP port (as opposed to a
fixed TCP port) and write the port number it binds to to a file.
This solution has the additional benefit that we will be able
to run several QA sessions on the same machine at the same time.
In the meantime, we should increase the sleep interval after
killing selfserv to work around this problem on Linux. If we
suspsect performance degradation issues, we should rely on Kirk's
performance tests.
Status: NEW → ASSIGNED
Comment 11•23 years ago
|
||
Wan-Teh,
FYI, this has been a constant problem on Linux for web server in the automated
tests. The tests would not keep running in tinderbox because some server
"threads" (really some left-over processes) would eventually stay even though
the other threads were gone. I always attributed it to Linux bugs.
FYI, here is what the web server stop script that ships with web server 6 does.
See the nice comments about Linux I put in there.
#!/bin/sh
SERVER_ROOT=/export/home/jpierre/60opt
PRODUCT_NAME=https
INSTANCE_NAME=https-admserv
PID_FILE=/export/home/jpierre/60opt/https-admserv/logs/pid
LD_LIBRARY_PATH=${SERVER_ROOT}/bin/${PRODUCT_NAME}/lib:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH
LIBPATH=${LIBPATH}:${LD_LIBRARY_PATH}:/usr/lib/threads:/usr/ibmcxx/lib:/usr/lib:/lib;
export LIBPATH
SHLIB_PATH=${SHLIB_PATH}:${LD_LIBRARY_PATH}; export SHLIB_PATH
NS_SERVER_HOME=${SERVER_ROOT}; export NS_SERVER_HOME
NSES_SERVER_HOME=${SERVER_ROOT}; export NSES_SERVER_HOME
NS_HTTPS_HOME=${NS_SERVER_HOME}/bin/https; export NS_HTTPS_HOME
PATH=${SERVER_ROOT}/bin/${PRODUCT_NAME}/bin:${PATH}; export PATH
if test -x "$SERVER_ROOT/bin/$PRODUCT_NAME/httpadmin/bin/shutdown" ; then
# Send watchdog a message instructing it to terminate
"$SERVER_ROOT/bin/$PRODUCT_NAME/httpadmin/bin/shutdown" $INSTANCE_NAME
else
# Kill the watchdog with SIGTERM (unreliable with some JVMs)
if test -f $PID_FILE ; then
if [ `uname` != "Linux" ]
then
kill -TERM `cat $PID_FILE`
if test $? -ne 0 ; then
exit 1
fi
else
# on Linux we send to the group to work around a glibc bug
kill -TERM -`cat $PID_FILE`
if test $? -ne 0 ; then
exit 1
fi
fi
else
echo server not running
exit 1
fi
fi
loop_counter=1
max_count=30
while test $loop_counter -le $max_count; do
loop_counter=`expr $loop_counter + 1`
if test -f $PID_FILE ; then
if [ `uname` != "Linux" ]
then
kill -TERM `cat $PID_FILE`
if test $? -ne 0 ; then
exit 1
fi
else
# on Linux we send to the group to work around a glibc bug
kill -TERM -`cat $PID_FILE`
if test $? -ne 0 ; then
exit 1
fi
fi
else
echo server not running
exit 1
fi
fi
loop_counter=1
max_count=30
while test $loop_counter -le $max_count; do
loop_counter=`expr $loop_counter + 1`
if test -f $PID_FILE ; then
sleep 2
else
exit 0
fi
done
echo server not responding to exit command
echo killing process group
kill -9 -`cat $PID_FILE`
rm $PID_FILE
exit 1
| Reporter | ||
Comment 12•23 years ago
|
||
I do not think that is related to killing a multithreaded process on Linux,
because on 1/10 on louie this problem was seen at the first selfserv start
ftp://ftp.mozilla.org/pub/security/nss/daily_qa/20020110.1/louie.7/results.html
| Assignee | ||
Comment 13•23 years ago
|
||
On Linux, each thread is actually a process, with its own pid.
This is evident in how 'selfserv' is displayed by the 'ps -eaf'
command.
wtc 1765 1757 1 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d
../server
wtc 1770 1765 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d
../server
wtc 1771 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d
../server
wtc 1772 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d
../server
wtc 1773 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d
../server
wtc 1774 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d
../server
wtc 1775 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d
../server
wtc 1776 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d
../server
wtc 1777 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d
../server
wtc 1778 1770 0 13:56 pts/2 00:00:00 selfserv -D -p 8443 -d
../server
Pid 1757 is the test script (ssl.sh).
Pid 1765 is the primary thread in selfserv.
Pid 1770 is the thread manager, created by the pthread library.
Pid's 1771-1778 are the threads created by selfserv.
If we kill the primary thread (pid 1765) and wait for its
termination, my theory is that the other threads (pids
1770-1778) may still be running when the primary thread
has terminated. This is because the thread manager needs
to first notice that the primary thread is gone and then
terminate all the other threads. All of this can only
happen after the primary thread is dead.
Knowing how threads are implemented on Linux, I came up
with an inelegant but more reliable way to kill selfserv
and wait for the termination of all its threads.
I kill an arbitrary thread created by selfserv, say the
first one, with pid 1771. My theory is that the following
actions will be taken by the pthread library.
1. The thread manager, the parent of pid 1771, receives a
SIGCHLD signal, indicating the death of pid 1771.
2. The thread manager detects that something is wrong and
proceeds to kill all the other threads (pids 1772-1778).
3. After reaping all of its dead children, the thread manager
itself terminates.
4. The primary thread, the parent of the thread manager,
receives a SIGCHLD, indicating the death of the thread manager.
5. The primary thread detects something is wrong and terminates
too.
My theory is that with this method, the primary thread is the
last one to go. Therefore, if we want for its termination, we
are sure that all the other threads have already terminated.
My patch implements this method for Linux. On non-Linux platforms,
there is no change. On Linux, we kill the first thread created
by selfserv and wait for the termination of the primary thread.
I modified selfserv.c so that the first thread created by selfserv
writes its pid to a file.
With this method, we can do without the 'sleep 5' command, which we
used on Linux to wait for the full termination of selfserv.
I've tested my patch a few times on Linux. A code review and more
testing are needed.
| Assignee | ||
Comment 14•23 years ago
|
||
To avoid confusion in the discussion of this bug, I'd like
to define two terms for the two types of errors that have
been described in this bug report.
Type I: a "selfserv process not detectable" error that is
preceded by a PR_Bind failure:
selfserv: PR_Bind returned error -5982:
Local Network address is in use.
Type II: a "selfserv process not detectable" error that is
NOT preceded by a PR_Bind failure.
All the "selfserv process not detectable" errors that
Bishakha and I have reproduced are Type I. (In fact, one
can easily cause Type I errors to occur by deleting the
SLEEP command in mozilla/security/nss/tests/ssl/ssl.sh.)
In comment #13, I presented a plausible theory that
explains Type I errors and proposed a solution
(attachment 65104 [details] [diff] [review]).
In comment #12, Sonja gave an example of Type II errors.
I examined the output.log file and found that there was no
error message from the selfserv process. Bishakha and I
haven't been able to reproduce Type II errors. Since I
can't reproduce Type II errors and don't have any error
message from selfserv, I don't know how to debug them.
| Assignee | ||
Comment 15•23 years ago
|
||
I examined selfserv.c and tstclnt.c. As far as I can tell,
selfserv and tstclnt always print an error message if they
exit because of an error.
Since there was no error message from selfserv or tstclnt
in the Type II error described in comment #1, I only have one
explanation: selfserv did a normal termination prematurely.
To verify this theory, I added a printf statement at the
end of the main() function in selfserv.c:
NSS_Shutdown();
PR_Cleanup();
+ printf("selfserv: normal termination\n");
return 0;
}
Comment 16•23 years ago
|
||
Alternatively, could the process exit without error because it was killed?
Perhaps some other user, or another script running simultaneously kills it?
| Assignee | ||
Comment 17•23 years ago
|
||
Julien, I tried your solution (killing the process group as
opposed to the process) on selfserv. It doesn't work.
The command I issued, after variable and command substitution,
was:
kill -TERM -31459
The error message was:
./ssl.sh: kill: (-31459) - No such pid
I take it that it is the web server, not the Linux pthread
library, that creates a process group?
Comment 18•23 years ago
|
||
Wan-Teh,
That's correct, the web server - or the watchdog, in case the server is started
by the watchdog - creates a process group, and writes its PID to the file. Then
we can kill it with -group and it's supposed to kill the process, all the
threads, as well as all the children that it spawned - unless they defined their
own process groups . This is supposed to kill for example looping CGIs that are
part of the web server test suite.
| Assignee | ||
Comment 19•23 years ago
|
||
Thanks, Julien.
Anyone care to review or test my patch? :-)
| Reporter | ||
Comment 20•23 years ago
|
||
I reviewed the script changes and they look fine to me.
| Assignee | ||
Comment 21•23 years ago
|
||
Comment on attachment 65104 [details] [diff] [review]
An inelegant but more reliable way to wait for the termination of selfserv on Linux
I checked in this patch. This patch should eliminate what I
call the Type I errors and speed up our QA on Linux. Note
that this patch does not address the Type II errors, which I
still don't have a handle for.
| Assignee | ||
Comment 22•23 years ago
|
||
Comment on attachment 65104 [details] [diff] [review]
An inelegant but more reliable way to wait for the termination of selfserv on Linux
I backed out this patch.
Ever since I checked in this patch, we are seeing the
"selfserv process not detectable" errors on Linux in
our daily QA reports. A common failure sequence is
reproduced here (running Linux2.2_x86_glibc_PTH_OPT.OBJ
on Red Hat Linux 6.2):
===================================
ssl.sh: SSL3 Request don't require client auth on 2nd hs (client auth) produced
a returncode of 0, expected is 0 PASSED
ssl.sh: SSL3 Require client auth on 2nd hs (client does not provide auth) ----
selfserv -D -p 8443 -d ../ext_server -n dewey.red.iplanet.com \
-w nss -r -r -r -r -i ../tests_pid.8947 &
selfserv started at Wed Jan 30 04:05:58 PST 2002
tstclnt -p 8443 -h dewey -q -d . <
/share/builds/mccrel/nss/nsstip/builds/20020130.1/booboo_Solaris8/mozilla/secur
ity/nss/tests/ssl/sslreq.txt
ssl.sh: Exit: 10 Fatal - selfserv process not detectable
ssl.sh: SSL Stress Test Extended test ===============================
ssl.sh: skipping Stress SSL2 RC4 128 with MD5 for Extended test
ssl.sh: Stress SSL3 RC4 128 with MD5 ----
selfserv -D -p 8443 -d ../ext_server -n dewey.red.iplanet.com \
-w nss -i ../tests_pid.8947 &
selfserv started at Wed Jan 30 04:05:59 PST 2002
tstclnt -p 8443 -h dewey -q -d . <
/share/builds/mccrel/nss/nsstip/builds/20020130.1/booboo_Solaris8/mozilla/secur
ity/nss/tests/ssl/sslreq.txt
selfserv: PR_Bind returned error -5982:
Local Network address is in use.
ssl.sh: Exit: 10 Fatal - selfserv process not detectable
kill: (22329) - No such pid
===================================
We would first have a Type II error, followed by one or two
Type I errors. Those Type I errors suggest that there were
still some selfserv threads (which are really processes)
hanging around from the initial Type II error. So it is
safe to assume that the subsequent Type I errors were caused
by the initial Type II errors.
I still have no clue why we had the initial Type II errors
in the first place, and this is difficult for me to debug.
selfserv just disappears without leaving a trace -- no error
or normal termination messages to stderr or stdout, and no
core files.
Comment 23•23 years ago
|
||
At this point, with the amount of time we've put into this, wouldn't it just be
easier to start the selfserv's on different ports and let the OS kill the
processes when the shell exits? Instead of writing the PID out to a file and
then trying to kill the process, selfserv could find an available port, write
the port number out to a file, and then tstclnt would pick up the port on which
to connect. This way, we wouldn't have to worry about trying to kill the
process within the script. Additionally, we wouldn't have any trouble with
running the QA concurrently on the same machine.
Yes/No? Am I missing something?
| Assignee | ||
Comment 24•23 years ago
|
||
Ian,
Good point. Sometimes a bystander can see things more clearly. :-)
I first proposed this in bug 58176 and then again in comment #10.
I totally forgot about this solution in the heat of debugging.
So I suggest that we do that. That will make the Type I errors
(with a PR_Bind error message) impossible. Then we'll see if
we are still seeing the mysterious Type II errors.
| Reporter | ||
Comment 25•23 years ago
|
||
At this point in time I use 3 different ports in standard QA, one for regular
QA, 1 for 32 bit tinderbox and 1 for 63 bit tinderbox. It is not impossible to
change, but still requires a little bit of work, since the scripts will run into
conflicts if we just keep incrementing
| Reporter | ||
Comment 26•23 years ago
|
||
Here is a case where multiple testclients connected to the same selfserv during
the cipher coverage test. After 6 successfull tstclnt connections to the
***same*** selfserv, without any errormessage the selfserv process goes away. I
would ask for a signal handler in selfserv to see if some old script of mine
accidentially gets activated and kills the wrong process
ssl.sh: SSL tests ===============================
ssl.sh: SSL Cipher Coverage ===============================
selfserv -D -p 8444 -d ../server -n dewey.red.iplanet.com \
-w nss -c ABCDEFabcdefghijklmnvy -i ../tests_pid.27150 &
selfserv started at Thu Jan 31 04:32:24 PST 2002
tstclnt -p 8444 -h dewey -q -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt
ssl.sh: running SSL2 RC4 128 WITH MD5 ----------------------------
tstclnt -p 8444 -h dewey -c A -T \
-f -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt
subject DN: CN=dewey.red.iplanet.com, O=BOGUS Netscape, L=Mountain View,
ST=California, C=US
issuer DN: CN=NSS Test CA, O=BOGUS NSS, L=Mountain View, ST=California, C=US
0 cache hits; 0 cache misses, 0 cache not reusable
HTTP/1.0 200 OK
Server: Generic Web Server
Date: Tue, 26 Aug 1997 22:10:05 GMT
Content-type: text/plain
Discarded 1 characters.
GET / HTTP/1.0
EOF
ssl.sh: running SSL2 RC4 128 EXPORT40 WITH MD5 ----------------------------
tstclnt -p 8444 -h dewey -c B \
-f -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt
subject DN: CN=dewey.red.iplanet.com, O=BOGUS Netscape, L=Mountain View,
ST=California, C=US
issuer DN: CN=NSS Test CA, O=BOGUS NSS, L=Mountain View, ST=California, C=US
0 cache hits; 0 cache misses, 0 cache not reusable
HTTP/1.0 200 OK
Server: Generic Web Server
Date: Tue, 26 Aug 1997 22:10:05 GMT
Content-type: text/plain
Discarded 1 characters.
GET / HTTP/1.0
EOF
ssl.sh: running SSL2 RC2 128 CBC WITH MD5 ----------------------------
tstclnt -p 8444 -h dewey -c C \
-f -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt
subject DN: CN=dewey.red.iplanet.com, O=BOGUS Netscape, L=Mountain View,
ST=California, C=US
issuer DN: CN=NSS Test CA, O=BOGUS NSS, L=Mountain View, ST=California, C=US
0 cache hits; 0 cache misses, 0 cache not reusable
HTTP/1.0 200 OK
Server: Generic Web Server
Date: Tue, 26 Aug 1997 22:10:05 GMT
Content-type: text/plain
Discarded 1 characters.
GET / HTTP/1.0
EOF
ssl.sh: running SSL2 RC2 128 CBC EXPORT40 WITH MD5 ----------------------------
tstclnt -p 8444 -h dewey -c D -T \
-f -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt
subject DN: CN=dewey.red.iplanet.com, O=BOGUS Netscape, L=Mountain View,
ST=California, C=US
issuer DN: CN=NSS Test CA, O=BOGUS NSS, L=Mountain View, ST=California, C=US
0 cache hits; 0 cache misses, 0 cache not reusable
HTTP/1.0 200 OK
Server: Generic Web Server
Date: Tue, 26 Aug 1997 22:10:05 GMT
Content-type: text/plain
Discarded 1 characters.
GET / HTTP/1.0
EOF
ssl.sh: running SSL2 DES 64 CBC WITH MD5 ----------------------------
tstclnt -p 8444 -h dewey -c E \
-f -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt
subject DN: CN=dewey.red.iplanet.com, O=BOGUS Netscape, L=Mountain View,
ST=California, C=US
issuer DN: CN=NSS Test CA, O=BOGUS NSS, L=Mountain View, ST=California, C=US
0 cache hits; 0 cache misses, 0 cache not reusable
HTTP/1.0 200 OK
Server: Generic Web Server
Date: Tue, 26 Aug 1997 22:10:05 GMT
Content-type: text/plain
Discarded 1 characters.
GET / HTTP/1.0
EOF
ssl.sh: running SSL2 DES 192 EDE3 CBC WITH MD5 ----------------------------
ssl.sh: Exit: 10 Fatal - selfserv process not detectable
cat:
/share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020131-04.02/dewey.2/tests_pid.27150:
No such file or directory
cat:
/share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020131-04.02/dewey.2/tests_pid.27150:
No such file or directory
rm: cannot remove
`/share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020131-04.02/dewey.2/tests_pid.27150':
No such file or directory
ssl.sh: SSL Client Authentication ===============================
| Reporter | ||
Comment 27•23 years ago
|
||
| Assignee | ||
Comment 28•23 years ago
|
||
Sonja,
Does your script always use SIGTERM (the default) to kill
selfserv? I assume I just need to add a signal handler
for SIGTERM?
| Reporter | ||
Comment 29•23 years ago
|
||
It first tries a SIGTERM, waits a second and then does a SIGKILL (-9)
I tried to disable it in standard and tinderbox QA a long time ago, and I'll
read my script again to see if I have forgotten something.
The scripts that do it are nssqa (function kill_by_name in the header) and
mainly qaclean, which is started qaclean [hostname] and rsh to the host and
kills selfserv there
| Assignee | ||
Comment 30•23 years ago
|
||
I checked in this patch.
| Reporter | ||
Comment 31•23 years ago
|
||
I changed the summary, since that bug has been seen on any type of linus by now.
I noticed that it has decreased over the last couple of days, this is the first
time that I notice it since the signal handler has been added:
tinderbox on dewey:
dewey.1 10 Fatal - selfserv process not detectable Failed
dewey.1 10 Fatal - selfserv process not detectable Failed
dewey.2 10 Fatal - selfserv process not detectable Failed
dewey.1/output.log:tstclnt -p 8444 -h dewey -q -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt
dewey.1/output.log:ssl.sh: Exit: 10 Fatal - selfserv process not detectable
dewey.1/output.log:selfserv: received SIGTERM
dewey.1/output.log:ssl.sh: SSL Stress Test Extended test
===============================
dewey.1/output.log:ssl.sh: skipping Stress SSL2 RC4 128 with MD5 for Extended test
dewey.1/output.log:ssl.sh: Stress SSL3 RC4 128 with MD5 ----
dewey.1/output.log:selfserv -D -p 8444 -d ../ext_server -n dewey.red.iplanet.com \
dewey.1/output.log: -w nss -i ../tests_pid.26331 &
dewey.1/output.log:selfserv started at Tue Feb 5 11:26:33 PST 2002
dewey.1/output.log:tstclnt -p 8444 -h dewey -q -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt
dewey.1/output.log:selfserv: PR_Bind returned error -5982:
dewey.1/output.log:Local Network address is in use.
dewey.1/output.log:ssl.sh: Exit: 10 Fatal - selfserv process not detectable
dewey.1/output.log:kill: (669) - No such pid
dewey.1/output.log-issuer DN: CN=NSS Chain2 Server Test CA, O=BOGUS NSS, L=Santa
Clara, ST=California, C=US
dewey.1/output.log-0 cache hits; 1 cache misses, 0 cache not reusable
dewey.1/output.log-tstclnt: read from socket failed: TCP connection reset by peer.
dewey.1/output.log-ssl.sh: SSL3 Require client auth on 2nd hs (bad password)
produced a returncode of 1, expected is 1 PASSED
dewey.1/output.log-selfserv: received SIGTERM
dewey.1/output.log-ssl.sh: SSL3 Require client auth on 2nd hs (client auth) ----
dewey.1/output.log-selfserv -D -p 8444 -d ../ext_server -n dewey.red.iplanet.com \
dewey.1/output.log- -w nss -r -r -r -r -i ../tests_pid.26331 &
dewey.1/output.log-selfserv started at Tue Feb 5 11:26:32 PST 2002
dewey.1/output.log-tstclnt -p 8444 -h dewey -q -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt
dewey.1/output.log:ssl.sh: Exit: 10 Fatal - selfserv process not detectable
dewey.1/output.log-selfserv: received SIGTERM
dewey.1/output.log-ssl.sh: SSL Stress Test Extended test
===============================
dewey.1/output.log-ssl.sh: skipping Stress SSL2 RC4 128 with MD5 for Extended test
dewey.1/output.log-ssl.sh: Stress SSL3 RC4 128 with MD5 ----
dewey.1/output.log-selfserv -D -p 8444 -d ../ext_server -n dewey.red.iplanet.com \
dewey.1/output.log- -w nss -i ../tests_pid.26331 &
dewey.1/output.log-selfserv started at Tue Feb 5 11:26:33 PST 2002
dewey.1/output.log-tstclnt -p 8444 -h dewey -q -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/ssl/sslreq.txt
dewey.1/output.log-selfserv: PR_Bind returned error -5982:
dewey.1/output.log-Local Network address is in use.
dewey.1/output.log:ssl.sh: Exit: 10 Fatal - selfserv process not detectable
dewey.1/output.log-kill: (669) - No such pid
dewey.1/output.log-sdr.sh: SDR Tests ===============================
dewey.1/output.log-sdr.sh: Creating an SDR key/Encrypt
--
dewey.2/output.log-Discarded 1 characters.
dewey.2/output.log-GET / HTTP/1.0
dewey.2/output.log-
dewey.2/output.log-EOF
dewey.2/output.log-
dewey.2/output.log-
dewey.2/output.log-
dewey.2/output.log-
dewey.2/output.log-
dewey.2/output.log-ssl.sh: running SSL3 RSA EXPORT WITH RC2 CBC 40 MD5
----------------------------
dewey.2/output.log:ssl.sh: Exit: 10 Fatal - selfserv process not detectable
dewey.2/output.log-selfserv: received SIGTERM
dewey.2/output.log-cat:
/share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020205-10.59/dewey.2/tests_pid.1325:
No such file or directory
dewey.2/output.log-cat:
/share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020205-10.59/dewey.2/tests_pid.1325:
No such file or directory
Summary: increased "selfserv process not detectable" errors on linux 2.4 → increased "selfserv process not detectable" errors on linux
| Reporter | ||
Comment 32•23 years ago
|
||
| Reporter | ||
Comment 33•23 years ago
|
||
| Assignee | ||
Comment 34•23 years ago
|
||
Comment on attachment 67963 [details]
output.log including signalhandler messages
There are two things strange about this output.log.
1. There is no "selfserv: SIGTERM received" message
at all.
2. In the first "selfserv process not detectable"
error, the pid file did not exist. The 'cat' and
'rm' commands on the pid file both failed because
the pid file did not exist.
| Reporter | ||
Comment 35•23 years ago
|
||
that is because I attached the wrong file, sorry about that... let me see if I
find the right one
| Reporter | ||
Updated•23 years ago
|
Attachment #67963 -
Attachment is obsolete: true
| Reporter | ||
Comment 36•23 years ago
|
||
| Assignee | ||
Comment 37•23 years ago
|
||
Comment on attachment 67964 [details]
one more output.log
This output.log looks more normal than the previous one.
There are many "selfserv: received SIGTERM" messages,
which are expected.
There is only one "selfserv process not detectable" error,
which is reproduced below:
ssl.sh: running SSL3 RSA EXPORT WITH RC2 CBC 40 MD5
----------------------------
ssl.sh: Exit: 10 Fatal - selfserv process not detectable
selfserv: received SIGTERM
cat:
/share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020205
-10.59/dewey.2/tests_pid.1325: No such file or directory
cat:
/share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-20020205
-10.59/dewey.2/tests_pid.1325: No such file or directory
rm: cannot remove
`/share/builds/mccrel/nss/nsstip/tinderbox/tests_results/security/dewey-2002020
5-10.59/dewey.2/tests_pid.1325': No such file or directory
It appears that the selfserv process was killed by the
SIGTERM signal. It is not clear whether that occurred
before or after the "selfserv process not detectable"
error because the "selfserv: received SIGTERM" message
was written to standard error (fd 2) whereas the
"selfserv process not detectable" message was written
to standard output (fd 1).
I just modified selfserv.c to write the "selfserv: received
SIGTERM" message to standard output (fd 1) as well so that
we have a better chance of seeing these messages in the
correct order.
One strange thing about this output.log is that the pid file
did not exist either after the error.
| Assignee | ||
Comment 38•23 years ago
|
||
Comment on attachment 68055 [details]
replaces attachment 67963 [details]
This output.log has two "selfserv process not detectable"
errors. Only the first one is interesting. The second
one is a Type I error, caused by the PR_Bind failure.
The first error is reproduced here:
ssl.sh: SSL3 Require client auth on 2nd hs (client auth) ----
selfserv -D -p 8444 -d ../ext_server -n dewey.red.iplanet.com \
-w nss -r -r -r -r -i ../tests_pid.26331 &
selfserv started at Tue Feb 5 11:26:32 PST 2002
tstclnt -p 8444 -h dewey -q -d . <
/export/nss_tbx_linux2.2/builds/tinderbox/Linux-2.2/mozilla/security/nss/tests/
ssl/sslreq.txt
ssl.sh: Exit: 10 Fatal - selfserv process not detectable
selfserv: received SIGTERM
This again seems to suggest that the selfserv process was
killed. Since the two messages were written to different
streams (standard out and standard error), I am not sure
whether these two events actually occurred in the same
order as shown in the output.log.
Sonja, do you know if ssl.sh was supposed to kill selfserv
when it could not detect selfserv?
Is it possible that the 'ps' command on Linux is not a
reliable way to detect whether a process is still running?
You can use "kill -0 $PID" to detect whether the process
is still alive.
| Reporter | ||
Comment 39•23 years ago
|
||
> Sonja, do you know if ssl.sh was supposed to kill selfserv
> when it could not detect selfserv?
it tries to kill it anyway. The Exit() function in ini.sh checks the file
$SERVERPID, which is not removed when is_selfserv_alive() can not find the PID
in the output of ps.
I can change this behavior easily if that is desired.
I was not aware of a kill -0 $PID - but so far the ps has given correct output.
I'll read up on the kill -0
| Reporter | ||
Comment 40•23 years ago
|
||
could not find any info about kill -0 in linus or solaris manpages
I am reluctant to use a feature like this - please review the patch
Index: ssl.sh
===================================================================
RCS file: /cvsroot/mozilla/security/nss/tests/ssl/ssl.sh,v
retrieving revision 1.44
diff -u -r1.44 ssl.sh
--- ssl.sh 2 Nov 2001 23:47:47 -0000 1.44
+++ ssl.sh 6 Feb 2002 02:50:16 -0000
@@ -111,8 +111,12 @@
fi
fi
PID=`cat ${SERVERPID}`
- $PS -e | grep $PID >/dev/null || \
- Exit 10 "Fatal - selfserv process not detectable"
+ if [ "${OS_ARCH}" = "Linux" ]; then
+ kill -0 $PID >/dev/null 2>/dev/null || Exit 10 "Fatal - selfserv process
not detectable"
+ else
+ $PS -e | grep $PID >/dev/null || \
+ Exit 10 "Fatal - selfserv process not detectable"
+ fi
}
########################### wait_for_selfserv ##########################
| Assignee | ||
Comment 41•23 years ago
|
||
kill -0 $PID is an idiom to check whether the process is alive.
See http://www.opengroup.org/onlinepubs/007908799/xcu/kill.html:
-signal_number
Specify a non-negative decimal integer, signal_number,
representing the signal to be used instead of SIGTERM,
as the sig argument in the effective call to kill().
and http://www.opengroup.org/onlinepubs/007908799/xsh/kill.html:
If sig is 0 (the null signal), error checking is performed
but no signal is actually sent. The null signal can be used
to check the validity of pid.
| Reporter | ||
Comment 42•23 years ago
|
||
OK - I guess that means I should check it in?
| Assignee | ||
Comment 43•23 years ago
|
||
I suggest we use the kill -0 idiom on all platforms, not just
Linux. We just need to verify that kill -0 works correctly
under MKS Korn shell.
| Reporter | ||
Comment 44•23 years ago
|
||
ps works fine on all platforms except maybe for linux, where it stopped working
very recently.
I do not have the time right now to test it on all platforms - can I just check
it in as it is?
| Reporter | ||
Comment 45•23 years ago
|
||
checked it in as Wan-The wanted it kill -0 on all pleatforms
| Assignee | ||
Comment 46•23 years ago
|
||
Sonja,
Have you seen this error again after we switched to "kill -0"?
| Reporter | ||
Comment 47•23 years ago
|
||
no I have not seen it since.
Thanks Wan-Teh for finding that problem in the QA scripts, I think we can close
the bug.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 48•23 years ago
|
||
If kill -0 works but ps doesn't, this shows the ps command
on Linux has a bug. We should submit a bug report to Red
Hat.
Sonja, on which Linux machines have we ever seen the "Type
II" errors (as defined in comment #14)? What kind of
machines are they (Red Hat Linux version and number of CPUs)?
| Reporter | ||
Comment 49•23 years ago
|
||
huey RH 7.1 P III (Coppermine) 996 MHz
louie RH 7.2 P III (Katmai) 596 MHZ (showed less failures than faster machines)
dewey RH 6.2 P III (Coppermine) 996 MHz
were some of the machines I saw it on. I also saw it on other machines, whose
CPU types I do not know
box RH 7.2 2 processor
washer RH 6.2 4 processor
turgur.mcom showed the problem as well as Ian's machine
from the data I have today it is not evident anymore if they were type II errors
or type one errors, because only the QA reports and not the individual results
and output logs are stored
| Reporter | ||
Comment 50•23 years ago
|
||
I saw an error in today's backward compatibility tests on dewey which run old
scripts that still work with ps that looks exactly the same again, just as an
explaination for this QA failure.
| Assignee | ||
Comment 51•23 years ago
|
||
Filed Red Hat Linux 6.2 bug 60921
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=60921
for the 'ps -e' problem.
I filed the bug against Red Hat Linux 6.2 because we have
seen Type II errors on dewey, which is:
dewey RH 6.2 P III (Coppermine) 996 MHz
If we see Type II errors on other Red Hat Linux versions,
I'll file new bugs against them.
| Assignee | ||
Comment 52•23 years ago
|
||
Comment on attachment 67387 [details] [diff] [review]
Patch to install a SIGTERM handler on Linux for debugging
I removed the SIGTERM signal handler from the tip.
| Assignee | ||
Comment 53•23 years ago
|
||
In http://bugzilla.mozilla.org/show_bug.cgi?id=129701#c16
we saw the 'ps -e' problem on 'huey' running Linux 2.4.
Sonja, what kind of machine is 'huey' and what is the
Red Hat Linux version it's running?
| Reporter | ||
Comment 54•23 years ago
|
||
> In http://bugzilla.mozilla.org/show_bug.cgi?id=129701#c16
> we saw the 'ps -e' problem on 'huey' running Linux 2.4.
> Sonja, what kind of machine is 'huey' and what is the
1 CPU Dell sever, 1000MhZ P III Coppermine
> Red Hat Linux version it's running?
Redhat 7.1 out of the box, kernel Linux 2.4.2-2
You need to log in
before you can comment on or make changes to this bug.
Description
•