Closed Bug 129701 Opened 22 years ago Closed 22 years ago

Remove the sleep commands for Linux in kill_selfserv()

Categories

(NSS :: Test, enhancement, P2)

x86
Linux
enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wtc, Assigned: wtc)

Details

Attachments

(2 files)

In tests/ssl/ssl.sh, kill_selfserv(), we sleep for 30 seconds on
Linux after killing the selfserv process, otherwise the next selfserv
process may fail to bind to the TCP port with an "Local Network
address is in use" error.

These sleep 30 commands significantly slow down the QA runs on
Linux.

In http://bugzilla.mozilla.org/show_bug.cgi?id=119340#c13, I
proposed a hypothesis and an inelegant patch.  That patch was
backed out because of interaction with another problem (the 'ps'
command doesn't seem to be reliable on Linux).  Now that we are
not using 'ps' to determine if a process is alive, I'd like to
try my inelegant solution again.
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → 3.4.1
I really hope we can remove the sleep 30. I was planing to do this after RTM,
and I think even without the patch we can reduce it to 3-5 seconds - of course
complete removal would be even better
Changed the QA contact to Bishakha.
QA Contact: sonja.mirtitsch → bishakhabanerjee
Set target milestone to NSS 3.5.
Target Milestone: 3.4.1 → 3.5
Moved to 3.6.
Target Milestone: 3.5 → 3.6
Attached patch Proposed patchSplinter Review
This is an updated version of the patch in bug 119340.
I found that my patch doesn't work on Red Hat Linux
7.3.  I tried patching it further and different
solutions, but none of them work reliably.  So I'm
going to mark this WONTFIX.

I am afraid that we will need to solve this problem
by redefining the problem.  For example, we can
modify selfserv so that it will sleep one second
and retry PR_Bind if PR_Bind fails with the address
in use error.  Or we can modify our test scripts
so that selfserv uses an arbitrary unused port.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → WONTFIX
> Or we can modify our test scripts
> so that selfserv uses an arbitrary unused port.

I vote against this, because it will result in even more selfservs hanging on QA
machines after failed QA

Using a fixed port for selfserv has prevented us
from doing simultaneous NSS QA runs on the same
machine.

Another possible solution is to shut down selfserv
cleanly by sending a 'stop' command to it.
OK, I came up with a different solution.  Instead of
trying to figure out how to kill and wait for a
multithreaded process on Linux reliably, I simply
replace the 'sleep 30' command by a loop that tries
to bind to the port.  When we successfully bind to
the port, we know the previous selfserv process has
fully died and freed the port.

I added a -b (bindOnly) option to selfserv so that I
can use selfserv to do the "try bind".

A possible improvement is to try to bind to the port
for no more than a fixed number of times.  Another
possible improvement is to print a message before
each "try bind" so that output.log shows what's going
on behind the scene.

Please review and test this patch.  Thanks.
I'd like to suggest Yet Another Solution.  We could pass an argument to selfserv
that determines the number of connections it will handle.  After that number,
selfserv shuts down cleanly.  This has the advantage of a clean shutdown,
without the delay of sending a STOP request.
I am sure there is a reason why we are killing the
selfserv processes instead of cleanly shutting them
down.  I guess it is to avoid leftover selfserv
processes or hanging QA when there is a bug in
selfserv.

I think a clean shutdown solution (either by sending
a STOP request or by handling a specified number of
connections) is better.  But we would still need to
kill selfserv when it fails to shut down cleanly
because of a bug.  Between the two proposed clean
shutdown solutions, it seems that handling a specified
number of connections is more reliable and less
vulnerable to a bug in our code.

My proposed patch #2 is quite simple.  How about if
we use it at least as an interim solution?
Status: RESOLVED → REOPENED
Priority: P1 → P2
Resolution: WONTFIX → ---
I agree we still have to shutdown the server in the even the number of specified
connections was not completed due to a bug.

I think your patch is good.
> Using a fixed port for selfserv has prevented us
> from doing simultaneous NSS QA runs on the same
> machine.

I run 2 different tinderboxes plus nightly QA on an HP machine and a Solaris 8
machine without problems, just set the ENV PORT to a portnumber before starting
all.sh (ssl.sh) or use nssqa with -p <portnumber>. nssqa -t, as we start the
tinderbox qa will automatically pick a different portnumber too.

I think the loop is a good idea for now, but we should really implement Ian's
solution as soon as someone has the time.


I will check in my patch on Saturday to get some testing
done on it during the weekend.
Status: REOPENED → ASSIGNED
Comment on attachment 98094 [details] [diff] [review]
Proposed patch #2: replace 'sleep 30' by looping until the port is freed

I just checked in this patch to have it tested on Tinderbox over the weekend.
problems on todays backward compatibility tests, see

ftp://ftp.mozilla.org/pub/security/nss/daily_qa/20020911.1/bct/huey.2/results.html
ftp://ftp.mozilla.org/pub/security/nss/daily_qa/20020911.1/bct/huey.2/output.log

backward compatibility tests on Linux 2.4 run 331 scripts (do a sleep 30) and
binaries and todays libraries

init.sh init: Testing PATH
/share/builds/mccrel3/nss/nsstip/../nss331/builds/20010928.2.331-RTM/booboo_Solaris8/mozilla/dist/../security/nss/tests:/share/builds/mccrel3/nss/nsstip/../nss331/builds/20010928.2.331-RTM/booboo_Solaris8/mozilla/dist/Linux2.4_x86_glibc_PTH_OPT.OBJ/bin:/lib:/usr/lib:/bin:/sbin:/usr/bin:/usr/sbin:.:/u/svbld/bin:/tools/ns/bin:/usr/ccs/bin:/usr/dist/local/exe:/usr/bin/X11:/usr/audio/bin:/u/svbld/ns/securityqa
against LIB
/share/builds/mccrel3/nss/nsstip/builds/20020911.1/booboo_Solaris8/mozilla/dist/Linux2.4_x86_glibc_PTH_OPT.OBJ/lib


ssl.sh: TLS Require client auth (bad password) ----
selfserv -D -p 8443 -d ../server -n huey.red.iplanet.com \
         -w nss -r -r -i ../tests_pid.30698  &
selfserv started at Wed Sep 11 07:19:55 PDT 2002
tstclnt -p 8443 -h huey -q -d . <
/share/builds/mccrel3/nss/nss331/builds/20010928.2.331-RTM/booboo_Solaris8/mozilla/security/nss/tests/ssl/sslreq.txt

ssl.sh: Exit: 10 Fatal - selfserv process not detectable
/share/builds/mccrel3/nss/nsstip/../nss331/builds/20010928.2.331-RTM/booboo_Solaris8/mozilla/dist/../security/nss/tests/all.sh:
line 1:  7349 Terminated              selfserv -D -p ${PORT} -d ${R_SERVERDIR}
-n ${HOSTADDR} -w nss ${sparam} -i ${R_SERVERPID} $verbose  (wd:
/share/builds/mccrel3/nss/nsstip/builds/20020911.1/booboo_Solaris8/mozilla/tests_results/security/bct/huey.2/client)

Sonja, thanks a lot for reporting these failures.

The failures in backward compatibility tests cannot have
been caused by my checkin because I only modified init.sh,
ssl.sh, and selfserv.exe.

I think these were caused by the Linux 'ps' problem we
identified in http://bugzilla.mozilla.org/show_bug.cgi?id=119340#c38.
Maybe we should check in the "kill -0 $PID" fix into the
backward compatibility test scripts.
Marked the bug fixed.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: