Closed Bug 129701 Opened 22 years ago Closed 22 years ago

Remove the sleep commands for Linux in kill_selfserv()

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

3.6

People

(Reporter: wtc, Assigned: wtc)

Details

Attachments

(2 files)

Proposed patch 22 years ago Wan-Teh Chang 4.24 KB, patch		Details \| Diff \| Splinter Review
Proposed patch #2: replace 'sleep 30' by looping until the port is freed 22 years ago Wan-Teh Chang 4.20 KB, patch		Details \| Diff \| Splinter Review

Wan-Teh Chang

Assignee

Description

•

22 years ago

In tests/ssl/ssl.sh, kill_selfserv(), we sleep for 30 seconds on
Linux after killing the selfserv process, otherwise the next selfserv
process may fail to bind to the TCP port with an "Local Network
address is in use" error.

These sleep 30 commands significantly slow down the QA runs on
Linux.

In http://bugzilla.mozilla.org/show_bug.cgi?id=119340#c13, I
proposed a hypothesis and an inelegant patch.  That patch was
backed out because of interaction with another problem (the 'ps'
command doesn't seem to be reliable on Linux).  Now that we are
not using 'ps' to determine if a process is alive, I'd like to
try my inelegant solution again.

Wan-Teh Chang

Assignee

Updated

•

22 years ago

Status: NEW → ASSIGNED

Priority: -- → P1

Target Milestone: --- → 3.4.1

Sonja Mirtitsch

Comment 1

•

22 years ago

I really hope we can remove the sleep 30. I was planing to do this after RTM,
and I think even without the patch we can reduce it to 3-5 seconds - of course
complete removal would be even better

Wan-Teh Chang

Assignee

Comment 2

•

22 years ago

Changed the QA contact to Bishakha.

QA Contact: sonja.mirtitsch → bishakhabanerjee

Wan-Teh Chang

Assignee

Comment 3

•

22 years ago

Set target milestone to NSS 3.5.

Target Milestone: 3.4.1 → 3.5

Wan-Teh Chang

Assignee

Comment 4

•

22 years ago

Moved to 3.6.

Target Milestone: 3.5 → 3.6

Wan-Teh Chang

Assignee

Comment 5

•

22 years ago

Attached patch Proposed patch — Details — Splinter Review

This is an updated version of the patch in bug 119340.

Wan-Teh Chang

Assignee

Comment 6

•

22 years ago

I found that my patch doesn't work on Red Hat Linux
7.3.  I tried patching it further and different
solutions, but none of them work reliably.  So I'm
going to mark this WONTFIX.

I am afraid that we will need to solve this problem
by redefining the problem.  For example, we can
modify selfserv so that it will sleep one second
and retry PR_Bind if PR_Bind fails with the address
in use error.  Or we can modify our test scripts
so that selfserv uses an arbitrary unused port.

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → WONTFIX

Sonja Mirtitsch

Comment 7

•

22 years ago

> Or we can modify our test scripts
> so that selfserv uses an arbitrary unused port.

I vote against this, because it will result in even more selfservs hanging on QA
machines after failed QA

Wan-Teh Chang

Assignee

Comment 8

•

22 years ago

Using a fixed port for selfserv has prevented us
from doing simultaneous NSS QA runs on the same
machine.

Another possible solution is to shut down selfserv
cleanly by sending a 'stop' command to it.

Wan-Teh Chang

Assignee

Comment 9

•

22 years ago

Attached patch Proposed patch #2: replace 'sleep 30' by looping until the port is freed — Details — Splinter Review

OK, I came up with a different solution.  Instead of
trying to figure out how to kill and wait for a
multithreaded process on Linux reliably, I simply
replace the 'sleep 30' command by a loop that tries
to bind to the port.  When we successfully bind to
the port, we know the previous selfserv process has
fully died and freed the port.

I added a -b (bindOnly) option to selfserv so that I
can use selfserv to do the "try bind".

A possible improvement is to try to bind to the port
for no more than a fixed number of times.  Another
possible improvement is to print a message before
each "try bind" so that output.log shows what's going
on behind the scene.

Please review and test this patch.  Thanks.

Ian McGreer

Comment 10

•

22 years ago

I'd like to suggest Yet Another Solution.  We could pass an argument to selfserv
that determines the number of connections it will handle.  After that number,
selfserv shuts down cleanly.  This has the advantage of a clean shutdown,
without the delay of sending a STOP request.

Wan-Teh Chang

Assignee

Comment 11

•

22 years ago

I am sure there is a reason why we are killing the
selfserv processes instead of cleanly shutting them
down.  I guess it is to avoid leftover selfserv
processes or hanging QA when there is a bug in
selfserv.

I think a clean shutdown solution (either by sending
a STOP request or by handling a specified number of
connections) is better.  But we would still need to
kill selfserv when it fails to shut down cleanly
because of a bug.  Between the two proposed clean
shutdown solutions, it seems that handling a specified
number of connections is more reliable and less
vulnerable to a bug in our code.

My proposed patch #2 is quite simple.  How about if
we use it at least as an interim solution?

Status: RESOLVED → REOPENED

Priority: P1 → P2

Resolution: WONTFIX → ---

Ian McGreer

Comment 12

•

22 years ago

I agree we still have to shutdown the server in the even the number of specified
connections was not completed due to a bug.

I think your patch is good.

Sonja Mirtitsch

Comment 13

•

22 years ago

> Using a fixed port for selfserv has prevented us
> from doing simultaneous NSS QA runs on the same
> machine.

I run 2 different tinderboxes plus nightly QA on an HP machine and a Solaris 8
machine without problems, just set the ENV PORT to a portnumber before starting
all.sh (ssl.sh) or use nssqa with -p <portnumber>. nssqa -t, as we start the
tinderbox qa will automatically pick a different portnumber too.

I think the loop is a good idea for now, but we should really implement Ian's
solution as soon as someone has the time.

Wan-Teh Chang

Assignee

Comment 14

•

22 years ago

I will check in my patch on Saturday to get some testing
done on it during the weekend.

Status: REOPENED → ASSIGNED

Wan-Teh Chang

Assignee

Comment 15

•

22 years ago

Comment on attachment 98094 [details] [diff] [review]
Proposed patch #2: replace 'sleep 30' by looping until the port is freed

I just checked in this patch to have it tested on Tinderbox over the weekend.

Sonja Mirtitsch

Comment 16

•

22 years ago

problems on todays backward compatibility tests, see

ftp://ftp.mozilla.org/pub/security/nss/daily_qa/20020911.1/bct/huey.2/results.html
ftp://ftp.mozilla.org/pub/security/nss/daily_qa/20020911.1/bct/huey.2/output.log

backward compatibility tests on Linux 2.4 run 331 scripts (do a sleep 30) and
binaries and todays libraries

init.sh init: Testing PATH
/share/builds/mccrel3/nss/nsstip/../nss331/builds/20010928.2.331-RTM/booboo_Solaris8/mozilla/dist/../security/nss/tests:/share/builds/mccrel3/nss/nsstip/../nss331/builds/20010928.2.331-RTM/booboo_Solaris8/mozilla/dist/Linux2.4_x86_glibc_PTH_OPT.OBJ/bin:/lib:/usr/lib:/bin:/sbin:/usr/bin:/usr/sbin:.:/u/svbld/bin:/tools/ns/bin:/usr/ccs/bin:/usr/dist/local/exe:/usr/bin/X11:/usr/audio/bin:/u/svbld/ns/securityqa
against LIB
/share/builds/mccrel3/nss/nsstip/builds/20020911.1/booboo_Solaris8/mozilla/dist/Linux2.4_x86_glibc_PTH_OPT.OBJ/lib


ssl.sh: TLS Require client auth (bad password) ----
selfserv -D -p 8443 -d ../server -n huey.red.iplanet.com \
         -w nss -r -r -i ../tests_pid.30698  &
selfserv started at Wed Sep 11 07:19:55 PDT 2002
tstclnt -p 8443 -h huey -q -d . <
/share/builds/mccrel3/nss/nss331/builds/20010928.2.331-RTM/booboo_Solaris8/mozilla/security/nss/tests/ssl/sslreq.txt

ssl.sh: Exit: 10 Fatal - selfserv process not detectable
/share/builds/mccrel3/nss/nsstip/../nss331/builds/20010928.2.331-RTM/booboo_Solaris8/mozilla/dist/../security/nss/tests/all.sh:
line 1:  7349 Terminated              selfserv -D -p ${PORT} -d ${R_SERVERDIR}
-n ${HOSTADDR} -w nss ${sparam} -i ${R_SERVERPID} $verbose  (wd:
/share/builds/mccrel3/nss/nsstip/builds/20020911.1/booboo_Solaris8/mozilla/tests_results/security/bct/huey.2/client)

Wan-Teh Chang

Assignee

Comment 17

•

22 years ago

Sonja, thanks a lot for reporting these failures.

The failures in backward compatibility tests cannot have
been caused by my checkin because I only modified init.sh,
ssl.sh, and selfserv.exe.

I think these were caused by the Linux 'ps' problem we
identified in http://bugzilla.mozilla.org/show_bug.cgi?id=119340#c38.
Maybe we should check in the "kill -0 $PID" fix into the
backward compatibility test scripts.

Wan-Teh Chang

Assignee

Comment 18

•

22 years ago

Marked the bug fixed.

Status: ASSIGNED → RESOLVED

Closed: 22 years ago → 22 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.