Closed Bug 64695 Opened 24 years ago Closed 24 years ago

Win2000 selfserver dies intermittantly (NT build only)

Categories

(NSS :: Tools, defect, P1)

x86
Windows 2000

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sonja.mirtitsch, Assigned: wtc)

Details

Attachments

(10 files)

selfserver is being started for various cipher coverage or client auth tests,
testclient -q does not give an error message, there are no messages from the
selfserver that either indicate that it is still alive or an error, no messages
from the shell about a terminated background process.
yesterday and today QA ran multiple times on clio - win2000 and sonjant - nt
4.0. the Win95 build tests fine on both platforms, the NT build tests fine on NT
4.0, but fails on Win 2000.

I attach one result and one output.log, there are more in the following locations:

http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/output.log
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/output.log


http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/output.log
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.5/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.5/output.log
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.6/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.6/output.log

http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.9/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.9/output.log
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.10/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.10/output.log
Attached file results.html, test 9
Attached file output.log
Severity: normal → major
Priority: -- → P1
This must be fixed before we ship 3.2.
Status: NEW → ASSIGNED
Target Milestone: --- → 3.2
I can't reproduce the test failures when I run all.sh manually
on the 20010108.1 build, both WINNT5.0_DBG.OBJ and WINNT5.0_OPT.OBJ.
Today we had failures on clio again, all of them were in context with a bad
password, but it still look as if the server is dying.

http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/output.log
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/output.log

The QA suites are highly sensitive to changed environment variables and machine
reconfigurations, and I still think that we need dedicated NSS QA machines, and
QA can not be just a task on somebody's workstation or server, especially on PCs.
If I am supposed to debugg the QA suites to see if I can find the problem in the
QA wrapper I will need at least a few hours on clio.
Tested again, I have seen this problem when running all.sh manually as well, but 
it is highly intermittant, today's results in

http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010110.1/
blowfish_NT4.0_Win95/mozilla/tests_results/security/

clio15, 16 and 17 are pure all.sh runs
I put in more info into the ssl script. looks like the server is still there,
see the ps output before the start of the testclient, and also after ending of
the testclient.

********************* SSL Client Auth ****************************
***** TLS Request don't require client auth (client does not provide auth) ****
selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/server
-n clio.red.iplanet.com -w nss -r -i C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808
& tstclnt -p 8443 -h clio -q -d
w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/client
redir from
W:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
W:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/client
Testing if server is alive...
Serverpidfile exists: C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808 1072
 1072  0:00 selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/server
-n clio.red.iplanet.com -w nss -r -i C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808
  972  0:00 grep 1072
selfserver process alive
tstclnt -p 8443 -h clio -f -d
w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/client
-w nss redir from
W:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
tstclnt: write to SSL socket failed: Encountered end of file.
[2] + Done(134) ?
  848	Abort	tstclnt
FAILURE: test TLS Request don't require client auth (client does not provide
auth) produced a returncode of 134, expected is 0 O_CRON =
about to kill:
 1072  0:00 selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/server
-n clio.red.iplanet.com -w nss -r -i C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808
Attached file patch to selfserv
It appears that if the machine is too slow, the first launched thread doesn't go
thru the cleanup in time. The cleanup would set the variable "stopping" to 1.
The main thread inquires this variable, if 0 and a thread exists (which is never
the case on fast machines like sonjaNT) it waits for a condition variable which
should be threadQ, but is stopQ. Probably this is a typo.
Sonja investigated and tracked down the bug, so I'm
assigning to bug to her.  This bug was introduced by
the fix for Bug #62396.
Assignee: wtc → sonmi
Status: ASSIGNED → NEW
I think the line of code that reads:
        while ( !stopping && threadCount > 0 ) {
should read:
        while ( !stopping || threadCount > 0 ) {
That is, this loop should continue until 
stopping is true and threadCount <= 0.  
Actually, it is probably sufficient to just wait 
        while (threadCount > 0) {
Please try that.
Comments on my patch (id=22742):
After reviewing the code, I agree with Nelson that
we should just wait while (threadCount > 0).  The
variable 'stopping' is set and tested without holding
a lock, so it should be declared volatile.
With the newest patch we had QA failures on GREASYBEAR - clio, sonjaNT and the
solaris machines (kentuckyderby, y2sun2, galileo, columbus) were fine.
The failure looks unrelated to the change to me, I will reboot GREASYBEAR and
then try again, and attempt to debug tomorrow. for now the fix is not checked
in. The Solaris failure that I saw at first had other reasons

***** TLS Request don't require client auth (bad password) ****
selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/server
-n GREASYBEAR.red.iplanet.com -w nss -r -i
C:/DOCUME~1/sonmi/LOCALS~1/Temp/tests_pid.908 & tstclnt -p 8443 -h GREASYBEAR -q
-d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client
redir from
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client



selfserv: About to call accept.
selfserv: PR_Accept returned error -5961:
TCP connection reset by peer.
selfserv: Closing listen socket.
tstclnt -p 8443 -h GREASYBEAR -f -d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client
-w bogus -n TestUser redir from
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
tstclnt: unable to connect (poll): Connection refused by peer.
[3] + Done(134) ?
  860	Abort	tstclnt
FAILURE: test TLS Request don't require client auth (bad password) produced a
returncode of 134, expected is 0 O_CRON =
kill: ./ssl.sh 351: 812: no such process


***** SSL3 Require client auth (client auth) ****
selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/server
-n GREASYBEAR.red.iplanet.com -w nss -r -r -i
C:/DOCUME~1/sonmi/LOCALS~1/Temp/tests_pid.908 & tstclnt -p 8443 -h GREASYBEAR -q
-d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client
redir from
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client



selfserv: About to call accept.
selfserv: PR_Accept returned error -5961:
TCP connection reset by peer.
selfserv: Closing listen socket.
tstclnt -p 8443 -h GREASYBEAR -f -d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client
-T -n TestUser -w nss redir from
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
tstclnt: unable to connect (poll): Connection refused by peer.
[13] + Done(134) ?
  608	Abort	tstclnt
FAILURE: test SSL3 Require client auth (client auth) produced a returncode of
134, expected is 0 O_CRON =
kill: ./ssl.sh 351: 712: no such process
This log shows the server getting ECONNRESET errors from PR_ACCEPT.
I'm not sure what's causing that, but I'm pretty sure it doesn't 
have anything to do with the patch to the while loop.
I think my "additional patch" has a logical error.
It is based on the assumption that the while loop
should read:
    while ( !stopping || threadCount > 0 ) {
I think that assumption is false.

I found that the do_accepts thread is usually
blocked in PR_Accept.  So the act of simply
setting 'stopping' to 1 won't kick the do_accepts
thread out of PR_Accept.  A PR_Interrupt call
would be needed.

Therefore, if we wait while (threadCount > 0),
threadCount will only drop down to 1 (the
do_accepts thread is still blocked in PR_Accept).
On the other hand, if we wait
while (!stopping && threadCount > 0), we will
exit the process as soon as 'stopping' is 1
even if there are still threads running.  In
fact, we might as well just wait while (!stopping).

So I suggest that we only check in the change
that replaces 1 by the enum constant 'rs_running'.
Unfortunately the fix didn't seem to be enough, the problem showed up on clio
today again.
I also noted that in the launch_thread function, it already waits for the
threadQ condition variable under certain circumstances. launch_thread is the
function called immidiately before the while loop that waits until the threads
are gone.
Wan-teh,

The method for shutting down selfserv cleanly has always been
two steps:
1) send it a request for /stop
2) send it another request

The first step sets "stopping".  The second one causes the 
listen thread to exit, which allows the threadCount to go
to zero.

Allowing the while loop to exit when stopping is true 
can cause a crash, because it zeros out an array of values,
incuding pointers, that may be in use by other threads.

I continue to believe that the proper test is to loop
until threadCount is zero.
Today Sonja compared the output logs of a successful
run and a failed run of selfserv and pinpointed the
problem area.  After examining the source code of
selfserv and tstclnt, I think I can explain the
problem now.   (Note that we already fixed the
selfserv hanging problem.  Now we have a different problem:
selfserv sometimes exits prematurely due to the action
of tstclnt -q.)

In tstclnt, with the -q option, if the PR_Connect
call returns successfully, tstclnt returns SECSuccess
immediately without closing the socket.  Therefore
the newly established connection is aborted nongracefully.
On the other side of the connection, if the timing is
right (or rather, wrong), selfserv's PR_Accept call
may fail with PR_CONNECT_RESET_ERROR.  The do_accepts
thread in selfserv exits the while loop if PR_Accept
fails.

I think this should be fixed on both sides.

1. In tstclnt, we should close the socket after
   PR_Connect completes successfully.  This may
   or may not prevent the PR_CONNECT_RESET_ERROR
   on the selfserv side.  I believe this depends
   on the implementation of the socket library.
   I need to consult my network programming book,
   which I left at work.
2. In selfserv, the do_accepts thread should not
   exit the while loop when it gets the
   PR_CONNECT_RESET_ERROR.

I would implement both 1 & 2, but perhaps we can
try just 1 first.

Wan-Teh
-------------------------------------------------

I implemented Wan-Teh's fix #2 in selfserver, now if the PR_Accept runs into a
PR_CONNECT_RESET_ERROR it will ignore it, and continue the while loop and try a
new accept.
I am trying to implement the testclient -q solution now.

Paste in part of the output.log

***** TLS Request don't require client auth (bad password) ****
selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/server
-n GREASYBEAR.red.iplanet.com -w nss -r -i
C:/DOCUME~1/sonmi/LOCALS~1/Temp/tests_pid.972 & tstclnt -p 8443 -h GREASYBEAR -q
-d
w:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/client
redir from
W:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
W:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/client
tstclnt -p 8443 -h GREASYBEAR -f -d
w:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/client
-w bogus -n TestUser redir from
W:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt



selfserv: About to call accept.
selfserv: PR_Accept returned error -5961:
TCP connection reset by peer.
Ignoring PR_CONNECT_RESET_ERROR error - continue



selfserv: About to call accept.
selfserv: 0 cache hits; 1 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: subject DN: no certificate
selfserv: issuer  DN: no certificate
Attached patch patchSplinter Review
Why is the PR_Shutdown desirable?
Nelson: some NSPR users (including Javi or Sangjin of the PSM team)
reported to me before that on some version of Windows, if one end of
the connection merely closes its socket without calling shutdown first,
the other end of the connection may get a ECONNRESET error.  The
ECONNRESET error goes away if they always call shutdown before close.

Since selfserv is getting PR_CONNECT_RESET_ERROR from PR_Accept, I
thought that this might be the same Winsock peculiarity that other
NSPR users reported before, hence the suggestion of calling PR_Shutdown.
I reviewed the patch for selfserv (id=23233)
and the patch for tstclnt (id=23220).  You
can check them in.

After that, I'd like you to try (again) my
patch (id=22742), which changes selfserv to
wait while (threadCount > 0).
While testing selfserv tonight, I was able to reproduce a crash
by sending a get request for /stop to the server twice.
When I changed that while statement to 
        while (threadCount > 0) {
I was no longer able to reproduce the crash.

I'm working on adding a multi-process server capability to 
selfserv, and I can fix this in that checkin if that's OK.
Nelson: yes, it's okay to change that while
statement to
    while (threadCount > 0) {

This is in my "Additional patch"
(http://bugzilla.mozilla.org/showattachment.cgi?attach_id=22742).
Do you also think that 'stopping' should
be declared volatile?
*** Bug 63609 has been marked as a duplicate of this bug. ***
*** Bug 63609 has been marked as a duplicate of this bug. ***
From my point of view it is fixed no more Win2K problems seen - but I think
wtc's additional patch has not been checked in. Also, I am not sure about the
impact of Nelson's new changes. Reassigning to Wan-Teh.
Assignee: sonmi → wtc
My "additional patch" was checked in in revision 1.16 of selfserv.c
when Nelson added multi-process server capability to selfserv.

Marked the bug as fixed.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: