Closed Bug 64695 Opened 24 years ago Closed 24 years ago

Win2000 selfserver dies intermittantly (NT build only)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

3.2

People

(Reporter: sonja.mirtitsch, Assigned: wtc)

Details

Attachments

(10 files)

results.html, test 9 24 years ago Sonja Mirtitsch 6.53 KB, text/html		Details
output.log 24 years ago Sonja Mirtitsch 72.45 KB, text/plain		Details
patch to selfserv 24 years ago Sonja Mirtitsch 2.10 KB, text/plain		Details
real patch (I modified an old version of selfserver before) 24 years ago Sonja Mirtitsch 1.33 KB, patch		Details \| Diff \| Splinter Review
Additional patch, based on Nelson's comments. 24 years ago Wan-Teh Chang 1.03 KB, patch		Details \| Diff \| Splinter Review
patch 24 years ago Sonja Mirtitsch 671 bytes, patch		Details \| Diff \| Splinter Review
proposed testclient patch (tested 3 times) 24 years ago Sonja Mirtitsch 668 bytes, patch		Details \| Diff \| Splinter Review
new testclient patch, added PR_Shutdown 24 years ago Sonja Mirtitsch 709 bytes, patch		Details \| Diff \| Splinter Review
selfserver patch: reversed order for readability 24 years ago Sonja Mirtitsch 678 bytes, patch		Details \| Diff \| Splinter Review
selfserver patch changes requested by Nelson 24 years ago Sonja Mirtitsch 873 bytes, patch		Details \| Diff \| Splinter Review

Sonja Mirtitsch

Reporter

Description

•

24 years ago

selfserver is being started for various cipher coverage or client auth tests,
testclient -q does not give an error message, there are no messages from the
selfserver that either indicate that it is still alive or an error, no messages
from the shell about a terminated background process.
yesterday and today QA ran multiple times on clio - win2000 and sonjant - nt
4.0. the Win95 build tests fine on both platforms, the NT build tests fine on NT
4.0, but fails on Win 2000.

I attach one result and one output.log, there are more in the following locations:

http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/output.log
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/output.log


http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/output.log
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.5/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.5/output.log
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.6/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.6/output.log

http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.9/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.9/output.log
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.10/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.10/output.log

Sonja Mirtitsch

Reporter

Comment 1

•

24 years ago

Attached file results.html, test 9 — Details

Sonja Mirtitsch

Reporter

Comment 2

•

24 years ago

Attached file output.log — Details

Sonja Mirtitsch

Reporter

Updated

•

24 years ago

Severity: normal → major

Priority: -- → P1

Wan-Teh Chang

Assignee

Comment 3

•

24 years ago

This must be fixed before we ship 3.2.

Status: NEW → ASSIGNED

Target Milestone: --- → 3.2

Wan-Teh Chang

Assignee

Comment 4

•

24 years ago

I can't reproduce the test failures when I run all.sh manually
on the 20010108.1 build, both WINNT5.0_DBG.OBJ and WINNT5.0_OPT.OBJ.

Sonja Mirtitsch

Reporter

Comment 5

•

24 years ago

Today we had failures on clio again, all of them were in context with a bad
password, but it still look as if the server is dying.

http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/output.log
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/results.html
http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/output.log

The QA suites are highly sensitive to changed environment variables and machine
reconfigurations, and I still think that we need dedicated NSS QA machines, and
QA can not be just a task on somebody's workstation or server, especially on PCs.
If I am supposed to debugg the QA suites to see if I can find the problem in the
QA wrapper I will need at least a few hours on clio.

Sonja Mirtitsch

Reporter

Comment 6

•

24 years ago

Tested again, I have seen this problem when running all.sh manually as well, but 
it is highly intermittant, today's results in

http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010110.1/
blowfish_NT4.0_Win95/mozilla/tests_results/security/

clio15, 16 and 17 are pure all.sh runs

Sonja Mirtitsch

Reporter

Comment 7

•

24 years ago

I put in more info into the ssl script. looks like the server is still there,
see the ps output before the start of the testclient, and also after ending of
the testclient.

********************* SSL Client Auth ****************************
***** TLS Request don't require client auth (client does not provide auth) ****
selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/server
-n clio.red.iplanet.com -w nss -r -i C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808
& tstclnt -p 8443 -h clio -q -d
w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/client
redir from
W:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
W:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/client
Testing if server is alive...
Serverpidfile exists: C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808 1072
 1072  0:00 selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/server
-n clio.red.iplanet.com -w nss -r -i C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808
  972  0:00 grep 1072
selfserver process alive
tstclnt -p 8443 -h clio -f -d
w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/client
-w nss redir from
W:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
tstclnt: write to SSL socket failed: Encountered end of file.
[2] + Done(134) ?
  848	Abort	tstclnt
FAILURE: test TLS Request don't require client auth (client does not provide
auth) produced a returncode of 134, expected is 0 O_CRON =
about to kill:
 1072  0:00 selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/server
-n clio.red.iplanet.com -w nss -r -i C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808

Sonja Mirtitsch

Reporter

Comment 8

•

24 years ago

Attached file patch to selfserv — Details

Sonja Mirtitsch

Reporter

Comment 9

•

24 years ago

It appears that if the machine is too slow, the first launched thread doesn't go
thru the cleanup in time. The cleanup would set the variable "stopping" to 1.
The main thread inquires this variable, if 0 and a thread exists (which is never
the case on fast machines like sonjaNT) it waits for a condition variable which
should be threadQ, but is stopQ. Probably this is a typo.

Wan-Teh Chang

Assignee

Comment 10

•

24 years ago

Sonja investigated and tracked down the bug, so I'm
assigning to bug to her.  This bug was introduced by
the fix for Bug #62396.

Assignee: wtc → sonmi

Status: ASSIGNED → NEW

Sonja Mirtitsch

Reporter

Comment 11

•

24 years ago

Attached patch real patch (I modified an old version of selfserver before) — Details — Splinter Review

Nelson Bolyard (seldom reads bugmail)

Comment 12

•

24 years ago

I think the line of code that reads:
        while ( !stopping && threadCount > 0 ) {
should read:
        while ( !stopping || threadCount > 0 ) {
That is, this loop should continue until 
stopping is true and threadCount <= 0.  
Actually, it is probably sufficient to just wait 
        while (threadCount > 0) {
Please try that.

Wan-Teh Chang

Assignee

Comment 13

•

24 years ago

Attached patch Additional patch, based on Nelson's comments. — Details — Splinter Review

Wan-Teh Chang

Assignee

Comment 14

•

24 years ago

Comments on my patch (id=22742):
After reviewing the code, I agree with Nelson that
we should just wait while (threadCount > 0).  The
variable 'stopping' is set and tested without holding
a lock, so it should be declared volatile.

Sonja Mirtitsch

Reporter

Comment 15

•

24 years ago

With the newest patch we had QA failures on GREASYBEAR - clio, sonjaNT and the
solaris machines (kentuckyderby, y2sun2, galileo, columbus) were fine.
The failure looks unrelated to the change to me, I will reboot GREASYBEAR and
then try again, and attempt to debug tomorrow. for now the fix is not checked
in. The Solaris failure that I saw at first had other reasons

***** TLS Request don't require client auth (bad password) ****
selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/server
-n GREASYBEAR.red.iplanet.com -w nss -r -i
C:/DOCUME~1/sonmi/LOCALS~1/Temp/tests_pid.908 & tstclnt -p 8443 -h GREASYBEAR -q
-d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client
redir from
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client



selfserv: About to call accept.
selfserv: PR_Accept returned error -5961:
TCP connection reset by peer.
selfserv: Closing listen socket.
tstclnt -p 8443 -h GREASYBEAR -f -d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client
-w bogus -n TestUser redir from
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
tstclnt: unable to connect (poll): Connection refused by peer.
[3] + Done(134) ?
  860	Abort	tstclnt
FAILURE: test TLS Request don't require client auth (bad password) produced a
returncode of 134, expected is 0 O_CRON =
kill: ./ssl.sh 351: 812: no such process


***** SSL3 Require client auth (client auth) ****
selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/server
-n GREASYBEAR.red.iplanet.com -w nss -r -r -i
C:/DOCUME~1/sonmi/LOCALS~1/Temp/tests_pid.908 & tstclnt -p 8443 -h GREASYBEAR -q
-d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client
redir from
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client



selfserv: About to call accept.
selfserv: PR_Accept returned error -5961:
TCP connection reset by peer.
selfserv: Closing listen socket.
tstclnt -p 8443 -h GREASYBEAR -f -d
w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client
-T -n TestUser -w nss redir from
W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
tstclnt: unable to connect (poll): Connection refused by peer.
[13] + Done(134) ?
  608	Abort	tstclnt
FAILURE: test SSL3 Require client auth (client auth) produced a returncode of
134, expected is 0 O_CRON =
kill: ./ssl.sh 351: 712: no such process

Nelson Bolyard (seldom reads bugmail)

Comment 16

•

24 years ago

This log shows the server getting ECONNRESET errors from PR_ACCEPT.
I'm not sure what's causing that, but I'm pretty sure it doesn't 
have anything to do with the patch to the while loop.

Wan-Teh Chang

Assignee

Comment 17

•

24 years ago

I think my "additional patch" has a logical error.
It is based on the assumption that the while loop
should read:
    while ( !stopping || threadCount > 0 ) {
I think that assumption is false.

I found that the do_accepts thread is usually
blocked in PR_Accept.  So the act of simply
setting 'stopping' to 1 won't kick the do_accepts
thread out of PR_Accept.  A PR_Interrupt call
would be needed.

Therefore, if we wait while (threadCount > 0),
threadCount will only drop down to 1 (the
do_accepts thread is still blocked in PR_Accept).
On the other hand, if we wait
while (!stopping && threadCount > 0), we will
exit the process as soon as 'stopping' is 1
even if there are still threads running.  In
fact, we might as well just wait while (!stopping).

So I suggest that we only check in the change
that replaces 1 by the enum constant 'rs_running'.

Sonja Mirtitsch

Reporter

Comment 18

•

24 years ago

Unfortunately the fix didn't seem to be enough, the problem showed up on clio
today again.
I also noted that in the launch_thread function, it already waits for the
threadQ condition variable under certain circumstances. launch_thread is the
function called immidiately before the while loop that waits until the threads
are gone.

Nelson Bolyard (seldom reads bugmail)

Comment 19

•

24 years ago

Wan-teh,

The method for shutting down selfserv cleanly has always been
two steps:
1) send it a request for /stop
2) send it another request

The first step sets "stopping".  The second one causes the 
listen thread to exit, which allows the threadCount to go
to zero.

Allowing the while loop to exit when stopping is true 
can cause a crash, because it zeros out an array of values,
incuding pointers, that may be in use by other threads.

I continue to believe that the proper test is to loop
until threadCount is zero.

Sonja Mirtitsch

Reporter

Comment 20

•

24 years ago

Today Sonja compared the output logs of a successful
run and a failed run of selfserv and pinpointed the
problem area. After examining the source code of
selfserv and tstclnt, I think I can explain the
problem now. (Note that we already fixed the
selfserv hanging problem. Now we have a different problem:
selfserv sometimes exits prematurely due to the action
of tstclnt -q.)

In tstclnt, with the -q option, if the PR_Connect
call returns successfully, tstclnt returns SECSuccess
immediately without closing the socket. Therefore
the newly established connection is aborted nongracefully.
On the other side of the connection, if the timing is
right (or rather, wrong), selfserv's PR_Accept call
may fail with PR_CONNECT_RESET_ERROR. The do_accepts
thread in selfserv exits the while loop if PR_Accept
fails.

I think this should be fixed on both sides.

1. In tstclnt, we should close the socket after
PR_Connect completes successfully. This may
or may not prevent the PR_CONNECT_RESET_ERROR
on the selfserv side. I believe this depends
on the implementation of the socket library.
I need to consult my network programming book,
which I left at work.
2. In selfserv, the do_accepts thread should not
exit the while loop when it gets the
PR_CONNECT_RESET_ERROR.

I would implement both 1 & 2, but perhaps we can
try just 1 first.

Wan-Teh
-------------------------------------------------

I implemented Wan-Teh's fix #2 in selfserver, now if the PR_Accept runs into a
PR_CONNECT_RESET_ERROR it will ignore it, and continue the while loop and try a
new accept.
I am trying to implement the testclient -q solution now.

Paste in part of the output.log

***** TLS Request don't require client auth (bad password) ****
selfserv -v -p 8443 -d
w:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/server
-n GREASYBEAR.red.iplanet.com -w nss -r -i
C:/DOCUME~1/sonmi/LOCALS~1/Temp/tests_pid.972 & tstclnt -p 8443 -h GREASYBEAR -q
-d
w:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/client
redir from
W:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
W:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/client
tstclnt -p 8443 -h GREASYBEAR -f -d
w:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/client
-w bogus -n TestUser redir from
W:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt

selfserv: About to call accept.
selfserv: PR_Accept returned error -5961:
TCP connection reset by peer.
Ignoring PR_CONNECT_RESET_ERROR error - continue

selfserv: About to call accept.
selfserv: 0 cache hits; 1 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: subject DN: no certificate
selfserv: issuer DN: no certificate

Sonja Mirtitsch

Reporter

Comment 21

•

24 years ago

Attached patch patch — Details — Splinter Review

Sonja Mirtitsch

Reporter

Comment 22

•

24 years ago

Attached patch proposed testclient patch (tested 3 times) — Details — Splinter Review

Sonja Mirtitsch

Reporter

Comment 23

•

24 years ago

Attached patch new testclient patch, added PR_Shutdown — Details — Splinter Review

Nelson Bolyard (seldom reads bugmail)

Comment 24

•

24 years ago

Why is the PR_Shutdown desirable?

Sonja Mirtitsch

Reporter

Comment 25

•

24 years ago

Attached patch selfserver patch: reversed order for readability — Details — Splinter Review

Wan-Teh Chang

Assignee

Comment 26

•

24 years ago

Nelson: some NSPR users (including Javi or Sangjin of the PSM team)
reported to me before that on some version of Windows, if one end of
the connection merely closes its socket without calling shutdown first,
the other end of the connection may get a ECONNRESET error.  The
ECONNRESET error goes away if they always call shutdown before close.

Since selfserv is getting PR_CONNECT_RESET_ERROR from PR_Accept, I
thought that this might be the same Winsock peculiarity that other
NSPR users reported before, hence the suggestion of calling PR_Shutdown.

Sonja Mirtitsch

Reporter

Comment 27

•

24 years ago

Attached patch selfserver patch changes requested by Nelson — Details — Splinter Review

Wan-Teh Chang

Assignee

Comment 28

•

24 years ago

I reviewed the patch for selfserv (id=23233)
and the patch for tstclnt (id=23220).  You
can check them in.

After that, I'd like you to try (again) my
patch (id=22742), which changes selfserv to
wait while (threadCount > 0).

Nelson Bolyard (seldom reads bugmail)

Comment 29

•

24 years ago

While testing selfserv tonight, I was able to reproduce a crash
by sending a get request for /stop to the server twice.
When I changed that while statement to 
        while (threadCount > 0) {
I was no longer able to reproduce the crash.

I'm working on adding a multi-process server capability to 
selfserv, and I can fix this in that checkin if that's OK.

Wan-Teh Chang

Assignee

Comment 30

•

24 years ago

Nelson: yes, it's okay to change that while
statement to
    while (threadCount > 0) {

This is in my "Additional patch"
(http://bugzilla.mozilla.org/showattachment.cgi?attach_id=22742).
Do you also think that 'stopping' should
be declared volatile?

Sonja Mirtitsch

Reporter

Comment 31

•

24 years ago

*** Bug 63609 has been marked as a duplicate of this bug. ***

Sonja Mirtitsch

Reporter

Comment 32

•

24 years ago

*** Bug 63609 has been marked as a duplicate of this bug. ***

Sonja Mirtitsch

Reporter

Comment 33

•

24 years ago

From my point of view it is fixed no more Win2K problems seen - but I think
wtc's additional patch has not been checked in. Also, I am not sure about the
impact of Nelson's new changes. Reassigning to Wan-Teh.

Assignee: sonmi → wtc

Wan-Teh Chang

Assignee

Comment 34

•

24 years ago

My "additional patch" was checked in in revision 1.16 of selfserv.c
when Nelson added multi-process server capability to selfserv.

Marked the bug as fixed.

Status: NEW → RESOLVED

Closed: 24 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.