Closed
Bug 64695
Opened 24 years ago
Closed 24 years ago
Win2000 selfserver dies intermittantly (NT build only)
Categories
(NSS :: Tools, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
3.2
People
(Reporter: sonja.mirtitsch, Assigned: wtc)
Details
Attachments
(10 files)
6.53 KB,
text/html
|
Details | |
72.45 KB,
text/plain
|
Details | |
2.10 KB,
text/plain
|
Details | |
1.33 KB,
patch
|
Details | Diff | Splinter Review | |
1.03 KB,
patch
|
Details | Diff | Splinter Review | |
671 bytes,
patch
|
Details | Diff | Splinter Review | |
668 bytes,
patch
|
Details | Diff | Splinter Review | |
709 bytes,
patch
|
Details | Diff | Splinter Review | |
678 bytes,
patch
|
Details | Diff | Splinter Review | |
873 bytes,
patch
|
Details | Diff | Splinter Review |
selfserver is being started for various cipher coverage or client auth tests, testclient -q does not give an error message, there are no messages from the selfserver that either indicate that it is still alive or an error, no messages from the shell about a terminated background process. yesterday and today QA ran multiple times on clio - win2000 and sonjant - nt 4.0. the Win95 build tests fine on both platforms, the NT build tests fine on NT 4.0, but fails on Win 2000. I attach one result and one output.log, there are more in the following locations: http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/results.html http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/output.log http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/results.html http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010107.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/output.log http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/results.html http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/output.log http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.5/results.html http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.5/output.log http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.6/results.html http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.6/output.log http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.9/results.html http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.9/output.log http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.10/results.html http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010108.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.10/output.log
Reporter | ||
Comment 1•24 years ago
|
||
Reporter | ||
Comment 2•24 years ago
|
||
Reporter | ||
Updated•24 years ago
|
Severity: normal → major
Priority: -- → P1
Assignee | ||
Comment 3•24 years ago
|
||
This must be fixed before we ship 3.2.
Status: NEW → ASSIGNED
Target Milestone: --- → 3.2
Assignee | ||
Comment 4•24 years ago
|
||
I can't reproduce the test failures when I run all.sh manually on the 20010108.1 build, both WINNT5.0_DBG.OBJ and WINNT5.0_OPT.OBJ.
Reporter | ||
Comment 5•24 years ago
|
||
Today we had failures on clio again, all of them were in context with a bad password, but it still look as if the server is dying. http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/results.html http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.1/output.log http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/results.html http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010109.1/y2sun2_Solaris8/mozilla/tests_results/security/clio.2/output.log The QA suites are highly sensitive to changed environment variables and machine reconfigurations, and I still think that we need dedicated NSS QA machines, and QA can not be just a task on somebody's workstation or server, especially on PCs. If I am supposed to debugg the QA suites to see if I can find the problem in the QA wrapper I will need at least a few hours on clio.
Reporter | ||
Comment 6•24 years ago
|
||
Tested again, I have seen this problem when running all.sh manually as well, but it is highly intermittant, today's results in http://cindercone/h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010110.1/ blowfish_NT4.0_Win95/mozilla/tests_results/security/ clio15, 16 and 17 are pure all.sh runs
Reporter | ||
Comment 7•24 years ago
|
||
I put in more info into the ssl script. looks like the server is still there, see the ps output before the start of the testclient, and also after ending of the testclient. ********************* SSL Client Auth **************************** ***** TLS Request don't require client auth (client does not provide auth) **** selfserv -v -p 8443 -d w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/server -n clio.red.iplanet.com -w nss -r -i C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808 & tstclnt -p 8443 -h clio -q -d w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/client redir from W:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt W:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/client Testing if server is alive... Serverpidfile exists: C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808 1072 1072 0:00 selfserv -v -p 8443 -d w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/server -n clio.red.iplanet.com -w nss -r -i C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808 972 0:00 grep 1072 selfserver process alive tstclnt -p 8443 -h clio -f -d w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/client -w nss redir from W:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt tstclnt: write to SSL socket failed: Encountered end of file. [2] + Done(134) ? 848 Abort tstclnt FAILURE: test TLS Request don't require client auth (client does not provide auth) produced a returncode of 134, expected is 0 O_CRON = about to kill: 1072 0:00 selfserv -v -p 8443 -d w:/nss/nsstip/builds/20010111.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/clio.5/server -n clio.red.iplanet.com -w nss -r -i C:/DOCUME~1/wtc/LOCALS~1/Temp/tests_pid.808
Reporter | ||
Comment 8•24 years ago
|
||
Reporter | ||
Comment 9•24 years ago
|
||
It appears that if the machine is too slow, the first launched thread doesn't go thru the cleanup in time. The cleanup would set the variable "stopping" to 1. The main thread inquires this variable, if 0 and a thread exists (which is never the case on fast machines like sonjaNT) it waits for a condition variable which should be threadQ, but is stopQ. Probably this is a typo.
Assignee | ||
Comment 10•24 years ago
|
||
Sonja investigated and tracked down the bug, so I'm assigning to bug to her. This bug was introduced by the fix for Bug #62396.
Assignee: wtc → sonmi
Status: ASSIGNED → NEW
Reporter | ||
Comment 11•24 years ago
|
||
Comment 12•24 years ago
|
||
I think the line of code that reads: while ( !stopping && threadCount > 0 ) { should read: while ( !stopping || threadCount > 0 ) { That is, this loop should continue until stopping is true and threadCount <= 0. Actually, it is probably sufficient to just wait while (threadCount > 0) { Please try that.
Assignee | ||
Comment 13•24 years ago
|
||
Assignee | ||
Comment 14•24 years ago
|
||
Comments on my patch (id=22742): After reviewing the code, I agree with Nelson that we should just wait while (threadCount > 0). The variable 'stopping' is set and tested without holding a lock, so it should be declared volatile.
Reporter | ||
Comment 15•24 years ago
|
||
With the newest patch we had QA failures on GREASYBEAR - clio, sonjaNT and the solaris machines (kentuckyderby, y2sun2, galileo, columbus) were fine. The failure looks unrelated to the change to me, I will reboot GREASYBEAR and then try again, and attempt to debug tomorrow. for now the fix is not checked in. The Solaris failure that I saw at first had other reasons ***** TLS Request don't require client auth (bad password) **** selfserv -v -p 8443 -d w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/server -n GREASYBEAR.red.iplanet.com -w nss -r -i C:/DOCUME~1/sonmi/LOCALS~1/Temp/tests_pid.908 & tstclnt -p 8443 -h GREASYBEAR -q -d w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client redir from W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client selfserv: About to call accept. selfserv: PR_Accept returned error -5961: TCP connection reset by peer. selfserv: Closing listen socket. tstclnt -p 8443 -h GREASYBEAR -f -d w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client -w bogus -n TestUser redir from W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt tstclnt: unable to connect (poll): Connection refused by peer. [3] + Done(134) ? 860 Abort tstclnt FAILURE: test TLS Request don't require client auth (bad password) produced a returncode of 134, expected is 0 O_CRON = kill: ./ssl.sh 351: 812: no such process ***** SSL3 Require client auth (client auth) **** selfserv -v -p 8443 -d w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/server -n GREASYBEAR.red.iplanet.com -w nss -r -r -i C:/DOCUME~1/sonmi/LOCALS~1/Temp/tests_pid.908 & tstclnt -p 8443 -h GREASYBEAR -q -d w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client redir from W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client selfserv: About to call accept. selfserv: PR_Accept returned error -5961: TCP connection reset by peer. selfserv: Closing listen socket. tstclnt -p 8443 -h GREASYBEAR -f -d w:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.5/client -T -n TestUser -w nss redir from W:/nss/nsstip/builds/20010117.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt tstclnt: unable to connect (poll): Connection refused by peer. [13] + Done(134) ? 608 Abort tstclnt FAILURE: test SSL3 Require client auth (client auth) produced a returncode of 134, expected is 0 O_CRON = kill: ./ssl.sh 351: 712: no such process
Comment 16•24 years ago
|
||
This log shows the server getting ECONNRESET errors from PR_ACCEPT. I'm not sure what's causing that, but I'm pretty sure it doesn't have anything to do with the patch to the while loop.
Assignee | ||
Comment 17•24 years ago
|
||
I think my "additional patch" has a logical error. It is based on the assumption that the while loop should read: while ( !stopping || threadCount > 0 ) { I think that assumption is false. I found that the do_accepts thread is usually blocked in PR_Accept. So the act of simply setting 'stopping' to 1 won't kick the do_accepts thread out of PR_Accept. A PR_Interrupt call would be needed. Therefore, if we wait while (threadCount > 0), threadCount will only drop down to 1 (the do_accepts thread is still blocked in PR_Accept). On the other hand, if we wait while (!stopping && threadCount > 0), we will exit the process as soon as 'stopping' is 1 even if there are still threads running. In fact, we might as well just wait while (!stopping). So I suggest that we only check in the change that replaces 1 by the enum constant 'rs_running'.
Reporter | ||
Comment 18•24 years ago
|
||
Unfortunately the fix didn't seem to be enough, the problem showed up on clio today again. I also noted that in the launch_thread function, it already waits for the threadQ condition variable under certain circumstances. launch_thread is the function called immidiately before the while loop that waits until the threads are gone.
Comment 19•24 years ago
|
||
Wan-teh, The method for shutting down selfserv cleanly has always been two steps: 1) send it a request for /stop 2) send it another request The first step sets "stopping". The second one causes the listen thread to exit, which allows the threadCount to go to zero. Allowing the while loop to exit when stopping is true can cause a crash, because it zeros out an array of values, incuding pointers, that may be in use by other threads. I continue to believe that the proper test is to loop until threadCount is zero.
Reporter | ||
Comment 20•24 years ago
|
||
Today Sonja compared the output logs of a successful run and a failed run of selfserv and pinpointed the problem area. After examining the source code of selfserv and tstclnt, I think I can explain the problem now. (Note that we already fixed the selfserv hanging problem. Now we have a different problem: selfserv sometimes exits prematurely due to the action of tstclnt -q.) In tstclnt, with the -q option, if the PR_Connect call returns successfully, tstclnt returns SECSuccess immediately without closing the socket. Therefore the newly established connection is aborted nongracefully. On the other side of the connection, if the timing is right (or rather, wrong), selfserv's PR_Accept call may fail with PR_CONNECT_RESET_ERROR. The do_accepts thread in selfserv exits the while loop if PR_Accept fails. I think this should be fixed on both sides. 1. In tstclnt, we should close the socket after PR_Connect completes successfully. This may or may not prevent the PR_CONNECT_RESET_ERROR on the selfserv side. I believe this depends on the implementation of the socket library. I need to consult my network programming book, which I left at work. 2. In selfserv, the do_accepts thread should not exit the while loop when it gets the PR_CONNECT_RESET_ERROR. I would implement both 1 & 2, but perhaps we can try just 1 first. Wan-Teh ------------------------------------------------- I implemented Wan-Teh's fix #2 in selfserver, now if the PR_Accept runs into a PR_CONNECT_RESET_ERROR it will ignore it, and continue the while loop and try a new accept. I am trying to implement the testclient -q solution now. Paste in part of the output.log ***** TLS Request don't require client auth (bad password) **** selfserv -v -p 8443 -d w:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/server -n GREASYBEAR.red.iplanet.com -w nss -r -i C:/DOCUME~1/sonmi/LOCALS~1/Temp/tests_pid.972 & tstclnt -p 8443 -h GREASYBEAR -q -d w:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/client redir from W:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt W:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/client tstclnt -p 8443 -h GREASYBEAR -f -d w:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/tests_results/security/GREASYBEAR.13/client -w bogus -n TestUser redir from W:/nss/nsstip/builds/20010122.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt selfserv: About to call accept. selfserv: PR_Accept returned error -5961: TCP connection reset by peer. Ignoring PR_CONNECT_RESET_ERROR error - continue selfserv: About to call accept. selfserv: 0 cache hits; 1 cache misses, 0 cache not reusable selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1 selfserv: subject DN: no certificate selfserv: issuer DN: no certificate
Reporter | ||
Comment 21•24 years ago
|
||
Reporter | ||
Comment 22•24 years ago
|
||
Reporter | ||
Comment 23•24 years ago
|
||
Comment 24•24 years ago
|
||
Why is the PR_Shutdown desirable?
Reporter | ||
Comment 25•24 years ago
|
||
Assignee | ||
Comment 26•24 years ago
|
||
Nelson: some NSPR users (including Javi or Sangjin of the PSM team) reported to me before that on some version of Windows, if one end of the connection merely closes its socket without calling shutdown first, the other end of the connection may get a ECONNRESET error. The ECONNRESET error goes away if they always call shutdown before close. Since selfserv is getting PR_CONNECT_RESET_ERROR from PR_Accept, I thought that this might be the same Winsock peculiarity that other NSPR users reported before, hence the suggestion of calling PR_Shutdown.
Reporter | ||
Comment 27•24 years ago
|
||
Assignee | ||
Comment 28•24 years ago
|
||
I reviewed the patch for selfserv (id=23233) and the patch for tstclnt (id=23220). You can check them in. After that, I'd like you to try (again) my patch (id=22742), which changes selfserv to wait while (threadCount > 0).
Comment 29•24 years ago
|
||
While testing selfserv tonight, I was able to reproduce a crash by sending a get request for /stop to the server twice. When I changed that while statement to while (threadCount > 0) { I was no longer able to reproduce the crash. I'm working on adding a multi-process server capability to selfserv, and I can fix this in that checkin if that's OK.
Assignee | ||
Comment 30•24 years ago
|
||
Nelson: yes, it's okay to change that while statement to while (threadCount > 0) { This is in my "Additional patch" (http://bugzilla.mozilla.org/showattachment.cgi?attach_id=22742). Do you also think that 'stopping' should be declared volatile?
Reporter | ||
Comment 31•24 years ago
|
||
*** Bug 63609 has been marked as a duplicate of this bug. ***
Reporter | ||
Comment 32•24 years ago
|
||
*** Bug 63609 has been marked as a duplicate of this bug. ***
Reporter | ||
Comment 33•24 years ago
|
||
From my point of view it is fixed no more Win2K problems seen - but I think wtc's additional patch has not been checked in. Also, I am not sure about the impact of Nelson's new changes. Reassigning to Wan-Teh.
Assignee: sonmi → wtc
Assignee | ||
Comment 34•24 years ago
|
||
My "additional patch" was checked in in revision 1.16 of selfserv.c when Nelson added multi-process server capability to selfserv. Marked the bug as fixed.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•