Closed Bug 124447 Opened 23 years ago Closed 23 years ago

SSL server stress test ends up in deadlock on Win2K

Categories

(NSS :: Libraries, defect, P1)

x86
Windows 2000
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: julien.pierre, Assigned: wtc)

Details

With client auth required, NES6 ended up at 0% CPU and not handling new connections after only 6 minutes of stress.
Priority: -- → P1
Target Milestone: --- → 3.4
I just reproduced it even without client auth, after just 10 minutes. The NT server deadlocked again. So it is not specific to client auth.
OS: other → Windows 2000
Hardware: Other → PC
*** Bug 124337 has been marked as a duplicate of this bug. ***
The exact test case on the client side to deadlock the server was : D:\60\WINNT_5.0_depend\ns\server\internal\B1\WINNT4.0_DBG.OBJ\tests\httptests>httptest -h oops:1890 -s -H 1 -g COMMON -g SSL -L 5 -p 128 -x "url51" -e 3600 -P Where oops:1890 was an SSL listen socket without client auth on my Win2k system.
I attached to the daemon process when the deadlock was reproduced for the third time. I don't have anything like pstack on NT, so I looked with the Debug/Threads menu, setting focus on all of the threads that were shown. None of the stacks were in NSS code. Most of them were in NSPR functions like PR_Accept, PR_WaitCondVar, which is expected for the web server. I suspect that the debugger is not showing me all the actual threads since fibers are used on NT. I'm going to disable fibers through the NSPR environment variable and try to reproduce it again.
I set NSPR_NATIVE_THREADS_ONLY=1 and restarted the daemon independently (not as a service). The test has been running for 1h30 and the deadlock has not been reproduced. It might be that the deadlock is specific to running with fibers.
After a whole week-end running without fibers, the server did not deadlock. I still saw plenty of SSL errors on PR_Recv and PR_Send on the client side, but the server was still accessible. There were no errors reported in the server log.
FYI, you can test fibers with selfserv on NT/Win2K with the -l option. Make sure you are using the NT version of NSPR.
Just for a data point: I ran with WinNT, using NSPR 4.1.1 built from source for NT target. I didn't realize I was building with 4.1.1, will try again with 4.1.2. I used selfserv with model sockets and local threads, selfserv -n localhost.red.iplanet.com -p 8888 -w nss -t 100 -d ext_server -l -m -D The client was doing full handshakes: strsclnt -n ExtendedSSLUser -p 8888 -d ext_client -c 50000 -d -w nss -t 40 -N localhost I got through 50000 connections without any error, and the server was still responsive.
Ian, I just thought of one important difference between the web server and selfserv on NT: the web server calls PR_SetConcurrency(). That function is especially important when using fibers. I suggest that you modify your selfserv.c as follows. I am running selfserv (with this modification) on a 4-CPU Windows 2000 box and the web server GAT client on a Solaris box. I will report back what I find out. Index: selfserv.c =================================================================== RCS file: /cvsroot/mozilla/security/nss/cmd/selfserv/selfserv.c,v retrieving revision 1.42 diff -u -r1.42 selfserv.c --- selfserv.c 6 Feb 2002 01:38:06 -0000 1.42 +++ selfserv.c 12 Feb 2002 21:55:14 -0000 @@ -1472,7 +1472,7 @@ case 'i': pidFile = optstate->value; break; - case 'l': useLocalThreads = PR_TRUE; break; + case 'l': useLocalThreads = PR_TRUE; PR_SetConcurrency(4); break; case 'm': useModelSocket = PR_TRUE; break;
Status: NEW → ASSIGNED
I don't think the web server is calling PR_SetConcurrency. I did a search in ns/netsite/lib in the source tree and did not find it. This may be a bug in the web server. In any case, I had originally reproduced the bug on my Win2k system which is a single-CPU PII 450. I just built another NSS tree and will see if I can reproduce the problem shortly.
I haven't been able to reproduce it today in about 100,000 sessions (roughly an hour).
I was able to cause selfserv to hang. Unfortunately there is no debugger installed on the machine I used, so I couldn't attach the debugger to the hung selfserv process. Here is my configuration. I checked out the tip of NSS yesterday morning and did a debug build on Windows 2000. I was running selfserv (modified to add a PR_SetConcurrency(4) call) on a 4-CPU Windows 2000 box with this command line: selfserv -n Server-Cert -p 8880 -w enterprise -t 100 -d . -l -m -D I was using the cert and key databases from web server 6.0. I was running the web server 6.0 httptest client (optimized build, with NSS 3.3.1) on a Solaris box with this command line: httptest -h area51:8880 -s -H 1 -g COMMON -g SSL -L 5 -p 128 -x "url51" -e 3600 -P (area51 is the host name of the Windows 2000 box running selfserv.)
Wan-Teh, I have not been able to get the server to hang on NT in my own stress tests over the past week. Can you still reproduce this problem with selfserv on the SMP system you were using ?
Julien, I haven't run any new NSS tests on that SMP system since then because it doesn't have a debugger. However, for verifying that this bug cannot be reproduced, I can use that machine. I will do that tomorrow.
I just started the same test as I described in comment #12. I am using the 20020228.1 WINNT4.0_DBG.OBJD build for selfserv. I will let it run overnight.
After 56 hours CPU time on the 4-CPU box (which probably means 14 hours), selfserv is still running and the memory usage is stable. The GAT httptest client completed 6021019 operations with no failure. With this and with Julien's comment #13, I am marking this bug fixed.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.