Closed
Bug 124447
Opened 23 years ago
Closed 23 years ago
SSL server stress test ends up in deadlock on Win2K
Categories
(NSS :: Libraries, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
3.4
People
(Reporter: julien.pierre, Assigned: wtc)
Details
With client auth required, NES6 ended up at 0% CPU and not handling new
connections after only 6 minutes of stress.
Reporter | ||
Updated•23 years ago
|
Priority: -- → P1
Target Milestone: --- → 3.4
Reporter | ||
Comment 1•23 years ago
|
||
I just reproduced it even without client auth, after just 10 minutes. The NT
server deadlocked again. So it is not specific to client auth.
OS: other → Windows 2000
Hardware: Other → PC
Reporter | ||
Comment 2•23 years ago
|
||
*** Bug 124337 has been marked as a duplicate of this bug. ***
Reporter | ||
Comment 3•23 years ago
|
||
The exact test case on the client side to deadlock the server was :
D:\60\WINNT_5.0_depend\ns\server\internal\B1\WINNT4.0_DBG.OBJ\tests\httptests>httptest
-h oops:1890 -s -H 1 -g COMMON -g SSL -L 5 -p 128 -x "url51" -e 3600 -P
Where oops:1890 was an SSL listen socket without client auth on my Win2k system.
Reporter | ||
Comment 4•23 years ago
|
||
I attached to the daemon process when the deadlock was reproduced for the third
time. I don't have anything like pstack on NT, so I looked with the
Debug/Threads menu, setting focus on all of the threads that were shown. None of
the stacks were in NSS code. Most of them were in NSPR functions like PR_Accept,
PR_WaitCondVar, which is expected for the web server. I suspect that the
debugger is not showing me all the actual threads since fibers are used on NT.
I'm going to disable fibers through the NSPR environment variable and try to
reproduce it again.
Reporter | ||
Comment 5•23 years ago
|
||
I set NSPR_NATIVE_THREADS_ONLY=1 and restarted the daemon independently (not as
a service). The test has been running for 1h30 and the deadlock has not been
reproduced. It might be that the deadlock is specific to running with fibers.
Reporter | ||
Comment 6•23 years ago
|
||
After a whole week-end running without fibers, the server did not deadlock. I
still saw plenty of SSL errors on PR_Recv and PR_Send on the client side, but
the server was still accessible. There were no errors reported in the server log.
Reporter | ||
Comment 7•23 years ago
|
||
FYI, you can test fibers with selfserv on NT/Win2K with the -l option. Make sure
you are using the NT version of NSPR.
Comment 8•23 years ago
|
||
Just for a data point:
I ran with WinNT, using NSPR 4.1.1 built from source for NT target. I didn't
realize I was building with 4.1.1, will try again with 4.1.2. I used selfserv
with model sockets and local threads,
selfserv -n localhost.red.iplanet.com -p 8888 -w nss -t 100 -d ext_server -l -m -D
The client was doing full handshakes:
strsclnt -n ExtendedSSLUser -p 8888 -d ext_client -c 50000 -d -w nss -t 40 -N
localhost
I got through 50000 connections without any error, and the server was still
responsive.
Assignee | ||
Comment 9•23 years ago
|
||
Ian,
I just thought of one important difference between the web server
and selfserv on NT: the web server calls PR_SetConcurrency().
That function is especially important when using fibers.
I suggest that you modify your selfserv.c as follows. I am running
selfserv (with this modification) on a 4-CPU Windows 2000 box and
the web server GAT client on a Solaris box. I will report back
what I find out.
Index: selfserv.c
===================================================================
RCS file: /cvsroot/mozilla/security/nss/cmd/selfserv/selfserv.c,v
retrieving revision 1.42
diff -u -r1.42 selfserv.c
--- selfserv.c 6 Feb 2002 01:38:06 -0000 1.42
+++ selfserv.c 12 Feb 2002 21:55:14 -0000
@@ -1472,7 +1472,7 @@
case 'i': pidFile = optstate->value; break;
- case 'l': useLocalThreads = PR_TRUE; break;
+ case 'l': useLocalThreads = PR_TRUE; PR_SetConcurrency(4); break;
case 'm': useModelSocket = PR_TRUE; break;
Status: NEW → ASSIGNED
Reporter | ||
Comment 10•23 years ago
|
||
I don't think the web server is calling PR_SetConcurrency.
I did a search in ns/netsite/lib in the source tree and did not find it. This
may be a bug in the web server.
In any case, I had originally reproduced the bug on my Win2k system which is a
single-CPU PII 450. I just built another NSS tree and will see if I can
reproduce the problem shortly.
Reporter | ||
Comment 11•23 years ago
|
||
I haven't been able to reproduce it today in about 100,000 sessions (roughly an
hour).
Assignee | ||
Comment 12•23 years ago
|
||
I was able to cause selfserv to hang. Unfortunately there is no
debugger installed on the machine I used, so I couldn't attach the
debugger to the hung selfserv process. Here is my configuration.
I checked out the tip of NSS yesterday morning and did a debug build
on Windows 2000.
I was running selfserv (modified to add a PR_SetConcurrency(4) call)
on a 4-CPU Windows 2000 box with this command line:
selfserv -n Server-Cert -p 8880 -w enterprise -t 100 -d . -l -m -D
I was using the cert and key databases from web server 6.0.
I was running the web server 6.0 httptest client (optimized build,
with NSS 3.3.1) on a Solaris box with this command line:
httptest -h area51:8880 -s -H 1 -g COMMON -g SSL -L 5 -p 128 -x "url51" -e 3600 -P
(area51 is the host name of the Windows 2000 box running selfserv.)
Reporter | ||
Comment 13•23 years ago
|
||
Wan-Teh,
I have not been able to get the server to hang on NT in my own stress tests over
the past week. Can you still reproduce this problem with selfserv on the SMP
system you were using ?
Assignee | ||
Comment 14•23 years ago
|
||
Julien,
I haven't run any new NSS tests on that SMP system since then
because it doesn't have a debugger.
However, for verifying that this bug cannot be reproduced, I
can use that machine. I will do that tomorrow.
Assignee | ||
Comment 15•23 years ago
|
||
I just started the same test as I described in comment #12.
I am using the 20020228.1 WINNT4.0_DBG.OBJD build for selfserv.
I will let it run overnight.
Assignee | ||
Comment 16•23 years ago
|
||
After 56 hours CPU time on the 4-CPU box (which probably means 14
hours), selfserv is still running and the memory usage is stable.
The GAT httptest client completed 6021019 operations with no
failure.
With this and with Julien's comment #13, I am marking this bug fixed.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•