Closed Bug 90518 Opened 24 years ago Closed 23 years ago

distributed stress test broken

Categories

(NSS :: Test, defect, P1)

3.2.1
x86
Windows NT
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sonja.mirtitsch, Assigned: bugz)

References

Details

Attachments

(7 files)

the test needs to be run before NSS 3.4 is released. This is a manual run test, the scripts ave to be adjusted, and the output needs to be examined. It does not need to run before 3.3. Documentation how to run and adjust this test needs to be checked in to tests/doc Problems: still no progress on the static IP address issue - can not run on any machine but clio bugs: 200 stress certs don't get generated testresults/security/HOSTDIR does not get created this is most likely a result of the cleanup in the ssl script, or the init script, which get sourced
*** Bug 90519 has been marked as a duplicate of this bug. ***
Severity: normal → blocker
Priority: -- → P2
Target Milestone: --- → 3.4
3 problems found so far: 1) PC needs to be in the .rhost files, 2) 2 linux machines (washer, box) both have disabled rsh access, and 3) clientside has problems on Solaris because it requires perl, which it tries to get from a networkdrive (NFS errors)
rsh to connect to a linux system from Win2K needs to be a Win2K rsh (Solaris can be contacted with the WinNT rsh...) fixed sonmi's environ.ksh to set the variable RSH to d:/winnt/system32/rsh.exe, and the set_environ script to not change this variable when it is already set. Can not use rsh found in PATH, because MKS and cygwin rsh make even moe problems. fixed a second problem, clio was hardcoded into the tstclnt, that was supposed to bring down the selfserv gracefully. There is no dump of any data now, just the line, I am not sure if that is the expected result: > Recv failed:Connection reset by peer both programms exit I checked in my changes, and will try again once I receive a comment by one of the developers about expected behavior
Severity: blocker → normal
Status: NEW → ASSIGNED
Attached file log file
Can't continue, need comment from Relyea
I'm confused about what kind of comment you need. What is it that's failing and under what conditions. bob
Hi Bob, I tried to shutdown the selfserv that I had put under stress before with the tstclnt -h nssqa-win2k -p 8443 -d client -n TestUser0 and the input GET /stop HTTP/1.0 after they connected I got the message > Recv failed:Connection reset by peer more details are in the attached log Thanks Sonja
I am unable to resolve this bug without help and I don't seem able to get it. Please let me know how high the priority is, and who will help me in looking at the results.
Assignee: sonja.mirtitsch → wtc
Status: ASSIGNED → NEW
Sonja, I agree with your statement that this test needs to be run before NSS 3.4 is released. Therefore, the priority should be P1. I understand that you need someone to investigate this error message emitted by tstclnt: > Recv failed:Connection reset by peer Is there anything else you need help with? I think Kirk can help you look into tstclnt's "connection reset by peer error". Should I assign this bug to you or Kirk?
No, I think Nelson who designed the test, or Bob would be the right person to handle this, he was the one originally requesting that the test would be run against 3.4 Kirk doesn't have the knowledge about this test and it would probably take him a lot longer to even understand the details of what is being done in this test, Bob and Nelson know a lot more about the test than I do, and there is no documentation about it. It does not seem to be a large problem, the test runs to the end, just when I try to shut down the selfserv the connection dies without giving messages. It is possible that that is the expected behavior if there is no memory leak, if there is one the server dumps the contents of some cert DB at this point.
Priority: P2 → P1
Status: NEW → ASSIGNED
Was the error message > Recv failed:Connection reset by peer emitted by selfserv, tstclnt, or one of the test scripts? I just did a search in all the files under mozilla/security/nss (the tip of the source tree) for "Recv" and didn't find any strings containing "Recv". So I don't know where the message > Recv failed:Connection reset by peer came from. I also need to know which platforms the program that emitted the error message and its peer were running on. Is this the only remaining issue for this bug?
I think one of the problems in the bug aging is, that very frequently errormessages are being changed, and lots of work has to be done all over. The message did not come from the script. The last part of the test is, to connect a tstclnt to the slfserv that has been stressed and give it a "GET /stop HTTP/1.0\n\n" to make it shut down gracefully. At this point, before shutting down the server usually gave messages. I never ran it with a regular selfserv, only with a special version of Nelson, I do not have access to this version of selfserv. I do not know if his selfserv changes are checked in, or it even would make sense to check them in. The test is designed for the selfserv and the tstcnt that shuts it down to run on NT, and the stressclients to run on any Unix platform, mostly Linux. I think we had about 400 on box and washer, and 10-20 on a few Solaris, AIX and HP machines in this run. If you are going to work on the bug now I will rerun all tests once I am done with trying to get the QA results reformated for ftp.mozilla.org, it is going to take me about 4 hours to set up machines and rerun the tests.
We should run the distributed stress test with a regular selfserv. We need to find out where the error message came from. Could it come from rsh?
There are no more rsh's running at this point in the test
I reran the tests a few time, but I was never able to produce measurable "stress" on the NT server machine. After a while selfserv seemed saturated, and new clients could not connect for 30-40 seconds. The CPU load never reached more than 50%. If I continued starting strsclnts the client machines seemed to run out of resources, for example: uname: error in loading shared libraries: libc.so.6: cannot open shared object file: Error 23 ssl_dist_stress.sh: /share/builds/mccrel/nss/nsstip/builds/20011115.1/booboo_Solaris8/mozilla/dist/Linux2.2_x86_glibc_P H_OPT.OBJ/bin/strsclnt: Too many open files in system ssl_dist_stress.sh: [: -ge: unary operator expected When the test terminated (regular termination, not error-caused) after that type of errors I saw hanging stressclients on some or all of the clientmachines, and sometimes the message message: $ > Recv failed:Connection reset by peer a while after the test (script and client) had finished showed up once or more The maximum number of clients I ran that did not show resource errors were 400, the CPU in this case never went to more than 40%. After (suspected) saturation was reached the server would not accept new connections for 30-40 seconds, and after that communication was resumed. With 400 clients this happened with only about 60 clients left. The standard termination messages seem to be: Date: Tue, 26 Aug 1997 22:10:05 GMT Content-type: text/plain Discarded 2 characters. GET /stop HTTP/1.0 EOF I attach all the logs
Attached file log of the next test
Attached file log of the last test
So we are still seeing the error message: > Recv failed:Connection reset by peer I don't know where that error message comes from. I just did a search for "Recv failed" in the whole mozilla source tree and did not find any match of "> Recv failed". So it does not come from NSPR, DBM, or NSS.
Then I would assume the message comes from either rsh or some other leftover. I think we should not chase this errormessage, because it only occurs after the system is failing anyway (resource problems on the client sides, hanging stressclients, that I had to kill manually to get rid of them). Also, at that point in time the selfserv is already terminated, and as you stated the message does not seem to come from the mozilla source. If it is very important to know where it comes from I will do a strings to try to find out. In my opinion we need to determine if the pause that the server has in accepting new connections is a sign of "stress" - at this point it has been presented with 200 different certs and over 300 connections. If we can accept this as sufficient stress, this test has no resource problems on the client sides, and no weird errormessages either. We could then just examine the exit messages to find out if they are sufficient. If we absolutely need to create more severe stress on the server then we need to spend time in figuring out which linux / unix system can handle how many clients and see if we can raise the load this way.
Bob, Nelson, could you answer Sonja's question (the second and third paragraphs of her previous comment)? My understanding is that the goal of this stress test is to cause our in-memory database to overflow. So one way to determine whether we have created enough stress on the server is to find out if DBM created those temporary files, which would show that the in-memory database overflowed. Agreed?
Wan-Teh wrote: > My understanding is that the goal of this stress test > is to cause our in-memory database to overflow. No, I don't think that was the goal of this stress test. If this is the test I recall, the purpose of this test was to have a lot of clients, each with its own individual unique client authentication certificate, performing SSL connections with client authentication. The purpose was to test realistic use of client authentication. It was to try to detect problems with servers doing heavy client auth in our own testing, before customers catch them, and to excersize a portion of our SSL server cache code that is only used when client auth is being done. Previous tests had all clients reusing the same client certificate, which was not realistic, because the server would have only one cert in memory (since that one cert matched all clients). By having each client use a unique cert, we forced the server to have all those different certs in the server SSL cache and in the various cert DBs at once. Even if this test is a different one than the one of which I am thinking, it cannot be measured by seeing when the temp cert DB overflows to disk, because beginning in NSS 3.4, the temp cert DB is no longer implemented using DBM, and it no longer overflows to disk.
Nelson, this is the stress test you recall. The question still stands -- how do we know we have created sufficient stress on the server?
Bob Relyea said this is a requirement for NSS 3.4. I don't think it is a Beta-Blocker but we need to run it before RTM
Severity: normal → blocker
Ian will make the appropriate modifications to selfserv according to Nelson's specification (via email).
Assignee: wtc → ian.mcgreer
Status: ASSIGNED → NEW
Nelson, With this patch, a call to nss_DumpCertificateCacheInfo() will list all certs in (1) the cache; and (2) the temporary store. Here is what sample output looks like: Certificates in the cache: Certificate: "OU=Secure Server Certification Authority, O="RSA Data Security, In c.", C=US" refs=2 Certificate: "CN=foobar, E=foo@bar.com" refs=2 Certificates in the temporary store: Certificate: "CN=localhost" refs=2 Does this look like what you need?
That looks like the right info. I think that will do nicely. I suggest that the ref count come at the beginning of the line, and drop the "Certificate: " on the front of each line. Maybe selfserv should have YA command line option to call this function before exiting.
Seems to me that it doesn't hurt for selfserv to always call this function before exiting. Agreed?
I'd prefer a commandline option, or is it that the cache and temp. store should be completely empty, and if there is no problem there would be no output at all?
I checked in the patch, with Nelson's suggestions. selfserv uses shared libraries. So we must export this function to take advantage of it? Is there no other way?
Could exporting these functions introduce a possible security problem? Could we do #ifdef DEBUG_NSS arounfd the functions and the selfserv exit and I'd do a special build if I need it?
As we discussed in the conference call this afternoon, Ian will export this new function from libnss3.so and add a new command-line option to selfserv.
Sonja- This function doesn't really pose any kind of security risk. All that is displayed is the certificate's subject and number of references. There is no way that I know of to conditionally define a function in our .def format used by the shared libraries. I thought of using #ifdef DEBUG for the header definition, but we would probably like to use this in optimized builds as well. I have defined the function in nss.h, implemented it as per the last patch, and exported it in nss.def. Also added -y option to selfserv to call this function before exit.
I ran the distributed stress test on nssqa_win2k with 800, 1200 and 1400 clients and it seemed fine. Once we got the (alleged) rsh errormessage "Recv failed: Connection reset by peer" The selfserv gave the lines: Certificates in cache: Certificates in temporary store: but listed no certificates, which I assume is expected behavior (Ian, Nelson please verify) last line was "selfserv: normal termination" I will attach the full log of 1400 clients, and a shortened version
Whiteboard: test ran, please review results
From the cache/temp store perspective, it looks good. That is the expected output if there are no leaked references.
test completed, closing fixed
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Whiteboard: test ran, please review results
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: