Closed Bug 90518 Opened 24 years ago Closed 23 years ago

distributed stress test broken

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

3.4

People

(Reporter: sonja.mirtitsch, Assigned: bugz)

References

Details

Attachments

(7 files)

log file 24 years ago Sonja Mirtitsch 804 bytes, text/plain		Details
log of the 1st stresstest (incomplete) 24 years ago Sonja Mirtitsch 196.87 KB, text/plain		Details
log of the next test 24 years ago Sonja Mirtitsch 185.53 KB, text/plain		Details
log of the last test 24 years ago Sonja Mirtitsch 650.34 KB, text/plain		Details
log state of temp store and cache 24 years ago Ian McGreer 3.49 KB, patch		Details \| Diff \| Splinter Review
dist. stress test 02/05/02 log 23 years ago Sonja Mirtitsch 386.18 KB, text/plain		Details
shortened version of the 02/05/02 og 23 years ago Sonja Mirtitsch 4.22 KB, text/plain		Details

Sonja Mirtitsch

Reporter

Description

•

24 years ago

the test needs to be run before NSS 3.4 is released. This is a manual run test, the scripts ave to be adjusted, and the output needs to be examined. It does not need to run before 3.3. Documentation how to run and adjust this test needs to be checked in to tests/doc Problems: still no progress on the static IP address issue - can not run on any machine but clio bugs: 200 stress certs don't get generated testresults/security/HOSTDIR does not get created this is most likely a result of the cleanup in the ssl script, or the init script, which get sourced

Sonja Mirtitsch

Reporter

Comment 1

•

24 years ago

*** Bug 90519 has been marked as a duplicate of this bug. ***

Sonja Mirtitsch

Reporter

Updated

•

24 years ago

Severity: normal → blocker

Priority: -- → P2

Target Milestone: --- → 3.4

Sonja Mirtitsch

Reporter

Comment 2

•

24 years ago

3 problems found so far: 1) PC needs to be in the .rhost files, 2) 2 linux machines (washer, box) both have disabled rsh access, and 3) clientside has problems on Solaris because it requires perl, which it tries to get from a networkdrive (NFS errors)

Sonja Mirtitsch

Reporter

Comment 3

•

24 years ago

rsh to connect to a linux system from Win2K needs to be a Win2K rsh (Solaris can be contacted with the WinNT rsh...) fixed sonmi's environ.ksh to set the variable RSH to d:/winnt/system32/rsh.exe, and the set_environ script to not change this variable when it is already set. Can not use rsh found in PATH, because MKS and cygwin rsh make even moe problems. fixed a second problem, clio was hardcoded into the tstclnt, that was supposed to bring down the selfserv gracefully. There is no dump of any data now, just the line, I am not sure if that is the expected result: > Recv failed:Connection reset by peer both programms exit I checked in my changes, and will try again once I receive a comment by one of the developers about expected behavior

Severity: blocker → normal

Status: NEW → ASSIGNED

Sonja Mirtitsch

Reporter

Comment 4

•

24 years ago

Attached file log file — Details

Sonja Mirtitsch

Reporter

Comment 5

•

24 years ago

Can't continue, need comment from Relyea

Robert Relyea

Comment 6

•

24 years ago

I'm confused about what kind of comment you need. What is it that's failing and under what conditions. bob

Sonja Mirtitsch

Reporter

Comment 7

•

24 years ago

Hi Bob, I tried to shutdown the selfserv that I had put under stress before with the tstclnt -h nssqa-win2k -p 8443 -d client -n TestUser0 and the input GET /stop HTTP/1.0 after they connected I got the message > Recv failed:Connection reset by peer more details are in the attached log Thanks Sonja

Sonja Mirtitsch

Reporter

Comment 8

•

24 years ago

I am unable to resolve this bug without help and I don't seem able to get it. Please let me know how high the priority is, and who will help me in looking at the results.

Assignee: sonja.mirtitsch → wtc

Status: ASSIGNED → NEW

Wan-Teh Chang

Comment 9

•

24 years ago

Sonja, I agree with your statement that this test needs to be run before NSS 3.4 is released. Therefore, the priority should be P1. I understand that you need someone to investigate this error message emitted by tstclnt: > Recv failed:Connection reset by peer Is there anything else you need help with? I think Kirk can help you look into tstclnt's "connection reset by peer error". Should I assign this bug to you or Kirk?

Sonja Mirtitsch

Reporter

Comment 10

•

24 years ago

No, I think Nelson who designed the test, or Bob would be the right person to handle this, he was the one originally requesting that the test would be run against 3.4 Kirk doesn't have the knowledge about this test and it would probably take him a lot longer to even understand the details of what is being done in this test, Bob and Nelson know a lot more about the test than I do, and there is no documentation about it. It does not seem to be a large problem, the test runs to the end, just when I try to shut down the selfserv the connection dies without giving messages. It is possible that that is the expected behavior if there is no memory leak, if there is one the server dumps the contents of some cert DB at this point.

Priority: P2 → P1

Wan-Teh Chang

Updated

•

24 years ago

Status: NEW → ASSIGNED

Wan-Teh Chang

Comment 11

•

24 years ago

Was the error message > Recv failed:Connection reset by peer emitted by selfserv, tstclnt, or one of the test scripts? I just did a search in all the files under mozilla/security/nss (the tip of the source tree) for "Recv" and didn't find any strings containing "Recv". So I don't know where the message > Recv failed:Connection reset by peer came from. I also need to know which platforms the program that emitted the error message and its peer were running on. Is this the only remaining issue for this bug?

Sonja Mirtitsch

Reporter

Comment 12

•

24 years ago

I think one of the problems in the bug aging is, that very frequently errormessages are being changed, and lots of work has to be done all over. The message did not come from the script. The last part of the test is, to connect a tstclnt to the slfserv that has been stressed and give it a "GET /stop HTTP/1.0\n\n" to make it shut down gracefully. At this point, before shutting down the server usually gave messages. I never ran it with a regular selfserv, only with a special version of Nelson, I do not have access to this version of selfserv. I do not know if his selfserv changes are checked in, or it even would make sense to check them in. The test is designed for the selfserv and the tstcnt that shuts it down to run on NT, and the stressclients to run on any Unix platform, mostly Linux. I think we had about 400 on box and washer, and 10-20 on a few Solaris, AIX and HP machines in this run. If you are going to work on the bug now I will rerun all tests once I am done with trying to get the QA results reformated for ftp.mozilla.org, it is going to take me about 4 hours to set up machines and rerun the tests.

Wan-Teh Chang

Comment 13

•

24 years ago

We should run the distributed stress test with a regular selfserv. We need to find out where the error message came from. Could it come from rsh?

Sonja Mirtitsch

Reporter

Comment 14

•

24 years ago

There are no more rsh's running at this point in the test

Sonja Mirtitsch

Reporter

Comment 15

•

24 years ago

I reran the tests a few time, but I was never able to produce measurable "stress" on the NT server machine. After a while selfserv seemed saturated, and new clients could not connect for 30-40 seconds. The CPU load never reached more than 50%. If I continued starting strsclnts the client machines seemed to run out of resources, for example: uname: error in loading shared libraries: libc.so.6: cannot open shared object file: Error 23 ssl_dist_stress.sh: /share/builds/mccrel/nss/nsstip/builds/20011115.1/booboo_Solaris8/mozilla/dist/Linux2.2_x86_glibc_P H_OPT.OBJ/bin/strsclnt: Too many open files in system ssl_dist_stress.sh: [: -ge: unary operator expected When the test terminated (regular termination, not error-caused) after that type of errors I saw hanging stressclients on some or all of the clientmachines, and sometimes the message message: $ > Recv failed:Connection reset by peer a while after the test (script and client) had finished showed up once or more The maximum number of clients I ran that did not show resource errors were 400, the CPU in this case never went to more than 40%. After (suspected) saturation was reached the server would not accept new connections for 30-40 seconds, and after that communication was resumed. With 400 clients this happened with only about 60 clients left. The standard termination messages seem to be: Date: Tue, 26 Aug 1997 22:10:05 GMT Content-type: text/plain Discarded 2 characters. GET /stop HTTP/1.0 EOF I attach all the logs

Sonja Mirtitsch

Reporter

Comment 16

•

24 years ago

Attached file log of the 1st stresstest (incomplete) — Details

Sonja Mirtitsch

Reporter

Comment 17

•

24 years ago

Attached file log of the next test — Details

Sonja Mirtitsch

Reporter

Comment 18

•

24 years ago

Attached file log of the last test — Details

Wan-Teh Chang

Comment 19

•

24 years ago

So we are still seeing the error message: > Recv failed:Connection reset by peer I don't know where that error message comes from. I just did a search for "Recv failed" in the whole mozilla source tree and did not find any match of "> Recv failed". So it does not come from NSPR, DBM, or NSS.

Sonja Mirtitsch

Reporter

Comment 20

•

24 years ago

Then I would assume the message comes from either rsh or some other leftover. I think we should not chase this errormessage, because it only occurs after the system is failing anyway (resource problems on the client sides, hanging stressclients, that I had to kill manually to get rid of them). Also, at that point in time the selfserv is already terminated, and as you stated the message does not seem to come from the mozilla source. If it is very important to know where it comes from I will do a strings to try to find out. In my opinion we need to determine if the pause that the server has in accepting new connections is a sign of "stress" - at this point it has been presented with 200 different certs and over 300 connections. If we can accept this as sufficient stress, this test has no resource problems on the client sides, and no weird errormessages either. We could then just examine the exit messages to find out if they are sufficient. If we absolutely need to create more severe stress on the server then we need to spend time in figuring out which linux / unix system can handle how many clients and see if we can raise the load this way.

Wan-Teh Chang

Comment 21

•

24 years ago

Bob, Nelson, could you answer Sonja's question (the second and third paragraphs of her previous comment)? My understanding is that the goal of this stress test is to cause our in-memory database to overflow. So one way to determine whether we have created enough stress on the server is to find out if DBM created those temporary files, which would show that the in-memory database overflowed. Agreed?

Nelson Bolyard (seldom reads bugmail)

Comment 22

•

24 years ago

Wan-Teh wrote: > My understanding is that the goal of this stress test > is to cause our in-memory database to overflow. No, I don't think that was the goal of this stress test. If this is the test I recall, the purpose of this test was to have a lot of clients, each with its own individual unique client authentication certificate, performing SSL connections with client authentication. The purpose was to test realistic use of client authentication. It was to try to detect problems with servers doing heavy client auth in our own testing, before customers catch them, and to excersize a portion of our SSL server cache code that is only used when client auth is being done. Previous tests had all clients reusing the same client certificate, which was not realistic, because the server would have only one cert in memory (since that one cert matched all clients). By having each client use a unique cert, we forced the server to have all those different certs in the server SSL cache and in the various cert DBs at once. Even if this test is a different one than the one of which I am thinking, it cannot be measured by seeing when the temp cert DB overflows to disk, because beginning in NSS 3.4, the temp cert DB is no longer implemented using DBM, and it no longer overflows to disk.

Wan-Teh Chang

Comment 23

•

24 years ago

Nelson, this is the stress test you recall. The question still stands -- how do we know we have created sufficient stress on the server?

Sonja Mirtitsch

Reporter

Comment 24

•

24 years ago

Bob Relyea said this is a requirement for NSS 3.4. I don't think it is a Beta-Blocker but we need to run it before RTM

Severity: normal → blocker

Wan-Teh Chang

Comment 25

•

24 years ago

Ian will make the appropriate modifications to selfserv according to Nelson's specification (via email).

Assignee: wtc → ian.mcgreer

Status: ASSIGNED → NEW

Ian McGreer

Assignee

Comment 26

•

24 years ago

Attached patch log state of temp store and cache — Details — Splinter Review

Nelson, With this patch, a call to nss_DumpCertificateCacheInfo() will list all certs in (1) the cache; and (2) the temporary store. Here is what sample output looks like: Certificates in the cache: Certificate: "OU=Secure Server Certification Authority, O="RSA Data Security, In c.", C=US" refs=2 Certificate: "CN=foobar, E=foo@bar.com" refs=2 Certificates in the temporary store: Certificate: "CN=localhost" refs=2 Does this look like what you need?

Nelson Bolyard (seldom reads bugmail)

Comment 27

•

24 years ago

That looks like the right info. I think that will do nicely. I suggest that the ref count come at the beginning of the line, and drop the "Certificate: " on the front of each line. Maybe selfserv should have YA command line option to call this function before exiting.

Wan-Teh Chang

Comment 28

•

24 years ago

Seems to me that it doesn't hurt for selfserv to always call this function before exiting. Agreed?

Sonja Mirtitsch

Reporter

Comment 29

•

24 years ago

I'd prefer a commandline option, or is it that the cache and temp. store should be completely empty, and if there is no problem there would be no output at all?

Ian McGreer

Assignee

Comment 30

•

24 years ago

I checked in the patch, with Nelson's suggestions. selfserv uses shared libraries. So we must export this function to take advantage of it? Is there no other way?

Sonja Mirtitsch

Reporter

Comment 31

•

24 years ago

Could exporting these functions introduce a possible security problem? Could we do #ifdef DEBUG_NSS arounfd the functions and the selfserv exit and I'd do a special build if I need it?

Wan-Teh Chang

Comment 32

•

24 years ago

As we discussed in the conference call this afternoon, Ian will export this new function from libnss3.so and add a new command-line option to selfserv.

Ian McGreer

Assignee

Comment 33

•

24 years ago

Sonja- This function doesn't really pose any kind of security risk. All that is displayed is the certificate's subject and number of references. There is no way that I know of to conditionally define a function in our .def format used by the shared libraries. I thought of using #ifdef DEBUG for the header definition, but we would probably like to use this in optimized builds as well. I have defined the function in nss.h, implemented it as per the last patch, and exported it in nss.def. Also added -y option to selfserv to call this function before exit.

Sonja Mirtitsch

Reporter

Comment 34

•

23 years ago

I ran the distributed stress test on nssqa_win2k with 800, 1200 and 1400 clients and it seemed fine. Once we got the (alleged) rsh errormessage "Recv failed: Connection reset by peer" The selfserv gave the lines: Certificates in cache: Certificates in temporary store: but listed no certificates, which I assume is expected behavior (Ian, Nelson please verify) last line was "selfserv: normal termination" I will attach the full log of 1400 clients, and a shortened version

Sonja Mirtitsch

Reporter

Comment 35

•

23 years ago

Attached file dist. stress test 02/05/02 log — Details

Sonja Mirtitsch

Reporter

Comment 36

•

23 years ago

Attached file shortened version of the 02/05/02 og — Details

Sonja Mirtitsch

Reporter

Updated

•

23 years ago

Whiteboard: test ran, please review results

Ian McGreer

Assignee

Comment 37

•

23 years ago

From the cache/temp store perspective, it looks good. That is the expected output if there are no leaked references.

Sonja Mirtitsch

Reporter

Comment 38

•

23 years ago

test completed, closing fixed

Status: NEW → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Whiteboard: test ran, please review results

You need to log in before you can comment on or make changes to this bug.