Closed
Bug 90518
Opened 24 years ago
Closed 23 years ago
distributed stress test broken
Categories
(NSS :: Test, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
3.4
People
(Reporter: sonja.mirtitsch, Assigned: bugz)
References
Details
Attachments
(7 files)
the test needs to be run before NSS 3.4 is released. This is a manual run test,
the scripts ave to be adjusted, and the output needs to be examined. It does not
need to run before 3.3.
Documentation how to run and adjust this test needs to be checked in to tests/doc
Problems: still no progress on the static IP address issue - can not run on any
machine but clio
bugs: 200 stress certs don't get generated
testresults/security/HOSTDIR does not get created
this is most likely a result of the cleanup in the ssl script, or the init
script, which get sourced
Reporter | ||
Updated•24 years ago
|
Severity: normal → blocker
Priority: -- → P2
Target Milestone: --- → 3.4
Reporter | ||
Comment 2•24 years ago
|
||
3 problems found so far:
1) PC needs to be in the .rhost files,
2) 2 linux machines (washer, box) both have disabled rsh access, and
3) clientside has problems on Solaris because it requires perl, which it tries
to get from a networkdrive (NFS errors)
Reporter | ||
Comment 3•24 years ago
|
||
rsh to connect to a linux system from Win2K needs to be a Win2K rsh (Solaris can
be contacted with the WinNT rsh...)
fixed sonmi's environ.ksh to set the variable RSH to d:/winnt/system32/rsh.exe,
and the set_environ script to not change this variable when it is already set.
Can not use rsh found in PATH, because MKS and cygwin rsh make even moe problems.
fixed a second problem, clio was hardcoded into the tstclnt, that was supposed
to bring down the selfserv gracefully.
There is no dump of any data now, just the line, I am not sure if that is the
expected result:
> Recv failed:Connection reset by peer
both programms exit
I checked in my changes, and will try again once I receive a comment by one of
the developers about expected behavior
Severity: blocker → normal
Status: NEW → ASSIGNED
Reporter | ||
Comment 4•24 years ago
|
||
Reporter | ||
Comment 5•24 years ago
|
||
Can't continue, need comment from Relyea
Comment 6•24 years ago
|
||
I'm confused about what kind of comment you need. What is it that's failing and
under what conditions.
bob
Reporter | ||
Comment 7•24 years ago
|
||
Hi Bob,
I tried to shutdown the selfserv that I had put under stress before with the
tstclnt -h nssqa-win2k -p 8443 -d client -n TestUser0
and the input
GET /stop HTTP/1.0
after they connected I got the message
> Recv failed:Connection reset by peer
more details are in the attached log
Thanks
Sonja
Reporter | ||
Comment 8•24 years ago
|
||
I am unable to resolve this bug without help and I don't seem able to get it.
Please let me know how high the priority is, and who will help me in looking at
the results.
Assignee: sonja.mirtitsch → wtc
Status: ASSIGNED → NEW
Comment 9•24 years ago
|
||
Sonja, I agree with your statement that this test needs
to be run before NSS 3.4 is released. Therefore, the
priority should be P1.
I understand that you need someone to investigate this
error message emitted by tstclnt:
> Recv failed:Connection reset by peer
Is there anything else you need help with?
I think Kirk can help you look into tstclnt's "connection
reset by peer error". Should I assign this bug to you
or Kirk?
Reporter | ||
Comment 10•24 years ago
|
||
No, I think Nelson who designed the test, or Bob would be the right person to
handle this, he was the one originally requesting that the test would be run
against 3.4
Kirk doesn't have the knowledge about this test and it would probably take him a
lot longer to even understand the details of what is being done in this test,
Bob and Nelson know a lot more about the test than I do, and there is no
documentation about it.
It does not seem to be a large problem, the test runs to the end, just when I
try to shut down the selfserv the connection dies without giving messages. It is
possible that that is the expected behavior if there is no memory leak, if there
is one the server dumps the contents of some cert DB at this point.
Priority: P2 → P1
Updated•24 years ago
|
Status: NEW → ASSIGNED
Comment 11•24 years ago
|
||
Was the error message
> Recv failed:Connection reset by peer
emitted by selfserv, tstclnt, or one of the test scripts?
I just did a search in all the files under mozilla/security/nss
(the tip of the source tree) for "Recv" and didn't find any
strings containing "Recv". So I don't know where the message
> Recv failed:Connection reset by peer
came from.
I also need to know which platforms the program that emitted
the error message and its peer were running on.
Is this the only remaining issue for this bug?
Reporter | ||
Comment 12•24 years ago
|
||
I think one of the problems in the bug aging is, that very frequently
errormessages are being changed, and lots of work has to be done all over.
The message did not come from the script. The last part of the test is, to
connect a tstclnt to the slfserv that has been stressed and give it a "GET
/stop HTTP/1.0\n\n" to make it shut down gracefully. At this point, before
shutting down the server usually gave messages. I never ran it with a regular
selfserv, only with a special version of Nelson, I do not have access to this
version of selfserv. I do not know if his selfserv changes are checked in, or it
even would make sense to check them in.
The test is designed for the selfserv and the tstcnt that shuts it down to run
on NT, and the stressclients to run on any Unix platform, mostly Linux. I think
we had about 400 on box and washer, and 10-20 on a few Solaris, AIX and HP
machines in this run.
If you are going to work on the bug now I will rerun all tests once I am done
with trying to get the QA results reformated for ftp.mozilla.org, it is going to
take me about 4 hours to set up machines and rerun the tests.
Comment 13•24 years ago
|
||
We should run the distributed stress test with
a regular selfserv.
We need to find out where the error message
came from. Could it come from rsh?
Reporter | ||
Comment 14•24 years ago
|
||
There are no more rsh's running at this point in the test
Reporter | ||
Comment 15•24 years ago
|
||
I reran the tests a few time, but I was never able to produce measurable
"stress" on the NT server machine. After a while selfserv seemed saturated, and
new clients could not connect for 30-40 seconds. The CPU load never reached more
than 50%. If I continued starting strsclnts the client machines seemed to run
out of resources, for example:
uname: error in loading shared libraries: libc.so.6: cannot open shared object
file: Error 23
ssl_dist_stress.sh:
/share/builds/mccrel/nss/nsstip/builds/20011115.1/booboo_Solaris8/mozilla/dist/Linux2.2_x86_glibc_P
H_OPT.OBJ/bin/strsclnt: Too many open files in system
ssl_dist_stress.sh: [: -ge: unary operator expected
When the test terminated (regular termination, not error-caused) after that type
of errors I saw hanging stressclients on some or all of the clientmachines, and
sometimes the message message:
$ > Recv failed:Connection reset by peer
a while after the test (script and client) had finished showed up once or more
The maximum number of clients I ran that did not show resource errors were 400,
the CPU in this case never went to more than 40%.
After (suspected) saturation was reached the server would not accept new
connections for 30-40 seconds, and after that communication was resumed. With
400 clients this happened with only about 60 clients left.
The standard termination messages seem to be:
Date: Tue, 26 Aug 1997 22:10:05 GMT
Content-type: text/plain
Discarded 2 characters.
GET /stop HTTP/1.0
EOF
I attach all the logs
Reporter | ||
Comment 16•24 years ago
|
||
Reporter | ||
Comment 17•24 years ago
|
||
Reporter | ||
Comment 18•24 years ago
|
||
Comment 19•24 years ago
|
||
So we are still seeing the error message:
> Recv failed:Connection reset by peer
I don't know where that error message comes from.
I just did a search for "Recv failed" in the whole
mozilla source tree and did not find any match of
"> Recv failed". So it does not come from NSPR,
DBM, or NSS.
Reporter | ||
Comment 20•24 years ago
|
||
Then I would assume the message comes from either rsh or some other leftover. I
think we should not chase this errormessage, because it only occurs after the
system is failing anyway (resource problems on the client sides, hanging
stressclients, that I had to kill manually to get rid of them). Also, at that
point in time the selfserv is already terminated, and as you stated the message
does not seem to come from the mozilla source. If it is very important to know
where it comes from I will do a strings to try to find out.
In my opinion we need to determine if the pause that the server has in accepting
new connections is a sign of "stress" - at this point it has been presented with
200 different certs and over 300 connections. If we can accept this as
sufficient stress, this test has no resource problems on the client sides, and
no weird errormessages either. We could then just examine the exit messages to
find out if they are sufficient.
If we absolutely need to create more severe stress on the server then we need to
spend time in figuring out which linux / unix system can handle how many
clients and see if we can raise the load this way.
Comment 21•24 years ago
|
||
Bob, Nelson, could you answer Sonja's question (the
second and third paragraphs of her previous comment)?
My understanding is that the goal of this stress test
is to cause our in-memory database to overflow. So
one way to determine whether we have created enough
stress on the server is to find out if DBM created
those temporary files, which would show that the
in-memory database overflowed. Agreed?
Comment 22•24 years ago
|
||
Wan-Teh wrote:
> My understanding is that the goal of this stress test
> is to cause our in-memory database to overflow.
No, I don't think that was the goal of this stress test.
If this is the test I recall, the purpose of this test was to have a lot
of clients, each with its own individual unique client authentication
certificate, performing SSL connections with client authentication.
The purpose was to test realistic use of client authentication. It was
to try to detect problems with servers doing heavy client auth in our
own testing, before customers catch them, and to excersize a portion of
our SSL server cache code that is only used when client auth is being done.
Previous tests had all clients reusing the same client certificate,
which was not realistic, because the server would have only one cert in
memory (since that one cert matched all clients). By having each client
use a unique cert, we forced the server to have all those different certs
in the server SSL cache and in the various cert DBs at once.
Even if this test is a different one than the one of which I am thinking,
it cannot be measured by seeing when the temp cert DB overflows to disk,
because beginning in NSS 3.4, the temp cert DB is no longer implemented
using DBM, and it no longer overflows to disk.
Comment 23•24 years ago
|
||
Nelson, this is the stress test you recall.
The question still stands -- how do we know we
have created sufficient stress on the server?
Reporter | ||
Comment 24•24 years ago
|
||
Bob Relyea said this is a requirement for NSS 3.4. I don't think it is a
Beta-Blocker but we need to run it before RTM
Severity: normal → blocker
Comment 25•24 years ago
|
||
Ian will make the appropriate modifications to selfserv
according to Nelson's specification (via email).
Assignee: wtc → ian.mcgreer
Status: ASSIGNED → NEW
Assignee | ||
Comment 26•24 years ago
|
||
Nelson,
With this patch, a call to nss_DumpCertificateCacheInfo() will list all certs
in (1) the cache; and (2) the temporary store. Here is what sample output
looks like:
Certificates in the cache:
Certificate: "OU=Secure Server Certification Authority, O="RSA Data Security,
In
c.", C=US" refs=2
Certificate: "CN=foobar, E=foo@bar.com" refs=2
Certificates in the temporary store:
Certificate: "CN=localhost" refs=2
Does this look like what you need?
Comment 27•24 years ago
|
||
That looks like the right info. I think that will do nicely.
I suggest that the ref count come at the beginning of the line,
and drop the "Certificate: " on the front of each line.
Maybe selfserv should have YA command line option to call this
function before exiting.
Comment 28•24 years ago
|
||
Seems to me that it doesn't hurt for selfserv to always
call this function before exiting. Agreed?
Reporter | ||
Comment 29•24 years ago
|
||
I'd prefer a commandline option, or is it that the cache and temp. store should
be completely empty, and if there is no problem there would be no output at all?
Assignee | ||
Comment 30•24 years ago
|
||
I checked in the patch, with Nelson's suggestions.
selfserv uses shared libraries. So we must export this function to take
advantage of it? Is there no other way?
Reporter | ||
Comment 31•24 years ago
|
||
Could exporting these functions introduce a possible security problem? Could we
do #ifdef DEBUG_NSS arounfd the functions and the selfserv exit and I'd do a
special build if I need it?
Comment 32•24 years ago
|
||
As we discussed in the conference call this afternoon, Ian
will export this new function from libnss3.so and add a
new command-line option to selfserv.
Assignee | ||
Comment 33•24 years ago
|
||
Sonja-
This function doesn't really pose any kind of security risk. All that is
displayed is the certificate's subject and number of references. There is no
way that I know of to conditionally define a function in our .def format used by
the shared libraries. I thought of using #ifdef DEBUG for the header
definition, but we would probably like to use this in optimized builds as well.
I have defined the function in nss.h, implemented it as per the last patch, and
exported it in nss.def. Also added -y option to selfserv to call this function
before exit.
Reporter | ||
Comment 34•23 years ago
|
||
I ran the distributed stress test on nssqa_win2k with 800, 1200 and 1400 clients
and it seemed fine.
Once we got the (alleged) rsh errormessage "Recv failed: Connection reset by peer"
The selfserv gave the lines:
Certificates in cache:
Certificates in temporary store:
but listed no certificates, which I assume is expected behavior (Ian, Nelson
please verify)
last line was "selfserv: normal termination"
I will attach the full log of 1400 clients, and a shortened version
Reporter | ||
Comment 35•23 years ago
|
||
Reporter | ||
Comment 36•23 years ago
|
||
Reporter | ||
Updated•23 years ago
|
Whiteboard: test ran, please review results
Assignee | ||
Comment 37•23 years ago
|
||
From the cache/temp store perspective, it looks good. That is the expected
output if there are no leaked references.
Reporter | ||
Comment 38•23 years ago
|
||
test completed, closing fixed
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Whiteboard: test ran, please review results
You need to log in
before you can comment on or make changes to this bug.
Description
•