Last Comment Bug 83593 - stress tests fail on orville
: stress tests fail on orville
Status: RESOLVED FIXED
:
Product: NSS
Classification: Components
Component: Libraries (show other bugs)
: 3.2.1
: HP HP-UX
: P1 normal (vote)
: 3.3.1
Assigned To: Wan-Teh Chang
: Sonja Mirtitsch
:
Mentors:
: 99493 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2001-05-31 18:05 PDT by Sonja Mirtitsch
Modified: 2001-10-31 22:48 PST (History)
2 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
stressclient output (509.22 KB, text/plain)
2001-06-15 17:19 PDT, Sonja Mirtitsch
no flags Details
selfserver output, coresponding to the attachement above (589.95 KB, text/plain)
2001-06-15 17:20 PDT, Sonja Mirtitsch
no flags Details
script to reproduce the problem (1.34 KB, text/plain)
2001-06-15 17:21 PDT, Sonja Mirtitsch
no flags Details
A possible workaround. Write 0 bytes to the socket and retry getpeername. (1.03 KB, patch)
2001-09-22 07:51 PDT, Wan-Teh Chang
no flags Details | Diff | Splinter Review
The bug report I submitted to HP about the getpeername() problem. (1.77 KB, text/plain)
2001-09-22 08:21 PDT, Wan-Teh Chang
no flags Details
Test server server.c (bug report attachment 1) (1.86 KB, text/plain)
2001-09-22 08:22 PDT, Wan-Teh Chang
no flags Details
Test client nbclient.c (bug report attachment 2) (4.97 KB, text/plain)
2001-09-22 08:23 PDT, Wan-Teh Chang
no flags Details
Test server server.c (bug report attachment 1) (2.08 KB, text/plain)
2001-09-22 08:25 PDT, Wan-Teh Chang
no flags Details
Second workaround: a patch for NSPR. Select the socket for writing with zero timeout after successful completion of a non-blocking connect. (1.07 KB, patch)
2001-09-22 23:44 PDT, Wan-Teh Chang
no flags Details | Diff | Splinter Review

Description Sonja Mirtitsch 2001-05-31 18:05:50 PDT
the iWs team would let us use orville again, which can finish our tests in 20
minutes (dump takes 3-5 hours)
The strestests on orville fail, but other HP sysems pass.
I filed bug #81707 on the fact that the stress tests did not return an error,
but the QA stat did. When this was fixed I rebooted orville, but this did not
help the condition.
Information about the failure is in every day's QA log.
Would you have a few minutes to explain to me what the failure means exactly?
Comment 1 Sonja Mirtitsch 2001-06-05 10:44:42 PDT
Nelson, could you please have a look at it? I realize it seems low priority but
it costs my time constantely, sinc eI have to doublecheck the QA reports every
single day, and even if it is just 5 or 10 minutes every day it adds up.
It also makes the QA report harder to read for the rest of you.
Comment 2 Nelson Bolyard (seldom reads bugmail) 2001-06-06 18:04:15 PDT
It is indeed a low priority until I get the SSL server session cache rewrite
done.  

strsclnt is reporting error -5978 PR_NOT_CONNECTED_ERROR, which is 
equivalent to unix's ENOTCONN error, in response to a PR_Write call 
by the client.  It appears that this error code is used only when an
OS call returns ENOTCONN.  

The number of errors varies from run to run, typically between 3 and 12.
The number of errors reported is coincidentally equal to 
   1000 - (cache_hits + cache_missed) 
as reported by strsclnt when it is done.  This begs a question:

Are these the last N connections that fail? or do these failures 
occur in the middle of the test with more succesful connections 
occuring afterwords?

The failures occur with both SSL2 and SSL3 ciphersuites, specifically
suites A and c, which are SSL2 RC4 128 WITH MD5 and SSL3 RSA WITH RC4 128 MD5
respectively.  I suspect these are the first SSL2 and SSL3 suites that
are tried by the QA test, respectively.

Question:  When one of these tests fails, do we stop trying any 
further suites from that protocol version?  That is, when the first SSL2
suite fails, does the QA script go on and try the rest of the SSL2 suites?
or does it stop there and go on to the SSL3 suites?

Unless someone else works on this, these questions will go unanswered
at least until next week, possibly later.
Comment 3 Sonja Mirtitsch 2001-06-07 16:02:25 PDT
CCing Wan-Teh, since we should have a look at this bug before the early release
Comment 4 Sonja Mirtitsch 2001-06-11 13:53:35 PDT
Nelson, could you please attach the old logs to verify that the failures on
orville were occuring before the recent SSL server cache changes, and are no
different now than before? Thanks

Orville was used as our main HP-UX 32 bit QA machine until about 5 months ago,
and I am not aware that it showed these failures then, we could not use it for a
while because of a tinderbox running there. When I started using it again, about
1 month or so ago I noticed this failure.
Comment 5 Sonja Mirtitsch 2001-06-15 16:19:59 PDT
I have created a small shell script that will reproduce the problem. running
both the server and the client with -v maves the problem disappear. I will work
on this bug until someone else has time to take over
Comment 6 Sonja Mirtitsch 2001-06-15 17:17:13 PDT
-v just decreases the failure - my previous observation was wrong

client on charm, server on orville did not show problems, neither does server on
charm and client on orville
-----------
to answer a few questions that were asked on this bug previously:

> Are these the last N connections that fail? or do these failures 
> occur in the middle of the test with more succesful connections 
> occuring afterwords?
These are not the last connections, I will attach the output when run with -v


> The failures occur with both SSL2 and SSL3 ciphersuites, specifically
> suites A and c, which are SSL2 RC4 128 WITH MD5 and SSL3 RSA WITH RC4 128 MD5
> respectively.  I suspect these are the first SSL2 and SSL3 suites that
> are tried by the QA test, respectively.
correct. The first and only 

> Question:  When one of these tests fails, do we stop trying any 
> further suites from that protocol version?  That is, when the first SSL2
> suite fails, does the QA script go on and try the rest of the SSL2 suites?
> or does it stop there and go on to the SSL3 suites?
The stress tests are the last of the ssl tests, there are only 2 tests enabled
in tests/ssl/sslstress.txt
feel free to add new ones, the just need to add lines with server and client
parameters


Other observations, bothe running -v, both on orville
-----------------------------------------------------
#selfserver output should look like this:

selfserv: About to call accept.
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1



selfserv: About to call accept.

#sometimes output changes, 2 threads writing at the same time 
#(redirected stderr to stdout)

selfserv: Aboutselfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
 to call accept.
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1

---------------

I will attach the script that I have been using for testing, please do not run
it in my directories, the commented out "cert.sh" needs to be run first, or the
directories from nightly QA mozilla/testresults/security/orville.1 have to be
copied to the working directory first





Comment 7 Sonja Mirtitsch 2001-06-15 17:19:00 PDT
Created attachment 38697 [details]
stressclient output
Comment 8 Sonja Mirtitsch 2001-06-15 17:20:16 PDT
Created attachment 38698 [details]
selfserver output, coresponding to the attachement above
Comment 9 Sonja Mirtitsch 2001-06-15 17:21:12 PDT
Created attachment 38699 [details]
script to reproduce the problem
Comment 10 Wan-Teh Chang 2001-06-15 21:10:10 PDT
Assigned the bug to Sonja.
Comment 11 Sonja Mirtitsch 2001-06-15 21:17:45 PDT
Wan-Teh, I will work on this bug and I might find some more evidence, but I do
not think I have enough experience, either in NSS, nor in multithreaded
programming to solve it.
If you would have the time to look at the output of the selfserver and the first
2 attachments I'd apprechiate it.
Comment 12 Wan-Teh Chang 2001-06-15 22:29:15 PDT
Assigned the bug to Nelson.
Comment 13 Sonja Mirtitsch 2001-06-18 15:24:10 PDT
I do not think it should be a P1, because I just reran 3.2.1 QA, and it shows
the same failure. 
It reacts very different than all the other stresstest failures we have been seeing.
Comment 14 Wan-Teh Chang 2001-06-21 23:53:01 PDT
I propose we move the target to 3.4 as 3.2.1 fails the
same way on orville.
Comment 15 Nelson Bolyard (seldom reads bugmail) 2001-06-22 14:09:43 PDT
Moved to 3.4 per Wan-Teh's suggestion, however, I will still try to 
look at it this coming week.
Comment 16 Wan-Teh Chang 2001-08-30 17:11:37 PDT
Assigned the bug to wtc.
Comment 17 Wan-Teh Chang 2001-09-22 07:51:15 PDT
Created attachment 50379 [details] [diff] [review]
A possible workaround.  Write 0 bytes to the socket and retry getpeername.
Comment 18 Wan-Teh Chang 2001-09-22 08:16:02 PDT
The strsclnt failure on orville, sjsu, and hpgamma
appears to be a bug in HP-UX B.11.00 or its kernel
patches.  I found that getpeername() occasionally
fails with ENOTCONN after a successful completion
of non-blocking connect.  The connection is in fact
successfully established because read() and write()
on the socket work.  Moreover, getpeername() works
after a write() call is made on the socket, which
inspires my workaround (attachment 50379 [details] [diff] [review]).

I tested the workaround with the 32-bit HP-UX B.11.00
debug build of NSS_3_3_BRANCH on orville, sjsu, and
hp64.  It worked.

I checked in the workaround on the NSS_3_3_BRANCH so
that it will get tested by the daily QA.  Nelson,
I'd appreciate it if you could review the workaround.

I will attach the bug report I submitted to HP, and
a test server and a test client (without using NSPR
or NSS) that reproduce the getpeername() bug.
Comment 19 Wan-Teh Chang 2001-09-22 08:21:08 PDT
Created attachment 50380 [details]
The bug report I submitted to HP about the getpeername() problem.
Comment 20 Wan-Teh Chang 2001-09-22 08:22:33 PDT
Created attachment 50381 [details]
Test server server.c (bug report attachment 1 [details] [diff] [review])
Comment 21 Wan-Teh Chang 2001-09-22 08:23:22 PDT
Created attachment 50382 [details]
Test client nbclient.c (bug report attachment 2 [details] [diff] [review])
Comment 22 Wan-Teh Chang 2001-09-22 08:24:52 PDT
Comment on attachment 50381 [details]
Test server server.c (bug report attachment 1 [details] [diff] [review])

Wrong file.
Comment 23 Wan-Teh Chang 2001-09-22 08:25:46 PDT
Created attachment 50383 [details]
Test server server.c (bug report attachment 1 [details] [diff] [review])
Comment 24 Wan-Teh Chang 2001-09-22 23:05:40 PDT
I did more experiments and found out more about this
HP-UX bug.

When a non-blocking connect completes successfully
but getpeername() fails with ENOTCONN, a second
connect() call on the socket will fail with EALREADY,
which is consistent with the ENOTCONN error of
getpeername().  If I then select the socket for
writing again with a zero timeout, the next
getpeername() call will succeed, and the next
connect() call will fail with the expected EISCONN
error.  I never have to select a socket for writing
a third time.

This suggests another workaround, which differs
from the first workaround in that we select
the socket for writing (with a zero timeout)
instead of writing zero bytes to it.  This can
be done in PR_Connect() for a blocking NSPR socket
or in PR_ConnectContinue() for a non-blocking NSPR
socket.  Note that we can't use this workaround
in NSS.  NSS can't call select() and must call
PR_Poll().  PR_Poll() only works when the bottom
layer is NSPR, but NSS may be on top of a non-NSPR
layer.  This workaround must be done in the bottom
layer, which would call select() directly on the
Unix file descriptor.

I will attach the second workaround, which will be
a patch for NSPR.
Comment 25 Wan-Teh Chang 2001-09-22 23:44:24 PDT
Created attachment 50444 [details] [diff] [review]
Second workaround: a patch for NSPR.  Select the socket for writing with zero timeout after successful completion of a non-blocking connect.
Comment 26 Nelson Bolyard (seldom reads bugmail) 2001-09-24 16:33:11 PDT
Perhaps it's best to patch both SSL and NSPR.
That way, the problem may be solved on HP boxes that use a different 
implementation of NSPR's I/O layer, also.
But perhaps it would be best to put the modification inside the 
function ssl_GetPeerInfo rather than after the calls to that function.
Comment 27 Wan-Teh Chang 2001-09-24 16:49:12 PDT
Thanks for the code review, Nelson.  I am inclined to also patch
NSPR, like you suggested.  What I haven't decided is when.

> But perhaps it would be best to put the modification inside the 
> function ssl_GetPeerInfo rather than after the calls to that function.

I didn't do that because the HP-UX bug only affects the party that
calls connect (that is, the client) so it is only necessary to use
the workaround when ssl_GetPeerInfo is called on the client side.

Comment 28 Wan-Teh Chang 2001-09-24 16:57:19 PDT
*** Bug 99493 has been marked as a duplicate of this bug. ***
Comment 29 Wan-Teh Chang 2001-09-24 18:31:37 PDT
The workaround made the NSS 3.3 branch pass the daily QA.
I checked in the workaround on the tip of NSS.
Comment 30 Wan-Teh Chang 2001-10-31 19:35:53 PST
Marked the bug fixed.

I submitted a bug report to HP but did not get
a reply.
Comment 31 Wan-Teh Chang 2001-10-31 22:48:02 PST
Really marked the bug fixed.

Note You need to log in before you can comment on or make changes to this bug.