The default bug view has changed. See this FAQ.

stress tests fail on orville

RESOLVED FIXED in 3.3.1

Status

NSS
Libraries
P1
normal
RESOLVED FIXED
16 years ago
16 years ago

People

(Reporter: Sonja Mirtitsch, Assigned: Wan-Teh Chang)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(8 attachments, 1 obsolete attachment)

(Reporter)

Description

16 years ago
the iWs team would let us use orville again, which can finish our tests in 20
minutes (dump takes 3-5 hours)
The strestests on orville fail, but other HP sysems pass.
I filed bug #81707 on the fact that the stress tests did not return an error,
but the QA stat did. When this was fixed I rebooted orville, but this did not
help the condition.
Information about the failure is in every day's QA log.
Would you have a few minutes to explain to me what the failure means exactly?
(Reporter)

Comment 1

16 years ago
Nelson, could you please have a look at it? I realize it seems low priority but
it costs my time constantely, sinc eI have to doublecheck the QA reports every
single day, and even if it is just 5 or 10 minutes every day it adds up.
It also makes the QA report harder to read for the rest of you.
It is indeed a low priority until I get the SSL server session cache rewrite
done.  

strsclnt is reporting error -5978 PR_NOT_CONNECTED_ERROR, which is 
equivalent to unix's ENOTCONN error, in response to a PR_Write call 
by the client.  It appears that this error code is used only when an
OS call returns ENOTCONN.  

The number of errors varies from run to run, typically between 3 and 12.
The number of errors reported is coincidentally equal to 
   1000 - (cache_hits + cache_missed) 
as reported by strsclnt when it is done.  This begs a question:

Are these the last N connections that fail? or do these failures 
occur in the middle of the test with more succesful connections 
occuring afterwords?

The failures occur with both SSL2 and SSL3 ciphersuites, specifically
suites A and c, which are SSL2 RC4 128 WITH MD5 and SSL3 RSA WITH RC4 128 MD5
respectively.  I suspect these are the first SSL2 and SSL3 suites that
are tried by the QA test, respectively.

Question:  When one of these tests fails, do we stop trying any 
further suites from that protocol version?  That is, when the first SSL2
suite fails, does the QA script go on and try the rest of the SSL2 suites?
or does it stop there and go on to the SSL3 suites?

Unless someone else works on this, these questions will go unanswered
at least until next week, possibly later.
(Reporter)

Comment 3

16 years ago
CCing Wan-Teh, since we should have a look at this bug before the early release
(Reporter)

Updated

16 years ago
Summary: machine / configuration problems on orville → stress tests fail on orville
(Reporter)

Comment 4

16 years ago
Nelson, could you please attach the old logs to verify that the failures on
orville were occuring before the recent SSL server cache changes, and are no
different now than before? Thanks

Orville was used as our main HP-UX 32 bit QA machine until about 5 months ago,
and I am not aware that it showed these failures then, we could not use it for a
while because of a tinderbox running there. When I started using it again, about
1 month or so ago I noticed this failure.
(Assignee)

Updated

16 years ago
Priority: -- → P1
Target Milestone: --- → 3.3
(Reporter)

Comment 5

16 years ago
I have created a small shell script that will reproduce the problem. running
both the server and the client with -v maves the problem disappear. I will work
on this bug until someone else has time to take over
(Reporter)

Comment 6

16 years ago
-v just decreases the failure - my previous observation was wrong

client on charm, server on orville did not show problems, neither does server on
charm and client on orville
-----------
to answer a few questions that were asked on this bug previously:

> Are these the last N connections that fail? or do these failures 
> occur in the middle of the test with more succesful connections 
> occuring afterwords?
These are not the last connections, I will attach the output when run with -v


> The failures occur with both SSL2 and SSL3 ciphersuites, specifically
> suites A and c, which are SSL2 RC4 128 WITH MD5 and SSL3 RSA WITH RC4 128 MD5
> respectively.  I suspect these are the first SSL2 and SSL3 suites that
> are tried by the QA test, respectively.
correct. The first and only 

> Question:  When one of these tests fails, do we stop trying any 
> further suites from that protocol version?  That is, when the first SSL2
> suite fails, does the QA script go on and try the rest of the SSL2 suites?
> or does it stop there and go on to the SSL3 suites?
The stress tests are the last of the ssl tests, there are only 2 tests enabled
in tests/ssl/sslstress.txt
feel free to add new ones, the just need to add lines with server and client
parameters


Other observations, bothe running -v, both on orville
-----------------------------------------------------
#selfserver output should look like this:

selfserv: About to call accept.
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1



selfserv: About to call accept.

#sometimes output changes, 2 threads writing at the same time 
#(redirected stderr to stdout)

selfserv: Aboutselfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1
 to call accept.
selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable
selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1

---------------

I will attach the script that I have been using for testing, please do not run
it in my directories, the commented out "cert.sh" needs to be run first, or the
directories from nightly QA mozilla/testresults/security/orville.1 have to be
copied to the working directory first





(Reporter)

Comment 7

16 years ago
Created attachment 38697 [details]
stressclient output
(Reporter)

Comment 8

16 years ago
Created attachment 38698 [details]
selfserver output, coresponding to the attachement above
(Reporter)

Comment 9

16 years ago
Created attachment 38699 [details]
script to reproduce the problem
(Assignee)

Comment 10

16 years ago
Assigned the bug to Sonja.
Assignee: nelsonb → sonmi
(Reporter)

Comment 11

16 years ago
Wan-Teh, I will work on this bug and I might find some more evidence, but I do
not think I have enough experience, either in NSS, nor in multithreaded
programming to solve it.
If you would have the time to look at the output of the selfserver and the first
2 attachments I'd apprechiate it.
(Assignee)

Comment 12

16 years ago
Assigned the bug to Nelson.
Assignee: sonmi → nelsonb
(Reporter)

Comment 13

16 years ago
I do not think it should be a P1, because I just reran 3.2.1 QA, and it shows
the same failure. 
It reacts very different than all the other stresstest failures we have been seeing.
(Assignee)

Updated

16 years ago
Whiteboard: NSS 3.3 Early Release
(Assignee)

Updated

16 years ago
Whiteboard: NSS 3.3 Early Release
(Assignee)

Comment 14

16 years ago
I propose we move the target to 3.4 as 3.2.1 fails the
same way on orville.
Moved to 3.4 per Wan-Teh's suggestion, however, I will still try to 
look at it this coming week.
Target Milestone: 3.3 → 3.4
(Assignee)

Comment 16

16 years ago
Assigned the bug to wtc.
Assignee: nelsonb → wtc
(Assignee)

Comment 17

16 years ago
Created attachment 50379 [details] [diff] [review]
A possible workaround.  Write 0 bytes to the socket and retry getpeername.
(Assignee)

Updated

16 years ago
Status: NEW → ASSIGNED
Target Milestone: 3.4 → 3.3.1
(Assignee)

Comment 18

16 years ago
The strsclnt failure on orville, sjsu, and hpgamma
appears to be a bug in HP-UX B.11.00 or its kernel
patches.  I found that getpeername() occasionally
fails with ENOTCONN after a successful completion
of non-blocking connect.  The connection is in fact
successfully established because read() and write()
on the socket work.  Moreover, getpeername() works
after a write() call is made on the socket, which
inspires my workaround (attachment 50379 [details] [diff] [review]).

I tested the workaround with the 32-bit HP-UX B.11.00
debug build of NSS_3_3_BRANCH on orville, sjsu, and
hp64.  It worked.

I checked in the workaround on the NSS_3_3_BRANCH so
that it will get tested by the daily QA.  Nelson,
I'd appreciate it if you could review the workaround.

I will attach the bug report I submitted to HP, and
a test server and a test client (without using NSPR
or NSS) that reproduce the getpeername() bug.
(Assignee)

Comment 19

16 years ago
Created attachment 50380 [details]
The bug report I submitted to HP about the getpeername() problem.
(Assignee)

Comment 20

16 years ago
Created attachment 50381 [details]
Test server server.c (bug report attachment 1 [details] [diff] [review])
(Assignee)

Comment 21

16 years ago
Created attachment 50382 [details]
Test client nbclient.c (bug report attachment 2 [details] [diff] [review])
(Assignee)

Comment 22

16 years ago
Comment on attachment 50381 [details]
Test server server.c (bug report attachment 1 [details] [diff] [review])

Wrong file.
Attachment #50381 - Attachment is obsolete: true
(Assignee)

Comment 23

16 years ago
Created attachment 50383 [details]
Test server server.c (bug report attachment 1 [details] [diff] [review])
(Assignee)

Comment 24

16 years ago
I did more experiments and found out more about this
HP-UX bug.

When a non-blocking connect completes successfully
but getpeername() fails with ENOTCONN, a second
connect() call on the socket will fail with EALREADY,
which is consistent with the ENOTCONN error of
getpeername().  If I then select the socket for
writing again with a zero timeout, the next
getpeername() call will succeed, and the next
connect() call will fail with the expected EISCONN
error.  I never have to select a socket for writing
a third time.

This suggests another workaround, which differs
from the first workaround in that we select
the socket for writing (with a zero timeout)
instead of writing zero bytes to it.  This can
be done in PR_Connect() for a blocking NSPR socket
or in PR_ConnectContinue() for a non-blocking NSPR
socket.  Note that we can't use this workaround
in NSS.  NSS can't call select() and must call
PR_Poll().  PR_Poll() only works when the bottom
layer is NSPR, but NSS may be on top of a non-NSPR
layer.  This workaround must be done in the bottom
layer, which would call select() directly on the
Unix file descriptor.

I will attach the second workaround, which will be
a patch for NSPR.
(Assignee)

Comment 25

16 years ago
Created attachment 50444 [details] [diff] [review]
Second workaround: a patch for NSPR.  Select the socket for writing with zero timeout after successful completion of a non-blocking connect.
Perhaps it's best to patch both SSL and NSPR.
That way, the problem may be solved on HP boxes that use a different 
implementation of NSPR's I/O layer, also.
But perhaps it would be best to put the modification inside the 
function ssl_GetPeerInfo rather than after the calls to that function.
(Assignee)

Comment 27

16 years ago
Thanks for the code review, Nelson.  I am inclined to also patch
NSPR, like you suggested.  What I haven't decided is when.

> But perhaps it would be best to put the modification inside the 
> function ssl_GetPeerInfo rather than after the calls to that function.

I didn't do that because the HP-UX bug only affects the party that
calls connect (that is, the client) so it is only necessary to use
the workaround when ssl_GetPeerInfo is called on the client side.

(Assignee)

Comment 28

16 years ago
*** Bug 99493 has been marked as a duplicate of this bug. ***
(Assignee)

Comment 29

16 years ago
The workaround made the NSS 3.3 branch pass the daily QA.
I checked in the workaround on the tip of NSS.
(Assignee)

Updated

16 years ago
Component: Tools → Libraries
(Assignee)

Comment 30

16 years ago
Marked the bug fixed.

I submitted a bug report to HP but did not get
a reply.
(Assignee)

Comment 31

16 years ago
Really marked the bug fixed.
Status: ASSIGNED → RESOLVED
Last Resolved: 16 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.