Closed Bug 83593 Opened 23 years ago Closed 23 years ago

stress tests fail on orville

Categories

(NSS :: Libraries, defect, P1)

3.2.1
HP
HP-UX
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sonja.mirtitsch, Assigned: wtc)

References

Details

Attachments

(8 files, 1 obsolete file)

the iWs team would let us use orville again, which can finish our tests in 20 minutes (dump takes 3-5 hours) The strestests on orville fail, but other HP sysems pass. I filed bug #81707 on the fact that the stress tests did not return an error, but the QA stat did. When this was fixed I rebooted orville, but this did not help the condition. Information about the failure is in every day's QA log. Would you have a few minutes to explain to me what the failure means exactly?
Nelson, could you please have a look at it? I realize it seems low priority but it costs my time constantely, sinc eI have to doublecheck the QA reports every single day, and even if it is just 5 or 10 minutes every day it adds up. It also makes the QA report harder to read for the rest of you.
It is indeed a low priority until I get the SSL server session cache rewrite done. strsclnt is reporting error -5978 PR_NOT_CONNECTED_ERROR, which is equivalent to unix's ENOTCONN error, in response to a PR_Write call by the client. It appears that this error code is used only when an OS call returns ENOTCONN. The number of errors varies from run to run, typically between 3 and 12. The number of errors reported is coincidentally equal to 1000 - (cache_hits + cache_missed) as reported by strsclnt when it is done. This begs a question: Are these the last N connections that fail? or do these failures occur in the middle of the test with more succesful connections occuring afterwords? The failures occur with both SSL2 and SSL3 ciphersuites, specifically suites A and c, which are SSL2 RC4 128 WITH MD5 and SSL3 RSA WITH RC4 128 MD5 respectively. I suspect these are the first SSL2 and SSL3 suites that are tried by the QA test, respectively. Question: When one of these tests fails, do we stop trying any further suites from that protocol version? That is, when the first SSL2 suite fails, does the QA script go on and try the rest of the SSL2 suites? or does it stop there and go on to the SSL3 suites? Unless someone else works on this, these questions will go unanswered at least until next week, possibly later.
CCing Wan-Teh, since we should have a look at this bug before the early release
Summary: machine / configuration problems on orville → stress tests fail on orville
Nelson, could you please attach the old logs to verify that the failures on orville were occuring before the recent SSL server cache changes, and are no different now than before? Thanks Orville was used as our main HP-UX 32 bit QA machine until about 5 months ago, and I am not aware that it showed these failures then, we could not use it for a while because of a tinderbox running there. When I started using it again, about 1 month or so ago I noticed this failure.
Priority: -- → P1
Target Milestone: --- → 3.3
I have created a small shell script that will reproduce the problem. running both the server and the client with -v maves the problem disappear. I will work on this bug until someone else has time to take over
-v just decreases the failure - my previous observation was wrong client on charm, server on orville did not show problems, neither does server on charm and client on orville ----------- to answer a few questions that were asked on this bug previously: > Are these the last N connections that fail? or do these failures > occur in the middle of the test with more succesful connections > occuring afterwords? These are not the last connections, I will attach the output when run with -v > The failures occur with both SSL2 and SSL3 ciphersuites, specifically > suites A and c, which are SSL2 RC4 128 WITH MD5 and SSL3 RSA WITH RC4 128 MD5 > respectively. I suspect these are the first SSL2 and SSL3 suites that > are tried by the QA test, respectively. correct. The first and only > Question: When one of these tests fails, do we stop trying any > further suites from that protocol version? That is, when the first SSL2 > suite fails, does the QA script go on and try the rest of the SSL2 suites? > or does it stop there and go on to the SSL3 suites? The stress tests are the last of the ssl tests, there are only 2 tests enabled in tests/ssl/sslstress.txt feel free to add new ones, the just need to add lines with server and client parameters Other observations, bothe running -v, both on orville ----------------------------------------------------- #selfserver output should look like this: selfserv: About to call accept. selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1 selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1 selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1 selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1 selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1 selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1 selfserv: About to call accept. #sometimes output changes, 2 threads writing at the same time #(redirected stderr to stdout) selfserv: Aboutselfserv: 0 cache hits; 0 cache misses, 0 cache not reusable selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1 to call accept. selfserv: 0 cache hits; 0 cache misses, 0 cache not reusable selfserv: bulk cipher RC4, 128 secret key bits, 128 key bits, status: 1 --------------- I will attach the script that I have been using for testing, please do not run it in my directories, the commented out "cert.sh" needs to be run first, or the directories from nightly QA mozilla/testresults/security/orville.1 have to be copied to the working directory first
Attached file stressclient output
Assigned the bug to Sonja.
Assignee: nelsonb → sonmi
Wan-Teh, I will work on this bug and I might find some more evidence, but I do not think I have enough experience, either in NSS, nor in multithreaded programming to solve it. If you would have the time to look at the output of the selfserver and the first 2 attachments I'd apprechiate it.
Assigned the bug to Nelson.
Assignee: sonmi → nelsonb
I do not think it should be a P1, because I just reran 3.2.1 QA, and it shows the same failure. It reacts very different than all the other stresstest failures we have been seeing.
Whiteboard: NSS 3.3 Early Release
Whiteboard: NSS 3.3 Early Release
I propose we move the target to 3.4 as 3.2.1 fails the same way on orville.
Moved to 3.4 per Wan-Teh's suggestion, however, I will still try to look at it this coming week.
Target Milestone: 3.3 → 3.4
Assigned the bug to wtc.
Assignee: nelsonb → wtc
Status: NEW → ASSIGNED
Target Milestone: 3.4 → 3.3.1
The strsclnt failure on orville, sjsu, and hpgamma appears to be a bug in HP-UX B.11.00 or its kernel patches. I found that getpeername() occasionally fails with ENOTCONN after a successful completion of non-blocking connect. The connection is in fact successfully established because read() and write() on the socket work. Moreover, getpeername() works after a write() call is made on the socket, which inspires my workaround (attachment 50379 [details] [diff] [review]). I tested the workaround with the 32-bit HP-UX B.11.00 debug build of NSS_3_3_BRANCH on orville, sjsu, and hp64. It worked. I checked in the workaround on the NSS_3_3_BRANCH so that it will get tested by the daily QA. Nelson, I'd appreciate it if you could review the workaround. I will attach the bug report I submitted to HP, and a test server and a test client (without using NSPR or NSS) that reproduce the getpeername() bug.
Comment on attachment 50381 [details] Test server server.c (bug report attachment 1 [details] [diff] [review]) Wrong file.
Attachment #50381 - Attachment is obsolete: true
I did more experiments and found out more about this HP-UX bug. When a non-blocking connect completes successfully but getpeername() fails with ENOTCONN, a second connect() call on the socket will fail with EALREADY, which is consistent with the ENOTCONN error of getpeername(). If I then select the socket for writing again with a zero timeout, the next getpeername() call will succeed, and the next connect() call will fail with the expected EISCONN error. I never have to select a socket for writing a third time. This suggests another workaround, which differs from the first workaround in that we select the socket for writing (with a zero timeout) instead of writing zero bytes to it. This can be done in PR_Connect() for a blocking NSPR socket or in PR_ConnectContinue() for a non-blocking NSPR socket. Note that we can't use this workaround in NSS. NSS can't call select() and must call PR_Poll(). PR_Poll() only works when the bottom layer is NSPR, but NSS may be on top of a non-NSPR layer. This workaround must be done in the bottom layer, which would call select() directly on the Unix file descriptor. I will attach the second workaround, which will be a patch for NSPR.
Perhaps it's best to patch both SSL and NSPR. That way, the problem may be solved on HP boxes that use a different implementation of NSPR's I/O layer, also. But perhaps it would be best to put the modification inside the function ssl_GetPeerInfo rather than after the calls to that function.
Thanks for the code review, Nelson. I am inclined to also patch NSPR, like you suggested. What I haven't decided is when. > But perhaps it would be best to put the modification inside the > function ssl_GetPeerInfo rather than after the calls to that function. I didn't do that because the HP-UX bug only affects the party that calls connect (that is, the client) so it is only necessary to use the workaround when ssl_GetPeerInfo is called on the client side.
*** Bug 99493 has been marked as a duplicate of this bug. ***
The workaround made the NSS 3.3 branch pass the daily QA. I checked in the workaround on the tip of NSS.
Component: Tools → Libraries
Marked the bug fixed. I submitted a bug report to HP but did not get a reply.
Really marked the bug fixed.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: