Closed Bug 81707 Opened 24 years ago Closed 24 years ago

strsclnt problems on orville, exitcode still 0

Categories

(NSS :: Tools, defect, P2)

3.2.1
HP
HP-UX
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sonja.mirtitsch, Assigned: larryh)

Details

Attachments

(1 file)

5978 is PR_NOT_CONNECTED_ERROR ssl.sh: SSL Stress Test =============================== ssl.sh: Stress SSL2 RC4 128 with MD5 ---- selfserv -D -p 8443 -d ../server -n orville.red.iplanet.com \ -w nss -i ../tests_pid.6758 & selfserv started at Fri May 18 06:05:57 PDT 2001 tstclnt -p 8443 -h orville -q -d . < /h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010518.1/y2sun2_Solaris8/mozilla/security/nss/tests/ssl/sslreq.txt strsclnt -q -p 8443 -d . -w nss -c 1000 -C A \ orville.red.iplanet.com strsclnt started at Fri May 18 06:05:57 PDT 2001 strsclnt: -- SSL: Server Certificate Validated. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: 1 server certificates tested. strsclnt completed at Fri May 18 06:06:02 PDT 2001 ssl.sh: Stress SSL3 RC4 128 with MD5 ---- selfserv -D -p 8443 -d ../server -n orville.red.iplanet.com \ -w nss -i ../tests_pid.6758 & selfserv started at Fri May 18 06:06:02 PDT 2001 tstclnt -p 8443 -h orville -q -d . < /h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010518.1/y2sun2_Solaris8/mozilla/security/nss/tests/ssl/sslreq.txt strsclnt -q -p 8443 -d . -w nss -c 1000 -C c \ orville.red.iplanet.com strsclnt started at Fri May 18 06:06:03 PDT 2001 strsclnt: -- SSL: Server Certificate Validated. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: PR_Write returned error -5978: Network file descriptor is not connected. strsclnt: 987 cache hits; 1 cache misses, 0 cache not reusable strsclnt completed at Fri May 18 06:06:10 PDT 2001 -------------- working stress test on dump, same OS ssl.sh: SSL Stress Test =============================== ssl.sh: Stress SSL2 RC4 128 with MD5 ---- selfserv -D -p 8443 -d ../server -n dump.red.iplanet.com \ -w nss -i ../tests_pid.16154 & selfserv started at Fri May 18 04:21:09 PDT 2001 tstclnt -p 8443 -h dump -q -d . < /h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010518.1/y2sun2_Solaris8/mozilla/security/nss/tests/ssl/sslreq.txt strsclnt -q -p 8443 -d . -w nss -c 1000 -C A \ dump.red.iplanet.com strsclnt started at Fri May 18 04:21:11 PDT 2001 strsclnt: -- SSL: Server Certificate Validated. strsclnt: 1 server certificates tested. strsclnt completed at Fri May 18 04:21:27 PDT 2001 /h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010518.1/y2sun2_Solaris8/mozilla/security/nss/tests/all.sh[99]: 16778 Terminated ssl.sh: Stress SSL3 RC4 128 with MD5 ---- selfserv -D -p 8443 -d ../server -n dump.red.iplanet.com \ -w nss -i ../tests_pid.16154 & selfserv started at Fri May 18 04:21:27 PDT 2001 tstclnt -p 8443 -h dump -q -d . < /h/hs-sca15c/export/builds/mccrel/nss/nsstip/builds/20010518.1/y2sun2_Solaris8/mozilla/security/nss/tests/ssl/sslreq.txt strsclnt -q -p 8443 -d . -w nss -c 1000 -C c \ dump.red.iplanet.com strsclnt started at Fri May 18 04:21:28 PDT 2001 strsclnt: -- SSL: Server Certificate Validated. strsclnt: 999 cache hits; 1 cache misses, 0 cache not reusable strsclnt completed at Fri May 18 04:21:52 PDT 2001
Larry, could you find out why strsclnt exits with status 0 after printing the error message below? strsclnt: PR_Write returned error -5978: Network file descriptor is not connected.
Assignee: wtc → larryh
Priority: -- → P2
Target Milestone: --- → 3.3
Status: NEW → ASSIGNED
Examining the code ... The error message is emitted by function errWarn(), as called by handle_connection(). After emitting the message, handle_connection() returns SECFailure to its caller: do_connects(). do_connects() quietly ignores the return value. That is the proximate cause. In discovering the intent of the strsclnt, there is some difference of opinion on its purpose. The original author (nelsonb) declared the program to be a "unit test, designed to help during development of SSL. ... that its result value is zero when a PR_Write() failed, is not important". The submitter of this bugzilla is using strsclnt in automated QA tests, where the result value of the program is an indication that the test failed. That there is no real "specification" for the correct behavior of strsclnt leaves me in a quandry about "fixing" this at all. How-some-ever, were I going to make it return a failing result, I'd probable set a global flag, such as "failed_already" when the error message is emitted, and keep going. Then, at program exit, if failed_already is true, then exit with a result indicating failure. I'll let management decide on whether strsclnt is a "QA test program" or an "engineer's unit test".
Attached patch proposed patchSplinter Review
proposed patch, as described in my previous diatribe. :-) Should do the trick. Leaves the program running for folks wanting it to keep running: unit testers. Returns non-zero on errors in that PR_Write() for QA purposes. Comments, please.
solution and patch look good to me.
Larry: The NSS tip QA failed on orville again today with the same error. It is odd that PR_Write() failed with PR_NOT_CONNECTED_ERROR. I am worried that PR_Connect() might have returned prematurely before the connection was established. Could you look into that? It would be a good idea to run the NSPR 4.1.1 test suite on orville first. (PR_NOT_CONNECTED_ERROR is the equivalent of the Unix errno ENOTCONN from send().) orville is maintained by the HP engineers working with the web server team. Therefore, it tends to have all the recommended kernel patches installed. It is likely that orville behaves differently from the other HP boxes we run our tests on (hp64 and dump). We also need to understand the purpose of the strsclnt test. If it is solely intended for generating client traffic for a server, it makes sense that it exits with a success status when non-fatal errors occur. The fact that the particular PR_Write() failure is only handled with a function named "errWarn", and the fact that many other failures are handled with "errExit" or exit() seem to indicate that the PR_Write() failure in question is not a fatal error and only warrants a warning message. If, after checking with Nelson, you decide that this is a fatal error, I would suggest that you use errExit() or exit() to terminate the strsclnt test rather than adding a new "failed_already" global variable.
about the orville failures: There are tons of old processes running on orville, I think a reboot would fix the problem there. The reaso I have not requested a reboot yet is, that I would like to see the fix work as long as we have the error. If Nelson owns the QA stress test then make the decision between you and Nelson, otherwise you might want to talk to me as well, since I also have some input and suggestions to make.
I do not think it is a "fatal" condition, so I prefer Larry's solution to have the test to run to it's end if possible as opposed to quit on the first failure. However I am certain that the PR_Write failure should produce a non 0 exit code, because they indicate that something is not working as expected, even if it is not "fatal". I too would like to hear Nelson's input, maybe we can get together in the early afternoon between Larry, Nelson and me. If some unit tests need it to exit with 0 after the failures I would like to discuss this too. What I am afraid of is the "not-my-job" approach - for example if the stress tests find a non stress related problem and fail to report it because we make the focus so narrow, and have the tests only report "expected" stress related problems.
I re-ran NSPR's test suite on orville this morning, debug and optimized builds. There were no errors indicating that PR_Connect() is misbehaving. Sonja's observation that there are many strsclnt (or selfserv) processes running on orville is consistent with my experience that multiple idling selfservs can lead to unpredictable results when talking to strsclnt. I believe her suggestion of killing off the hung processes or rebooting the box is sound. ... I have to ask: How did orville end up with dangling selfservs in the first place?
orville did not have old stressclient and selfserver processes, but a lot of webserver related processes (at least that's what I think they are). Also, there seems to be a constant tinderbox build going on, which at times fills up the /tmp. It might not be the best machine for us to test on, I just tried it again after not using it for 3 months or so, because Christian thought it should work for us again.
Larry, I would change the last line in your patch exitVal = ( exitVal || failed_already )? 1 : 0; to something like if (!exitVal) { exitVal = failed_already; } This way we don't lose the original value of exitVal if it is nonzero. You can go ahead and check in your patch.
Checked in parts.
Marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: