Closed Bug 207313 Opened 22 years ago Closed 22 years ago

Windows QA failures. tstclnt reports connections reset

Categories

(NSS :: Tools, defect, P1)

x86
Windows 2000
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nelson, Assigned: nelson)

Details

Attachments

(1 file)

For NSS 3.9, tstclnt was fixed/enhanced to be able to work interactively on Windows as it always has on Unix. The checkin was rev 1.26 of tstclnt.c. Beginning at that time, Some of the SSL QA tests have been failing repeatedly on some of our Windows QA test machines that run Win2K or WinXP. The failure has so-far been seen only on QA machines, and not in developers' individual runs of the ssl.sh QA test script. A typical failed QA log entry looks like this: tstclnt -p 8443 -h (host) -f -d ../client -T -n TestUser -w bogus \ < (...)/mozilla/security/nss/tests/ssl/sslreq.txt tstclnt: read from socket failed: TCP Connection aborted. ssl.sh: SSL3 Require client auth (bad password) produced a returncode of 1, expected is 254 FAILED Note that it is always this particular test that fails. Also, for the QA test machines, all the file systems are network mounted. Preliminary analysis suggests that this problem was not directly caused by rev 1.26, but rather was a problem that was previously masked, and was uncovered by the changes in rev 1.26. The problem seems related to a different set of actions on the SSL socket that occur when stdin is not ready to read after the first call to PR_Poll in tstclnt's main loop. Wan-Teh discovered that delaying the main loop until stdin is readable works around this problem! That is the basis of rev 1.28. This workaround should not be needed, and why it works is also unknown. Rev 1.29 will cause all diagnostic output to go to stderr, not stdout.
all.sh, which calls ssl.sh also passes on the local machine, where the file-systems are not network mounted. The QA machines and the Tinderbox machine (which also showed these failures) run another script, "nssqa". "nssqa" basically calls all.sh after setting test and result directories to network drives, i.e. they are now no longer running locally on the machine. "nssqa" also automates the running of "all.sh" by setting build type to DBG or OPT and OS_TARGET to WINNT or WIN95, and calling "all.sh" in turn for these respective settings. While these particular tests were failing (before wtc's workaround), at Netscape, we saw on the platform running Win2000: failures of the WINNT DBG build QA (failed on 5/22 and 5/25, passed on 5/24) and failures of the WIN95 DBG and OPT builds QA all days (failed on 5/22 thru 5/25), and on the platform running WinXP: all builds' (i.e. WINNT and WIN95 DBG and OPT) QA fail while running on WinXP, except on 5/24 when it passed the WINNT5.1 DBG build. At Sun, Sonja reported seeing these failures on only the WIN954.0_DBG.OBJ build on her Win2000 machine. All tests on her NT machine passed for both NT DBG and optimized and WIN95 builds. This was during one run.
Observations about this bug: 1. Only two tests fail in any failure cases, They are described as: TLS Require client auth (bad password) SSL3 Require client auth (bad password) These tests require client auth on the first handshake. 2. This table lists the test results seen on 5/22 - 5/25. Key: N_D_T tested with NSPR for WinNT, DeBuG build, TLS test 9_O_S tested with NSPR for Win9x, OPTimized build, SSL3 test Date system OS N_D_T N_D_S N_O_T N_O_S 9_D_T 9_D_S 9_O_T 9_o_S 5/22 elgamal 2K pass FAIL pass pass FAIL FAIL FAIL FAIL koblitz XP pass FAIL pass FAIL FAIL FAIL pass FAIL 5/23 elgamal 2K pass FAIL pass pass FAIL FAIL FAIL FAIL koblitz XP pass FAIL pass FAIL pass FAIL pass FAIL 5/24 elgamal 2k pass FAIL pass pass FAIL FAIL FAIL FAIL koblitz XP pass pass pass FAIL pass FAIL pass FAIL 5/25 elgamal 2k pass pass pass pass FAIL FAIL FAIL FAIL koblitz XP pass FAIL pass FAIL pass FAIL pass FAIL The tests with NSPR-NT and TLS always pass (both debug and optimized). The tests with NSPR-9x and SSL3 always FAIL (both debug and optimized) The tests with NSPR-NT, optimized, SSL3 always pass on elgamal, always fail on koblitz. The tests with NSPR-9x, optimized, TLS always fail on elgamal alawys pass on koblitz. The tests with NSPR-NT, debug, SSL3 are intermittent on both systems. The tests with NSPR-9x, debug, TLS always fails on elgamal, is intermittent on koblitz. 3. SSL3 test failures are DIFFERENT from the TLS test failues. TLS test failures look like this in the logs: tstclnt: read from socket failed: SSL peer cannot verify your certificate. ssl.sh: TLS Require client auth (bad password) produced a returncode of 1, expected is 254 FAILED SSL3 test failures look like this in the log: tstclnt: read from socket failed: TCP Connection aborted. ssl.sh: SSL3 Require client auth (bad password) produced a returncode of 1, expected is 254 FAILED One very odd thing about this is that the error code reported for the TLS test, "SSL peer cannot verify your certificate", is exactly the expected error for this test. In other words, it appears that this test actually behaved as expected, except that the program returned 1 instead of 254. By contrast, the SSL3 tests report "TCP Connection aborted." 4. There is a bug that may be related, which reports different test results on WinNT4.0 vs Win2k. bug 118668. That bug reported a problem that occured on a similar but different test, SSL require client auth on 2nd handshake (bad passwd). That bug reports the same difference of error code between the SSL and TLS tests, TLS says "peer cannot verify" vs. See http://bugzilla.mozilla.org/show_bug.cgi?id=118668#c5 and http://bugzilla.mozilla.org/show_bug.cgi?id=118668#c10 and http://bugzilla.mozilla.org/show_bug.cgi?id=118668#c14 (point 2) Both this bug and bug 118668 fail only during nightly QA, and not in individual all.sh tests on developers' systems.
Although the TLS test has the expected "SSL peer cannot verify your certificate" error, the error does not come from the expected operation. We are looking for a write to socket to fail with that error, but it is a read from socket that fails with that error. Therefore the test fails.
Good catch! Glad you spotted that. This explains the relevance of the socket being readable after the connect finishes. stdin isn't ready, so we don't attempt to write to the socket. But PR_Poll says the socket is readable, so we try to read it. The handshake begins in the first PR_Recv call. I surmise that No PR_Send ever completes without error; that is, the data to be written is never sent, and the handshake completes (as far as it does) during PR_Recv calls, not PR_Send calls. If that is correct, then this is not an error from a security point of view. We haven't sent any application data over an unauthenticated SSL connection. So, I conclude that the test should pass if that error is returned to either PR_Send or PR_Recv in tstclnt. Maybe this is just a QA test script adjustment. The remaining questions (that occur to me) are: 1. Why is the socket initially readable? and 2. Why do the SSL tests get a connection reset? and 3. Why is only this specific test affected? I can imagine using printfs or PRLogging in NSS's poll method to determine if that is responsible for the socket being readable. I'd like to see a packet trace of the packets going between the client and server. Of course, that may be tricky, given that they're on the same machine.
I think the following code in ssl_Poll() (sslsock.c) explains why the socket is initially readable. if (ss->TCPconnected) { if (!ss->handshakeBegun) { /* If the handshake has not begun, poll on read or write ** based on the local application's role in the handshake, ** not based on what the application requested. */ new_flags &= ~PR_POLL_RW; if (ss->handshaking == sslHandshakingAsClient) { new_flags |= PR_POLL_WRITE; } else { /* handshaking as server */ new_flags |= PR_POLL_READ; } ... Since tstclnt is an SSL client, we will be polling the underlying OS socket for writing regardless of the poll flags specified for the SSL layer. The underlying OS socket is writable, so PR_Poll reports that the SSL socket is readable (because we poll it for read).
I think this behavior (socket being readable, handshaking on PR_Recv) is probably AOK. For the TLS tests, the only problem seems to be that the QA script expects the failure to be on a write call, rather than on a read call. (The two apparently generate different return values for the program). I think the program's return value should probably indicate the error experienced rather than the function call that experienced it. I've been wondering why this affected only the tests where the server requires client auth and the client doesn't auth due to a bad password. I wondered why it didn't also occur when the client simply refused to authenticate, since the two should be identical in their protocol behaviors. I now believe the explanation is bug 207398, which I have just fixed. The tests in which the client should have been refusing to authenticate were actually authenticating because the command line options were improper to prevent authentication. Now that I've fixed the tests so that the client does not authenticate when it should not, I believe we will see the same failures on those tests that we see on the "bad password" tests. To test this hypothesis, I will back out rev 1.28 of tstclnt.c. This will cause QA failures again, but it is deliberate. I also have a hypothesis about why the SSL3 tests see resets but the TLS tests do not. I will devise a test for that tomorrow.
Here is the explanation for the TCP reset errors that occur for SSL3 handshakes that do not authentication to a server that requires client authentication. I will also explain why this affects only SSL3 and not also TLS. I have checked in a fix, which I will explain below. When an SSL3/TLS server requests client authentication, and the client chooses not to send a certificate, the client sends a negative response to the server saying "I have no certificate for you". For TLS, this "no_certificate" response is simply an empty certificate message in the message layer of the protocol. For SSL3, this "no_certificate" response is an Alert record. Alert records are not buffered. They are sent immediately, rather than being coalesced with other data that follows. This means that TLS sends empty certificate message client key exchange message change cipher specs record finished message --> written to server All that data is sent in a single write. But for SSL3, the sequence is no_certificate alert --> written to server client key exchange message change cipher specs record finished message --> written to server Which is two writes. When the server receives the empty certificate message or no_certificate alert, it will send an alert and close the connection (because it requires client auth). If the data in the client's second write arrives at the server after the server has closed the connection, the server will send back a connection reset. If the data in the second write arrives before the close, it will not trigger a reset. TLS only does one write, so there is no second write to potentially cause a reset. The solution is to buffer the SSL3 no_certificate alert, so that it is sent with the subsequent messages and records in a single write. I have tested this and it seems to work. Here is the patch: Index: ssl3con.c =================================================================== RCS file: /cvsroot/mozilla/security/nss/lib/ssl/ssl3con.c,v retrieving revision 1.54 diff -u -r1.54 ssl3con.c --- ssl3con.c 17 Apr 2003 02:03:38 -0000 1.54 +++ ssl3con.c 30 May 2003 10:38:06 -0000 @@ -2016,7 +2016,9 @@ rv = ssl3_FlushHandshake(ss, ssl_SEND_FLAG_FORCE_INTO_BUFFER); if (rv == SECSuccess) { PRInt32 sent; - sent = ssl3_SendRecord(ss, content_alert, bytes, 2, 0); + sent = ssl3_SendRecord(ss, content_alert, bytes, 2, + desc == no_certificate + ? ssl_SEND_FLAG_FORCE_INTO_BUFFER : 0); rv = (sent >= 0) ? SECSuccess : (SECStatus)sent; } ssl_ReleaseXmitBufLock(ss); I have checked this in and am waiting to see how it affects the QA results. I expect it will eliminate the TCP reset errors for SSL3. Once the TCP reset errors are eliminated, I will check in a fix for the other problem, where tstclnt returns 1 when the expected value is 254.
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → 3.9
This patch does two things, one large and one small. 1, It uses two new symbols SSOCK_FD and STDIN_FD instead of 0 and 1 to reference members of the poll set array. I just could never remember whether pollset[0] was the socket or was stdin. Now I don't have to remember. This changes a lot of lines, but is very simple. 2. It implementa a new flag clientSpeaksFirst, set with the -f option. The only effect of this flag is whether the value of pollset[SSOCK_FD].in_flags is initially zero (if true) or PR_POLL_READ (if false). Note that pollset[SSOCK_FD].in_flags is set to PR_POLL_READ after each succesful write to the socket. I'm waiting for the tree to go green (except on Windows) before checking this in.
Checking in tstclnt.c; new revision: 1.31; previous revision: 1.30 Tomorrow's nightly QA tests will reveal if this patch eliminated these QA failures.
Assigned the bug to Nelson, who has been working on this bug.
Assignee: wtc → nelsonb
Status: ASSIGNED → NEW
Comment on attachment 124606 [details] [diff] [review] (re)implement -f (Client Speaks First) flag r=wtc. Good patch, Nelson. One question: >- pollset[0].fd = s; >- pollset[0].in_flags = PR_POLL_READ; >- pollset[1].fd = PR_GetSpecialFD(PR_StandardInput); >- pollset[1].in_flags = PR_POLL_READ; >+ pollset[SSOCK_FD].fd = s; >+ pollset[SSOCK_FD].in_flags = clientSpeaksFirst ? 0 : PR_POLL_READ; >+ pollset[STDIN_FD].fd = PR_GetSpecialFD(PR_StandardInput); >+ pollset[STDIN_FD].in_flags = PR_POLL_READ; I am wondering if we should start with polling exactly one fd. Should the initial in_flags for STDIN_FD be the following instead? pollset[STDIN_FD].in_flags = clientSpeaksFirst ? PR_POLL_READ : 0;
Attachment #124606 - Flags: review+
I backed out rev 1.27, which turned verbose mode on by default. I consider this bug now fixed. The change suggested in comment 11 would require more changes, because there is no code after the affected line that would ever set pollset[STDIN_FD].in_flags = PR_POLL_READ; The logic to do that would have to remember whether EOF had previously been seen for stdin, and not set in_flags to PR_POLL_READ for stdin if EOF had already been seen there. It's feasible, but since we have no QA tests that test server-speaks-first protocols (AFAIK), that enahncement seems like a lower priority.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: