Closed
Bug 207313
Opened 22 years ago
Closed 22 years ago
Windows QA failures. tstclnt reports connections reset
Categories
(NSS :: Tools, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
3.9
People
(Reporter: nelson, Assigned: nelson)
Details
Attachments
(1 file)
9.07 KB,
patch
|
wtc
:
review+
|
Details | Diff | Splinter Review |
For NSS 3.9, tstclnt was fixed/enhanced to be able to work interactively on
Windows as it always has on Unix. The checkin was rev 1.26 of tstclnt.c.
Beginning at that time, Some of the SSL QA tests have been failing
repeatedly on some of our Windows QA test machines that run Win2K or WinXP.
The failure has so-far been seen only on QA machines, and not in developers'
individual runs of the ssl.sh QA test script.
A typical failed QA log entry looks like this:
tstclnt -p 8443 -h (host) -f -d ../client -T -n TestUser -w bogus \
< (...)/mozilla/security/nss/tests/ssl/sslreq.txt
tstclnt: read from socket failed: TCP Connection aborted.
ssl.sh: SSL3 Require client auth (bad password) produced a returncode of 1,
expected is 254 FAILED
Note that it is always this particular test that fails. Also, for the QA
test machines, all the file systems are network mounted.
Preliminary analysis suggests that this problem was not directly caused
by rev 1.26, but rather was a problem that was previously masked, and was
uncovered by the changes in rev 1.26. The problem seems related to a
different set of actions on the SSL socket that occur when stdin is not
ready to read after the first call to PR_Poll in tstclnt's main loop.
Wan-Teh discovered that delaying the main loop until stdin is readable
works around this problem! That is the basis of rev 1.28.
This workaround should not be needed, and why it works is also unknown.
Rev 1.29 will cause all diagnostic output to go to stderr, not stdout.
Comment 1•22 years ago
|
||
all.sh, which calls ssl.sh also passes on the local machine, where the
file-systems are not network mounted.
The QA machines and the Tinderbox machine (which also showed these failures) run
another script, "nssqa". "nssqa" basically calls all.sh after setting test and
result directories to network drives, i.e. they are now no longer running
locally on the machine. "nssqa" also automates the running of "all.sh" by
setting build type to DBG or OPT and OS_TARGET to WINNT or WIN95, and calling
"all.sh" in turn for these respective settings.
While these particular tests were failing (before wtc's workaround), at
Netscape, we saw on the platform running Win2000:
failures of the WINNT DBG build QA (failed on 5/22 and 5/25, passed on 5/24) and
failures of the WIN95 DBG and OPT builds QA all days (failed on 5/22 thru 5/25),
and on the platform running WinXP:
all builds' (i.e. WINNT and WIN95 DBG and OPT) QA fail while running on WinXP,
except on 5/24 when it passed the WINNT5.1 DBG build.
At Sun, Sonja reported seeing these failures on only the WIN954.0_DBG.OBJ build
on her Win2000 machine. All tests on her NT machine passed for both NT DBG and
optimized and WIN95 builds. This was during one run.
Assignee | ||
Comment 2•22 years ago
|
||
Observations about this bug:
1. Only two tests fail in any failure cases, They are described as:
TLS Require client auth (bad password)
SSL3 Require client auth (bad password)
These tests require client auth on the first handshake.
2. This table lists the test results seen on 5/22 - 5/25.
Key: N_D_T tested with NSPR for WinNT, DeBuG build, TLS test
9_O_S tested with NSPR for Win9x, OPTimized build, SSL3 test
Date system OS N_D_T N_D_S N_O_T N_O_S 9_D_T 9_D_S 9_O_T 9_o_S
5/22 elgamal 2K pass FAIL pass pass FAIL FAIL FAIL FAIL
koblitz XP pass FAIL pass FAIL FAIL FAIL pass FAIL
5/23 elgamal 2K pass FAIL pass pass FAIL FAIL FAIL FAIL
koblitz XP pass FAIL pass FAIL pass FAIL pass FAIL
5/24 elgamal 2k pass FAIL pass pass FAIL FAIL FAIL FAIL
koblitz XP pass pass pass FAIL pass FAIL pass FAIL
5/25 elgamal 2k pass pass pass pass FAIL FAIL FAIL FAIL
koblitz XP pass FAIL pass FAIL pass FAIL pass FAIL
The tests with NSPR-NT and TLS always pass (both debug and optimized).
The tests with NSPR-9x and SSL3 always FAIL (both debug and optimized)
The tests with NSPR-NT, optimized, SSL3 always pass on elgamal,
always fail on koblitz.
The tests with NSPR-9x, optimized, TLS always fail on elgamal
alawys pass on koblitz.
The tests with NSPR-NT, debug, SSL3 are intermittent on both systems.
The tests with NSPR-9x, debug, TLS always fails on elgamal,
is intermittent on koblitz.
3. SSL3 test failures are DIFFERENT from the TLS test failues.
TLS test failures look like this in the logs:
tstclnt: read from socket failed: SSL peer cannot verify your certificate.
ssl.sh: TLS Require client auth (bad password) produced a returncode of 1,
expected is 254 FAILED
SSL3 test failures look like this in the log:
tstclnt: read from socket failed: TCP Connection aborted.
ssl.sh: SSL3 Require client auth (bad password) produced a returncode of 1,
expected is 254 FAILED
One very odd thing about this is that the error code reported for the TLS
test, "SSL peer cannot verify your certificate", is exactly the expected
error for this test. In other words, it appears that this test actually
behaved as expected, except that the program returned 1 instead of 254.
By contrast, the SSL3 tests report "TCP Connection aborted."
4. There is a bug that may be related, which reports different test
results on WinNT4.0 vs Win2k. bug 118668.
That bug reported a problem that occured on a similar but different
test, SSL require client auth on 2nd handshake (bad passwd).
That bug reports the same difference of error code between the SSL and
TLS tests, TLS says "peer cannot verify" vs.
See http://bugzilla.mozilla.org/show_bug.cgi?id=118668#c5
and http://bugzilla.mozilla.org/show_bug.cgi?id=118668#c10
and http://bugzilla.mozilla.org/show_bug.cgi?id=118668#c14 (point 2)
Both this bug and bug 118668 fail only during nightly QA, and not in
individual all.sh tests on developers' systems.
Comment 3•22 years ago
|
||
Although the TLS test has the expected
"SSL peer cannot verify your certificate" error,
the error does not come from the expected operation.
We are looking for a write to socket to fail with
that error, but it is a read from socket that fails
with that error. Therefore the test fails.
Assignee | ||
Comment 4•22 years ago
|
||
Good catch! Glad you spotted that.
This explains the relevance of the socket being readable after the connect
finishes. stdin isn't ready, so we don't attempt to write to the socket.
But PR_Poll says the socket is readable, so we try to read it.
The handshake begins in the first PR_Recv call.
I surmise that No PR_Send ever completes without error; that is, the data to
be written is never sent, and the handshake completes (as far as it does)
during PR_Recv calls, not PR_Send calls. If that is correct, then this
is not an error from a security point of view. We haven't sent any
application data over an unauthenticated SSL connection.
So, I conclude that the test should pass if that error is returned to
either PR_Send or PR_Recv in tstclnt. Maybe this is just a QA test
script adjustment.
The remaining questions (that occur to me) are:
1. Why is the socket initially readable? and
2. Why do the SSL tests get a connection reset? and
3. Why is only this specific test affected?
I can imagine using printfs or PRLogging in NSS's poll method to determine
if that is responsible for the socket being readable.
I'd like to see a packet trace of the packets going between the client and
server. Of course, that may be tricky, given that they're on the same machine.
Comment 5•22 years ago
|
||
I think the following code in ssl_Poll() (sslsock.c)
explains why the socket is initially readable.
if (ss->TCPconnected) {
if (!ss->handshakeBegun) {
/* If the handshake has not begun, poll on read or write
** based on the local application's role in the handshake,
** not based on what the application requested.
*/
new_flags &= ~PR_POLL_RW;
if (ss->handshaking == sslHandshakingAsClient) {
new_flags |= PR_POLL_WRITE;
} else { /* handshaking as server */
new_flags |= PR_POLL_READ;
}
...
Since tstclnt is an SSL client, we will be polling the
underlying OS socket for writing regardless of the
poll flags specified for the SSL layer. The underlying
OS socket is writable, so PR_Poll reports that the SSL
socket is readable (because we poll it for read).
Assignee | ||
Comment 6•22 years ago
|
||
I think this behavior (socket being readable, handshaking on PR_Recv) is
probably AOK. For the TLS tests, the only problem seems to be that the QA
script expects the failure to be on a write call, rather than on a read call.
(The two apparently generate different return values for the program).
I think the program's return value should probably indicate the error
experienced rather than the function call that experienced it.
I've been wondering why this affected only the tests where the server requires
client auth and the client doesn't auth due to a bad password. I wondered why
it didn't also occur when the client simply refused to authenticate, since the
two should be identical in their protocol behaviors. I now believe the
explanation is bug 207398, which I have just fixed. The tests in which the
client should have been refusing to authenticate were actually authenticating
because the command line options were improper to prevent authentication.
Now that I've fixed the tests so that the client does not authenticate when
it should not, I believe we will see the same failures on those tests that we
see on the "bad password" tests. To test this hypothesis, I will back out
rev 1.28 of tstclnt.c. This will cause QA failures again, but it is deliberate.
I also have a hypothesis about why the SSL3 tests see resets but the TLS tests
do not. I will devise a test for that tomorrow.
Assignee | ||
Comment 7•22 years ago
|
||
Here is the explanation for the TCP reset errors that occur for SSL3 handshakes
that do not authentication to a server that requires client authentication.
I will also explain why this affects only SSL3 and not also TLS.
I have checked in a fix, which I will explain below.
When an SSL3/TLS server requests client authentication, and the client chooses
not to send a certificate, the client sends a negative response to the server
saying "I have no certificate for you". For TLS, this "no_certificate" response
is simply an empty certificate message in the message layer of the protocol.
For SSL3, this "no_certificate" response is an Alert record. Alert records are
not buffered. They are sent immediately, rather than being coalesced with
other data that follows. This means that TLS sends
empty certificate message
client key exchange message
change cipher specs record
finished message --> written to server
All that data is sent in a single write. But for SSL3, the sequence is
no_certificate alert --> written to server
client key exchange message
change cipher specs record
finished message --> written to server
Which is two writes. When the server receives the empty certificate
message or no_certificate alert, it will send an alert and close the
connection (because it requires client auth). If the data in the
client's second write arrives at the server after the server has closed
the connection, the server will send back a connection reset. If the
data in the second write arrives before the close, it will not trigger
a reset. TLS only does one write, so there is no second write to
potentially cause a reset.
The solution is to buffer the SSL3 no_certificate alert, so that it is
sent with the subsequent messages and records in a single write.
I have tested this and it seems to work. Here is the patch:
Index: ssl3con.c
===================================================================
RCS file: /cvsroot/mozilla/security/nss/lib/ssl/ssl3con.c,v
retrieving revision 1.54
diff -u -r1.54 ssl3con.c
--- ssl3con.c 17 Apr 2003 02:03:38 -0000 1.54
+++ ssl3con.c 30 May 2003 10:38:06 -0000
@@ -2016,7 +2016,9 @@
rv = ssl3_FlushHandshake(ss, ssl_SEND_FLAG_FORCE_INTO_BUFFER);
if (rv == SECSuccess) {
PRInt32 sent;
- sent = ssl3_SendRecord(ss, content_alert, bytes, 2, 0);
+ sent = ssl3_SendRecord(ss, content_alert, bytes, 2,
+ desc == no_certificate
+ ? ssl_SEND_FLAG_FORCE_INTO_BUFFER : 0);
rv = (sent >= 0) ? SECSuccess : (SECStatus)sent;
}
ssl_ReleaseXmitBufLock(ss);
I have checked this in and am waiting to see how it affects the QA results.
I expect it will eliminate the TCP reset errors for SSL3.
Once the TCP reset errors are eliminated, I will check in a fix for the
other problem, where tstclnt returns 1 when the expected value is 254.
Status: NEW → ASSIGNED
Priority: -- → P1
Target Milestone: --- → 3.9
Assignee | ||
Comment 8•22 years ago
|
||
This patch does two things, one large and one small.
1, It uses two new symbols SSOCK_FD and STDIN_FD instead of 0 and 1 to
reference members of the poll set array. I just could never remember whether
pollset[0] was the socket or was stdin. Now I don't have to remember.
This changes a lot of lines, but is very simple.
2. It implementa a new flag clientSpeaksFirst, set with the -f option.
The only effect of this flag is whether the value of pollset[SSOCK_FD].in_flags
is initially zero (if true) or PR_POLL_READ (if false). Note that
pollset[SSOCK_FD].in_flags is set to PR_POLL_READ after each succesful write
to the socket.
I'm waiting for the tree to go green (except on Windows) before checking this
in.
Assignee | ||
Comment 9•22 years ago
|
||
Checking in tstclnt.c; new revision: 1.31; previous revision: 1.30
Tomorrow's nightly QA tests will reveal if this patch eliminated these QA
failures.
Comment 10•22 years ago
|
||
Assigned the bug to Nelson, who has been working on
this bug.
Assignee: wtc → nelsonb
Status: ASSIGNED → NEW
Comment 11•22 years ago
|
||
Comment on attachment 124606 [details] [diff] [review]
(re)implement -f (Client Speaks First) flag
r=wtc. Good patch, Nelson.
One question:
>- pollset[0].fd = s;
>- pollset[0].in_flags = PR_POLL_READ;
>- pollset[1].fd = PR_GetSpecialFD(PR_StandardInput);
>- pollset[1].in_flags = PR_POLL_READ;
>+ pollset[SSOCK_FD].fd = s;
>+ pollset[SSOCK_FD].in_flags = clientSpeaksFirst ? 0 : PR_POLL_READ;
>+ pollset[STDIN_FD].fd = PR_GetSpecialFD(PR_StandardInput);
>+ pollset[STDIN_FD].in_flags = PR_POLL_READ;
I am wondering if we should start with polling
exactly one fd. Should the initial in_flags for
STDIN_FD be the following instead?
pollset[STDIN_FD].in_flags = clientSpeaksFirst ? PR_POLL_READ : 0;
Attachment #124606 -
Flags: review+
Assignee | ||
Comment 12•22 years ago
|
||
I backed out rev 1.27, which turned verbose mode on by default.
I consider this bug now fixed.
The change suggested in comment 11 would require more changes, because
there is no code after the affected line that would ever set
pollset[STDIN_FD].in_flags = PR_POLL_READ;
The logic to do that would have to remember whether EOF had previously been
seen for stdin, and not set in_flags to PR_POLL_READ for stdin if EOF had
already been seen there. It's feasible, but since we have no QA tests that
test server-speaks-first protocols (AFAIK), that enahncement seems like a
lower priority.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•