Closed Bug 124337 Opened 23 years ago Closed 23 years ago

SSL client with client auth gets PR_Recv and PR_Send errors

Categories

(NSS :: Libraries, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 127960

People

(Reporter: julien.pierre, Assigned: wtc)

References

Details

Using a multithreaded SSL client with NSS 3.4, with a load test program, I am getting intermittent errors such as this : error: PR_Send failure - NSPR rc = -5961 , OS rc = 131 Connection reset by peer error: PR_Recv failure - NSPR rc = -5938 , OS rc = 0 End of file info: 83 operations and 0 failure in the last 5.17 seconds - 16.07 ops/second error: PR_Send failure - NSPR rc = -5961 , OS rc = 131 Connection reset by peer error: PR_Recv failure - NSPR rc = -5938 , OS rc = 0 End of file error: PR_Send failure - NSPR rc = -5961 , OS rc = 131 Connection reset by peer error: PR_Recv failure - NSPR rc = -5938 , OS rc = 0 End of file The errors don't occur when NSS 3.3 is used on the client. The errors are more frequent if client auth is engaged. I have reproduced this on both NT and Solaris.
Priority: -- → P1
Target Milestone: --- → 3.4
NSPR error -5938 is PR_END_OF_FILE_ERROR. NSPR itself does not use this error code. This error is only emitted by the SSL library when a read during SSL handshake returns a 0 byte count, meaning the peer closed the TCP connection. The "connection reset by peer" error from PR_Send() may indicate the same condition: the peer closed the TCP connection prematurely during SSL handshake. I guess the best way to debug this is to find out what's going on on the server side or use ssltap. It would be easier if we could reproduce these errors with our strsclnt and selfserv.
You are right, it's coming from the server side. I also finally reproduced the errors even with a 3.3 client - there were just fewer with that version of NSS. The server was running 3.4.
I'll mark this as a dupe of the server problems in 124446/124447. *** This bug has been marked as a duplicate of 124447 ***
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → DUPLICATE
I'm reopening this bug. Using 2/15 tip version of NSS on both sides, I get errors on the client side very quickly (within about 300 iterations). This occurs in the "double handshake client auth test" where the first full handshake on each connection is done without client auth, and a second handshake is done immediately after with client auth. info: 31 operations and 0 failure in the last 5.14 seconds - 6.04 ops/second error: PR_Recv failure - NSPR rc = -5961 , OS rc = 131 Connection reset by peer error: PR_Recv failure - NSPR rc = -5961 , OS rc = 131 Connection reset by peer error: PR_Recv failure - NSPR rc = -5961 , OS rc = 131 Connection reset by peer error: PR_Recv failure - NSPR rc = -5961 , OS rc = 131 Connection reset by peer error: PR_Recv failure - NSPR rc = -5961 , OS rc = 131 Connection reset by peer The server is also getting these errors : [15/Feb/2002:21:09:37] failure (26248): for host 10.169.25.24 trying to GET /, Client-Auth reports: Timeout while waiting for client certificate. [15/Feb/2002:21:09:39] failure (26248): for host 10.169.25.24 trying to GET /, Client-Auth reports: Timeout while waiting for client certificate. [15/Feb/2002:21:09:39] failure (26248): for host 10.169.25.24 trying to GET /, Client-Auth reports: Timeout while waiting for client certificate. [15/Feb/2002:21:09:40] failure (26248): for host 10.169.25.24 trying to GET /, Client-Auth reports: Timeout while waiting for client certificate. [15/Feb/2002:21:09:40] failure (26248): for host 10.169.25.24 trying to GET /, Client-Auth reports: Timeout while waiting for client certificate. [15/Feb/2002:21:09:44] failure (26248): for host 10.169.25.24 trying to GET /, Client-Auth reports: Timeout while waiting for client certificate. [15/Feb/2002:21:09:45] failure (26248): for host 10.169.25.24 trying to GET /, Client-Auth reports: Timeout while waiting for client certificate. [15/Feb/2002:21:09:45] failure (26248): for host 10.169.25.24 trying to GET /, Client-Auth reports: Timeout while waiting for client certificate. If I switch the client to NSS 3.3 - but leave the server running 3.4, the problem does not occur.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
The same problem occurs on the client side with NSS 3.4 using single-handshake client auth.
While I was teaching Bishakha how to run NES tests today, I saw this problem repeatedly doing client auth, either single or double-handshake, within a minute. The PR_Recv and PR_Send errors occurred on the client occurred with both NSS 3.4 and NSS 3.3.2 . I was running NES 6 as the web server. The log messages in the server (see above in the defect) only happen in the double-handshake client-auth case because the cert is required by a web server SAF. SSLClientAuthTimeout in magnus.conf was set to 3600. That setting is supposed to increase the timeout in the double-handshake client auth. In the single-handshake client auth case, no error got logged by the server since the SAF isn't engaged - the server socket requires client auth from the first handshake. But the client still got errors. I also tried running NSS 3.3.2 on the server side, and it still failed. It seems the rate of errors was lower with 3.3.2 on both sides instead of 3.4, but it was still high. One interesting experiment I did was to reduce the number of client threads from 128 to 16 and rerun the test. Then, all the errors went away and I could run the test for several minutes. This indicates to me that we might have a long-standing thread-safety issue in our codepath for client auth, that we can observe only with high concurrency.
Summary: SSL client errors → SSL client with client auth gets PR_Recv and PR_Send errors
It is possible that this is a thread safety issue, but I think it is more likely that it is just another manifestation of the common phenomena that occur when clients persistently send more connections requests to the server than the server can serve. Eventually, TCP connection queues overflow (queues of TCP connections that have been acked, and are waiting for the application to call accept() and receive them), resulting in connection resets. This behavior is commonly seen on many unix platforms. The fact that reducing the number of concurrent client threads eliminated the problem seems like strong evidence to support that alternative hypothesis.
I came up with a have a theory on this bug : There may be timeouts occuring internally in NSS with clent auth. SSL has an internal handshake timeout, which is the last timeout used by PR_Send or PR_Recv on the second. As it turns out, it is set to a low value in the web server in the initial PR_Recv, 30s. The full handshake with client auth takes longer than that on my little Ultra10 when there is a high concurrency of client threads - 128 - and running the client on the same system as the server. There is a setting in NES to increase the PR_Send / PR_Recv / PR_Accept timeouts, called "AcceptTimeout". I increased it to 3600s. The test has been running without any errors on either side so far for a few minutes. Nelson, as far as the connection queue overflowing, this isn't the case with NES. We have dedicated threads that do nothing but accept, and put the sockets in a server's queue, always keeping the socket's accept queue from being full. Of course the server can only process so many requests, based on limited CPU, but it accepts all connections. We aren't getting any PR_Connect or PR_Accept failures, so this isn't the issue here.
Please see bug 127960 for a related feature request.
The problem was not seen after bumping up the initial PR_Send / PR_Recv timeout in the web server to 3600s . I ran a test for 441559 sessions with double-handshake client auth and got no error on either side, or leaks.
I'm marking this a dupe of 127960, which is the feature enhancement to provide the correct fix for this problem. As there is a workaround and this isn't a new problem in 3.4, this isn't a P1 for NSS 3.4. *** This bug has been marked as a duplicate of 127960 ***
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → DUPLICATE
*** Bug 127130 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.