SSL client with client auth gets PR_Recv and PR_Send errors

RESOLVED DUPLICATE of bug 127960

Status

NSS
Libraries
P1
normal
RESOLVED DUPLICATE of bug 127960
16 years ago
16 years ago

People

(Reporter: Julien Pierre, Assigned: Wan-Teh Chang)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

16 years ago
Using a multithreaded SSL client with NSS 3.4, with a load test program, I am
getting intermittent errors such as this :

error: PR_Send failure - NSPR rc = -5961 , OS rc = 131
       Connection reset by peer
error: PR_Recv failure - NSPR rc = -5938 , OS rc = 0
       End of file
info: 83 operations and 0 failure in the last 5.17 seconds - 16.07 ops/second
error: PR_Send failure - NSPR rc = -5961 , OS rc = 131
       Connection reset by peer
error: PR_Recv failure - NSPR rc = -5938 , OS rc = 0
       End of file
error: PR_Send failure - NSPR rc = -5961 , OS rc = 131
       Connection reset by peer
error: PR_Recv failure - NSPR rc = -5938 , OS rc = 0
       End of file

The errors don't occur when NSS 3.3 is used on the client.
The errors are more frequent if client auth is engaged.

I have reproduced this on both NT and Solaris.
(Reporter)

Updated

16 years ago
Priority: -- → P1
Target Milestone: --- → 3.4
(Assignee)

Comment 1

16 years ago
NSPR error -5938 is PR_END_OF_FILE_ERROR.  NSPR itself
does not use this error code.  This error is only emitted
by the SSL library when a read during SSL handshake
returns a 0 byte count, meaning the peer closed the TCP
connection.

The "connection reset by peer" error from PR_Send() may
indicate the same condition: the peer closed the TCP
connection prematurely during SSL handshake.

I guess the best way to debug this is to find out what's
going on on the server side or use ssltap.  It would be
easier if we could reproduce these errors with our
strsclnt and selfserv.
(Reporter)

Comment 2

16 years ago
You are right, it's coming from the server side. I also finally reproduced the
errors even with a 3.3 client - there were just fewer with that version of NSS.
The server was running 3.4.
(Reporter)

Comment 3

16 years ago
I'll mark this as a dupe of the server problems in 124446/124447.

*** This bug has been marked as a duplicate of 124447 ***
Status: NEW → RESOLVED
Last Resolved: 16 years ago
Resolution: --- → DUPLICATE
(Reporter)

Comment 4

16 years ago
I'm reopening this bug. Using 2/15 tip version of NSS on both sides, I get
errors on the client side very quickly (within about 300 iterations). This
occurs in the "double handshake client auth test" where the first full handshake
on each connection is done without client auth, and a second handshake is done
immediately after with client auth.
info: 31 operations and 0 failure in the last 5.14 seconds - 6.04 ops/second
error: PR_Recv failure - NSPR rc = -5961 , OS rc = 131
       Connection reset by peer
error: PR_Recv failure - NSPR rc = -5961 , OS rc = 131
       Connection reset by peer
error: PR_Recv failure - NSPR rc = -5961 , OS rc = 131
       Connection reset by peer
error: PR_Recv failure - NSPR rc = -5961 , OS rc = 131
       Connection reset by peer
error: PR_Recv failure - NSPR rc = -5961 , OS rc = 131
       Connection reset by peer

The server is also getting these errors :

[15/Feb/2002:21:09:37] failure (26248): for host 10.169.25.24 trying to GET /,
Client-Auth reports: Timeout while waiting for client certificate.
[15/Feb/2002:21:09:39] failure (26248): for host 10.169.25.24 trying to GET /,
Client-Auth reports: Timeout while waiting for client certificate.
[15/Feb/2002:21:09:39] failure (26248): for host 10.169.25.24 trying to GET /,
Client-Auth reports: Timeout while waiting for client certificate.
[15/Feb/2002:21:09:40] failure (26248): for host 10.169.25.24 trying to GET /,
Client-Auth reports: Timeout while waiting for client certificate.
[15/Feb/2002:21:09:40] failure (26248): for host 10.169.25.24 trying to GET /,
Client-Auth reports: Timeout while waiting for client certificate.
[15/Feb/2002:21:09:44] failure (26248): for host 10.169.25.24 trying to GET /,
Client-Auth reports: Timeout while waiting for client certificate.
[15/Feb/2002:21:09:45] failure (26248): for host 10.169.25.24 trying to GET /,
Client-Auth reports: Timeout while waiting for client certificate.
[15/Feb/2002:21:09:45] failure (26248): for host 10.169.25.24 trying to GET /,
Client-Auth reports: Timeout while waiting for client certificate.

If I switch the client to NSS 3.3 - but leave the server running 3.4, the
problem does not occur.

Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
(Reporter)

Comment 5

16 years ago
The same problem occurs on the client side with NSS 3.4 using single-handshake
client auth.
(Reporter)

Comment 6

16 years ago
While I was teaching Bishakha how to run NES tests today, I saw this problem
repeatedly doing client auth, either single or double-handshake, within a minute.

The PR_Recv and PR_Send errors occurred on the client occurred with both NSS 3.4
and NSS 3.3.2 .

I was running NES 6 as the web server. The log messages in the server (see above
in the defect) only happen in the double-handshake client-auth case because the
cert is required by a web server SAF. SSLClientAuthTimeout in magnus.conf was
set to 3600. That setting is supposed to increase the timeout in the
double-handshake client auth.

In the single-handshake client auth case, no error got logged by the server
since the SAF isn't engaged - the server socket requires client auth from the
first handshake. But the client still got errors.
I also tried running NSS 3.3.2 on the server side, and it still failed. It seems
the rate of errors was lower with 3.3.2 on both sides instead of 3.4, but it was
still high.
One interesting experiment I did was to reduce the number of client threads from
128 to 16 and rerun the test. Then, all the errors went away and I could run the
test for several minutes. This indicates to me that we might have a
long-standing thread-safety issue in our codepath for client auth, that we can
observe only with high concurrency.
(Reporter)

Updated

16 years ago
Summary: SSL client errors → SSL client with client auth gets PR_Recv and PR_Send errors
It is possible that this is a thread safety issue, but I think it is more
likely that it is just another manifestation of the common phenomena that
occur when clients persistently send more connections requests to the 
server than the server can serve.  Eventually, TCP connection queues
overflow (queues of TCP connections that have been acked, and are 
waiting for the application to call accept() and receive them), resulting
in connection resets.  This behavior is commonly seen on many unix 
platforms.

The fact that reducing the number of concurrent client threads eliminated
the problem seems like strong evidence to support that alternative 
hypothesis.
(Reporter)

Comment 8

16 years ago
I came up with a have a theory on this bug :
There may be timeouts occuring internally in NSS with clent auth. SSL has an
internal handshake timeout, which is the last timeout used by PR_Send or PR_Recv
on the second. As it turns out, it is set to a low value in the web server in
the initial PR_Recv, 30s. The full handshake with client auth takes longer than
that on my little Ultra10 when there is a high concurrency of client threads -
128 - and running the client on the same system as the server.

There is a setting in NES to increase the PR_Send / PR_Recv / PR_Accept
timeouts, called "AcceptTimeout". I increased it to 3600s. The test has been
running without any errors on either side so far for a few minutes.


Nelson, as far as the connection queue overflowing, this isn't the case with
NES. We have dedicated threads that do nothing but accept, and put the sockets
in a server's queue, always keeping the socket's accept queue from being full.

Of course the server can only process so many requests, based on limited CPU,
but it accepts all connections. We aren't getting any PR_Connect or PR_Accept
failures, so this isn't the issue here.
(Reporter)

Comment 9

16 years ago
Please see bug 127960 for a related feature request.
(Reporter)

Comment 10

16 years ago
The problem was not seen after bumping up the initial PR_Send / PR_Recv timeout
in the web server to 3600s . I ran a test for 441559 sessions with
double-handshake client auth and got no error on either side, or leaks.
(Reporter)

Comment 11

16 years ago
I'm marking this a dupe of 127960, which is the feature enhancement to provide
the correct fix for this problem.
As there is a workaround and this isn't a new problem in 3.4, this isn't a P1
for NSS 3.4.


*** This bug has been marked as a duplicate of 127960 ***
Status: REOPENED → RESOLVED
Last Resolved: 16 years ago16 years ago
Resolution: --- → DUPLICATE
(Reporter)

Comment 12

16 years ago
*** Bug 127130 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.