Closed Bug 381886 Opened 17 years ago Closed 15 years ago

tstclnt gets ECONNREF

Categories

(NSS :: Tools, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: slavomir.katuscak+mozilla, Assigned: slavomir.katuscak+mozilla)

Details

trying to kill selfserv with PID 712712 at Thu May 24 05:44:48 PDT 2007
kill -USR1 712712
selfserv: normal termination
selfserv with PID 712712 killed at Thu May 24 05:44:53 PDT 2007
ssl.sh: SSL3 Request don't require client auth on 2nd hs (EC) (client auth) ----
selfserv starting at Thu May 24 05:44:53 PDT 2007
selfserv -D -p 8447 -d ../server -n morgan1.red.iplanet.com  \
         -e morgan1.red.iplanet.com-ec -w nss -r -r -r -i ../tests_pid.475152  &
trying to connect to selfserv at Thu May 24 05:44:54 PDT 2007
tstclnt -p 8447 -h morgan1.red.iplanet.com  -q \
        -d ../client < /share/builds/mccrel3/security/securityjes5/builds/20070523.1/zebra_AIX5.1/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
retrying to connect to selfserv at Thu May 24 05:46:11 PDT 2007
tstclnt -p 8447 -h morgan1.red.iplanet.com  -q \
        -d ../client < /share/builds/mccrel3/security/securityjes5/builds/20070523.1/zebra_AIX5.1/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
kill -0 712734 >/dev/null 2>/dev/null
selfserv with PID 712734 found at Thu May 24 05:48:00 PDT 2007
selfserv with PID 712734 started at Thu May 24 05:48:00 PDT 2007
tstclnt -p 8447 -h morgan1.red.iplanet.com -f -d ../client \
        -T -n TestUser40-ec -w nss  < /share/builds/mccrel3/security/securityjes5/builds/20070523.1/zebra_AIX5.1/mozilla/security/nss/tests/ssl/sslreq.dat
selfserv: SSL_ForceHandshake returned error -8180:
Peer's Certificate has been revoked.
subject DN: CN=morgan1.red.iplanet.com,E=morgan1.red.iplanet.com@bogus.com,O=BOGUS NSS,L=Mountain View,ST=California,C=US
issuer  DN: CN=NSS Test CA,O=BOGUS NSS,L=Mountain View,ST=California,C=US
0 cache hits; 1 cache misses, 0 cache not reusable
tstclnt: read from socket failed: SSL peer rejected your certificate as revoked.
ssl.sh: SSL3 Request don't require client auth on 2nd hs (EC) (client auth) (cert TestUser40 - revoked) produced a returncode of 1, expected is 1 PASSED

Occured in nightly testing on morgan1 AIX5.1_64_OPT.OBJ securityjes5 20070523.1. (But have seen more time this week.)

What happened:
1. Killed old selfserv.
2. Started new selfserv.
3. First try to connect to selfserv -> timed out.
4. Second try to connect to selfserv (after more than minute) -> timed out.
5. Check if selfserv is running (kill -0) -> selfserv running.
6. Trying to connect with tstclnt -> connect failed with SSL_ForceHandshake error. 

I'm not sure if SSL_ForceHandshake is really SSL error, or selfserv did not accept the connection. 

Now there are 2 problems:
1. Selfserv doesn't accept connection (at least for more than 1 minute).
2. If following tstclnt tests expect error (revoked certificate), then test pass, without real cert checking.
trying to kill selfserv with PID 679950 at Tue May 29 04:01:15 PDT 2007
kill -USR1 679950
selfserv: normal termination
selfserv with PID 679950 killed at Tue May 29 04:01:15 PDT 2007
ssl.sh: Stress SSL3 ECDHE-RSA   AES 128 CBC with SHA ----
selfserv starting at Tue May 29 04:01:16 PDT 2007
selfserv -D -p 8447 -d ../ext_server -n morgan1.red.iplanet.com  \
         -e morgan1.red.iplanet.com-ec -w nss -c :C013 -i ../tests_pid.843858  &
trying to connect to selfserv at Tue May 29 04:01:16 PDT 2007
tstclnt -p 8447 -h morgan1.red.iplanet.com -B -s -q \
        -d ../ext_client < /share/builds/mccrel3/security/securityjes5/builds/20070528.1/zebra_AIX5.1/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
retrying to connect to selfserv at Tue May 29 04:02:37 PDT 2007
tstclnt -p 8447 -h morgan1.red.iplanet.com -B -s -q \
        -d ../ext_client < /share/builds/mccrel3/security/securityjes5/builds/20070528.1/zebra_AIX5.1/mozilla/security/nss/tests/ssl/sslreq.dat
tstclnt: Client timed out while waiting for connection to server: Connection refused by peer.
kill -0 679966 >/dev/null 2>/dev/null
selfserv with PID 679966 found at Tue May 29 04:04:15 PDT 2007
selfserv with PID 679966 started at Tue May 29 04:04:15 PDT 2007
strsclnt -q -p 8447 -d ../ext_client -B -s -w nss -c 1000 -C :C013 -T \
          morgan1.red.iplanet.com
strsclnt started at Tue May 29 04:04:15 PDT 2007
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
strsclnt: 999 cache hits; 1 cache misses, 0 cache not reusable
strsclnt completed at Tue May 29 04:04:21 PDT 2007
ssl.sh: Stress SSL3 ECDHE-RSA   AES 128 CBC with SHA produced a returncode of 0, expected is 0.  PASSED

Seems that strsclnt is able to connect to selfserv also after 2 tstclnt failures.
As far as this error :

tstclnt: Client timed out while waiting for connection to server: Connection
refused by peer.

If you look at the times, you will see tstclnt was invoked just 1s after selfserv was start. On the slow AIX machine, with multiple simultaneous QA runs for branches, this may not be enough time for selfserv to be really listening .

I think we should either serialize the nightly QA runs for AIX, or get more AIX hardware for QA.
That's why there is second try. There is some sleep time between first and second try (5 seconds should be enough for this).
When tstclnt is invoked with the -q option, it tries to connect to the server.
If that fails, it waits 0.1 seconds and tries again.  It will repeat this 
until it succeeds or until 600 attempts have failed (60 seconds in all).

ssl.sh has a function, wait_for_selfserv, that tries this, and if it fails,
sleeps 5 seconds and tries again.

Maybe this is all a bit excessive.
I noticed some situations in logs when first tstclnt run was not succesfull and second passed. 
What kind of machine has this problem? 
Does it have IPv6 enabled?
Looking over this bug, I see that
- the failing client is always tstclnt, and 
- the failure is ECONREF, Connection refused by Peer
This means, quite simply, that no server is listening on the port
to which the client is trying to connect.  

Selfserv is not accepting the client's connections, because it is not 
receiving the client's connection attempts.  

On systems where TCP ports for IPv4 are separate from TCP ports for IPv6,
it is conceivable that the problem is that selfserv is listening on an 
IPv4 TCP port, but tstclnt is trying to connect to an IPv6 TCP port.  
That's why I asked the question in comment 6.
That is not the only possible explanation, but I'd start with that question.
Summary: Selfserv doesn't accept network connections. → tstclnt gets ECONREF
Summary: tstclnt gets ECONREF → tstclnt gets ECONNREF
Nelson,

I don't think this is IPv6 problem. I have seen it on these machines:

nssamdrhel3 (RH3) (IPv4 only)
nssamdrhel4 (RH4) (IPv4 + IPv6)
morgan1     (AIX) (IPv4 + IPv6 on loopback device)
mintchip    (S8)  (IPv4 only)

I checked output.log files for string "selfserv starting" - there are 493 lines with this string, that means selfserv is started 493 times while testing. When this failure happens, it happens usually only once per all test suite, so it looks more like race condition problem in selfserv. (Both tstclnt attempts failed - so it doesn't look like tstclnt problem.)

Assignee: nobody → slavomir.katuscak
Priority: -- → P2
Target Milestone: --- → 3.12
Slavo,

selfserv needs to be enhanced to have better logging. Right now, it doesn't even output anything when it starts up.

At the minimum, it should print a startup message, and another message after it has bound and is listening and is ready to accept connections.

These messages should contain timestamps, PID, perhaps even thread ID. Those should be added to the termination as well.

Certainly it seems excessively long for selfserv to take more than 1 minute to bind and be ready to accept a connection. I don't see anything in the code that could be expected to cause that sort of delay. Maybe there is some serialization happening in some system calls. We need to narrow down where exactly the delay is occurring by adding those messages to selfserv.
In reply to comment 9, I wouldn't want to see this proposed additional trace 
output all the time in all uses of selfserv, but I'd be happy to see it 
enabled optionally somehow.  

selfserv has a -v (verbose) option that causes selfserv to output all the 
http requests that it receives.  Maybe that's too verbose.  :) It was used 
many years ago to help diagnose a browser problem that occurred only on 
https requests.  I doubt it's been very useful since then.  

Here are some alternative suggestions for ways to optionally enable this 
additional startup tracing.  I'm sure more ways can be thought up.

a) use NSPR logging to do the tracing.  The NSPR logging would be enabled
with the usual method (an environment variable).

b) make -v be a counted option, where -vv is more verbose than a single -v.
Make the current level of verbosity occur only if -vv is given, and let a 
single -v enable that tracing.  (That may not be possible with the present option parser.)

c) add a "long" option to selfserv, e.g. --trace, to enable tracing.
I don't see any specific reason, but since last lab outage when we have more network errors I see this failure much more often then before. It's specific for nightly tests and I don't see it on Tinderbox (nightly tests uses network mounted drive, Tinderboxes local drive), so it's possible that it's caused by network problems.
Unsetting target milestone in unresolved bugs whose targets have passed.
Target Milestone: 3.12 → ---
This bug used to be seen relatively often (2-3 times per week) in nightly QA. However it wasn't seen for more than 4 months, most probably it was fixed within fix for other bug so I'm closing it as WORKSFORME now. If it occurs again, feel free to reopen it.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.