Open Bug 334961 Opened 18 years ago Updated 1 year ago

strsclnt: PR_Connect: Local Network address is in use

Categories

(NSS :: Test, defect, P3)

3.11
x86
Windows XP

Tracking

(Not tracked)

People

(Reporter: Sandeep.Konchady, Unassigned)

Details

Attachments

(2 files, 1 obsolete file)

When running all.sh on WinXP with OS_TARGET=WIN95 the following SSL test fails. This does not seem to be caught by the script and is not reported as a failure.

ssl.sh: Stress SSL3 ECDHE-RSA   AES 128 CBC with SHA ----
selfserv -D -p 8443 -d ../server -n WINXP.red.iplanet.com  \
         -e WINXP.red.iplanet.com-ec -w nss -c :C013 -i ../tests_pid.168  &
selfserv started at Thu Apr 20 14:48:06 PST 2006
tstclnt -p 8443 -h WINXP.red.iplanet.com -B -s -q \
        -d ../client < C:/export/tinderbox/WINNT_5.1_Depend/mozilla/security/nss/tests/ssl/sslreq.dat
strsclnt -q -p 8443 -d ../client -B -s -w nss -c 1000 -C :C013 -T \
          WINXP.red.iplanet.com
strsclnt started at Thu Apr 20 14:48:06 PST 2006
strsclnt: 958 cache hits; 1 cache misses, 0 cache not reusable
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
NEXT ERROR strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt completed at Thu Apr 20 14:48:12 PST 2006
ssl.sh: Stress SSL3 ECDHE-RSA   AES 128 CBC with SHA produced a returncode of 0, expected is 0.  PASSED
This not only happens for OS_TARGET=WIN95, but for normal builds too. System details are:

OS      : Microsoft Windows XP Professional 2002 SP2
CPU     : Intel Pentium III 728 MHz
RAM     : 512 MB
Compiler: Microsoft Visual Studio .NET 2003

ssl.sh: Stress SSL3 ECDHE-RSA   AES 128 CBC with SHA ----
selfserv -D -p 8443 -d ../server -n WINXP.red.iplanet.com  \
         -e WINXP.red.iplanet.com-ec -w nss -c :C013 -i ../tests_pid.2580  &
selfserv started at Fri Apr 21 10:43:06 PST 2006
tstclnt -p 8443 -h WINXP.red.iplanet.com -B -s -q \
        -d ../client < C:/export/tinderbox/WINNT_5.1_Depend/mozilla/security/nss/tests/ssl/sslreq.dat
strsclnt -q -p 8443 -d ../client -B -s -w nss -c 1000 -C :C013 -T \
          WINXP.red.iplanet.com
strsclnt started at Fri Apr 21 10:43:07 PST 2006
strsclnt: 958 cache hits; 1 cache misses, 0 cache not reusable
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt: PR_Connect returned error -5982:
Local Network address is in use.
strsclnt completed at Fri Apr 21 10:43:12 PST 2006
ssl.sh: Stress SSL3 ECDHE-RSA   AES 128 CBC with SHA produced a returncode of 0, expected is 0.  PASSED
Summary: strsclnt on WinXP with OS_TARGET=WIN95 throw error → strsclnt on WinXP Professional throw error running ssl tests
We've seen this error in selfserv, and understand why that occurs.
But this error is coming from strsclnt.  

The single most important issue here is that strsclnt reports this
as a succesful run, when it clearly is not.
Summary: strsclnt on WinXP Professional throw error running ssl tests → strsclnt: PR_Connect: Local Network address is in use
Target Milestone: --- → 3.11.1
Version: 3.12 → 3.11
Priority: -- → P3
Target Milestone: 3.11.1 → ---
I noticed this error when I tested TLS session tickets on
Windows.  At first I thought this error only occurs when
session tickets are enabled.  Then I discovered that it
also occurs when session tickets are disabled, it's just
not reported as a failure (see Nelson's comment 2).

It is the Winsock connect() function that fails with the
WSAEADDRINUSE error, so TLS/SSL is not involved.

I did a search for this error and it seems that other people
have seen it, too:

http://www.cygwin.com/ml/cygwin-developers/2001-11/msg00080.html
http://www.fastcgi.com/archives/fastcgi-developers/2001-February/000933.html
http://tech.groups.yahoo.com/group/new-httpd/message/47305
http://ureader.com/msg/165626.aspx
Assignee: nobody → wtc
Target Milestone: --- → 3.12
Wan-Teh,

Shouldn't we fix this problem in NSPR instead ?

Wan-Teh, how does ignoring this error fix it? 
Doesn't this merely mask the problem?
Is the problem transient? Or does it persist once it occurs?
Is an immediately retry likely to experience the same error? or not??
If so, isn't this patch likely to create a tight loop, retrying failed 
connections as fast as possible?
There is probably no good solution to this problem.

http://support.microsoft.com/kb/319502
http://msdn2.microsoft.com/en-us/library/aa560610.aspx
http://www.velocityreviews.com/forums/t368277-clientside-tcp-socket-receiving-quotaddress-already-in-usequot-upon-connect.html

Let me call this patch a proposed approach because we can change the
details but the basic idea is the same: retry PR_Connect after the PR_ADDRESS_IN_USE_ERROR.
Note that I also changed strsclnt to require the specified number of SSL
connections to me made.

The second approach is to keep track of how many times we get the
PR_ADDRESS_IN_USE_ERROR, and subtract that from the number
of SSL connections required for strsclnt to return successfully.  This
approach still need to retry PR_Connect in tstclnt though.  (Yes,
tstclnt is also affected by this error.)
Attachment #301414 - Attachment is obsolete: true
Attachment #301414 - Flags: review?(nelson)
Unsetting target milestone in unresolved bugs whose targets have passed.
Target Milestone: 3.12 → ---
Brian: I did a search for "PR_Connect PR_ADDRESS_IN_USE_ERROR" in Bugzilla
and found this old bug. So the strsclnt failures you saw are not caused by
recent changes in NSS 3.15.4.

Apparently I researched this issue thoroughly and even wrote a patch, but
I didn't remember a thing about this yesterday. It's amazing how my brain
can page out unneeded info :-)

Please feel free to review my patch. Note Julien and Nelson's comments in
comment 5 and comment 6 if you review the patch. Fixing this bug in NSPR
seems hard if we can't simply retrying connect on the OS socket.
Flags: needinfo?
Flags: needinfo?
Comment on attachment 301692 [details] [diff] [review]
Proposed approach 1: ignore PR_ADDR_IN_USE_ERROR from PR_Connect

r=kaie

but it needs merging
Attachment #301692 - Flags: review?(brian) → review+
Attached patch p2-334961Splinter Review
merged to trunk.
The merged patch worked for me locally (linux).

Checked in:
https://hg.mozilla.org/projects/nss/rev/014577d91a1a

Let's see if it works.
Status with the patch checked in:

- win 2008 (7) is good

- win 2003 (previously unaffected) fails a few times with:
        strsclnt: PR_Connect returned error -5990, OS error 10060: I/O operation timed out

- win xp no longer shows the original failure (-5982), instead,
  it fails a few times with the same failure that was introduced on win 2000
-5990 means PR_IO_TIMEOUT_ERROR
Kai: thank you for the update. I also saw these errors. The WSAETIMEDOUT
(10060) is worse than WSAEADDRINUSE because WSAETIMEDOUT takes longer to
fail. Since we can't eliminate PR_Connect failures completely, I think
we should undo the retry on WSAEADDRINUSE, and then implement the second
approach in comment 7. Do you have time to work on that?

strsclnt.c doesn't seem to have a counter for successful connections.
So we'll need to add one. ('numConnected' is a counter for current
connections, so it is decremented when connections are closed.)
Comment on attachment 301692 [details] [diff] [review]
Proposed approach 1: ignore PR_ADDR_IN_USE_ERROR from PR_Connect

marking as r- because it doesn't work
Attachment #301692 - Flags: review+ → review-
Attachment #8403955 - Flags: review-
Is it possible that we could just reduce the number of iterations on Windows and maybe only for CI?  The tests won't be as good, but we won't use as many ports.

I see that we could increase the number of ephemeral ports and the length of TIME_WAIT on the build machines:
 HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
  Value Name: MaxUserPort
  Data Type: REG_DWORD
  Value: 65534 (for example)
  Valid Range: 5000-65534 (decimal)
  Default: 0x1388 (5000 decimal)
 HKEY_LOCAL_MACHINE\System\CurrectControlSet\services\Tcpip\Parameters
  Value Name: TcpTimedWaitDelay
  Value Type: REG_DWORD-time in seconds
  Valid Range: 30-300 (decimal)
  Default: 0xF0 (240 decimal) 

That might help a little, but even then 65k over 30s seems like it might be possible.

I also think that we should try to re-land Wan-Teh's retry patch.  I suspect that the timeouts are only due to the increase in the number of sockets that it creates.
Severity: normal → S3

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: wtc → nobody
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: