Open Bug 332348 Opened 18 years ago Updated 2 years ago

PR_Close fails on Windows when PR_SockOpt_Linger is set

Categories

(NSPR :: NSPR, defect, P5)

4.6.1

Tracking

(Not tracked)

People

(Reporter: nelson, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 5 obsolete files)

This bug is similar to bug 110515 but I believe it is not a duplicate.

ssl.sh comes to a complete halt, will hang for hours, during a stress test.

I am now seeing this bug every tiem I try to run ssl.sh on windows XP.
I cannot get ssl.sh to run to completion any more on Windows.
Consequently, this bug meets the definition of a blocker.

breaking into the hung processes with the debugger shows that 

- selfserv thinks it has finished all connections. One thread is listening,
the rest are waiting to be given connections to work on.  No connections
are thought to be open.  All requests are thought to be finished.

- client threads are hung, trying to read on the socket, with infitnite
timeouts.  It appears that they have already read the entire server 
response into their receive buffers.  IOW, it apepars that the https 
transaction is effectively complete.  The client threads should be 
getting EOF or soemthing like that.  But they hang in read.  
The hung threads cannot be altered.  I cannot force the read call to 
finish.  But the process is killable.  

I'm proposing a workaround, which is to stop using PR_Read and PR_Write
in strsclnt, and switch to PR_Send and PR_Recv with 10 second timeouts.
I will shortly attach a patch to do just that.
Priority: -- → P1
Target Milestone: --- → 3.11.1
Attached patch patch v1 (obsolete) — Splinter Review
patch for both strsclnt and tstclnt
The good news about this patch is that, with it, ssl.sh now runs to 
completion, albeit with numerous timeouts, on WinXP.

The bad news is that the results.html file shows GREEN SUCCESS when the 
time timeouts occur.  So, I'm not going to ask for review of this change
until I figure out why the results are still GREEN.  

This makes me wonder what other test failures are being ignored.
Summary: strsclnt hangs on windows → strsclnt and tstclnt hangs on windows
observation:  when strsclnt hangs, netstat shows thousands of connections 
in TIME_WAIT state.  I suspect a system resource limit.  
Nelson,

You may be on the right track here about resource limits. I don't see any problem in ssl.sh/all.sh on my Win2K machine, but it is a Win2K server box, which has higher limits on number of sockets.

Do we have any idea why this problem doesn't show up in the nightly QA ? Is the box for the nightly QA also running a "server" edition of a Microsoft OS ?
Comment on attachment 216827 [details] [diff] [review]
patch v1

I know you didn't ask for review, but I still have some comments on this one.

I would much prefer that we get to the bottom of the Windows issue without having to resort to a timeout.

But if we can't, the timeout should be implemented as an option controllable through the command-line. The default should still be infinite timeout through PR_Read / PR_Write . The infinite timeout is very useful for performance testing.
Attachment #216827 - Flags: review-
Attached patch updated patch (obsolete) — Splinter Review
While I (too) very much want to get to the "root cause" of 
these problems and fix it "the right way", in the mean time, 
I must get the ECC work done, and this timeout problem is 
now the primary impendiment to progress on that project.

This patch does more than the previous one.  It also:

- uses atomic increment and decrements in some places where
it seems necessary, and 

- shuts down strsclnt when the first "fatal" error occurs,
so that strsclnt doesn't go on repeating the same mistakes
over and over 1000 times (typically).  This speeds up ssl.sh
considerably on systems that have a lot of failures.

I have not yet addressed Julien's request for configurable
timeout and so am not yet asking for review.  But unsolicited
reviews are always welcome.
Attachment #216827 - Attachment is obsolete: true
With the most recent of the above patches in place, ssl.sh is now faster
on my XPhome box where so many failures occur.  But ssh.sh is not 
completing without manual intervention.  There are still occasional
hangs in tstclnt tests that require manual intervention (killing tstclnt).

At one such occasion, I observed with Netstat that there was one connection
between tstclnt and selfserv, and it was in the ESTABLISHED state.
I broke into selfserv with the debugger and found that the server was in 
the usual quiescent state, one thread blocked in accept, and all the rest 
blocked on a condition variable waiting for something to do.  IOW, there 
was no thread that seemed to have a socket in an established state. 

I broke into tstclnt, bug the debugger could not bring the process out of
the "running" state, so I was not able to see the list of threads, nor see
the stack trace of any thread.  :-/  

Note: I am using the trunk, not the 3.11 branch, for my work.
Nelson,

FYI, on friday, ssl.sh took 27 minutes on my Win2K server box at the office, and completed fine. This was for the trunk.

You also said that you were seeing this behavior only in the last 7 days on your system. Would it be too much to ask to check out new source trees at different dates and narrow down when the regression happened ? In particular check without the linger change I put in selfserv tuesday night.

Also, I wonder if Microsoft spooled any networking patch for XP in the last 7 days that caused this breakage. I don't have any XP machine so I wouldn't know about XP patches, but you should be able to see the patch install history.
Attached patch patch v3 (obsolete) — Splinter Review
This third patch enables the timeout with an environment variable.
It also stops strsclnt by default on the first error, but adds a 
new -i (ignore errors) option to continue to run until the full
count is done, even if errors occur early.  I talso includes the 
fixes for the atomic update that I previous attached to another 
already-resolved bug.

Julien, please review.
Attachment #216932 - Attachment is obsolete: true
Attachment #217101 - Flags: review?(julien.pierre.bugs)
Comment on attachment 217101 [details] [diff] [review]
patch v3

Nelson,

The timeout logic is fine, though I would have preferred command-line switches.

The only reason for r- is that I took a closer look at the numUsed variable, and it is actually protected by threadLock too. The comments at the top of the file that list all the variables protected by threadLock say so, and they are correct. The lock is acquired just before numUsed is inspected/incremented in launch_thread.
Attachment #217101 - Flags: review?(julien.pierre.bugs) → review-
OK, back to ++ for numUsed.  If you like this one, 
please ask Bob or Wan-Teh for a second review.
Attachment #217101 - Attachment is obsolete: true
Attachment #217115 - Flags: review?(julien.pierre.bugs)
Attachment #217115 - Flags: superreview?(wtchang)
Attachment #217115 - Flags: review?(julien.pierre.bugs)
Attachment #217115 - Flags: review+
Comment on attachment 217115 [details] [diff] [review]
patch v4 (Checked in on trunk)

Patch checked in on trunk.  Still waiting second review for branch.
strsclnt/strsclnt.c; new revision: 1.49; previous revision: 1.48
tstclnt/tstclnt.c;   new revision: 1.44; previous revision: 1.43
Attachment #217115 - Attachment description: patch v4 → patch v4 (Checked in on trunk)
Comment on attachment 217115 [details] [diff] [review]
patch v4 (Checked in on trunk)

r=wtc.

The change of the types of remaining_connections
and numUsed from int to PRInt32 is not necessary
and results in inconsistency with the type of
active_threads.

>-            if (--remaining_connections >= 0) {
>+            if (--remaining_connections >= 0) { /* protected by threadLock */

This new comment is not necessary.

>-    i = numUsed;
>-    slot = &threads[numUsed++];
>+    i = numUsed++;
>+    slot = &threads[i];

I like this change.  I wanted to make this change, too.

I agree with Julien that we should use a command line
option rather than an environment variable
(NSS_DEBUG_TIMEOUT) to set the timeout.  My reason is
that a command line option would force us to document
it at least in the program's usage message.
Attachment #217115 - Flags: superreview?(wtchang) → superreview+
Nelson,

Are you running into this hang problem when targetting WINNT, WIN95, or both ?

The nightly QA targets WINNT and runs successfully on Win2K server.
My local WINNT build runs successfully on Win2K server.

However, I just did a WIN95 build and ran it on my Win2k server box. I ran into the same problem that you did - tstclnt hung.

It would not return, even after I killed selfserv. It is stuck in PR_Poll on line 839.
I interrupted the program and looked at the pollset . pollset[SSOCK_FD].out_flags is 1, which means it got modified by PR_Poll, but the call did not return yet.
I'm thinking we may need to add a PR_POLL_EXCEPT to the in_flags.
This patch will allow tstclnt to return if selfserv is killed manually.
It does not however fix the main issue of the hang in ssl.sh with the WIN95 build.
Attachment #217240 - Flags: review?(wtchang)
I checked the state of selfserv when the hang occurred :
It had 8 threads waiting for a job, and 1 thread in accept.
tstclnt was waiting in PR_Poll on the socket .
netstat showed the socket in the ESTABLISHED state.
This tells me that the bug is probably selfserv not closing the socket.
I did 3.9 and 3.10 builds targeted to WIN95 and did not see this problem.
On my tip build, I reverted selfserv.c to version 1.69, and also the problem was gone. Version 1.70 is the one that introduced the linger change and signal handler. But only the linger change is compiled for Windows. Removing the PR_SetSocketOption alone was enough to fix the problem, as I suggested in comment 8 . You told me in our meeting today that this didn't have any effect for your build. Can you please double-check this on your build ?

Certainly it doesn't make sense that setting the linger option would keep the connection open forever on the server-side. It could be an OS bug, or an NSPR bug that manifests itself only in the WIN95 build.
I would also suggest we add some socket tracing to selfserv to make sure we aren't leaking any connection socket, which could cause this hang. It would be good to print the remote (client) port before calling PR_Close. This information could be used to correlate with netstat output and find out if the server actually tried to close the connection.
(In reply to comment #14)
> Are you running into this hang problem when targetting WINNT, WIN95, or both ?

Win95. That's what I build 99% of the time.
Comment on attachment 217240 [details] [diff] [review]
add PR_POLL_EXCEPT to PR_Poll call . Checked in to the tip.


r=wtc.  You should do an exhaustive search for
PR_POLL_ in this file to see if you need to make
other related changes.
Attachment #217240 - Flags: review?(wtchang) → review+
Comment on attachment 217240 [details] [diff] [review]
add PR_POLL_EXCEPT to PR_Poll call . Checked in to the tip.


Please fold this new too-long line in some reasonable way.
Changing OS to Win95, since this is the target where the problem occurred.
I would like to lower this bug to a P2, since the problem only happens on that target platform, and we don't build or QA on it . Does anyone at Red Hat do builds or QA for Win95 ?

I talked to Nelson earlier, and he agreed to implement my suggestion in comment 18 in selfserv, which should allow narrowing down the problem to either a selfserv socket leak bug, or an NSPR or system linger-related bug.
OS: Windows XP → Windows 95
Comment on attachment 217240 [details] [diff] [review]
add PR_POLL_EXCEPT to PR_Poll call . Checked in to the tip.


I folded the line and checked this in to the tip.

Checking in tstclnt.c;
/cvsroot/mozilla/security/nss/cmd/tstclnt/tstclnt.c,v  <--  tstclnt.c
new revision: 1.45; previous revision: 1.44

I already inspected the other 2 PR_Poll calls, and they had PR_POLL_EXCEPT already for the socket fd.
Attachment #217240 - Attachment description: add PR_POLL_EXCEPT to PR_Poll call → add PR_POLL_EXCEPT to PR_Poll call . Checked in to the tip.
Summary: strsclnt and tstclnt hangs on windows → strsclnt and tstclnt hangs on Win95 target
The OS_TARGET=WIN95 configuration is one of the most
important build configurations because all the Mozilla
clients use this configuration on Windows.  The "WIN95"
here is poorly named.  This build configuration really
means "WIN32".

The linger socket option change (in bug 331413) should
be backed out on the NSS_3_11_BRANCH because it
introduced this regression (comment 17).
OS: Windows 95 → Windows XP
Then we need builds and QA for the WIN95 build. Christophe/Jason, this is done the same way as the WINNT build, except you set OS_TARGET=WIN95 in the build and environment.
This implements logging of the remote port and IP address of every connection close in selfserv, as I suggested in comment 18 .
I ran ssl.sh with linger enabled and this logging, and it logged that it was closing connection on port 3032 . Yet netstat -a showed that connection still in the established state on both sides. This proves that the ensuing PR_Close call did not work.
This is not an NSS bug . PR_Close did not close. Transferring to NSPR .
Component: Test → NSPR
Product: NSS → NSPR
Target Milestone: 3.11.1 → ---
Version: 3.11 → 4.6.1
Please also patch the _MD_CloseSocket function
in mozilla/nsprpub/pr/src/md/windows/w95sock.c to
log a message after the statement
    rv = closesocket((SOCKET) osfd );

I bet that you will find every PR_Close log message
has a corresponding successful (rv=0) closesocket
log message.

If that's the case, the only thing I can think
of is to call shutdown((SOCKET) osfd, SD_BOTH)
before closesocket(), or call PR_Shutdown()
before PR_Close().
I added some logging code to check the return value of PR_Close . It is PR_FAILURE . The NSPR error code is -5998 . The OS error code is 10035, ie. WSAEWOULDBLOCK .
Thank you, Julien.

closesocket() may fail with the WSAEWOULDBLOCK error
if you set the SO_LINGER socket option with a small
linger time.  You can try increasing the linger time
to 10 seconds.
Wan-Teh,

I increased linger to 10 seconds. That made no difference, I still got the same failure in PR_Close. I have never heard of a requirement to check for PR_WOULD_BLOCK_ERROR in PR_Close before . I would consider this an NSPR bug regardless of the linger value.
Summary: strsclnt and tstclnt hangs on Win95 target → PR_Close fails on Win95 target when PR_SockOpt_Linger is set
Wan-Teh,

You can just apply this patch to the NSS tip with OS_TARGET=WIN95 and run all.sh . You will hit the assertion on PR_Close failure on the second PR_Close.
Attachment #218117 - Attachment is obsolete: true
Assignee: nelson → wtchang
I have to agree this is an NSPR error.  The socket in question is a blocking
NSPR socket.  Yes, we all know that, under the hood, all the system sockets
that underly NSPR sockets are really non-blocking system sockets.  But when 
an NSPR socket is offering blocking socket semantics to its user, it ought
never return an error that is equivalent to EWOULDBLOCK.  
Julien, thanks for your investigation of this bug.

I did some Web search for this topic and found some useful
information.

These two Microsoft documentation pages describe how Winsock
handles SO_LINGER and closesocket():
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/closesocket_2.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/graceful_shutdown_linger_options_and_socket_closure_2.asp

These two pages mention the SO_LINGER socket option and
close() failing with EWOULDBLOCK in general.  I am not sure
of the quality of the second page:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4631988
http://www.developerweb.net/forum/archive/index.php/t-2982.html

UNIX Network Programming: The Sockets Networking API, Volume 1,
Third Edition says in Sec. 7.5 on page 203:

  3. If l_onoff is nonzero and l_linger is nonzero, then the
     kernel will linger when the socket is closed (p. 472 of
     TCPv2).  That is, if there is any data still remaining
     in the socket send buffer, the process is put to sleep
     until either: (i) all the data is sent and acknowledged
     by the peer TCP, or (ii) the linger time expires.  If
     the socket has been set to nonblocking (Chapter 16), it
     will not wait for the close to complete, even if the
     linger time is nonzero.  When using this feature of the
     SO_LINGER option, it is important for the application to
     check the return value from close, because if the linger
     time expires before the remaining data is sent and
     acknowledged, close returns EWOULDBLOCK and any
     remaining data in the send buffer is discarded.

This paragraph leaves out many details about close() and
EWOULDBLOCK on blocking and nonblocking sockets.  Unfortunately
this means it's not easy to find out if there is any portable
behavior of close() and EWOULDBLOCK on blocking and nonblocking
sockets that NSPR should implement.

I don't have time to study this issue further.  But it
seems that for an NSPR blocking socket, NSPR should set
the underlying system socket back to blocking mode before
calling close()/closesocket().  This experimental patch
implements this, only for the WIN95 configuration for now.
Note that system call counters won't like this patch
because it adds a system call to PR_Close.
Wan-Teh,

I think this will vary on a per-platform basis.
FYI, the man page for close on Solaris says :

     If fildes refers to a socket, close() causes the  socket  to
     be  destroyed.   If  the  socket is connection-mode, and the
     SO_LINGER option is set for the socket with non-zero  linger
     time,  and  the  socket has untransmitted data, then close()
     will block for up to the current linger interval  until  all
     data is transmitted.

EWOULDBLOCK is not listed among the possible error codes for close().
Only EINTR, ENOLINK, ENOSPC and EIO are listed in the man page.

So we don't need to add it for Solaris. I think as far as Sun is concerned, we don't mind adding a system call to all the other platforms if there is no time to do the research for each one. Right now it certainly seems like only Win95 needs it. It is odd that WinNT doesn't appear to need it.
(In reply to comment #34)

I built NSPR for Win95 from the NSPR 4_6_BRANCH with your experimental
patch applied.  
Before doing that, I could not get ssl.sh to run to completion without
hangs in tstclnt.
After rebuilding NSPR with the patch and relinking everything to use 
that new NSPR, I have been able to run all.sh without any apparent hangs.
Perhaps I should repeat this all.sh test a few times to further confirm 
the result, but it looks like a winner to me!  
Comment on attachment 218183 [details] [diff] [review]
Experimental patch: set the system socket back to blocking mode so that close/closesocket blocks

The patch also fixes the problem for me.

What I don't understand is why it is not needed when targetting WINNT instead of WIN95, and running on the same machine where the WIN95 build fails without the fix. It is running on the same underlying OS.

If we are concerned about saving system calls and avoiding the ioctlsocket : perhaps the _PR_MD_CLOSE_SOCKET function could simply try close first without changing the socket to blocking. It could change it only if it gets an EWOULDBLOCK/WSAEWOULDBLOCK , and then try closing again once. This fix could work for all platforms, and would only ever cost something if the platform has the problem of close sometimes returning this error.
Blocks: 331413
Attached patch revised patch (obsolete) — Splinter Review
Wan-Teh, this slight revision to your last patch differs from yours
only in that, for non-DEBUG builds, it ignores failure of the 
socketioctl call, and proceeds to attempt to close the socket. 

Please review.  I plan to check this in if you approve.
Assignee: wtchang → nelson
Status: NEW → ASSIGNED
Attachment #231546 - Flags: review?(wtchang)
Comment on attachment 231546 [details] [diff] [review]
revised patch

We should avoid such a behavioral difference between debug and optimized
builds.  You can add a PR_LOG statement to log ioctlsocket failures.
Attached patch patch, rev "j"Splinter Review
This patch uses PR_LOG.  I am unaware of any conventions or limitations on the
strings used with PR_LOG, so please advise if this string breaks any rules.
Attachment #231546 - Attachment is obsolete: true
Attachment #231667 - Flags: review?(wtchang)
Attachment #231546 - Flags: review?(wtchang)
Comment on attachment 231667 [details] [diff] [review]
patch, rev "j"

I now remember why my "experimental patch" put #ifdef WIN95 in a different
place.  I want to set the underlying OS socket back to blocking mode on
all platforms, but I only wrote the code to do that for WIN95.  This
comment shows that:

>+			 * This should become a _PR_MD_MAKE_BLOCK macro
>+			 * defined for all platforms.

So please either write the code for all platforms, or move #ifdef WIN95
back to where it is in my experimental patch.  (The WINNT configuration
doesn't need this code.  For XP_UNIX you can call fcntl.)
update:  
testing on WinXP with the "Win95" target and debug RTL, ssl.sh proceeds along
to a certain point, at which it starts to hang.  From that point forward,
all tests fail, reporting timeout errors.  (Thank goodness for the timeout 
errors, otherwise, the tests would just hang completely).  

I guess lots more debugging is needed.  I will continue to work on that.  

Wan-Teh, I'm not aware that closing non-blocking sockets is a problem on any
platform except windows.  If I make the suggested changes for all platforms,
then they need to be thoroughly tested on all platforms, which will delay this
fix (and others).  

Question: if it isn't broken (on other platforms), why fix it ?
Nelson,

A goal of NSPR is that an NSPR function should have
cross-platform behavior.  This is usually accomplished
by calling the same (or equivalent) system library
functions on all platforms.  But sometimes we need
to add platform-specific code for some platforms to
adjust the behavior so that it is the same as the other
platforms. The question is whether the WIN95 code in my
experimental patch is the (incomplete) former case or the
latter case.

Looking at my patch, I believe the cross-platform behavior
we want is (feel free to correct it):

   The close operation of a blocking NSPR socket should
   block (until either all the data are sent or the linger
   time expires) if the linger socket option is set.

and my research reported in comment 34 showed that we should
set the underlying OS socket to blocking mode on all platforms.
A bug fix should not introduce platform inconsistency.
Attachment #231667 - Flags: review?(wtchang) → review-
Severity: blocker → normal
Status: ASSIGNED → NEW
OS: Windows XP → All
Priority: P1 → P3
Hardware: PC → All
Blocks: 355183
This problem is not specific to Win95. PR_Close also fails on Windows NT (OS_TARGET=WINNT) with linger set in selfserv when running in multi-process mode.
Summary: PR_Close fails on Win95 target when PR_SockOpt_Linger is set → PR_Close fails on Windows when PR_SockOpt_Linger is set
Bumping priority to P2.
Priority: P3 → P2
QA Contact: test → nspr
Assignee: nelson → wtc

The bug assignee didn't login in Bugzilla in the last 7 months and this bug has priority 'P2'.
:KaiE, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: wtc → nobody
Flags: needinfo?(kaie)

Server code issues aren't a high priority these days.

Flags: needinfo?(kaie)
Priority: P2 → P5
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: