Closed Bug 94811 Opened 23 years ago Closed 23 years ago

leaking sockets; nsISocketTransport::IsAlive not implemented reliably

Categories

(Core :: Networking: HTTP, defect, P1)

x86
All
defect

Tracking

()

VERIFIED FIXED
mozilla0.9.4

People

(Reporter: darin.moz, Assigned: darin.moz)

References

()

Details

(Keywords: memory-leak, Whiteboard: r=wtc, sr=dougt)

Attachments

(3 files)

browse to www.cnn.com, click on a link and then go back.  fire up netstat -tcpd
and wait a bit... notice all the many sockets in the CLOSE_WAIT state.  now
goto some other site, like mozilla.org.  notice that the CLOSE_WAIT sockets have
not disappeared.  this is bad!
Severity: normal → major
Status: NEW → ASSIGNED
Keywords: mlk
Priority: -- → P1
Target Milestone: --- → mozilla0.9.4
Is this really a leak? We now do keep alives to lots of hosts we didn't,
including cnn.com. Our current default timeout is 300 seconds, so you will have
to wait 5 minutes for these sockets to be reused. Or are you seeing this persist
for more than 5 minutes?

See bug 91204 for the timeout issue.
our IsAlive test includes both testing that the connection has not idled for
more than the alotted time, and it also calls PR_Read(fd, buf, 0) to check that
the socket has not been closed.  i've always found this test to be sufficient at
detecting the CLOSE_WAIT state on linux and windows.

so, i believe that for some reason, the IsAlive test is not being run.
But with HTTP/1.1 connections, the server can keep the connection open for as
long as it wants.

On cnn.com, that appears to be 30 seconds. Does this still happen if you wait a
minute or so?
if the socket is in the CLOSE_WAIT state, then that implies that the connection
has already been closed on the server end.  

still, i'll wait longer and see if it ever goes away, just to be sure.
I see this on an 08-12 nightly. Not all sockets get stuck but, for those
that do, the last thing that happens is the server sends a FIN packet,
it is ACK'ed but nothing else follows. Mozilla should send a FIN but
doesn't. I waited 20 minutes and the stuck sockets were still there.
It's not CNN either. I see it on any number of sites.
OK, i've verified that this is a problem with our IsAlive test, and it seems to
be showing up on winnt as well (changing OS to ALL).  That is, until i revisit a
server with a connection in the CLOSE_WAIT state, the CLOSE_WAIT socket doesn't
dissappear.  So, it seems that it is not until we try to actually write out a
request that we detect the closed socket.
OS: Linux → All
after further investigation, PR_Read(fd, &c, 0) on one of these CLOSE_WAIT
sockets returns PR_WOULD_BLOCK_ERROR.  from nsSocketTransport.cpp:1716, notice
that we do not treat this error as an indication that the socket is closed.
after further investigation, i'm not convinced that there is any portable
reliable way to detect a socket in the CLOSE_WAIT state.  moreover, it really
isn't necessary that we can detect CLOSE_WAIT.  so, i think the right solution
to this "problem" is to reduce our default idle timeout.

however, i wonder if we shouldn't be using a different idle timeout when talking
to a proxy.  bbaetz: what do you think?
This came up in bug 91204. You want longer with a proxy, because you're much
more likely to talk to it again than a normal server. We should wait on taht til
my bug 89500 cleanup has gone in - it will make it a bit easier.

Does the NSPR landing yesterday give us any way to detect a closed socket? wtc,
do you have any suggestions?

I think that that timeout is too high for a server - we should take that up in
bug 91204.
Keywords: patch
agreed.. bug 91204 is probably a better place for my patch to all.js.
Keywords: patch
improving summary
Summary: leaking sockets → leaking sockets; nsISocketTransport::IsAlive not implemented reliably
shaver suggested this algorithm (supposed to work on unix and win32):

PR_Bool IsSocketDead(int fd) {
  struct msghdr msg;
  return recvmgr(fd, &msg, MSG_PEEK) == -1;
}

i have yet to verify it though.
s/recvmgr/recvmsg/
The only reliable way to test for a closed connection
is to read from it.  If read returns 0, the peer has
closed the connection (at least for writing).

You can probably try getpeername(), but I am not sure
if it works reliably.
wtc: so, peeking at the data does not equate to reading?

b/c now that we have NSPR 4.2, i was thinking about trying PR_Recv with the
PR_MSG_PEEK flag set.
Suppose your peer sends you 10 bytes of data and closes the connection.

If you call recv(MSG_PEEK) for 5 bytes, you will get 5 bytes.
If you do that again, you will still get 5 bytes.
etc. etc.

You will never discover that the connection is closed by the peer this
way.

On the other hand, if you read 5 bytes each time, the third read will
return 0, and you know the peer closed the connection (at least for
writing).
i'm only concerned with sockets that should not have any unread data, so
PR_Recv(MSG_PEEK) should be the way to go.
Keywords: patch
looks like LXR is not showing NSPR 4.2 b/c in 4.2 it is valid, and PR_MSG_PEEK
is defined.
The flags parameter is valid for PR_Recv.  We haven't updated
the documentation.  Darin is right, LXR is still showing the
old NSPR.  This will be fixed soon.

The only value you can use for the flags parameter is 0 or
PR_MSG_PEEK.
does "is alive" mean the same as "is there data on the socket"?
the "is alive" test is meant to check the socket connection status.  for
example, if you run "netstat -tcpd" you'll see a list of all TCP/IP sockets
along with each socket's current state.  we would like to be able to detect
sockets in the CLOSE_WAIT state (meaning sockets that have already been closed
[for writing] on the server side).  this helps us throw out these connections
before we actually try using them.  the benefits are twofold: 1) helps remove
deadwood -- servers can't cleanup a socket connection until we also close the
connection, and 2) helps avoid having to restart HTTP requests in the event that
the connection is already closed.

the side effect of not being able to detect a socket in the CLOSE_WAIT state is
not sooo bad... since we also limit the idle time per socket, but IMO fixing
this is good as it will help make us a more friendly player on the internet.
wtc, dougt: is this patch ok?  if so, can i get an r/sr= from you guys?  thx!
I am not qualified to review this patch.
I can only say the original code is no
good because it depends on the undocumented
behavior of reading zero bytes.

You wrote:
> i'm only concerned with sockets that
> should not have any unread data

If this is true, you should be able to say:
    char c;
    PRInt32 rval = PR_Read (mSocketFD, &c, 1);
    PR_ASSERT(rval <= 0);
    if (rval > 0) {
        /* read unexpected data */
        /* display error message */
    }
wtc: right, but even though the socket "should not" have any unread data when
isAlive is called, there is unfortunately no guarantee that client code won't
call isAlive when it really would be meaningless to do so.  thus, the
PR_Recv(MSG_PEEK) solution seems very nice.
r=wtc.
sr=dougt
Whiteboard: r=wtc, sr=dougt
I have debugged this IsAlive thing too, i notised that when proxy connections
timed out netstat shows that there is one byte in received queue, so
PR_Read allways thinks that connection is ok.

I hacked IsAlive to use PR_Poll to see that socket is ok, eg there shouldn
be any bytes availeable in socket and no errors so it could be reused again.

I'll attach my poll hack.
tomi, the one byte in the Recv-Q is expected when the socket moves to the
CLOSE_WAIT state.  can you try out my v1.1 patch to verify that it works for
you?  in trying to fix this, i also messed around with PR_Poll, but with your
patch you'll get false negatives when there really is data to be read.  not that
someone should be calling isAlive in that case, but they might.
fixed-on-trunk
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Whiteboard: r=wtc, sr=dougt → r=wtc, sr=dougt, fixed-on-trunk
Yes v1.1 seems to work ok for me. (linux)

There is still other problen than socket cleanup is only done when
opening new sockets, maybe timer based cleanup routine?
yeah, but that's another bug ;-)
verified:  sockets in the CLOSE_WAIT state are now being removed appropriately
when going to other sites.   Fixed before branch.

Linux rh6 2001-09-24-04-0.9.4
Win NT4  2001-09-24-05-0.9.4
Status: RESOLVED → VERIFIED
Whiteboard: r=wtc, sr=dougt, fixed-on-trunk → r=wtc, sr=dougt
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: