leaking sockets; nsISocketTransport::IsAlive not implemented reliably

VERIFIED FIXED in mozilla0.9.4

Status

()

P1
major
VERIFIED FIXED
17 years ago
17 years ago

People

(Reporter: darin.moz, Assigned: darin.moz)

Tracking

({memory-leak})

Trunk
mozilla0.9.4
x86
All
memory-leak
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: r=wtc, sr=dougt, URL)

Attachments

(3 attachments)

(Assignee)

Description

17 years ago
browse to www.cnn.com, click on a link and then go back.  fire up netstat -tcpd
and wait a bit... notice all the many sockets in the CLOSE_WAIT state.  now
goto some other site, like mozilla.org.  notice that the CLOSE_WAIT sockets have
not disappeared.  this is bad!
(Assignee)

Updated

17 years ago
Severity: normal → major
Status: NEW → ASSIGNED
Keywords: mlk
Priority: -- → P1
Target Milestone: --- → mozilla0.9.4
Is this really a leak? We now do keep alives to lots of hosts we didn't,
including cnn.com. Our current default timeout is 300 seconds, so you will have
to wait 5 minutes for these sockets to be reused. Or are you seeing this persist
for more than 5 minutes?

See bug 91204 for the timeout issue.
(Assignee)

Comment 2

17 years ago
our IsAlive test includes both testing that the connection has not idled for
more than the alotted time, and it also calls PR_Read(fd, buf, 0) to check that
the socket has not been closed.  i've always found this test to be sufficient at
detecting the CLOSE_WAIT state on linux and windows.

so, i believe that for some reason, the IsAlive test is not being run.
But with HTTP/1.1 connections, the server can keep the connection open for as
long as it wants.

On cnn.com, that appears to be 30 seconds. Does this still happen if you wait a
minute or so?
(Assignee)

Comment 4

17 years ago
if the socket is in the CLOSE_WAIT state, then that implies that the connection
has already been closed on the server end.  

still, i'll wait longer and see if it ever goes away, just to be sure.

Comment 5

17 years ago
I see this on an 08-12 nightly. Not all sockets get stuck but, for those
that do, the last thing that happens is the server sends a FIN packet,
it is ACK'ed but nothing else follows. Mozilla should send a FIN but
doesn't. I waited 20 minutes and the stuck sockets were still there.
It's not CNN either. I see it on any number of sites.
(Assignee)

Comment 6

17 years ago
OK, i've verified that this is a problem with our IsAlive test, and it seems to
be showing up on winnt as well (changing OS to ALL).  That is, until i revisit a
server with a connection in the CLOSE_WAIT state, the CLOSE_WAIT socket doesn't
dissappear.  So, it seems that it is not until we try to actually write out a
request that we detect the closed socket.
OS: Linux → All
(Assignee)

Comment 7

17 years ago
after further investigation, PR_Read(fd, &c, 0) on one of these CLOSE_WAIT
sockets returns PR_WOULD_BLOCK_ERROR.  from nsSocketTransport.cpp:1716, notice
that we do not treat this error as an indication that the socket is closed.
(Assignee)

Comment 8

17 years ago
after further investigation, i'm not convinced that there is any portable
reliable way to detect a socket in the CLOSE_WAIT state.  moreover, it really
isn't necessary that we can detect CLOSE_WAIT.  so, i think the right solution
to this "problem" is to reduce our default idle timeout.

however, i wonder if we shouldn't be using a different idle timeout when talking
to a proxy.  bbaetz: what do you think?
(Assignee)

Comment 9

17 years ago
Created attachment 45950 [details] [diff] [review]
v1.0 reduces the default keep-alive timeout pref
This came up in bug 91204. You want longer with a proxy, because you're much
more likely to talk to it again than a normal server. We should wait on taht til
my bug 89500 cleanup has gone in - it will make it a bit easier.

Does the NSPR landing yesterday give us any way to detect a closed socket? wtc,
do you have any suggestions?

I think that that timeout is too high for a server - we should take that up in
bug 91204.
(Assignee)

Updated

17 years ago
Keywords: patch
(Assignee)

Comment 11

17 years ago
agreed.. bug 91204 is probably a better place for my patch to all.js.
(Assignee)

Updated

17 years ago
Keywords: patch
(Assignee)

Comment 12

17 years ago
improving summary
Summary: leaking sockets → leaking sockets; nsISocketTransport::IsAlive not implemented reliably
(Assignee)

Comment 13

17 years ago
shaver suggested this algorithm (supposed to work on unix and win32):

PR_Bool IsSocketDead(int fd) {
  struct msghdr msg;
  return recvmgr(fd, &msg, MSG_PEEK) == -1;
}

i have yet to verify it though.
(Assignee)

Comment 14

17 years ago
s/recvmgr/recvmsg/

Comment 15

17 years ago
The only reliable way to test for a closed connection
is to read from it.  If read returns 0, the peer has
closed the connection (at least for writing).

You can probably try getpeername(), but I am not sure
if it works reliably.
(Assignee)

Comment 16

17 years ago
wtc: so, peeking at the data does not equate to reading?

b/c now that we have NSPR 4.2, i was thinking about trying PR_Recv with the
PR_MSG_PEEK flag set.

Comment 17

17 years ago
Suppose your peer sends you 10 bytes of data and closes the connection.

If you call recv(MSG_PEEK) for 5 bytes, you will get 5 bytes.
If you do that again, you will still get 5 bytes.
etc. etc.

You will never discover that the connection is closed by the peer this
way.

On the other hand, if you read 5 bytes each time, the third read will
return 0, and you know the peer closed the connection (at least for
writing).
(Assignee)

Comment 18

17 years ago
i'm only concerned with sockets that should not have any unread data, so
PR_Recv(MSG_PEEK) should be the way to go.
(Assignee)

Comment 19

17 years ago
Created attachment 46044 [details] [diff] [review]
v1.1 implement IsAlive using PR_Recv(MSG_PEEK)
(Assignee)

Updated

17 years ago
Keywords: patch
(Assignee)

Comment 21

17 years ago
looks like LXR is not showing NSPR 4.2 b/c in 4.2 it is valid, and PR_MSG_PEEK
is defined.

Comment 22

17 years ago
The flags parameter is valid for PR_Recv.  We haven't updated
the documentation.  Darin is right, LXR is still showing the
old NSPR.  This will be fixed soon.

The only value you can use for the flags parameter is 0 or
PR_MSG_PEEK.

Comment 23

17 years ago
does "is alive" mean the same as "is there data on the socket"?
(Assignee)

Comment 24

17 years ago
the "is alive" test is meant to check the socket connection status.  for
example, if you run "netstat -tcpd" you'll see a list of all TCP/IP sockets
along with each socket's current state.  we would like to be able to detect
sockets in the CLOSE_WAIT state (meaning sockets that have already been closed
[for writing] on the server side).  this helps us throw out these connections
before we actually try using them.  the benefits are twofold: 1) helps remove
deadwood -- servers can't cleanup a socket connection until we also close the
connection, and 2) helps avoid having to restart HTTP requests in the event that
the connection is already closed.

the side effect of not being able to detect a socket in the CLOSE_WAIT state is
not sooo bad... since we also limit the idle time per socket, but IMO fixing
this is good as it will help make us a more friendly player on the internet.
(Assignee)

Comment 25

17 years ago
wtc, dougt: is this patch ok?  if so, can i get an r/sr= from you guys?  thx!

Comment 26

17 years ago
I am not qualified to review this patch.
I can only say the original code is no
good because it depends on the undocumented
behavior of reading zero bytes.

You wrote:
> i'm only concerned with sockets that
> should not have any unread data

If this is true, you should be able to say:
    char c;
    PRInt32 rval = PR_Read (mSocketFD, &c, 1);
    PR_ASSERT(rval <= 0);
    if (rval > 0) {
        /* read unexpected data */
        /* display error message */
    }
(Assignee)

Comment 27

17 years ago
wtc: right, but even though the socket "should not" have any unread data when
isAlive is called, there is unfortunately no guarantee that client code won't
call isAlive when it really would be meaningless to do so.  thus, the
PR_Recv(MSG_PEEK) solution seems very nice.

Comment 28

17 years ago
r=wtc.

Comment 29

17 years ago
sr=dougt
(Assignee)

Updated

17 years ago
Whiteboard: r=wtc, sr=dougt

Comment 30

17 years ago
I have debugged this IsAlive thing too, i notised that when proxy connections
timed out netstat shows that there is one byte in received queue, so
PR_Read allways thinks that connection is ok.

I hacked IsAlive to use PR_Poll to see that socket is ok, eg there shouldn
be any bytes availeable in socket and no errors so it could be reused again.

I'll attach my poll hack.

Comment 31

17 years ago
Created attachment 46651 [details] [diff] [review]
hack that uses PR_Poll to see is socket in ok state
(Assignee)

Comment 32

17 years ago
tomi, the one byte in the Recv-Q is expected when the socket moves to the
CLOSE_WAIT state.  can you try out my v1.1 patch to verify that it works for
you?  in trying to fix this, i also messed around with PR_Poll, but with your
patch you'll get false negatives when there really is data to be read.  not that
someone should be calling isAlive in that case, but they might.
(Assignee)

Comment 33

17 years ago
fixed-on-trunk
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → FIXED
Whiteboard: r=wtc, sr=dougt → r=wtc, sr=dougt, fixed-on-trunk

Comment 34

17 years ago
Yes v1.1 seems to work ok for me. (linux)

There is still other problen than socket cleanup is only done when
opening new sockets, maybe timer based cleanup routine?
(Assignee)

Comment 35

17 years ago
yeah, but that's another bug ;-)

Comment 36

17 years ago
verified:  sockets in the CLOSE_WAIT state are now being removed appropriately
when going to other sites.   Fixed before branch.

Linux rh6 2001-09-24-04-0.9.4
Win NT4  2001-09-24-05-0.9.4
Status: RESOLVED → VERIFIED
Whiteboard: r=wtc, sr=dougt, fixed-on-trunk → r=wtc, sr=dougt
You need to log in before you can comment on or make changes to this bug.