Closed Bug 94811 Opened 23 years ago Closed 23 years ago

leaking sockets; nsISocketTransport::IsAlive not implemented reliably

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla0.9.4

People

(Reporter: darin.moz, Assigned: darin.moz)

References

(
URL
)

Details

(Keywords: memory-leak, Whiteboard: r=wtc, sr=dougt)

Attachments

(3 files)

v1.0 reduces the default keep-alive timeout pref 23 years ago Darin Fisher 600 bytes, patch		Details \| Diff \| Splinter Review
v1.1 implement IsAlive using PR_Recv(MSG_PEEK) 23 years ago Darin Fisher 1.72 KB, patch		Details \| Diff \| Splinter Review
hack that uses PR_Poll to see is socket in ok state 23 years ago Tomi Leppikangas 1.01 KB, patch		Details \| Diff \| Splinter Review

Darin Fisher

Assignee

Description

•

23 years ago

browse to www.cnn.com, click on a link and then go back.  fire up netstat -tcpd
and wait a bit... notice all the many sockets in the CLOSE_WAIT state.  now
goto some other site, like mozilla.org.  notice that the CLOSE_WAIT sockets have
not disappeared.  this is bad!

Darin Fisher

Assignee

Updated

•

23 years ago

Severity: normal → major

Status: NEW → ASSIGNED

Keywords: mlk

Priority: -- → P1

Target Milestone: --- → mozilla0.9.4

Bradley Baetz (:bbaetz)

Comment 1

•

23 years ago

Is this really a leak? We now do keep alives to lots of hosts we didn't,
including cnn.com. Our current default timeout is 300 seconds, so you will have
to wait 5 minutes for these sockets to be reused. Or are you seeing this persist
for more than 5 minutes?

See bug 91204 for the timeout issue.

Darin Fisher

Assignee

Comment 2

•

23 years ago

our IsAlive test includes both testing that the connection has not idled for
more than the alotted time, and it also calls PR_Read(fd, buf, 0) to check that
the socket has not been closed.  i've always found this test to be sufficient at
detecting the CLOSE_WAIT state on linux and windows.

so, i believe that for some reason, the IsAlive test is not being run.

Bradley Baetz (:bbaetz)

Comment 3

•

23 years ago

But with HTTP/1.1 connections, the server can keep the connection open for as
long as it wants.

On cnn.com, that appears to be 30 seconds. Does this still happen if you wait a
minute or so?

Darin Fisher

Assignee

Comment 4

•

23 years ago

if the socket is in the CLOSE_WAIT state, then that implies that the connection
has already been closed on the server end.  

still, i'll wait longer and see if it ever goes away, just to be sure.

tenthumbs

Comment 5

•

23 years ago

I see this on an 08-12 nightly. Not all sockets get stuck but, for those
that do, the last thing that happens is the server sends a FIN packet,
it is ACK'ed but nothing else follows. Mozilla should send a FIN but
doesn't. I waited 20 minutes and the stuck sockets were still there.
It's not CNN either. I see it on any number of sites.

Darin Fisher

Assignee

Comment 6

•

23 years ago

OK, i've verified that this is a problem with our IsAlive test, and it seems to
be showing up on winnt as well (changing OS to ALL).  That is, until i revisit a
server with a connection in the CLOSE_WAIT state, the CLOSE_WAIT socket doesn't
dissappear.  So, it seems that it is not until we try to actually write out a
request that we detect the closed socket.

OS: Linux → All

Darin Fisher

Assignee

Comment 7

•

23 years ago

after further investigation, PR_Read(fd, &c, 0) on one of these CLOSE_WAIT
sockets returns PR_WOULD_BLOCK_ERROR.  from nsSocketTransport.cpp:1716, notice
that we do not treat this error as an indication that the socket is closed.

Darin Fisher

Assignee

Comment 8

•

23 years ago

after further investigation, i'm not convinced that there is any portable
reliable way to detect a socket in the CLOSE_WAIT state.  moreover, it really
isn't necessary that we can detect CLOSE_WAIT.  so, i think the right solution
to this "problem" is to reduce our default idle timeout.

however, i wonder if we shouldn't be using a different idle timeout when talking
to a proxy.  bbaetz: what do you think?

Darin Fisher

Assignee

Comment 9

•

23 years ago

Attached patch v1.0 reduces the default keep-alive timeout pref — Details — Splinter Review

Bradley Baetz (:bbaetz)

Comment 10

•

23 years ago

This came up in bug 91204. You want longer with a proxy, because you're much
more likely to talk to it again than a normal server. We should wait on taht til
my bug 89500 cleanup has gone in - it will make it a bit easier.

Does the NSPR landing yesterday give us any way to detect a closed socket? wtc,
do you have any suggestions?

I think that that timeout is too high for a server - we should take that up in
bug 91204.

Darin Fisher

Assignee

Updated

•

23 years ago

Keywords: patch

Darin Fisher

Assignee

Comment 11

•

23 years ago

agreed.. bug 91204 is probably a better place for my patch to all.js.

Darin Fisher

Assignee

Updated

•

23 years ago

Keywords: patch

Darin Fisher

Assignee

Comment 12

•

23 years ago

improving summary

Summary: leaking sockets → leaking sockets; nsISocketTransport::IsAlive not implemented reliably

Darin Fisher

Assignee

Comment 13

•

23 years ago

shaver suggested this algorithm (supposed to work on unix and win32):

PR_Bool IsSocketDead(int fd) {
  struct msghdr msg;
  return recvmgr(fd, &msg, MSG_PEEK) == -1;
}

i have yet to verify it though.

Darin Fisher

Assignee

Comment 14

•

23 years ago

s/recvmgr/recvmsg/

Wan-Teh Chang

Comment 15

•

23 years ago

The only reliable way to test for a closed connection
is to read from it.  If read returns 0, the peer has
closed the connection (at least for writing).

You can probably try getpeername(), but I am not sure
if it works reliably.

Darin Fisher

Assignee

Comment 16

•

23 years ago

wtc: so, peeking at the data does not equate to reading?

b/c now that we have NSPR 4.2, i was thinking about trying PR_Recv with the
PR_MSG_PEEK flag set.

Wan-Teh Chang

Comment 17

•

23 years ago

Suppose your peer sends you 10 bytes of data and closes the connection.

If you call recv(MSG_PEEK) for 5 bytes, you will get 5 bytes.
If you do that again, you will still get 5 bytes.
etc. etc.

You will never discover that the connection is closed by the peer this
way.

On the other hand, if you read 5 bytes each time, the third read will
return 0, and you know the peer closed the connection (at least for
writing).

Darin Fisher

Assignee

Comment 18

•

23 years ago

i'm only concerned with sockets that should not have any unread data, so
PR_Recv(MSG_PEEK) should be the way to go.

Darin Fisher

Assignee

Comment 19

•

23 years ago

Attached patch v1.1 implement IsAlive using PR_Recv(MSG_PEEK) — Details — Splinter Review

Darin Fisher

Assignee

Updated

•

23 years ago

Keywords: patch

Doug Turner (:dougt)

Comment 20

•

23 years ago

wtc: is the flag parameter valid now?  

http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/include/prio.h#1360

Darin Fisher

Assignee

Comment 21

•

23 years ago

looks like LXR is not showing NSPR 4.2 b/c in 4.2 it is valid, and PR_MSG_PEEK
is defined.

Wan-Teh Chang

Comment 22

•

23 years ago

The flags parameter is valid for PR_Recv.  We haven't updated
the documentation.  Darin is right, LXR is still showing the
old NSPR.  This will be fixed soon.

The only value you can use for the flags parameter is 0 or
PR_MSG_PEEK.

Doug Turner (:dougt)

Comment 23

•

23 years ago

does "is alive" mean the same as "is there data on the socket"?

Darin Fisher

Assignee

Comment 24

•

23 years ago

the "is alive" test is meant to check the socket connection status.  for
example, if you run "netstat -tcpd" you'll see a list of all TCP/IP sockets
along with each socket's current state.  we would like to be able to detect
sockets in the CLOSE_WAIT state (meaning sockets that have already been closed
[for writing] on the server side).  this helps us throw out these connections
before we actually try using them.  the benefits are twofold: 1) helps remove
deadwood -- servers can't cleanup a socket connection until we also close the
connection, and 2) helps avoid having to restart HTTP requests in the event that
the connection is already closed.

the side effect of not being able to detect a socket in the CLOSE_WAIT state is
not sooo bad... since we also limit the idle time per socket, but IMO fixing
this is good as it will help make us a more friendly player on the internet.

Darin Fisher

Assignee

Comment 25

•

23 years ago

wtc, dougt: is this patch ok?  if so, can i get an r/sr= from you guys?  thx!

Wan-Teh Chang

Comment 26

•

23 years ago

I am not qualified to review this patch.
I can only say the original code is no
good because it depends on the undocumented
behavior of reading zero bytes.

You wrote:
> i'm only concerned with sockets that
> should not have any unread data

If this is true, you should be able to say:
    char c;
    PRInt32 rval = PR_Read (mSocketFD, &c, 1);
    PR_ASSERT(rval <= 0);
    if (rval > 0) {
        /* read unexpected data */
        /* display error message */
    }

Darin Fisher

Assignee

Comment 27

•

23 years ago

wtc: right, but even though the socket "should not" have any unread data when
isAlive is called, there is unfortunately no guarantee that client code won't
call isAlive when it really would be meaningless to do so.  thus, the
PR_Recv(MSG_PEEK) solution seems very nice.

Wan-Teh Chang

Comment 28

•

23 years ago

r=wtc.

Doug Turner (:dougt)

Comment 29

•

23 years ago

sr=dougt

Darin Fisher

Assignee

Updated

•

23 years ago

Whiteboard: r=wtc, sr=dougt

Tomi Leppikangas

Comment 30

•

23 years ago

I have debugged this IsAlive thing too, i notised that when proxy connections
timed out netstat shows that there is one byte in received queue, so
PR_Read allways thinks that connection is ok.

I hacked IsAlive to use PR_Poll to see that socket is ok, eg there shouldn
be any bytes availeable in socket and no errors so it could be reused again.

I'll attach my poll hack.

Tomi Leppikangas

Comment 31

•

23 years ago

Attached patch hack that uses PR_Poll to see is socket in ok state — Details — Splinter Review

Darin Fisher

Assignee

Comment 32

•

23 years ago

tomi, the one byte in the Recv-Q is expected when the socket moves to the
CLOSE_WAIT state.  can you try out my v1.1 patch to verify that it works for
you?  in trying to fix this, i also messed around with PR_Poll, but with your
patch you'll get false negatives when there really is data to be read.  not that
someone should be calling isAlive in that case, but they might.

Darin Fisher

Assignee

Comment 33

•

23 years ago

fixed-on-trunk

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

Whiteboard: r=wtc, sr=dougt → r=wtc, sr=dougt, fixed-on-trunk

Tomi Leppikangas

Comment 34

•

23 years ago

Yes v1.1 seems to work ok for me. (linux)

There is still other problen than socket cleanup is only done when
opening new sockets, maybe timer based cleanup routine?

Darin Fisher

Assignee

Comment 35

•

23 years ago

yeah, but that's another bug ;-)

Tom Everingham

Comment 36

•

23 years ago

verified:  sockets in the CLOSE_WAIT state are now being removed appropriately
when going to other sites.   Fixed before branch.

Linux rh6 2001-09-24-04-0.9.4
Win NT4  2001-09-24-05-0.9.4

Status: RESOLVED → VERIFIED

Whiteboard: r=wtc, sr=dougt, fixed-on-trunk → r=wtc, sr=dougt

You need to log in before you can comment on or make changes to this bug.