Closed
Bug 94811
Opened 23 years ago
Closed 23 years ago
leaking sockets; nsISocketTransport::IsAlive not implemented reliably
Categories
(Core :: Networking: HTTP, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla0.9.4
People
(Reporter: darin.moz, Assigned: darin.moz)
References
()
Details
(Keywords: memory-leak, Whiteboard: r=wtc, sr=dougt)
Attachments
(3 files)
600 bytes,
patch
|
Details | Diff | Splinter Review | |
1.72 KB,
patch
|
Details | Diff | Splinter Review | |
1.01 KB,
patch
|
Details | Diff | Splinter Review |
browse to www.cnn.com, click on a link and then go back. fire up netstat -tcpd and wait a bit... notice all the many sockets in the CLOSE_WAIT state. now goto some other site, like mozilla.org. notice that the CLOSE_WAIT sockets have not disappeared. this is bad!
Assignee | ||
Updated•23 years ago
|
Severity: normal → major
Status: NEW → ASSIGNED
Keywords: mlk
Priority: -- → P1
Target Milestone: --- → mozilla0.9.4
Comment 1•23 years ago
|
||
Is this really a leak? We now do keep alives to lots of hosts we didn't, including cnn.com. Our current default timeout is 300 seconds, so you will have to wait 5 minutes for these sockets to be reused. Or are you seeing this persist for more than 5 minutes? See bug 91204 for the timeout issue.
Assignee | ||
Comment 2•23 years ago
|
||
our IsAlive test includes both testing that the connection has not idled for more than the alotted time, and it also calls PR_Read(fd, buf, 0) to check that the socket has not been closed. i've always found this test to be sufficient at detecting the CLOSE_WAIT state on linux and windows. so, i believe that for some reason, the IsAlive test is not being run.
Comment 3•23 years ago
|
||
But with HTTP/1.1 connections, the server can keep the connection open for as long as it wants. On cnn.com, that appears to be 30 seconds. Does this still happen if you wait a minute or so?
Assignee | ||
Comment 4•23 years ago
|
||
if the socket is in the CLOSE_WAIT state, then that implies that the connection has already been closed on the server end. still, i'll wait longer and see if it ever goes away, just to be sure.
I see this on an 08-12 nightly. Not all sockets get stuck but, for those that do, the last thing that happens is the server sends a FIN packet, it is ACK'ed but nothing else follows. Mozilla should send a FIN but doesn't. I waited 20 minutes and the stuck sockets were still there. It's not CNN either. I see it on any number of sites.
Assignee | ||
Comment 6•23 years ago
|
||
OK, i've verified that this is a problem with our IsAlive test, and it seems to be showing up on winnt as well (changing OS to ALL). That is, until i revisit a server with a connection in the CLOSE_WAIT state, the CLOSE_WAIT socket doesn't dissappear. So, it seems that it is not until we try to actually write out a request that we detect the closed socket.
OS: Linux → All
Assignee | ||
Comment 7•23 years ago
|
||
after further investigation, PR_Read(fd, &c, 0) on one of these CLOSE_WAIT sockets returns PR_WOULD_BLOCK_ERROR. from nsSocketTransport.cpp:1716, notice that we do not treat this error as an indication that the socket is closed.
Assignee | ||
Comment 8•23 years ago
|
||
after further investigation, i'm not convinced that there is any portable reliable way to detect a socket in the CLOSE_WAIT state. moreover, it really isn't necessary that we can detect CLOSE_WAIT. so, i think the right solution to this "problem" is to reduce our default idle timeout. however, i wonder if we shouldn't be using a different idle timeout when talking to a proxy. bbaetz: what do you think?
Assignee | ||
Comment 9•23 years ago
|
||
Comment 10•23 years ago
|
||
This came up in bug 91204. You want longer with a proxy, because you're much more likely to talk to it again than a normal server. We should wait on taht til my bug 89500 cleanup has gone in - it will make it a bit easier. Does the NSPR landing yesterday give us any way to detect a closed socket? wtc, do you have any suggestions? I think that that timeout is too high for a server - we should take that up in bug 91204.
Assignee | ||
Comment 11•23 years ago
|
||
agreed.. bug 91204 is probably a better place for my patch to all.js.
Assignee | ||
Comment 12•23 years ago
|
||
improving summary
Summary: leaking sockets → leaking sockets; nsISocketTransport::IsAlive not implemented reliably
Assignee | ||
Comment 13•23 years ago
|
||
shaver suggested this algorithm (supposed to work on unix and win32): PR_Bool IsSocketDead(int fd) { struct msghdr msg; return recvmgr(fd, &msg, MSG_PEEK) == -1; } i have yet to verify it though.
Assignee | ||
Comment 14•23 years ago
|
||
s/recvmgr/recvmsg/
Comment 15•23 years ago
|
||
The only reliable way to test for a closed connection is to read from it. If read returns 0, the peer has closed the connection (at least for writing). You can probably try getpeername(), but I am not sure if it works reliably.
Assignee | ||
Comment 16•23 years ago
|
||
wtc: so, peeking at the data does not equate to reading? b/c now that we have NSPR 4.2, i was thinking about trying PR_Recv with the PR_MSG_PEEK flag set.
Comment 17•23 years ago
|
||
Suppose your peer sends you 10 bytes of data and closes the connection. If you call recv(MSG_PEEK) for 5 bytes, you will get 5 bytes. If you do that again, you will still get 5 bytes. etc. etc. You will never discover that the connection is closed by the peer this way. On the other hand, if you read 5 bytes each time, the third read will return 0, and you know the peer closed the connection (at least for writing).
Assignee | ||
Comment 18•23 years ago
|
||
i'm only concerned with sockets that should not have any unread data, so PR_Recv(MSG_PEEK) should be the way to go.
Assignee | ||
Comment 19•23 years ago
|
||
Comment 20•23 years ago
|
||
wtc: is the flag parameter valid now? http://lxr.mozilla.org/seamonkey/source/nsprpub/pr/include/prio.h#1360
Assignee | ||
Comment 21•23 years ago
|
||
looks like LXR is not showing NSPR 4.2 b/c in 4.2 it is valid, and PR_MSG_PEEK is defined.
Comment 22•23 years ago
|
||
The flags parameter is valid for PR_Recv. We haven't updated the documentation. Darin is right, LXR is still showing the old NSPR. This will be fixed soon. The only value you can use for the flags parameter is 0 or PR_MSG_PEEK.
Comment 23•23 years ago
|
||
does "is alive" mean the same as "is there data on the socket"?
Assignee | ||
Comment 24•23 years ago
|
||
the "is alive" test is meant to check the socket connection status. for example, if you run "netstat -tcpd" you'll see a list of all TCP/IP sockets along with each socket's current state. we would like to be able to detect sockets in the CLOSE_WAIT state (meaning sockets that have already been closed [for writing] on the server side). this helps us throw out these connections before we actually try using them. the benefits are twofold: 1) helps remove deadwood -- servers can't cleanup a socket connection until we also close the connection, and 2) helps avoid having to restart HTTP requests in the event that the connection is already closed. the side effect of not being able to detect a socket in the CLOSE_WAIT state is not sooo bad... since we also limit the idle time per socket, but IMO fixing this is good as it will help make us a more friendly player on the internet.
Assignee | ||
Comment 25•23 years ago
|
||
wtc, dougt: is this patch ok? if so, can i get an r/sr= from you guys? thx!
Comment 26•23 years ago
|
||
I am not qualified to review this patch.
I can only say the original code is no
good because it depends on the undocumented
behavior of reading zero bytes.
You wrote:
> i'm only concerned with sockets that
> should not have any unread data
If this is true, you should be able to say:
char c;
PRInt32 rval = PR_Read (mSocketFD, &c, 1);
PR_ASSERT(rval <= 0);
if (rval > 0) {
/* read unexpected data */
/* display error message */
}
Assignee | ||
Comment 27•23 years ago
|
||
wtc: right, but even though the socket "should not" have any unread data when isAlive is called, there is unfortunately no guarantee that client code won't call isAlive when it really would be meaningless to do so. thus, the PR_Recv(MSG_PEEK) solution seems very nice.
Comment 28•23 years ago
|
||
r=wtc.
Comment 29•23 years ago
|
||
sr=dougt
Assignee | ||
Updated•23 years ago
|
Whiteboard: r=wtc, sr=dougt
Comment 30•23 years ago
|
||
I have debugged this IsAlive thing too, i notised that when proxy connections timed out netstat shows that there is one byte in received queue, so PR_Read allways thinks that connection is ok. I hacked IsAlive to use PR_Poll to see that socket is ok, eg there shouldn be any bytes availeable in socket and no errors so it could be reused again. I'll attach my poll hack.
Comment 31•23 years ago
|
||
Assignee | ||
Comment 32•23 years ago
|
||
tomi, the one byte in the Recv-Q is expected when the socket moves to the CLOSE_WAIT state. can you try out my v1.1 patch to verify that it works for you? in trying to fix this, i also messed around with PR_Poll, but with your patch you'll get false negatives when there really is data to be read. not that someone should be calling isAlive in that case, but they might.
Assignee | ||
Comment 33•23 years ago
|
||
fixed-on-trunk
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Whiteboard: r=wtc, sr=dougt → r=wtc, sr=dougt, fixed-on-trunk
Comment 34•23 years ago
|
||
Yes v1.1 seems to work ok for me. (linux) There is still other problen than socket cleanup is only done when opening new sockets, maybe timer based cleanup routine?
Assignee | ||
Comment 35•23 years ago
|
||
yeah, but that's another bug ;-)
Comment 36•23 years ago
|
||
verified: sockets in the CLOSE_WAIT state are now being removed appropriately when going to other sites. Fixed before branch. Linux rh6 2001-09-24-04-0.9.4 Win NT4 2001-09-24-05-0.9.4
Status: RESOLVED → VERIFIED
Whiteboard: r=wtc, sr=dougt, fixed-on-trunk → r=wtc, sr=dougt
You need to log in
before you can comment on or make changes to this bug.
Description
•