Closed Bug 43567 Opened 25 years ago Closed 23 years ago

FTP - keeping connections alive forever

Categories

(Core Graveyard :: Networking: FTP, defect, P3)

defect

Tracking

(Not tracked)

VERIFIED FIXED
mozilla0.9.7

People

(Reporter: waterson, Assigned: bbaetz)

References

Details

(Keywords: testcase)

Attachments

(4 files, 4 obsolete files)

I may have prematurely closed one of the other socket transport bugs. Oh well, here's another one. Turns out we still leak - a socket transport (three refs, two from pipes, below) - the socket transport service (one ref) - two nsPipes (two refs each) I think that there's a circularity here. I'll report on what I find...
Keywords: mlk
From the refcount balancer tree of the first nsPipe object, it's pretty clear that the two references are being dropped by nsSocketTransport::OpenOutputStream(), presumably because the socket transport is taking ownership of the pipe.
...so where's that last reference being lost from? The nsSocketTransportService maybe?
Aha. It looks like maybe the nsFtpConnectionThread is bound up in a cyclic reference with the nsSocketTransportService? Do we need some kind of out-of-band notification to break the cycle here?
I added a second FTP URL to the "bloaturls.txt" file, and lo, a second socket transport was leaked, along with two more pipes.
Jud: This fix seems to eliminate one more addref for the nsSocketTransport object that's leaking. We believe that it holds on to the 2 pipes and nsSocketTransportService. Canceling the "pipe" channels is necessary because they return to the work queue in the socket transport service because they just return WOULD_BLOCK. However, we don't think that the nsFTPReleaseEvents are actually releasing what they think they're releasing, particularly mConnCache. You should check into this. We added some MOZ_COUNT_CTOR/DTOR macros for the ftp events and nsConnectionCacheObj, and determined that mConn is also leaking, but that may be due to the mConnCache problem. Index: nsFtpConnectionThread.cpp =================================================================== RCS file: /cvsroot/mozilla/netwerk/protocol/ftp/src/nsFtpConnectionThread.cpp,v retrieving revision 1.121 diff -c -r1.121 nsFtpConnectionThread.cpp *** nsFtpConnectionThread.cpp 2000/06/17 01:38:36 1.121 --- nsFtpConnectionThread.cpp 2000/06/23 06:32:58 *************** *** 1809,1816 **** } // Release the transports ! mCPipe = 0; ! mDPipe = 0; mIPv6Checked = PR_FALSE; if (mIPv6ServerAddress) { nsMemory::Free(mIPv6ServerAddress); --- 1809,1822 ---- } // Release the transports ! if (mCPipe) { ! mCPipe->Cancel(NS_BINDING_ABORTED); ! mCPipe = 0; ! } ! if (mDPipe) { ! mDPipe->Cancel(NS_BINDING_ABORTED); ! mDPipe = 0; ! } mIPv6Checked = PR_FALSE; if (mIPv6ServerAddress) { nsMemory::Free(mIPv6ServerAddress);
potts and I went back and fourth on the pipe cancellation stuff (note the pipe cancels just above that which are commented out). I'm not sure why returning to the workQ is helping? Sounds like the socket transport has some cleanup problems. I'm worried about cancelling because a cancel will cause an OnStop() to fire w/ an abort code. This *might* cancel the data channel before all the data has been pumped. Yea, this is a bug in the socket transport.
We were seeing 19 addref's but only 18 release's through nsSocketTransport::ProcessWorkQ(). http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsSocketTransportServic e.cpp#206 This led us to believe that the last trip through was getting stuck here: http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsSocketTransportServic e.cpp#250 In other words, the transport was returning EWOULDBLOCK rather than closing.
Oops... we had another diff in the socket transport service for cleaning out the work queue on shutdown. Chris, can you post that one too so Rick can review it?
rpotts: below is the diff. it's a bit sloppy (because we'll still leak the PR_CLIST stuff), but you get the idea.... Index: nsSocketTransportService.cpp =================================================================== RCS file: /cvsroot/mozilla/netwerk/base/src/nsSocketTransportService.cpp,v retrieving revision 1.42 diff -u -r1.42 nsSocketTransportService.cpp --- nsSocketTransportService.cpp 2000/06/23 02:02:03 1.42 +++ nsSocketTransportService.cpp 2000/06/23 21:29:09 @@ -619,7 +619,10 @@ } else { rv = NS_ERROR_FAILURE; } - + + for (int i = 0; i < mSelectFDSetCount; i++) { + NS_IF_RELEASE(mActiveTransportList[i]); + } return rv; }
hey chris, the patch looks fine to me... Do you think that before we release each transport, we should cancel it? -- rick
Well, I don't know...one thing that puzzled warren and I last night was "how come HTTP doesn't leak"? Is there some shutdown sequence that FTP is just not doing right? Does FTP keep a connection open to the server forever? When is that connection supposed to go away?
Yes, maybe the destructors should always cancel, just to be safe (like closing a file object in its destructor).
ftp keeps a transport cached, per server, in the ftpprotocolhandler, indefinately.
So the nsFTPProtocolHandler *is* going away: maybe it's not properly shutting down the socket transports?
hey jud, if the ftpprotocolhandler keeps a transport cached per server, won't we eventually run out of socket transports if we visit ~8 different FTP sites? -- rick
Good question. I just visited a dozen different sites and everything was fine. There are no cache cleanup smarts in FTP, so I suspect we just kept building socket transports and caching them (maybe the limit is higher, or we grow it). How does HTTP do cleanup? ftp server timeouts are typically longer than HTTP (for the command channel).
We need to fix the thread-pool bug 36750 before we can up the limit on transport threads. Cc'ing dougt.
The leaked nsSocketTransportService leaks an nsStringBundle too.
Keywords: nsbeta3
approving for nsbeta3
Assignee: gagan → ruslan
approving this time (didn't make it last time!) sorry.
Whiteboard: [tind-mlk] → [tind-mlk][nsbeta3+]
nsSocketTransport/nsResChannel-related leak is gone. As to FTP - it does seem to be keeping connections alive forever. May be we should set up a timeout like http does? I'm also not sure that recovery logic works correctly in case the remote server drops the connection.
Status: NEW → ASSIGNED
robert-- welcome to necko!
Assignee: ruslan → rjc
Status: ASSIGNED → NEW
Changing summary. As the leaks are detailed in bug # 51937, this bug can be strictly about FTP keeping connections alive. In terms of recovery logic, I used Mozilla to FTP into one of my Linux machines, then telnet'ed to the same machine (out-of-band) and killed off the FTP process. Finally, back in Mozilla I reloaded the FTP URL in question and Mozilla recovered fine.
Status: NEW → ASSIGNED
Summary: FTP causes nsSocketTransport and two nsPipe objects to leak → FTP - keeping connections alive forever
Per PDT rules P3 bugs are now nsbeta3-
Whiteboard: [tind-mlk][nsbeta3+] → [tind-mlk][nsbeta3-]
Target Milestone: --- → Future
Hi dougt, welcome to necko :)
Assignee: rjc → dougt
Status: ASSIGNED → NEW
Blocks: 62356
mass move, v2. qa to me.
QA Contact: tever → benc
Keywords: mozilla1.0
We need to fix this soon.
Target Milestone: Future → mozilla0.9.6
Blocks: 92580
to bradley
Assignee: dougt → bbaetz
Component: Networking → Networking: FTP
No longer blocks: 92580
This is no longer an mlk bug, according to comments. I'll look at this.
Status: NEW → ASSIGNED
Keywords: mlk
Whiteboard: [tind-mlk][nsbeta3-] → [nsbeta3-]
OK, I'm attaching a patch. We use one timer per stored host, and the default timeout is 5 minutes, and the pref is network.ftp.idleConnectionTimeout. I shouldn't be leaking by being a prefobserver, because I don't hold onto a pref branch - there's no need, since its given to me in the Observe callback. OTOH, by that same logic I'm not sure why http needs to be an nsIWeakReference, since it doesn't appear to hold onto a pref branch either. darin? BTW, why isn't the nsITimer callback function declared as PR_CALLBACK in the idl? darin, dougt: r/sr, please?
Attached patch patch (obsolete) — Splinter Review
Oh, and while I'm at it, why is the hashtable being created as threadsafe? Is this just because its being created on a different thread than we're using it on, and if so can we just delay creation until the first time we need it? My callback code isn't threadsafe, and I don't think it needs to be.
Comment on attachment 55014 [details] [diff] [review] patch This leaks from the hashtable on shutdown. I'll do a new patch tomorrow.
Attachment #55014 - Flags: needs-work+
Attachment #55014 - Attachment is obsolete: true
Attachment #55394 - Attachment is obsolete: true
Attached patch new patch (obsolete) — Splinter Review
OK, I've attached a new patch, which uses a lock to get access to the connection array. We need this because the ftp connection thread can call ::insertconnection after the protocol handler shuts down, because they're on different threads. This wasn't a problem with the threadsafe supportshashtable, but now I have cleanup to do.
do we really want necko to depend on widget/timer... perhaps we should consider a timer generated from the socket transport poll loop??
darin: I don't follow. I can't really use the strategy http currently uses because we don't make enough ftp connections for it to be worthwhile. I really don't think that its a problem for us to depend on timers. We do currently, although thats only for tests. timer is a separate module, despite where it is in the tree.
i understand why using the timer module makes this code simpler, but it does add a new dependency for necko. because of this, i think you should ensure that FTP still works (albeit with a reduced feature set) if the timer service is not present... it should not be a fatal error. my point about the socket transport is that it would be possible to generate a low frequency timer using the socket transport poll loop. we could easily make the poll call timeout every second and then process timer events off of that. this would be sufficient for necko's internal needs. but, since what i'm talking about doesn't exist, i don't want it to hold up this patch. ... it's just a possibility that we might consider for the future.
Currently it fails if there is no timer. I guess I could change that, though, and keep the entries forever in that case. A 1 sec timer loop would be cpu intensive though, and we'd just be simulating nsitimer anyway. However, embedding/browser depends on timer, as does libpr0n and the widget code (network/test also). In other words, its pretty much a requirement to do anything with the browser. alecf: Do you think that having necko depend on timer is a problem?
Oh, and libpref already depends on timers, and necko already depends on libpref, so I really don't think its an issue. It only uses it for the pref autoconfig stuff, though.
Our tree structure may obfuscate the dependencies, but it seems clear that timers are, or could be (pav's thread-based timer reimplementation) more primitive than networking. Partial orders: timer < necko, timer < widget, necko < widget -- total order: timer < necko < widget. bbaetz: if ~nsFtpProtocolHandler can run on a thread while another thread is calling InsertConnection or RemoveConnection, isn't there a bogus weak reference to the handler on the latter thread? The null tests of mLock and the lock-free clearing of it, followed by the 'if (tmp && mRootConnectionList) {...}' block in the dtor, do not appear thread-safe to me. /be
brendan: right. Simulating times via a select twith a timeout on the socket transport thread just seems yuck. insert/removeConnection are static functions, as is mLock, and the connection list, so there is no bogus reference to the protocol handler instance from the other thread. I spent quite a bit of time peering at the new code, and couldn't come up with a case where things broke. None of the functions need to be reentrant, so thats not an issue. Actually, looking again, you're right, if the second thread tests mLock before its set to null in the destructor, then we could crash. Hmm. We don't have an atomic test and set op, do we? I suppose I need to use a semaphone, then. Any better suggestions? (BTW, I noticed that we start up, then destroy the ftp protcol handler before starting it up again to do any work. That may just be a byproduct of the test program, though - I'll look into it)
I know the members are static, but who calls the dtor? It would be best if you could ensure that the protocol handler isn't destroyed until the other threads are joined somehow. That would relieve the dtor from having to do any locking. /be
Its destroyed from the ui thread, probably in the ioservice's destructor (the ioservice caches protocol handlers) However, darin's just told me that despite having a file called nsFtpConnectionThread.cpp, and the object proxying we do in nsFtpChannel, and the comment arround line 150, dougt removed the threading stuff a while back. Grr. I wondered why I couldn't find where we initied this thread... In that case, my second patch (http://bugzilla.mozilla.org/attachment.cgi?id=55394&action=view, the one w/o the locks) should be OK. I think I was thrown by seeing calls to ::insertConnection in the debug log after seeing the ftp destructor being called. I'm going to file a separate bug to clean this all up, I think. Most of these locks are unneeded.
Attachment #55403 - Attachment is obsolete: true
Attachment #55394 - Attachment is obsolete: false
currently, PR_Poll times out every 35 seconds. why is it bad to imagine pruning dead (or old) connections every time PR_Poll times out? this could very well be a feature of the socket transport service. and owners of a socket transport could be notified when the socket closes due to inactivity. seems like all network protocol handlers would benefit from this. i don't think this general solution needs to go in before this bug can be fixed, but i do think that it might be the right solution down the road. we could of course adjust how often PR_Poll times out... 35 seconds is a rather arbitray time out value.
Hmm. Well, we could check for dead connections every 35 seconds. But the definition of dead depends on the protocol (We may want to wait more than 35 seconds for teh server to respond), and I can imagine a protocol wanting to do cleanup before we quit, eg by sending a QUIT command, or something.
I personally don't see a problem with necko depending on timer, and I'm actually surprised we don't already.. timer is a pretty low-level library - it has few dependencies of its own. It also seems better than trying to cobble together a psuedo timer out of PR_Poll, not to mention you'd loose some of the useful aspects of the timer such as low-priority events (this seems like such an event)
hmm... ok, i'm beginning to see the light. anyways, we're only talking about adding a timer to FTP (and eventually HTTP) since those are the only protocols that currently cache connected sockets. the work required to make the socket transport service support such a feature is probably not worth the effort... besides, it currently only knows about sockets with active read or write requests (which doesn't include idle sockets).
Yeah, it would be nice to have a generic service, but if you ignore the changes to add a pref for the timeout, and to move from an array to a hashtable, there isn't much code, really. So can I get a review on http://bugzilla.mozilla.org/attachment.cgi?id=55394, then, please?
Comment on attachment 55394 [details] [diff] [review] new patch, using an array instead of a hashtable >Index: protocol/ftp/src/nsFtpProtocolHandler.cpp >+// Stuff for the timer callback function >+struct timerStruct { >+ nsCOMPtr<nsITimer> timer; >+ nsCOMPtr<nsISupports> conn; >+ char* key; >+ >+ timerStruct() : key(nsnull) {}; >+ >+ ~timerStruct() { >+ if (timer) >+ timer->Cancel(); >+ if (key) >+ nsMemory::Free(key); >+ } >+}; down below key is allocated using nsCRT::strdup, but here you free it using nsMemory::Free. it's probably not a good idea to mix these. looks like there is an extra semicolon at the end of timerStruct() how about making timerStruct an inner class of nsFtpProtocolHandler to reduce namespace clutter? >-nsSupportsHashtable* nsFtpProtocolHandler::mRootConnectionList = nsnull; >+nsVoidArray* nsFtpProtocolHandler::mRootConnectionList = nsnull; >+PRInt32 nsFtpProtocolHandler::mIdleTimeout = -1; why aren't these plain old member variables? if they are to be static/global, then why not prefix them with 'g' or 's'? > //////////////////////////////////////////////////////////////////////////////// > nsFtpProtocolHandler::nsFtpProtocolHandler() { >- NS_INIT_REFCNT(); >+ NS_INIT_REFCNT(); > } NS_INIT_ISUPPORTS supercedes NS_INIT_REFCNT, right? >+ >+ if (mIdleTimeout == -1) { >+ nsCOMPtr<nsIPrefService> prefSrv = do_GetService(NS_PREFSERVICE_CONTRACTID, &rv); >+ if (NS_FAILED(rv)) return rv; >+ nsCOMPtr<nsIPrefBranch> branch; >+ rv = prefSrv->GetBranch("network.ftp.", getter_AddRefs(branch)); >+ if (NS_FAILED(rv)) return rv; >+ rv = branch->GetIntPref(IDLE_TIMEOUT_PREF, &mIdleTimeout); >+ if (NS_FAILED(rv)) >+ mIdleTimeout = 5*60; // 5 minute default >+ >+ nsCOMPtr<nsIPrefBranchInternal> pbi = do_QueryInterface(branch); >+ pbi->AddObserver(IDLE_TIMEOUT_PREF, this, PR_TRUE); does this work correctly? i thought that AddObsever required the full path to the pref. >+void >+nsFtpProtocolHandler::Timeout(nsITimer *aTimer, void *aClosure) >+{ >+ PR_LOG(gFTPLog, PR_LOG_DEBUG, ("Timeout reached for %0x\n", aClosure)); >+ could mRootConnectionList be null at this point? could the timeout fire after shutting down the FTP protocol handler? >+ PRBool found = mRootConnectionList->RemoveElement(aClosure); > nsresult >-nsFtpProtocolHandler::RemoveConnection(nsIURI *aKey, nsISupports* *_retval) { >+nsFtpProtocolHandler::RemoveConnection(nsIURI *aKey, nsISupports* *_retval) >+{ NIT ALERT: are you changing the brace style uniformly in this file?
> down below key is allocated using nsCRT::strdup, but here you free it using > nsMemory::Free. it's probably not a good idea to mix these. It all calls libc's free, which is one of my nits, but that was an accident. Will fix. > how about making timerStruct an inner class of nsFtpProtocolHandler to reduce > namespace clutter? OK. > >-nsSupportsHashtable* nsFtpProtocolHandler::mRootConnectionList = nsnull; > >+nsVoidArray* nsFtpProtocolHandler::mRootConnectionList = nsnull; > >+PRInt32 nsFtpProtocolHandler::mIdleTimeout = -1; > > why aren't these plain old member variables? if they are to be static/global,> then why not prefix them with 'g' or 's'? Because ::Insert/RemoveConnection are static functions. I plan on cleaning this up as part of my "remove useless locks and comification" patch. They're static members of the class to avoid namespace leakage, I guess. > NS_INIT_ISUPPORTS supercedes NS_INIT_REFCNT, right? Yeah - I'd added some code to the constructor, then removed it, and fixed up the indenting at the same time. > does this work correctly? i thought that AddObsever required the full path to the pref. Well, if I set it in my profile it loads the pref correctly, and thats usually a sign of it working. The docs imply that it can - maybe the necko2 dll, unlike the necko dll, is being loaded late enough that this isn't mattering. I'm adding an observer to the particular pref branch, though. Looking at the code, you may be right. I'll put some printfs in, and ask arround. > could mRootConnectionList be null at this point? could the timeout fire > after shutting down the FTP protocol handler? No. The timers will fire on the UI thread, and I cancel the timers as part of destructing the timer struct from the protocol handler destructor. (This is part of why the uneeded locking was a pain) > NIT ALERT: are you changing the brace style uniformly in this file? > Not really. That was a sideeffect of initially having the nsISupports be an nsFtpControlConnection, but then I changed it back. I'll probably change it back again later, though, when I clean up the locking.
you should be fine adding the relative pref name to the pref branch.. all methods off of the pref branch should be acting on relative pref names..
bbaetz: how about just adding a NS_ASSERTION(mRootConnectionList, "wtf") or something like that to your timeout handler?
I'll attach a new patch tomorrow. darin: will do alecf: Nope, as discussed on irc. See bug 107617. Any observer set on a not-root pref branch doesn't appear to work. I'll have to rewrite my code, I think.
I didn't get to this. ->0.9.7
Target Milestone: mozilla0.9.6 → mozilla0.9.7
Attached patch updated patch (obsolete) — Splinter Review
The reason noone reviewed this would probably because I never attached the updated patch. Oops
Attachment #55394 - Attachment is obsolete: true
Comment on attachment 58616 [details] [diff] [review] updated patch sr=darin
Attachment #58616 - Flags: superreview+
Attached patch worksSplinter Review
That one crashed, because of the scoping of ts in RemoveConnection. Not sure why it didn't crash until today, then did so reliably. Anyway, fixed.
Attachment #58616 - Attachment is obsolete: true
Comment on attachment 60431 [details] [diff] [review] works >+ struct timerStruct { >+ ~timerStruct() { >+ if (timer) >+ timer->Cancel(); >+ if (key) >+ nsCRT::free(key); CRTFREEIF(key) ? otherwise sr=darin
Attachment #60431 - Flags: superreview+
Comment on attachment 60431 [details] [diff] [review] works nit- I would change the for loops to use the post decrement operator. Nit: In your Observer(), could you do a strcmp on the aTopic. R= with or without my nits fixed. Looks really good.
Attachment #60431 - Flags: review+
Preincrement is faster, becauset the compiler doesn't need a temporary. (Not that it makes a differnce on POD types, I suspect) I'll add the strcmp
perhaps the strcmp on aTopic could be inside a debug assertion?
Checked in (with the extra assertion)
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Keywords: testcase
Whiteboard: [nsbeta3-] → checklinux
VERIFIED: 2002-05-13 commercial 1.0.1, Linux
Status: RESOLVED → VERIFIED
Whiteboard: checklinux
*** Bug 123129 has been marked as a duplicate of this bug. ***
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: