Closed Bug 43567 Opened 24 years ago Closed 23 years ago

FTP - keeping connections alive forever

Categories

(Core Graveyard :: Networking: FTP, defect, P3)

defect

Tracking

(Not tracked)

VERIFIED FIXED
mozilla0.9.7

People

(Reporter: waterson, Assigned: bbaetz)

References

Details

(Keywords: testcase)

Attachments

(4 files, 4 obsolete files)

I may have prematurely closed one of the other socket transport bugs. Oh well,
here's another one.

Turns out we still leak

- a socket transport (three refs, two from pipes, below)
- the socket transport service (one ref)
- two nsPipes (two refs each)

I think that there's a circularity here. I'll report on what I find...
Keywords: mlk
From the refcount balancer tree of the first nsPipe object, it's pretty clear
that the two references are being dropped by
nsSocketTransport::OpenOutputStream(), presumably because the socket transport
is taking ownership of the pipe.
...so where's that last reference being lost from? The nsSocketTransportService
maybe?
Aha. It looks like maybe the nsFtpConnectionThread is bound up in a cyclic
reference with the nsSocketTransportService? Do we need some kind of out-of-band
notification to break the cycle here?
I added a second FTP URL to the "bloaturls.txt" file, and lo, a second socket
transport was leaked, along with two more pipes.
Jud: This fix seems to eliminate one more addref for the nsSocketTransport
object that's leaking. We believe that it holds on to the 2 pipes and
nsSocketTransportService. Canceling the "pipe" channels is necessary because
they return to the work queue in the socket transport service because they just
return WOULD_BLOCK. 

However, we don't think that the nsFTPReleaseEvents are actually releasing what
they think they're releasing, particularly mConnCache. You should check into
this.

We added some MOZ_COUNT_CTOR/DTOR macros for the ftp events and
nsConnectionCacheObj, and determined that mConn is also leaking, but that may be
due to the mConnCache problem.

Index: nsFtpConnectionThread.cpp
===================================================================
RCS file: /cvsroot/mozilla/netwerk/protocol/ftp/src/nsFtpConnectionThread.cpp,v
retrieving revision 1.121
diff -c -r1.121 nsFtpConnectionThread.cpp
*** nsFtpConnectionThread.cpp	2000/06/17 01:38:36	1.121
--- nsFtpConnectionThread.cpp	2000/06/23 06:32:58
***************
*** 1809,1816 ****
      }
  
      // Release the transports
!     mCPipe = 0;
!     mDPipe = 0;
      mIPv6Checked = PR_FALSE;
      if (mIPv6ServerAddress) {
          nsMemory::Free(mIPv6ServerAddress);
--- 1809,1822 ----
      }
  
      // Release the transports
!     if (mCPipe) {
!         mCPipe->Cancel(NS_BINDING_ABORTED);
!         mCPipe = 0;
!     }
!     if (mDPipe) {
!         mDPipe->Cancel(NS_BINDING_ABORTED);
!         mDPipe = 0;
!     }
      mIPv6Checked = PR_FALSE;
      if (mIPv6ServerAddress) {
          nsMemory::Free(mIPv6ServerAddress);
potts and I went back and fourth on the pipe cancellation stuff (note the pipe 
cancels just above that which are commented out). I'm not sure why returning to 
the workQ is helping? Sounds like the socket transport has some cleanup 
problems. I'm worried about cancelling because a cancel will cause an OnStop() 
to fire w/ an abort code. This *might* cancel the data channel before all the 
data has been pumped. Yea, this is a bug in the socket transport.
We were seeing 19 addref's but only 18 release's through 
nsSocketTransport::ProcessWorkQ().

http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsSocketTransportServic
e.cpp#206

This led us to believe that the last trip through was getting stuck here:

http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsSocketTransportServic
e.cpp#250

In other words, the transport was returning EWOULDBLOCK rather than closing. 
Oops... we had another diff in the socket transport service for cleaning out the 
work queue on shutdown. Chris, can you post that one too so Rick can review it?
rpotts: below is the diff. it's a bit sloppy (because we'll still leak the 
PR_CLIST stuff), but you get the idea....

Index: nsSocketTransportService.cpp
===================================================================
RCS file: /cvsroot/mozilla/netwerk/base/src/nsSocketTransportService.cpp,v
retrieving revision 1.42
diff -u -r1.42 nsSocketTransportService.cpp
--- nsSocketTransportService.cpp        2000/06/23 02:02:03     1.42
+++ nsSocketTransportService.cpp        2000/06/23 21:29:09
@@ -619,7 +619,10 @@
   } else {
     rv = NS_ERROR_FAILURE;
   }
-
+  
+  for (int i = 0; i < mSelectFDSetCount; i++) {
+    NS_IF_RELEASE(mActiveTransportList[i]);
+  }
   return rv;
 }
hey chris,

the patch looks fine to me...

Do you think that before we release each transport, we should cancel it?

-- rick
Well, I don't know...one thing that puzzled warren and I last night was "how 
come HTTP doesn't leak"? Is there some shutdown sequence that FTP is just not 
doing right? Does FTP keep a connection open to the server forever? When is that 
connection supposed to go away?
Yes, maybe the destructors should always cancel, just to be safe (like closing a 
file object in its destructor).
ftp keeps a transport cached, per server, in the ftpprotocolhandler, 
indefinately.
So the nsFTPProtocolHandler *is* going away: maybe it's not properly shutting 
down the socket transports?
hey jud,

if the ftpprotocolhandler keeps a transport cached per server, won't we 
eventually run out of socket transports if we visit ~8 different FTP sites?

-- rick
Good question. I just visited a dozen different sites and everything was fine. 
There are no cache cleanup smarts in FTP, so I suspect we just kept building 
socket transports and caching them (maybe the limit is higher, or we grow it). 
How does HTTP do cleanup?

ftp server timeouts are typically longer than HTTP (for the command channel).
We need to fix the thread-pool bug 36750 before we can up the limit on transport 
threads. Cc'ing dougt.
The leaked nsSocketTransportService leaks an nsStringBundle too.
Keywords: nsbeta3
approving for nsbeta3
Assignee: gagan → ruslan
approving this time (didn't make it last time!) sorry.
Whiteboard: [tind-mlk] → [tind-mlk][nsbeta3+]
nsSocketTransport/nsResChannel-related leak is gone. As to FTP - it does seem to 
be keeping connections alive forever. May be we should set up a timeout like 
http does? I'm also not sure that recovery logic works correctly in case the 
remote server drops the connection.
Status: NEW → ASSIGNED
robert-- welcome to necko!
Assignee: ruslan → rjc
Status: ASSIGNED → NEW
Changing summary.  As the leaks are detailed in bug # 51937, this bug can be
strictly about FTP keeping connections alive.

In terms of recovery logic, I used Mozilla to FTP into one of my Linux machines,
then telnet'ed to the same machine (out-of-band) and killed off the FTP process.
Finally, back in Mozilla I reloaded the FTP URL in question and Mozilla
recovered fine.
Status: NEW → ASSIGNED
Summary: FTP causes nsSocketTransport and two nsPipe objects to leak → FTP - keeping connections alive forever
Per PDT rules P3 bugs are now nsbeta3-
Whiteboard: [tind-mlk][nsbeta3+] → [tind-mlk][nsbeta3-]
Target Milestone: --- → Future
Hi dougt, welcome to necko :)
Assignee: rjc → dougt
Status: ASSIGNED → NEW
Blocks: 62356
mass move, v2.
qa to me.
QA Contact: tever → benc
Keywords: mozilla1.0
We need to fix this soon.  
Target Milestone: Future → mozilla0.9.6
Blocks: 92580
to bradley
Assignee: dougt → bbaetz
Component: Networking → Networking: FTP
No longer blocks: 92580
This is no longer an mlk bug, according to comments.

I'll look at this.
Status: NEW → ASSIGNED
Keywords: mlk
Whiteboard: [tind-mlk][nsbeta3-] → [nsbeta3-]
OK, I'm attaching a patch. We use one timer per stored host, and the default
timeout is 5 minutes, and the pref is network.ftp.idleConnectionTimeout.

I shouldn't be leaking by being a prefobserver, because I don't hold onto a pref
branch - there's no need, since its given to me in the Observe callback. OTOH,
by that same logic I'm not sure why http needs to be an nsIWeakReference, since
it doesn't appear to hold onto a pref branch either. darin?

BTW, why isn't the nsITimer callback function declared as PR_CALLBACK in the idl?

darin, dougt: r/sr, please?
Attached patch patch (obsolete) — Splinter Review
Oh, and while I'm at it, why is the hashtable being created as threadsafe? Is
this just because its being created on a different thread than we're using it
on, and if so can we just delay creation until the first time we need it?

My callback code isn't threadsafe, and I don't think it needs to be.
Comment on attachment 55014 [details] [diff] [review]
patch

This leaks from the hashtable on shutdown. I'll do a new patch tomorrow.
Attachment #55014 - Flags: needs-work+
Attachment #55014 - Attachment is obsolete: true
Attachment #55394 - Attachment is obsolete: true
Attached patch new patch (obsolete) — Splinter Review
OK, I've attached a new patch, which uses a lock to get access to the connection
array. We need this because the ftp connection thread can call
::insertconnection after the protocol handler shuts down, because they're on
different threads. This wasn't a problem with the threadsafe supportshashtable,
but now I have cleanup to do.
do we really want necko to depend on widget/timer... perhaps we should consider
a timer generated from the socket transport poll loop??
darin: I don't follow. I can't really use the strategy http currently uses
because we don't make enough ftp connections for it to be worthwhile. I really
don't think that its a problem for us to depend on timers. We do currently,
although thats only for tests.

timer is a separate module, despite where it is in the tree.
i understand why using the timer module makes this code simpler, but it does add
a new dependency for necko.  because of this, i think you should ensure that FTP
still works (albeit with a reduced feature set) if the timer service is not
present... it should not be a fatal error.

my point about the socket transport is that it would be possible to generate a
low frequency timer using the socket transport poll loop.  we could easily make
the poll call timeout every second and then process timer events off of that. 
this would be sufficient for necko's internal needs.  but, since what i'm
talking about doesn't exist, i don't want it to hold up this patch.  ... it's
just a possibility that we might consider for the future.
Currently it fails if there is no timer. I guess I could change that, though,
and keep the entries forever in that case.

A 1 sec timer loop would be cpu intensive though, and we'd just be simulating
nsitimer anyway.

However, embedding/browser depends on timer, as does libpr0n and the widget code
(network/test also). In other words, its pretty much a requirement to do
anything with the browser.

alecf: Do you think that having necko depend on timer is a problem?
Oh, and libpref already depends on timers, and necko already depends on libpref,
so I really don't think its an issue. It only uses it for the pref autoconfig
stuff, though.
Our tree structure may obfuscate the dependencies, but it seems clear that
timers are, or could be (pav's thread-based timer reimplementation) more
primitive than networking.  Partial orders: timer < necko, timer < widget, necko
< widget -- total order: timer < necko < widget.  

bbaetz: if ~nsFtpProtocolHandler can run on a thread while another thread is
calling InsertConnection or RemoveConnection, isn't there a bogus weak reference
to the handler on the latter thread?  The null tests of mLock and the lock-free
clearing of it, followed by the 'if (tmp && mRootConnectionList) {...}' block in
the dtor, do not appear thread-safe to me.

/be
brendan: right. Simulating times via a select twith a timeout on the socket
transport thread just seems yuck.

insert/removeConnection are static functions, as is mLock, and the connection
list, so there is no bogus reference to the protocol handler instance from the
other thread.

I spent quite a bit of time peering at the new code, and couldn't come up with a
case where things broke. None of the functions need to be reentrant, so thats
not an issue.

Actually, looking again, you're right, if the second thread tests mLock before
its set to null in the destructor, then we could crash. Hmm. We don't have an
atomic test and set op, do we? I suppose I need to use a semaphone, then. Any
better suggestions?

(BTW, I noticed that we start up, then destroy the ftp protcol handler before
starting it up again to do any work. That may just be a byproduct of the test
program, though - I'll look into it)
I know the members are static, but who calls the dtor?  It would be best if you
could ensure that the protocol handler isn't destroyed until the other threads
are joined somehow.  That would relieve the dtor from having to do any locking.

/be
Its destroyed from the ui thread, probably in the ioservice's destructor (the
ioservice caches protocol handlers)

However, darin's just told me that despite having a file called
nsFtpConnectionThread.cpp, and the object proxying we do in nsFtpChannel, and
the comment arround line 150, dougt removed the threading stuff a while back.
Grr. I wondered why I couldn't find where we initied this thread...

In that case, my second patch
(http://bugzilla.mozilla.org/attachment.cgi?id=55394&action=view, the one w/o
the locks) should be OK. I think I was thrown by seeing calls to
::insertConnection in the debug log after seeing the ftp destructor being called.

I'm going to file a separate bug to clean this all up, I think. Most of these
locks are unneeded.
Attachment #55403 - Attachment is obsolete: true
Attachment #55394 - Attachment is obsolete: false
currently, PR_Poll times out every 35 seconds.  why is it bad to imagine pruning
dead (or old) connections every time PR_Poll times out?  this could very well be
a feature of the socket transport service.  and owners of a socket transport
could be notified when the socket closes due to inactivity.  seems like all
network protocol handlers would benefit from this.  i don't think this general
solution needs to go in before this bug can be fixed, but i do think that it
might be the right solution down the road.  we could of course adjust how often
PR_Poll times out... 35 seconds is a rather arbitray time out value.
Hmm. Well, we could check for dead connections every 35 seconds. But the
definition of dead depends on the protocol (We may want to wait more than 35
seconds for teh server to respond), and I can imagine a protocol wanting to do
cleanup before we quit, eg by sending a QUIT command, or something.
I personally don't see a problem with necko depending on timer, and I'm actually
surprised we don't already.. timer is a pretty low-level library - it has few
dependencies of its own. 

It also seems better than trying to cobble together a psuedo timer out of
PR_Poll, not to mention you'd loose some of the useful aspects of the timer such
as low-priority events (this seems like such an event)

hmm... ok, i'm beginning to see the light.  anyways, we're only talking about
adding a timer to FTP (and eventually HTTP) since those are the only protocols
that currently cache connected sockets.

the work required to make the socket transport service support such a feature is
probably not worth the effort... besides, it currently only knows about sockets
with active read or write requests (which doesn't include idle sockets).
Yeah, it would be nice to have a generic service, but if you ignore the changes
to add a pref for the timeout, and to move from an array to a hashtable, there
isn't much code, really.

So can I get a review on http://bugzilla.mozilla.org/attachment.cgi?id=55394,
then, please?
Comment on attachment 55394 [details] [diff] [review]
new patch, using an array instead of a hashtable


>Index: protocol/ftp/src/nsFtpProtocolHandler.cpp

>+// Stuff for the timer callback function
>+struct timerStruct {
>+    nsCOMPtr<nsITimer> timer;
>+    nsCOMPtr<nsISupports> conn;
>+    char* key;
>+    
>+    timerStruct() : key(nsnull) {};
>+    
>+    ~timerStruct() {
>+        if (timer)
>+            timer->Cancel();
>+        if (key)
>+            nsMemory::Free(key);
>+    }
>+};

down below key is allocated using nsCRT::strdup, but here you free it using 
nsMemory::Free.  it's probably not a good idea to mix these.

looks like there is an extra semicolon at the end of timerStruct()

how about making timerStruct an inner class of nsFtpProtocolHandler to reduce
namespace clutter?


>-nsSupportsHashtable* nsFtpProtocolHandler::mRootConnectionList = nsnull;
>+nsVoidArray* nsFtpProtocolHandler::mRootConnectionList = nsnull;
>+PRInt32 nsFtpProtocolHandler::mIdleTimeout = -1;

why aren't these plain old member variables?  if they are to be static/global,
then why not prefix them with 'g' or 's'?


> ////////////////////////////////////////////////////////////////////////////////
> nsFtpProtocolHandler::nsFtpProtocolHandler() {
>-        NS_INIT_REFCNT();
>+    NS_INIT_REFCNT();
> }

NS_INIT_ISUPPORTS supercedes NS_INIT_REFCNT, right?


 
>+
>+    if (mIdleTimeout == -1) {
>+        nsCOMPtr<nsIPrefService> prefSrv = do_GetService(NS_PREFSERVICE_CONTRACTID, &rv);
>+        if (NS_FAILED(rv)) return rv;
>+        nsCOMPtr<nsIPrefBranch> branch;
>+        rv = prefSrv->GetBranch("network.ftp.", getter_AddRefs(branch));
>+        if (NS_FAILED(rv)) return rv;
>+        rv = branch->GetIntPref(IDLE_TIMEOUT_PREF, &mIdleTimeout);
>+        if (NS_FAILED(rv))
>+            mIdleTimeout = 5*60; // 5 minute default
>+        
>+        nsCOMPtr<nsIPrefBranchInternal> pbi = do_QueryInterface(branch);
>+        pbi->AddObserver(IDLE_TIMEOUT_PREF, this, PR_TRUE);

does this work correctly?  i thought that AddObsever required the full path to the pref.


>+void
>+nsFtpProtocolHandler::Timeout(nsITimer *aTimer, void *aClosure)
>+{
>+    PR_LOG(gFTPLog, PR_LOG_DEBUG, ("Timeout reached for %0x\n", aClosure));
>+    

could mRootConnectionList be null at this point?  could the timeout fire
after shutting down the FTP protocol handler?

>+    PRBool found = mRootConnectionList->RemoveElement(aClosure);


> nsresult
>-nsFtpProtocolHandler::RemoveConnection(nsIURI *aKey, nsISupports* *_retval) {
>+nsFtpProtocolHandler::RemoveConnection(nsIURI *aKey, nsISupports* *_retval)
>+{

NIT ALERT: are you changing the brace style uniformly in this file?
> down below key is allocated using nsCRT::strdup, but here you free it using
> nsMemory::Free.  it's probably not a good idea to mix these.

It all calls libc's free, which is one of my nits, but that was an accident.
Will fix.

> how about making timerStruct an inner class of nsFtpProtocolHandler to reduce
> namespace clutter?

OK.

> >-nsSupportsHashtable* nsFtpProtocolHandler::mRootConnectionList = nsnull;
> >+nsVoidArray* nsFtpProtocolHandler::mRootConnectionList = nsnull;
> >+PRInt32 nsFtpProtocolHandler::mIdleTimeout = -1;
> 
> why aren't these plain old member variables?  if they are to be
static/global,> then why not prefix them with 'g' or 's'?
Because ::Insert/RemoveConnection are static functions. I plan on cleaning this
up as part of my "remove useless locks and comification" patch. They're static
members of the class to avoid namespace leakage, I guess.

> NS_INIT_ISUPPORTS supercedes NS_INIT_REFCNT, right?

Yeah - I'd added some code to the constructor, then removed it, and fixed up the
indenting at the same time.

> does this work correctly?  i thought that AddObsever required the full path to
the pref.
Well, if I set it in my profile it loads the pref correctly, and thats usually a
sign of it working. The docs imply that it can - maybe the necko2 dll, unlike
the necko dll, is being loaded late enough that this isn't mattering. I'm adding
an observer to the particular pref branch, though. Looking at the code, you may
be right. I'll put some printfs in, and ask arround.

> could mRootConnectionList be null at this point?  could the timeout fire
> after shutting down the FTP protocol handler?

No. The timers will fire on the UI thread, and I cancel the timers as part of
destructing the timer struct from the protocol handler destructor. (This is part
of why the uneeded locking was a pain)

> NIT ALERT: are you changing the brace style uniformly in this file?
> 
Not really. That was a sideeffect of initially having the nsISupports be an
nsFtpControlConnection, but then I changed it back. I'll probably change it back
again later, though, when I clean up the locking.
you should be fine adding the relative pref name to the pref branch.. all
methods off of the pref branch should be acting on relative pref names..
bbaetz: how about just adding a

  NS_ASSERTION(mRootConnectionList, "wtf")

or something like that to your timeout handler?
I'll attach a new patch tomorrow.

darin: will do
alecf: Nope, as discussed on irc. See bug 107617. Any observer set on a not-root
pref branch doesn't appear to work. I'll have to rewrite my code, I think.
I didn't get to this.

->0.9.7
Target Milestone: mozilla0.9.6 → mozilla0.9.7
Attached patch updated patch (obsolete) — Splinter Review
The reason noone reviewed this would probably because I never attached the
updated patch. Oops
Attachment #55394 - Attachment is obsolete: true
Comment on attachment 58616 [details] [diff] [review]
updated patch

sr=darin
Attachment #58616 - Flags: superreview+
Attached patch worksSplinter Review
That one crashed, because of the scoping of ts in RemoveConnection. Not sure
why it didn't crash until today, then did so reliably. Anyway, fixed.
Attachment #58616 - Attachment is obsolete: true
Comment on attachment 60431 [details] [diff] [review]
works

>+    struct timerStruct {

>+        ~timerStruct() {
>+            if (timer)
>+                timer->Cancel();
>+            if (key)
>+                nsCRT::free(key);

CRTFREEIF(key) ?

otherwise sr=darin
Attachment #60431 - Flags: superreview+
Comment on attachment 60431 [details] [diff] [review]
works

nit- I would change the for loops to use the post decrement operator.

Nit: In your Observer(), could you do a strcmp on the aTopic.

R= with or without my nits fixed.  Looks really good.
Attachment #60431 - Flags: review+
Preincrement is faster, becauset the compiler doesn't need a temporary. (Not
that it makes a differnce on POD types, I suspect)

I'll add the strcmp
perhaps the strcmp on aTopic could be inside a debug assertion?
Checked in (with the extra assertion)
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Keywords: testcase
Whiteboard: [nsbeta3-] → checklinux
VERIFIED: 2002-05-13 commercial 1.0.1, Linux
Status: RESOLVED → VERIFIED
Whiteboard: checklinux
*** Bug 123129 has been marked as a duplicate of this bug. ***
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: