FTP - keeping connections alive forever

VERIFIED FIXED in mozilla0.9.7



19 years ago
17 years ago


(Reporter: waterson, Assigned: bbaetz)



Dependency tree / graph

Firefox Tracking Flags

(Not tracked)



(4 attachments, 4 obsolete attachments)



19 years ago
I may have prematurely closed one of the other socket transport bugs. Oh well,
here's another one.

Turns out we still leak

- a socket transport (three refs, two from pipes, below)
- the socket transport service (one ref)
- two nsPipes (two refs each)

I think that there's a circularity here. I'll report on what I find...


19 years ago
Keywords: mlk

Comment 2

19 years ago
From the refcount balancer tree of the first nsPipe object, it's pretty clear
that the two references are being dropped by
nsSocketTransport::OpenOutputStream(), presumably because the socket transport
is taking ownership of the pipe.

Comment 4

19 years ago
...so where's that last reference being lost from? The nsSocketTransportService

Comment 6

19 years ago
Aha. It looks like maybe the nsFtpConnectionThread is bound up in a cyclic
reference with the nsSocketTransportService? Do we need some kind of out-of-band
notification to break the cycle here?

Comment 7

19 years ago
I added a second FTP URL to the "bloaturls.txt" file, and lo, a second socket
transport was leaked, along with two more pipes.

Comment 8

19 years ago
Jud: This fix seems to eliminate one more addref for the nsSocketTransport
object that's leaking. We believe that it holds on to the 2 pipes and
nsSocketTransportService. Canceling the "pipe" channels is necessary because
they return to the work queue in the socket transport service because they just
return WOULD_BLOCK. 

However, we don't think that the nsFTPReleaseEvents are actually releasing what
they think they're releasing, particularly mConnCache. You should check into

We added some MOZ_COUNT_CTOR/DTOR macros for the ftp events and
nsConnectionCacheObj, and determined that mConn is also leaking, but that may be
due to the mConnCache problem.

Index: nsFtpConnectionThread.cpp
RCS file: /cvsroot/mozilla/netwerk/protocol/ftp/src/nsFtpConnectionThread.cpp,v
retrieving revision 1.121
diff -c -r1.121 nsFtpConnectionThread.cpp
*** nsFtpConnectionThread.cpp	2000/06/17 01:38:36	1.121
--- nsFtpConnectionThread.cpp	2000/06/23 06:32:58
*** 1809,1816 ****
      // Release the transports
!     mCPipe = 0;
!     mDPipe = 0;
      mIPv6Checked = PR_FALSE;
      if (mIPv6ServerAddress) {
--- 1809,1822 ----
      // Release the transports
!     if (mCPipe) {
!         mCPipe->Cancel(NS_BINDING_ABORTED);
!         mCPipe = 0;
!     }
!     if (mDPipe) {
!         mDPipe->Cancel(NS_BINDING_ABORTED);
!         mDPipe = 0;
!     }
      mIPv6Checked = PR_FALSE;
      if (mIPv6ServerAddress) {

Comment 9

19 years ago
potts and I went back and fourth on the pipe cancellation stuff (note the pipe 
cancels just above that which are commented out). I'm not sure why returning to 
the workQ is helping? Sounds like the socket transport has some cleanup 
problems. I'm worried about cancelling because a cancel will cause an OnStop() 
to fire w/ an abort code. This *might* cancel the data channel before all the 
data has been pumped. Yea, this is a bug in the socket transport.

Comment 10

19 years ago
We were seeing 19 addref's but only 18 release's through 


This led us to believe that the last trip through was getting stuck here:


In other words, the transport was returning EWOULDBLOCK rather than closing. 

Comment 11

19 years ago
Oops... we had another diff in the socket transport service for cleaning out the 
work queue on shutdown. Chris, can you post that one too so Rick can review it?

Comment 12

19 years ago
rpotts: below is the diff. it's a bit sloppy (because we'll still leak the 
PR_CLIST stuff), but you get the idea....

Index: nsSocketTransportService.cpp
RCS file: /cvsroot/mozilla/netwerk/base/src/nsSocketTransportService.cpp,v
retrieving revision 1.42
diff -u -r1.42 nsSocketTransportService.cpp
--- nsSocketTransportService.cpp        2000/06/23 02:02:03     1.42
+++ nsSocketTransportService.cpp        2000/06/23 21:29:09
@@ -619,7 +619,10 @@
   } else {
+  for (int i = 0; i < mSelectFDSetCount; i++) {
+    NS_IF_RELEASE(mActiveTransportList[i]);
+  }
   return rv;

Comment 13

19 years ago
hey chris,

the patch looks fine to me...

Do you think that before we release each transport, we should cancel it?

-- rick

Comment 14

19 years ago
Well, I don't know...one thing that puzzled warren and I last night was "how 
come HTTP doesn't leak"? Is there some shutdown sequence that FTP is just not 
doing right? Does FTP keep a connection open to the server forever? When is that 
connection supposed to go away?

Comment 15

19 years ago
Yes, maybe the destructors should always cancel, just to be safe (like closing a 
file object in its destructor).

Comment 16

19 years ago
ftp keeps a transport cached, per server, in the ftpprotocolhandler, 

Comment 17

19 years ago
So the nsFTPProtocolHandler *is* going away: maybe it's not properly shutting 
down the socket transports?

Comment 18

19 years ago
hey jud,

if the ftpprotocolhandler keeps a transport cached per server, won't we 
eventually run out of socket transports if we visit ~8 different FTP sites?

-- rick

Comment 19

19 years ago
Good question. I just visited a dozen different sites and everything was fine. 
There are no cache cleanup smarts in FTP, so I suspect we just kept building 
socket transports and caching them (maybe the limit is higher, or we grow it). 
How does HTTP do cleanup?

ftp server timeouts are typically longer than HTTP (for the command channel).

Comment 20

19 years ago
We need to fix the thread-pool bug 36750 before we can up the limit on transport 
threads. Cc'ing dougt.
The leaked nsSocketTransportService leaks an nsStringBundle too.


19 years ago
Keywords: nsbeta3

Comment 22

19 years ago
approving for nsbeta3
Assignee: gagan → ruslan

Comment 23

19 years ago
approving this time (didn't make it last time!) sorry.
Whiteboard: [tind-mlk] → [tind-mlk][nsbeta3+]

Comment 24

19 years ago
nsSocketTransport/nsResChannel-related leak is gone. As to FTP - it does seem to 
be keeping connections alive forever. May be we should set up a timeout like 
http does? I'm also not sure that recovery logic works correctly in case the 
remote server drops the connection.

Comment 25

19 years ago
robert-- welcome to necko!
Assignee: ruslan → rjc
Changing summary.  As the leaks are detailed in bug # 51937, this bug can be
strictly about FTP keeping connections alive.

In terms of recovery logic, I used Mozilla to FTP into one of my Linux machines,
then telnet'ed to the same machine (out-of-band) and killed off the FTP process.
Finally, back in Mozilla I reloaded the FTP URL in question and Mozilla
recovered fine.
Summary: FTP causes nsSocketTransport and two nsPipe objects to leak → FTP - keeping connections alive forever
Per PDT rules P3 bugs are now nsbeta3-
Whiteboard: [tind-mlk][nsbeta3+] → [tind-mlk][nsbeta3-]
Target Milestone: --- → Future

Comment 28

19 years ago
Hi dougt, welcome to necko :)
Assignee: rjc → dougt


19 years ago
Blocks: 62356

Comment 29

18 years ago
mass move, v2.
qa to me.
QA Contact: tever → benc


18 years ago
Keywords: mozilla1.0
We need to fix this soon.  
Target Milestone: Future → mozilla0.9.6


18 years ago
Blocks: 92580
to bradley
Assignee: dougt → bbaetz
Component: Networking → Networking: FTP


18 years ago
No longer blocks: 92580

Comment 32

18 years ago
This is no longer an mlk bug, according to comments.

I'll look at this.
Keywords: mlk
Whiteboard: [tind-mlk][nsbeta3-] → [nsbeta3-]

Comment 33

18 years ago
OK, I'm attaching a patch. We use one timer per stored host, and the default
timeout is 5 minutes, and the pref is network.ftp.idleConnectionTimeout.

I shouldn't be leaking by being a prefobserver, because I don't hold onto a pref
branch - there's no need, since its given to me in the Observe callback. OTOH,
by that same logic I'm not sure why http needs to be an nsIWeakReference, since
it doesn't appear to hold onto a pref branch either. darin?

BTW, why isn't the nsITimer callback function declared as PR_CALLBACK in the idl?

darin, dougt: r/sr, please?

Comment 34

18 years ago
Posted patch patch (obsolete) — Splinter Review

Comment 35

18 years ago
Oh, and while I'm at it, why is the hashtable being created as threadsafe? Is
this just because its being created on a different thread than we're using it
on, and if so can we just delay creation until the first time we need it?

My callback code isn't threadsafe, and I don't think it needs to be.

Comment 36

18 years ago
Comment on attachment 55014 [details] [diff] [review]

This leaks from the hashtable on shutdown. I'll do a new patch tomorrow.
Attachment #55014 - Flags: needs-work+


18 years ago
Attachment #55014 - Attachment is obsolete: true


18 years ago
Attachment #55394 - Attachment is obsolete: true

Comment 38

18 years ago
Posted patch new patch (obsolete) — Splinter Review

Comment 39

18 years ago
OK, I've attached a new patch, which uses a lock to get access to the connection
array. We need this because the ftp connection thread can call
::insertconnection after the protocol handler shuts down, because they're on
different threads. This wasn't a problem with the threadsafe supportshashtable,
but now I have cleanup to do.

Comment 40

18 years ago
do we really want necko to depend on widget/timer... perhaps we should consider
a timer generated from the socket transport poll loop??

Comment 41

18 years ago
darin: I don't follow. I can't really use the strategy http currently uses
because we don't make enough ftp connections for it to be worthwhile. I really
don't think that its a problem for us to depend on timers. We do currently,
although thats only for tests.

timer is a separate module, despite where it is in the tree.

Comment 42

18 years ago
i understand why using the timer module makes this code simpler, but it does add
a new dependency for necko.  because of this, i think you should ensure that FTP
still works (albeit with a reduced feature set) if the timer service is not
present... it should not be a fatal error.

my point about the socket transport is that it would be possible to generate a
low frequency timer using the socket transport poll loop.  we could easily make
the poll call timeout every second and then process timer events off of that. 
this would be sufficient for necko's internal needs.  but, since what i'm
talking about doesn't exist, i don't want it to hold up this patch.  ... it's
just a possibility that we might consider for the future.

Comment 43

18 years ago
Currently it fails if there is no timer. I guess I could change that, though,
and keep the entries forever in that case.

A 1 sec timer loop would be cpu intensive though, and we'd just be simulating
nsitimer anyway.

However, embedding/browser depends on timer, as does libpr0n and the widget code
(network/test also). In other words, its pretty much a requirement to do
anything with the browser.

alecf: Do you think that having necko depend on timer is a problem?

Comment 44

18 years ago
Oh, and libpref already depends on timers, and necko already depends on libpref,
so I really don't think its an issue. It only uses it for the pref autoconfig
stuff, though.
Our tree structure may obfuscate the dependencies, but it seems clear that
timers are, or could be (pav's thread-based timer reimplementation) more
primitive than networking.  Partial orders: timer < necko, timer < widget, necko
< widget -- total order: timer < necko < widget.  

bbaetz: if ~nsFtpProtocolHandler can run on a thread while another thread is
calling InsertConnection or RemoveConnection, isn't there a bogus weak reference
to the handler on the latter thread?  The null tests of mLock and the lock-free
clearing of it, followed by the 'if (tmp && mRootConnectionList) {...}' block in
the dtor, do not appear thread-safe to me.


Comment 46

18 years ago
brendan: right. Simulating times via a select twith a timeout on the socket
transport thread just seems yuck.

insert/removeConnection are static functions, as is mLock, and the connection
list, so there is no bogus reference to the protocol handler instance from the
other thread.

I spent quite a bit of time peering at the new code, and couldn't come up with a
case where things broke. None of the functions need to be reentrant, so thats
not an issue.

Actually, looking again, you're right, if the second thread tests mLock before
its set to null in the destructor, then we could crash. Hmm. We don't have an
atomic test and set op, do we? I suppose I need to use a semaphone, then. Any
better suggestions?

(BTW, I noticed that we start up, then destroy the ftp protcol handler before
starting it up again to do any work. That may just be a byproduct of the test
program, though - I'll look into it)
I know the members are static, but who calls the dtor?  It would be best if you
could ensure that the protocol handler isn't destroyed until the other threads
are joined somehow.  That would relieve the dtor from having to do any locking.


Comment 48

18 years ago
Its destroyed from the ui thread, probably in the ioservice's destructor (the
ioservice caches protocol handlers)

However, darin's just told me that despite having a file called
nsFtpConnectionThread.cpp, and the object proxying we do in nsFtpChannel, and
the comment arround line 150, dougt removed the threading stuff a while back.
Grr. I wondered why I couldn't find where we initied this thread...

In that case, my second patch
(http://bugzilla.mozilla.org/attachment.cgi?id=55394&action=view, the one w/o
the locks) should be OK. I think I was thrown by seeing calls to
::insertConnection in the debug log after seeing the ftp destructor being called.

I'm going to file a separate bug to clean this all up, I think. Most of these
locks are unneeded.


18 years ago
Attachment #55403 - Attachment is obsolete: true


18 years ago
Attachment #55394 - Attachment is obsolete: false

Comment 49

18 years ago
currently, PR_Poll times out every 35 seconds.  why is it bad to imagine pruning
dead (or old) connections every time PR_Poll times out?  this could very well be
a feature of the socket transport service.  and owners of a socket transport
could be notified when the socket closes due to inactivity.  seems like all
network protocol handlers would benefit from this.  i don't think this general
solution needs to go in before this bug can be fixed, but i do think that it
might be the right solution down the road.  we could of course adjust how often
PR_Poll times out... 35 seconds is a rather arbitray time out value.

Comment 50

18 years ago
Hmm. Well, we could check for dead connections every 35 seconds. But the
definition of dead depends on the protocol (We may want to wait more than 35
seconds for teh server to respond), and I can imagine a protocol wanting to do
cleanup before we quit, eg by sending a QUIT command, or something.

Comment 51

18 years ago
I personally don't see a problem with necko depending on timer, and I'm actually
surprised we don't already.. timer is a pretty low-level library - it has few
dependencies of its own. 

It also seems better than trying to cobble together a psuedo timer out of
PR_Poll, not to mention you'd loose some of the useful aspects of the timer such
as low-priority events (this seems like such an event)

Comment 52

18 years ago
hmm... ok, i'm beginning to see the light.  anyways, we're only talking about
adding a timer to FTP (and eventually HTTP) since those are the only protocols
that currently cache connected sockets.

the work required to make the socket transport service support such a feature is
probably not worth the effort... besides, it currently only knows about sockets
with active read or write requests (which doesn't include idle sockets).

Comment 53

18 years ago
Yeah, it would be nice to have a generic service, but if you ignore the changes
to add a pref for the timeout, and to move from an array to a hashtable, there
isn't much code, really.

So can I get a review on http://bugzilla.mozilla.org/attachment.cgi?id=55394,
then, please?

Comment 54

18 years ago
Comment on attachment 55394 [details] [diff] [review]
new patch, using an array instead of a hashtable

>Index: protocol/ftp/src/nsFtpProtocolHandler.cpp

>+// Stuff for the timer callback function
>+struct timerStruct {
>+    nsCOMPtr<nsITimer> timer;
>+    nsCOMPtr<nsISupports> conn;
>+    char* key;
>+    timerStruct() : key(nsnull) {};
>+    ~timerStruct() {
>+        if (timer)
>+            timer->Cancel();
>+        if (key)
>+            nsMemory::Free(key);
>+    }

down below key is allocated using nsCRT::strdup, but here you free it using 
nsMemory::Free.  it's probably not a good idea to mix these.

looks like there is an extra semicolon at the end of timerStruct()

how about making timerStruct an inner class of nsFtpProtocolHandler to reduce
namespace clutter?

>-nsSupportsHashtable* nsFtpProtocolHandler::mRootConnectionList = nsnull;
>+nsVoidArray* nsFtpProtocolHandler::mRootConnectionList = nsnull;
>+PRInt32 nsFtpProtocolHandler::mIdleTimeout = -1;

why aren't these plain old member variables?  if they are to be static/global,
then why not prefix them with 'g' or 's'?

> ////////////////////////////////////////////////////////////////////////////////
> nsFtpProtocolHandler::nsFtpProtocolHandler() {
>-        NS_INIT_REFCNT();
> }


>+    if (mIdleTimeout == -1) {
>+        nsCOMPtr<nsIPrefService> prefSrv = do_GetService(NS_PREFSERVICE_CONTRACTID, &rv);
>+        if (NS_FAILED(rv)) return rv;
>+        nsCOMPtr<nsIPrefBranch> branch;
>+        rv = prefSrv->GetBranch("network.ftp.", getter_AddRefs(branch));
>+        if (NS_FAILED(rv)) return rv;
>+        rv = branch->GetIntPref(IDLE_TIMEOUT_PREF, &mIdleTimeout);
>+        if (NS_FAILED(rv))
>+            mIdleTimeout = 5*60; // 5 minute default
>+        nsCOMPtr<nsIPrefBranchInternal> pbi = do_QueryInterface(branch);
>+        pbi->AddObserver(IDLE_TIMEOUT_PREF, this, PR_TRUE);

does this work correctly?  i thought that AddObsever required the full path to the pref.

>+nsFtpProtocolHandler::Timeout(nsITimer *aTimer, void *aClosure)
>+    PR_LOG(gFTPLog, PR_LOG_DEBUG, ("Timeout reached for %0x\n", aClosure));

could mRootConnectionList be null at this point?  could the timeout fire
after shutting down the FTP protocol handler?

>+    PRBool found = mRootConnectionList->RemoveElement(aClosure);

> nsresult
>-nsFtpProtocolHandler::RemoveConnection(nsIURI *aKey, nsISupports* *_retval) {
>+nsFtpProtocolHandler::RemoveConnection(nsIURI *aKey, nsISupports* *_retval)

NIT ALERT: are you changing the brace style uniformly in this file?

Comment 55

18 years ago
> down below key is allocated using nsCRT::strdup, but here you free it using
> nsMemory::Free.  it's probably not a good idea to mix these.

It all calls libc's free, which is one of my nits, but that was an accident.
Will fix.

> how about making timerStruct an inner class of nsFtpProtocolHandler to reduce
> namespace clutter?


> >-nsSupportsHashtable* nsFtpProtocolHandler::mRootConnectionList = nsnull;
> >+nsVoidArray* nsFtpProtocolHandler::mRootConnectionList = nsnull;
> >+PRInt32 nsFtpProtocolHandler::mIdleTimeout = -1;
> why aren't these plain old member variables?  if they are to be
static/global,> then why not prefix them with 'g' or 's'?
Because ::Insert/RemoveConnection are static functions. I plan on cleaning this
up as part of my "remove useless locks and comification" patch. They're static
members of the class to avoid namespace leakage, I guess.


Yeah - I'd added some code to the constructor, then removed it, and fixed up the
indenting at the same time.

> does this work correctly?  i thought that AddObsever required the full path to
the pref.
Well, if I set it in my profile it loads the pref correctly, and thats usually a
sign of it working. The docs imply that it can - maybe the necko2 dll, unlike
the necko dll, is being loaded late enough that this isn't mattering. I'm adding
an observer to the particular pref branch, though. Looking at the code, you may
be right. I'll put some printfs in, and ask arround.

> could mRootConnectionList be null at this point?  could the timeout fire
> after shutting down the FTP protocol handler?

No. The timers will fire on the UI thread, and I cancel the timers as part of
destructing the timer struct from the protocol handler destructor. (This is part
of why the uneeded locking was a pain)

> NIT ALERT: are you changing the brace style uniformly in this file?
Not really. That was a sideeffect of initially having the nsISupports be an
nsFtpControlConnection, but then I changed it back. I'll probably change it back
again later, though, when I clean up the locking.

Comment 56

18 years ago
you should be fine adding the relative pref name to the pref branch.. all
methods off of the pref branch should be acting on relative pref names..

Comment 57

18 years ago
bbaetz: how about just adding a

  NS_ASSERTION(mRootConnectionList, "wtf")

or something like that to your timeout handler?

Comment 58

18 years ago
I'll attach a new patch tomorrow.

darin: will do
alecf: Nope, as discussed on irc. See bug 107617. Any observer set on a not-root
pref branch doesn't appear to work. I'll have to rewrite my code, I think.

Comment 59

18 years ago
I didn't get to this.

Target Milestone: mozilla0.9.6 → mozilla0.9.7

Comment 60

18 years ago
Posted patch updated patch (obsolete) — Splinter Review
The reason noone reviewed this would probably because I never attached the
updated patch. Oops
Attachment #55394 - Attachment is obsolete: true

Comment 61

18 years ago
Comment on attachment 58616 [details] [diff] [review]
updated patch

Attachment #58616 - Flags: superreview+

Comment 62

18 years ago
Posted patch worksSplinter Review
That one crashed, because of the scoping of ts in RemoveConnection. Not sure
why it didn't crash until today, then did so reliably. Anyway, fixed.
Attachment #58616 - Attachment is obsolete: true

Comment 63

18 years ago
Comment on attachment 60431 [details] [diff] [review]

>+    struct timerStruct {

>+        ~timerStruct() {
>+            if (timer)
>+                timer->Cancel();
>+            if (key)
>+                nsCRT::free(key);


otherwise sr=darin
Attachment #60431 - Flags: superreview+
Comment on attachment 60431 [details] [diff] [review]

nit- I would change the for loops to use the post decrement operator.

Nit: In your Observer(), could you do a strcmp on the aTopic.

R= with or without my nits fixed.  Looks really good.
Attachment #60431 - Flags: review+

Comment 65

18 years ago
Preincrement is faster, becauset the compiler doesn't need a temporary. (Not
that it makes a differnce on POD types, I suspect)

I'll add the strcmp

Comment 66

18 years ago
perhaps the strcmp on aTopic could be inside a debug assertion?

Comment 67

18 years ago
Checked in (with the extra assertion)
Last Resolved: 18 years ago
Resolution: --- → FIXED


17 years ago
Keywords: testcase


17 years ago
Whiteboard: [nsbeta3-] → checklinux

Comment 68

17 years ago
VERIFIED: 2002-05-13 commercial 1.0.1, Linux
Whiteboard: checklinux

Comment 69

17 years ago
*** Bug 123129 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.