Closed Bug 30648 Opened 25 years ago Closed 25 years ago

nsSocketTransports (from nsMsgProtocol.cpp) not getting removed from mActiveTransportList (reaching limit of 50)

Categories

(MailNews Core :: Backend, defect, P1)

x86
Windows 98
defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: trudelle, Assigned: sspitzer)

Details

(Whiteboard: [PDT+]3/15 verified/seems OK win32 b1, but need more soaking)

Attachments

(4 files)

I don't like to open bugs without a finite set of steps to reproduce, but I've
been seeing a very short MTBF while browsing newsgroups using recent
(commercial) verification builds. It probably hasn't changed suddenly, but until
recently no part of the browser had an MTBF that was any longer, so it didn't
stand out so much. I only check a few newsgroups, but they have lots (thousands)
of messages.  Here's a typical scenario:

Launch into mail, or into browser and then open a mail window.
Login to IMAP server, read a few messages, respond to a few.
load n.p.m.xpfe & read a few messages
Load n.p.m.ui, read several messages, respond to a few.
After 20-30 minutes, all loads/reads will fail.

Up until a few weeks ago, I couldn't read more than 15 articles or so before the
operation would fail. Message loading would start out slow, then gradually get
much slower until it failed. Now I can read a lot more, but memory use continues
to go up quickly with each newsgroup loaded.  For instance, loading n.p.m.ui
typically eats about 12 MB or RAM.  Using MemTurbo, I've noticed that the
failure to load newsgroups or read messages corresponds roughly with running out
of free RAM.
May be related to bug 29808. Peter, when loads/reloads fail, is it a network
problem or a repaint problem? In otehr words, if you alt-tab away from Mozilla
and back, does the message display?  Anyway, this is Seth's.
Assignee: phil → sspitzer
zach, I thought #29808 was linux only.

a while back I had a bug where I wasn't closing the nntp connection.

after the client opened enough (without closing any) the server start refusing,
and this sounds similar.

but I'll also investigate the memory leaking / running out of memory issue that
peter points out.
Status: NEW → ASSIGNED
It is not a refresh problem, the status and throbber show operations in progress, 
and forced repaints have no effect. The key problem for me is that the message 
load times seem to start out slow and get progressively longer until they fail 
altogether.  I mention the memory use only because it might be relevant. I'm not 
claiming it is a leak or even a problem, for all I know it is being put to good 
use as a cache.
I'm wondering if it could be similar to the bug I filed a while 
back:  http://bugzilla.mozilla.org/show_bug.cgi?id=20653
Yes, it is similar, except this is happening in News, on Win98, I'm not using
Biff for anything, it is easy to reproduce (100%).  It sounds just like what
asadotzler@netscape.net was describing in 20653.  I think it is a dogfood issue,
adding keyword. I'm very motivated to use this product, but for News it is
impractical.  One more note: one of the times it happened today was with 40 MB
free RAM, so it may have nothing to do with low memory.  
Keywords: dogfood
Putting on PDT+ for beta1. 
Whiteboard: [PDT+]
investigating this now.  I'll start by making sure we are closing all nntp
connections.
Target Milestone: M14
good news, is that on linux, I'm not seeing the nntp connections being left
open.  so most likely, that isn't the problem on windows either (but I'm going
to check.)

I was able to post messages and read messages for a while on linux, switching
between newsgroups.

I'll continue to try to reproduce.
I haven't seen this problem on Linux either, it could be limited to Win98.
updating summary and keywords.
Keywords: beta1
Whiteboard: [PDT+] → [PDT+](trying to reproduce)
I displayed about 50 mail messages in about 15 minutes with today's debug build 
on WinNT.  No problems, other than the expected memory leaks (approx 10-100K 
per message.)  Maybe the reported problem really is just a low memory problem, 
even though that doesn't jive with Peter's comment. A debug stack from the 
crash would sure help.
What crash? This is a failure to load messages while reading news on Windows 98;
has anyone tried to reproduce it there?
It happens to me all the time, so if anyone is interested I can demonstrate it.
I don't think it is low RAM, but why on earth
would we expect leaks of 100K per message?
Whiteboard: [PDT+](trying to reproduce) → [PDT+](trying to reproduce on other platforms)
I see this also.  Using today's build.  When this happens, I can't load any 
newsgroup messages (even in a different newsgroup).  I can go back to my mail 
folder and load a mail (IMAP) message.  Then, I try to go back to news and still 
can't load.  I'll see if I can narrow down further, but I'm out most of the day 
Thurs.  This is on Win32 system.
Seth - if you want, you can use my notebook while I'm out tomorrow.  Just be 
sure to let me know so that I can turn off my screen saver password.
I'm getting somewhere.  god bless the MSVC++ debugger.

deep down in nsSocketTransportService.cpp, the list of active transport sockets 
keeps growing, until we start failing.  (the get filled up with transports to 
the news server.)

something is not cleaning up properly and I'm still looking for it.
Whiteboard: [PDT+](trying to reproduce on other platforms) → [PDT+](reproducable, debugging and working on a fix.)
yep, we start failing because we get 50 connections in the mActiveTransportList

MAX_OPEN_CONNECTIONS is set to 50

still trying to figure out why connections aren't getting properly removed.
it may be something bigger than just news.

it may be something with nsMsgProtocol.cpp

if I do "get new messages" on a pop account, nsSocketTransports get left in the
mActiveTransport list.

after 50 get new messages, I'll start failing on that too, because the limit has 
been reached.
Priority: P3 → P1
adding necko team members to see if they have any ideas.

I'm still debugging and trying to come up with a fix.
changing summary to reflect the real problem.
Summary: Very short MTBF reading newsgroups → nsSocketTransports (from nsMsgProtocol.cpp) not getting removed from mActiveTransportList (reaching limit of 50)
I have a fix, but is not the final fix.

the problem, I think, is that transport never gets removed.

we call nsSocketTransport::doWrite() and eventuall, we write out the entire 
buffer, then we get to line 1070

  else if (NS_BASE_STREAM_WOULD_BLOCK == rv) {
    //
    // If the buffer is empty, then notify the reader and stop polling
    // for write until there is data in the buffer.  See the OnWrite()
    // notification...
    //
    NS_ASSERTION((0 == totalBytesWritten), "returned NS_BASE_STREAM_WOULD_BLOCK 
and a writeCount");
    if (GetFlag(eSocketWrite_Sync)) {
      // We can only wait if we created the stream (a pipe -- created in the 
synchronous 
      // OpenOutputStream case). If we didn't create it, we couldn't have been 
the buffer
      // observer, so we won't get any notification when more data becomes 
available. 
      SetFlag(eSocketWrite_Wait);
      mSelectFlags &= (~PR_POLL_WRITE);
    }

my solution is to expose a CloseTransport() method and call it 
from nsMsgProtocol::CloseSocket()

you get the idea.  I'll attach the patch.

I'd like to discuss the correct solution with the necko guys.
the fix I have in hand solves the problem.
closed connections get removed from the mActiveTransports array so we never hit 
50.
Whiteboard: [PDT+](reproducable, debugging and working on a fix.) → [PDT+](possible fix in hand)
Adjusted whiteboard based on reading of comment (i.e., the fix "works").  I
still assume you need review and testing.  IF other folks have provided that...
please feel free to add that to the bug report, to help move this along.
Thanks,
Jim
Whiteboard: [PDT+](possible fix in hand) → [PDT+] fix in hand
Seth: Why do you add a new CloseTransport method? Why can't you just call 
Cancel?
Cancel is part of the nsIRequest interface?  yes, I could just QI to nsIRequest 
and call cancel.

The reason I didn't was I wasn't sure cancel going to be the right thing to do.

Cancel() has the desired side effect (of removing the nsSocketTransport).

Is there a better, more proper way to do this?
You don't even need to QI. Just call Cancel on the transport directly.
in nsMsgProtocol, all I have is a nsIChannel
duh.  never mind.  I can do what warren suggests, just calling cancel on the 
nsIChannel.
ok, I've got a cleaned up version of the patch.

I'll attach in a second for warren's review.
Shouldn't that read:

     if (m_channel) {
        rv = m_channel->Cancel();

? 
yes, thanks for catching that.

that cleaner than calling nsMsgProtocol::Cancel()

ok then... r=warren
seeking PDT approval now.
I've got the final version of the patch.  seeking review from mscott (for the 
additional nsMsgProtocol.cpp change) now.
cvs diff -c is your friend
         ^^
I can't read your diff.
Looks good.
seeking approval now.  mscott and warren have reviewed.
will check in fix this weekend.
Whiteboard: [PDT+] fix in hand → [PDT+]3/12 landing (fix in hand)
fixed.  I still need to land this on the beta1 branch, and I'll do that soon.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
the fix for this has also been checked into the beta1 branch.
QA Contact: lchiang → laurel
I believe this is OK using 2000-03-15-06 nb1b commercial build NT 4.0.  I've
tried a few times and haven't seen a complete failure, although switching
newsgroups or back to mail folders then again to newsgroup is dog slow (seems
slower than it was last week).  Seems OK, but I'd like to give it some more soak
time tomorrow.  Won't mark verified until I test more tomorrow.
Whiteboard: [PDT+]3/12 landing (fix in hand) → [PDT+]3/15 verified/seems OK win32 b1, but need more soaking
OK, looks good to me.
Marking verified with:
2000-03-20-06 nb1b commercial build NT 4.0
2000-03-20-06 nb1b commercial build linux rh6.0
2000-03-20-06 nb1b commercial build mac OS 9.0
Status: RESOLVED → VERIFIED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: