Closed Bug 71391 Opened 24 years ago Closed 24 years ago

Complete networking failure while running the page loader on win98

Categories

(Core :: Networking, defect)

x86
Windows 98
defect
Not set
critical

Tracking

()

VERIFIED FIXED

People

(Reporter: jrgmorrison, Assigned: darin.moz)

References

()

Details

Attachments

(1 file)

Overview Description: 
 
  Beginning with today's builds (03/08) on win98, the page loading
  test, after visiting a fair number of the pages, would throw up 
  a dialog 'Connection refused when trying to contact jrgm.mcom.com'.

  After that dialog is dismissed, mozilla cannot visit *any* URL (it 
  just throws up the same alert). But, on top of that, neither Nav4.7
  or IE5.5 on the same machine could do the connection.

  I mentioned this to gagan, and he said that there was an existing 
  DNS related bug, and this might be the same. However, I searched 
  for that bug, but couldn't find it, so here is a new bug (Sorry). 

  However, I'm wondering if this might be related to the leaks that  
  were happening earlier today. The reason why is that the error message
  for Nav4.7, says: 

  "A network error occurred : unable to connect to server
  (TCP Error: Not enough memory) 

  The server may be down or unreachable. Try connecting later."

Steps to Reproduce: 
1) http://jrgm.mcom.com/page-loader/loader.pl and hit submit
2) go have a coffee; return to PC in ~20 minutes

Build Date & Platform Bug Found: win98 2001030813 build

Additional Builds and Platforms Tested On: not seen on Mac or Linux

Sidenote to twalker: the workaround is to copy the current URL in mozilla 
 to the clipboard, quit mozilla, restart, paste in the URL and continue the
 the test from there. [Actually, that's not a great workaround, since I don't
 know that there aren't OS side effects that have happened, and maybe a reboot
 would be the cleanest thing to do. But for now, we can just go with the simple
 workaround].
This may be resolved with the mem leak fix, but if not, this is a pretty 
serious problem (user cannot connect to anything on the network). 
Severity: normal → critical
Keywords: nsbeta1, nsdogfood
Also see bug 71332, which seems to be the same issue.
this is now also reported in:

bug 71375 "mail/news eats "buffer space", causes other apps to fail connecting
to server"
bug 71392 "Mozilla doesn't close TCP connections"
bug 71395 "After several minutes of browsing, tcpip locks up"
I see the unclosed sockets accumulate on Linux too, all are left in CLOSE_WAIT
state, and never vanish.
Bug 67957, bug 70417 and bug 70605 is about the same phenomena in mailnews.
Okay, so the first set of other bugs are likely dups, and this is probably 
the leak, for which there is a fix in hand, bug 71317. 

The second set of bugs noted above, related to mailnews, are not this bug. 
Or, let's keep it clear that this bug and other 713** started today, and 
those other bugs have been around for some time. 
Yes, this is a new bug, being reported once an hour. A likely "mostfreq" before
the day is over. Tempting to suggest blocker-status.
*** Bug 71331 has been marked as a duplicate of this bug. ***
Seems like a dupe of bug 71317 to me.  Please reopen if the problem persists.

*** This bug has been marked as a duplicate of 71317 ***
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → DUPLICATE
Fair enough. Tracy, if the win98 test runs tomorrow without conking out, then 
please slap a verify on this bug.
*** Bug 71332 has been marked as a duplicate of this bug. ***
*** Bug 71375 has been marked as a duplicate of this bug. ***
*** Bug 71395 has been marked as a duplicate of this bug. ***
i applied the patch from bug 71317 to a fresh CVS build, but after having
browsed 3 sites i had 181 sockets hanging in CLOSE_WAIT. (linux)
sounds like this bug needs to be reopened.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
reassigning to myself
Assignee: neeti → darin
Status: REOPENED → NEW
Or is R.K. pointing out an pre-existing problem (perhaps Bug 67957, bug 70417 
or bug 70605) that is independent of the leak and this (assumed) consequence
of the leak. Before today's build, this had never occurred on win98 running
the test.
true, this bug is reported on win98.  can someone test this on win98?  thx!
Keywords: qawanted, verifyme
an example:
load http://www.digi.no
Strangely enough one last small picture won't load till a minute has passed.
During this, there are 53 sockets hanging in CLOSE_WAIT.
Then, the last image seems to be "flushed" - renders - and checking open sockets
to the site there is now ONE socket less open - but the first 52 remains
hanging. Seems only the last socket used gets closed normally, but for each item
on the page before that, one new socket is opened and never closes.
Keywords: qawanted, verifyme
jrgm: On Linux I have never seen the *browser* leave sockets hanging in
close_wait "forever", till  now on the 8th.
This is in reality a Windows AND Linux bug - and it's about sockets not closing.
That becomes "fatal" on Windows quicker, since MSWindows allow so few
simoultanously open sockets.
I am definitely seeing the same thing as R.K.Aa on linux.
Status: NEW → ASSIGNED
It looks like HTTP is leaking socket transports.  It creates sockets with 
keep-alive status and then losses them along the way.  There has been a bug
open for sometime about leaking HTTP channels and hence leaking their
respective transports.  R.K.Aa maybe your seeing that bug instead??  It's
difficult to tell the difference between bug 31317 and 62388 from the
perspective of netstat -tcpd.
Also, I noticed that the CLOSE_WAIT problem does not show up under gtkEmbed.
Under mozilla, I see them pile up right away.
HTTP channels are being held open as well, which are probably holding onto
the socket transports.

Also, I disabled my fix to bug 66516, and this problem still persists!
Status: ASSIGNED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Reopening... I did not mean to close this.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I understand the problem!!  The server (www.digi.no) is sending an HTTP/1.1 
response with "Connection: close"  HTTP does not convey this info to the 
socket transport b/c it assumes that the default mode of the socket transport
is to automatically close when done, unless otherwise instructed.  For
keep-alive connections, HTTP tells the socket transport that it might be
reused.  So, I think the problem is in the socket transport.  My recent changes
on dougt's branch probably broke the old/assumed behavior of "close when done by
default unless otherwise instructed."
ahh that would explain why our mail connections are being kept open to for pop
and news.

It turns out we had two problems in the socket transport:

1) We were not closing the socket transport on PR_POLL_HUP
2) We were not closing the socket transport when (mReuseCount == 0)
oh baby, check that puppy in. sr=mscott

cc'ing naving so he's in the loop on this as he was investigating many of the
mail problems with the sockets being left open. Nice catch Darin.
applied the patch - now things look Good again :)
*** Bug 71423 has been marked as a duplicate of this bug. ***
The current builds on win98 are not showing the problem originally reported.
This is fixed for that point.
Status: REOPENED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Fix check in
as far as i can tell, this fixed all the mailnews bugs about open sockets as
well. Nice work.
verified
marking verified per comments
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: