Closed Bug 337550 Opened 18 years ago Closed 18 years ago

Network connection dies after browser has been idle

Categories

(Camino Graveyard :: General, defect)

PowerPC
macOS
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: phiw2, Unassigned)

References

Details

(Keywords: hang, regression)

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.9a1) Gecko/20060510 Camino/1.2+
Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.9a1) Gecko/20060510 Camino/1.2+

With the most recent Trunk builds - using 2006051020 (1.2+) - the browser fails to connect to websites after it has been idle for a moment. It shows 'loading' in the statusbar, but the site is never loaded.

Other symptoms observed: clicking on a link does nothing anymore.

This has happened in two ways:
load page, then go reading your email , back to browser, and it is dead
load page, let it sit there while I was looking at another browser on another computer.

First noticed with my own build (checkout start: Thu May 11 08:33:57 JST 2006), and now with the  2006051020 (1.2+) 'Maya' build.

Suspicion is on bug 326273.
Camino atm, can't check with Firefox - the equivalent FX tinderbox builds simply crash at start-up (bug 337481).

My network connection is alright, as I can connect with other browsers.

Reproducible: Always
This WFM with that same Maya build.  Is it possible that this is just DNS stuff?
It happens with any site, including those loaded from my own dev. server.
Loading the same site, at the same moment nearly, in any other browser works perfectly.
Surprisingly, I haven't been able to reproduce this on OS X 10.3.9.
But it reproduces on two machines running 10.4.6.
It reproduces.

Mac OS X 10.3.9
Camino NB
I see this in today's (200605011-01) trunk nightly on 10.3.9.  Even stuff that doesn't have to hit the network (Bookmarks) just goes "Loading..." forever.

One time the console.log printed this message:
libxpt: bad magic header in input file; found '', expected 'XPCOM\nTypeLib\r\n\032'

It doesn't show up every time, though.

We need to verify the regression ranges (was yesterday's nightly before , but since basically the only thing that landed on the trunk yesterday was bug 326273....
Severity: major → critical
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression
The last build that works for me was this 'Maya' build: 2006051008 (1.2+)
Ah, I got it to happen in the build I was using yesterday, too; apparently I had never paused long enough when I was doing stuff in the trunk builds yesterday--it seems the length of the pause required to cause this can sometimes be quite short  and other times quite long :/  

But the build philippe notes as the last working one is the last one before ThreadManager landed.

Darin, mento, any idea what might be causing Camino to "lose" network connectivity?
JUst to add, seeing the smae thing on Intel.
I saw a weird hang yesterday in my trunk build, is this a hang?
Keywords: hang
It sounds like something is preventing the processing of 'gecko' events.  I'd start by investigating the changes made in widget/src/cocoa/.
Now seeing the same thing on 10.4.6, Camino Version 2006051122 (1.2+)
In debugging bug 337841, I'm seeing cases where we get stuck in the native run loop when we really should be getting called away from it to process Gecko events.  In that case, Camino's UI would continue to run but Cocoafox's would not.  That's consistent with the behavior I'm seeing.
> In debugging bug 337841

Make that bug 337481.  Thanks for looking into this Mark!
I want to say I just ran into that on my Firefox trunk build too. I had to restart my computer before I could connect to any websites again. Maybe this isn't Mac-only?

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20060511 Firefox/3.0a1
similar symptoms:

Bugzilla Bug 272787
After a random amount of time, unable to connect to anything
I think that bug is actually the same as this.

After a while of inactivity, I can't go to any page, even local files won't load.

This is really a smoketest blocker...
Version: unspecified → Trunk
272787 is likely unrelated.  Re comments 12, 14, 15, this is most likely shares the same cause as [part of] bug 337481, although I haven't tried debugging this one in isolation yet.  The problem does not appear to be in the platform-specific appshell proper, but an unfortunate result of the way of a performance/stability/cleanliness-of-code enhancement I included in the Mac appshells interacts with an odd condition that seems to be preventing Gecko events from signaling the native loop.  The root cause of the bug is actually most likely cross-platform, but the appshells for other platforms are more tolerant because they don't have the same optimization/don't need it, and handling any single system event like a mouse-move will unhang the app and let the main thread's gecko event queue drain.

A VERY rough diagram of the xp runloop, without all of the anti-starvation measures is:

  while TRUE
    process a gecko event
    if there are no more gecko events
      process a native event and block if none are available
    else
      process a native event and don't block

When allowed to block, the Mac "process a native event" code for Carbon and Cocoa calls right into system routines that run an event loop, so the Mac implementations plus the system internals look like this:

  if can block
    do while running
      get os event from queue blocking until event is available
      dispatch
  else
    get os event from queue, don't block if none is available
    if got an event
      dispatch

This differs from other platforms, which don't have the |do while running| clause and instead block waiting for a single event, dispatch it, and return.

The design of the new system is such that when a Gecko event occurs, if the system is blocked waiting on an event, it should be interrupted and return control back to Gecko.  On the Mac, that means stopping the |do while running| loop.  This ordinarily works, but apparently, it's sometimes failing.  The failure doesn't seem to be in the platform appshells - it seems that the platform appshells just aren't being notified.  This may be as simple as making certain ops atomic, which is something that Darin and I covered earlier in development, but the affected code has changed slightly now.

On non-Mac platforms, there's no |do while running| loop, so a failure to interrupt the call blocked waiting for a native event, while still wrong, doesn't hang all Gecko events on the main thread.  It just takes a single native event to get things flowing again.  This is almost definitely also causing a perf regression too (bug 337689?)

Because Camino has a native Cocoa FE and the native event loop is still spinning, Camino appears to be running, but tasks that Gecko handles on Gecko events (like network chatter) won't work.  In the Fox and other apps, the XUL UI depends much more heavily on Gecko, so when you get stuck in the native loop and can't break free to process Gecko events, the app's UI will be more solidly wedged.
Severity: critical → blocker
mProcessingNextNativeEvent may be a problem.  Using RunWasCalled the way we do in the Mac appshells may be a problem.
The patches in bug 337824 fix this bug.
Today's Camino nightly exits with no crash log, after it is idle for a minute or so. Very consistent.

There are many logouts like this in console.log:

2006-05-17 16:00:26.570 Camino[3864] *** _NSAutoreleaseNoPool(): Object 0x6632b00 of class BrowserWindowController autoreleased with no pool in place - just leaking
2006-05-17 16:00:26.570 Camino[3864] *** _NSAutoreleaseNoPool(): Object 0x665cc60 of class TopLevelWindowData autoreleased with no pool in place - just leaking
2006-05-17 16:00:26.570 Camino[3864] *** _NSAutoreleaseNoPool(): Object 0x6656160 of class NSCFString autoreleased with no pool in place - just leaking
Attached is the complete list of messages in console.log that appear when Camino exits.
One bug per bug report, please!

This bug (comment 0) is fixed by the checkin of bug 326273.

Comment 21 sounds like bug 338249.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Sorry - fixed by checkin of bug 337824.
Depends on: 337824
I just experienced this again in my trunk build, after having used it without any problems for several hours.
Maybe there are still occasional problems getting the browser to leave its blocking wait?  Håkan, since you were running for several hours, I assume you were using my test "stop" patch from bug 338249?
(In reply to comment #25)
> Maybe there are still occasional problems getting the browser to leave its
> blocking wait?  Håkan, since you were running for several hours, I assume you
> were using my test "stop" patch from bug 338249?
> 

Yeah, I think so.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: