337550 - Network connection dies after browser has been idle

Reporter

Description

•

18 years ago

User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.9a1) Gecko/20060510 Camino/1.2+ Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en; rv:1.9a1) Gecko/20060510 Camino/1.2+ With the most recent Trunk builds - using 2006051020 (1.2+) - the browser fails to connect to websites after it has been idle for a moment. It shows 'loading' in the statusbar, but the site is never loaded. Other symptoms observed: clicking on a link does nothing anymore. This has happened in two ways: load page, then go reading your email , back to browser, and it is dead load page, let it sit there while I was looking at another browser on another computer. First noticed with my own build (checkout start: Thu May 11 08:33:57 JST 2006), and now with the 2006051020 (1.2+) 'Maya' build. Suspicion is on bug 326273. Camino atm, can't check with Firefox - the equivalent FX tinderbox builds simply crash at start-up (bug 337481). My network connection is alright, as I can connect with other browsers. Reproducible: Always

froodian (Ian Leue)

Comment 1

•

18 years ago

This WFM with that same Maya build. Is it possible that this is just DNS stuff?

philippe (part-time)

Reporter

Comment 2

•

18 years ago

It happens with any site, including those loaded from my own dev. server. Loading the same site, at the same moment nearly, in any other browser works perfectly.

philippe (part-time)

Reporter

Comment 3

•

18 years ago

Surprisingly, I haven't been able to reproduce this on OS X 10.3.9. But it reproduces on two machines running 10.4.6.

Hiro

Comment 4

•

18 years ago

It reproduces. Mac OS X 10.3.9 Camino NB

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 5

•

18 years ago

I see this in today's (200605011-01) trunk nightly on 10.3.9. Even stuff that doesn't have to hit the network (Bookmarks) just goes "Loading..." forever. One time the console.log printed this message: libxpt: bad magic header in input file; found '', expected 'XPCOM\nTypeLib\r\n\032' It doesn't show up every time, though. We need to verify the regression ranges (was yesterday's nightly before , but since basically the only thing that landed on the trunk yesterday was bug 326273....

Severity: major → critical

Status: UNCONFIRMED → NEW

Ever confirmed: true

Keywords: regression

philippe (part-time)

Reporter

Comment 6

•

18 years ago

The last build that works for me was this 'Maya' build: 2006051008 (1.2+)

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 7

•

18 years ago

Ah, I got it to happen in the build I was using yesterday, too; apparently I had never paused long enough when I was doing stuff in the trunk builds yesterday--it seems the length of the pause required to cause this can sometimes be quite short and other times quite long :/ But the build philippe notes as the last working one is the last one before ThreadManager landed. Darin, mento, any idea what might be causing Camino to "lose" network connectivity?

Blocks: nsIThreadManager

Joel Craig

Comment 8

•

18 years ago

JUst to add, seeing the smae thing on Intel.

Håkan Waara

Comment 9

•

18 years ago

I saw a weird hang yesterday in my trunk build, is this a hang?

Keywords: hang

Darin Fisher

Comment 10

•

18 years ago

It sounds like something is preventing the processing of 'gecko' events. I'd start by investigating the changes made in widget/src/cocoa/.

Warren TenBrook

Comment 11

•

18 years ago

Now seeing the same thing on 10.4.6, Camino Version 2006051122 (1.2+)

Mark Mentovai

Comment 12

•

18 years ago

In debugging bug 337841, I'm seeing cases where we get stuck in the native run loop when we really should be getting called away from it to process Gecko events. In that case, Camino's UI would continue to run but Cocoafox's would not. That's consistent with the behavior I'm seeing.

Darin Fisher

Comment 13

•

18 years ago

> In debugging bug 337841 Make that bug 337481. Thanks for looking into this Mark!

Ryan VanderMeulen [:RyanVM]

Comment 14

•

18 years ago

I want to say I just ran into that on my Firefox trunk build too. I had to restart my computer before I could connect to any websites again. Maybe this isn't Mac-only? Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20060511 Firefox/3.0a1

Worcester12345

Comment 15

•

18 years ago

similar symptoms: Bugzilla Bug 272787 After a random amount of time, unable to connect to anything

Håkan Waara

Comment 16

•

18 years ago

I think that bug is actually the same as this. After a while of inactivity, I can't go to any page, even local files won't load. This is really a smoketest blocker...

OstGote!

Updated

•

18 years ago

Version: unspecified → Trunk

Mark Mentovai

Comment 17

•

18 years ago

272787 is likely unrelated. Re comments 12, 14, 15, this is most likely shares the same cause as [part of] bug 337481, although I haven't tried debugging this one in isolation yet. The problem does not appear to be in the platform-specific appshell proper, but an unfortunate result of the way of a performance/stability/cleanliness-of-code enhancement I included in the Mac appshells interacts with an odd condition that seems to be preventing Gecko events from signaling the native loop. The root cause of the bug is actually most likely cross-platform, but the appshells for other platforms are more tolerant because they don't have the same optimization/don't need it, and handling any single system event like a mouse-move will unhang the app and let the main thread's gecko event queue drain. A VERY rough diagram of the xp runloop, without all of the anti-starvation measures is: while TRUE process a gecko event if there are no more gecko events process a native event and block if none are available else process a native event and don't block When allowed to block, the Mac "process a native event" code for Carbon and Cocoa calls right into system routines that run an event loop, so the Mac implementations plus the system internals look like this: if can block do while running get os event from queue blocking until event is available dispatch else get os event from queue, don't block if none is available if got an event dispatch This differs from other platforms, which don't have the |do while running| clause and instead block waiting for a single event, dispatch it, and return. The design of the new system is such that when a Gecko event occurs, if the system is blocked waiting on an event, it should be interrupted and return control back to Gecko. On the Mac, that means stopping the |do while running| loop. This ordinarily works, but apparently, it's sometimes failing. The failure doesn't seem to be in the platform appshells - it seems that the platform appshells just aren't being notified. This may be as simple as making certain ops atomic, which is something that Darin and I covered earlier in development, but the affected code has changed slightly now. On non-Mac platforms, there's no |do while running| loop, so a failure to interrupt the call blocked waiting for a native event, while still wrong, doesn't hang all Gecko events on the main thread. It just takes a single native event to get things flowing again. This is almost definitely also causing a perf regression too (bug 337689?) Because Camino has a native Cocoa FE and the native event loop is still spinning, Camino appears to be running, but tasks that Gecko handles on Gecko events (like network chatter) won't work. In the Fox and other apps, the XUL UI depends much more heavily on Gecko, so when you get stuck in the native loop and can't break free to process Gecko events, the app's UI will be more solidly wedged.

Håkan Waara

Updated

•

18 years ago

Severity: critical → blocker

Mark Mentovai

Comment 18

•

18 years ago

mProcessingNextNativeEvent may be a problem. Using RunWasCalled the way we do in the Mac appshells may be a problem.

Mark Mentovai

Comment 19

•

18 years ago

The patches in bug 337824 fix this bug.

Mark Knopper

Comment 20

•

18 years ago

Today's Camino nightly exits with no crash log, after it is idle for a minute or so. Very consistent. There are many logouts like this in console.log: 2006-05-17 16:00:26.570 Camino[3864] *** _NSAutoreleaseNoPool(): Object 0x6632b00 of class BrowserWindowController autoreleased with no pool in place - just leaking 2006-05-17 16:00:26.570 Camino[3864] *** _NSAutoreleaseNoPool(): Object 0x665cc60 of class TopLevelWindowData autoreleased with no pool in place - just leaking 2006-05-17 16:00:26.570 Camino[3864] *** _NSAutoreleaseNoPool(): Object 0x6656160 of class NSCFString autoreleased with no pool in place - just leaking

Mark Knopper

Comment 21

•

18 years ago

Attached file Messages in console log when Camino exits — Details

Attached is the complete list of messages in console.log that appear when Camino exits.

Mark Mentovai

Comment 22

•

18 years ago

One bug per bug report, please! This bug (comment 0) is fixed by the checkin of bug 326273. Comment 21 sounds like bug 338249.

Status: NEW → RESOLVED

Closed: 18 years ago

Resolution: --- → FIXED

Mark Mentovai

Comment 23

•

18 years ago

Sorry - fixed by checkin of bug 337824.

Depends on: 337824

Håkan Waara

Comment 24

•

18 years ago

I just experienced this again in my trunk build, after having used it without any problems for several hours.

Mark Mentovai

Comment 25

•

18 years ago

Maybe there are still occasional problems getting the browser to leave its blocking wait? Håkan, since you were running for several hours, I assume you were using my test "stop" patch from bug 338249?

Håkan Waara

Comment 26

•

18 years ago

(In reply to comment #25) > Maybe there are still occasional problems getting the browser to leave its > blocking wait? Håkan, since you were running for several hours, I assume you > were using my test "stop" patch from bug 338249? > Yeah, I think so.