Last Comment Bug 627084 - UI becomes sluggish/does not redraw/repaint unless mouse is moved after long browsing sessions
: UI becomes sluggish/does not redraw/repaint unless mouse is moved after long ...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Widget: Win32 (show other bugs)
: Trunk
: x86 Windows 7
: -- normal with 3 votes (vote)
: mozilla8
Assigned To: Jim Mathies [:jimm]
:
Mentors:
: 602019 628354 634163 678124 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-01-19 09:56 PST by The 8472
Modified: 2011-10-04 15:06 PDT (History)
35 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
.x+


Attachments
2 threaddumps taken after several hours of uptime (122.80 KB, text/plain)
2011-02-15 18:15 PST, The 8472
no flags Details
threaddump taken after upload applet froze (68.53 KB, text/plain)
2011-02-16 02:05 PST, The 8472
no flags Details
FF Nightly 7.0a1 with Processor Pinning (49.19 KB, image/png)
2011-06-30 20:40 PDT, Rob
no flags Details
native callback timeout patch v.1 (2.82 KB, patch)
2011-08-01 05:03 PDT, Jim Mathies [:jimm]
no flags Details | Diff | Review
native callback timeout patch v.2 (3.32 KB, patch)
2011-08-08 10:02 PDT, Jim Mathies [:jimm]
roc: review+
Details | Diff | Review

Description The 8472 2011-01-19 09:56:12 PST
User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10pre) Gecko/20110118 Firefox/4.0b10pre
Build Identifier: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10pre) Gecko/20110118 Firefox/4.0b10pre

My standard browsing session has 300+ tabs and firefox generally stays open 24 hours. After several hours of browsing the UI starts to become somewhat unresponsive. The symptoms are the following:
- autoscrolling with the middle mouse button becomes stuttery
- animated gifs are not animated properly (that includes the spinning tab load icons)
- loading pages seems stuck since the page is not re-drawn

All these issues only appear when i don't generate any input. As long as i'm moving my mouse or press keys it redraws the pages in realtime. As soon as i stop interacting with the UI it stops redrawing and only refreshes it every 2-3 seconds.

I experienced this issue with and without flash and with hardware acceleration on and off.

Reproducible: Always
Comment 1 The 8472 2011-02-15 18:15:57 PST
Created attachment 512679 [details]
2 threaddumps taken after several hours of uptime

It is extremely difficult to capture this behavior since moving the mouse or creating other UI events can dissipate it temporarily and since it appears in bursts.
Note that i'm seeing an IPC thread there despite no plugin-container.exe process existing at that time.
Comment 2 The 8472 2011-02-16 01:36:11 PST
Another indicator that this is IPC-related is that i had a youtube upload running overnight with their java uploader applet. Obviously the PC was unattended and the upload seemed to freeze up at some point during the night. When i maximized the FF window it started again.
Although I wasn't able to capture a stacktrace of that event.
Comment 3 The 8472 2011-02-16 02:05:45 PST
Created attachment 512751 [details]
threaddump taken after upload applet froze

I was able to reproduce the issue with the same browser session and the same applet, here's the stacktrace.
Comment 4 The 8472 2011-02-21 19:49:12 PST
I was able to reproduce the issue on a relatively clean profile with no addons and all plugins disabled. The only thing i did on that profile was browsing and submitting posts/images on 4chan and after a dozen posts or the UI started to become sluggish again.

So could this be
a) something related to POSTing
b) the file selection dialog
c) the bfcache on image-heavy sites
d) ????

Anyway, i can reproduce the issue reasonably well now, but i've no idea how to isolate it any further.
Comment 5 Vladimir Vukicevic [:vlad] [:vladv] 2011-02-22 16:50:58 PST
I can confirm this -- just happened to me, session wasn't particularly long.

It is similar to previous bugs where canvas animations or flash wouldn't update until mouse movement happened.

Marking this b? since we'll want to keep an eye on it -- it should probably be fixed in .x.  The8472 says he can reproduce it reliably, but I don't have any good ideas on how to actually figure out what's going on.
Comment 6 Vladimir Vukicevic [:vlad] [:vladv] 2011-02-22 17:37:47 PST
pulled this into the debugger to poke around.  One weird thing is that in nsAppShell::ProcessNextEvent, I seem to loop going through PeekMessageW returning false, then calling WaitMessage -- which is immediately returning, indicating that a message should be waiting.  PeekMessageW returns false. And this spin continues for a while, until it eventually somehow breaks out.  During this time, the appshell has a native event pending bit set.

So, we seem to have the state where WaitMessage thinks there's a message waiting, but PeekMessage disagrees.  That would certainly cause the problem that I'm seeing -- in particular, PeekMessage should be returning a message (because there is one -- the native event callback, at the very least!).
Comment 7 Mike Beltzner [:beltzner, not reading bugmail] 2011-02-22 21:18:17 PST
Moving to .x for now; Vlad, please renominate if this becomes something you think we should fix before release.
Comment 8 Vladimir Vukicevic [:vlad] [:vladv] 2011-02-22 22:15:33 PST
.x is fine -- I don't think we have nearly enough info about what's happening to block the release on it.  Moving it to Core Widget, though.

There are lots of variants of this bug -- plugins not repainting, video not repainting, canvas animations not repainting, animated gifs not repainting are all symptoms.  I think whatever's going on here is the underlying cuase.
Comment 9 The 8472 2011-02-23 04:47:37 PST
Just to note, this does not just affect rendering. Flash completely locks up when you try to interact with it while the browser is in that state. E.g. trying to move the volume slider on youtube leads to the entire UI freezing for several minutes.
Keyboard input also gets dropped when the UI is in a frozen state.

But yeah, all those things most likely are just symptoms.
Comment 10 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-23 14:51:24 PST
If any of you guys can reproduce this reliably, can you try to pinpoint a regression range?  Also, can you post a set of steps to reproduce that you're using?
Comment 11 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-23 15:07:06 PST
(In reply to comment #6)
> pulled this into the debugger to poke around.  One weird thing is that in
> nsAppShell::ProcessNextEvent, I seem to loop going through PeekMessageW
> returning false, then calling WaitMessage -- which is immediately returning,
> indicating that a message should be waiting.  PeekMessageW returns false. And
> this spin continues for a while, until it eventually somehow breaks out. 
> During this time, the appshell has a native event pending bit set.

Which PeekMessageW returns false?  The ones in PeekUIMessage, or the one in nsAppShell::ProcessNextNativeEvent?

Also, what is the return value for WaitMessage?  If it's false, what does GetLastError say?

I think one case where this could be happening is when PeekMessageW dispatches a sent message.  I'm not exactly sure if it would return false in that case (if there is no window or thread message posted) or not.  Note that send messages do not go through the message queue (as they are processed synchronously).
Comment 12 The 8472 2011-02-23 15:51:42 PST
(In reply to comment #10)
I can reproduce it reliably, but my procedure is too crude and cumbersome to do it for regression testing. It basically consists for browsing a while on 4chan and posting about 20-40 posts with image uploads.
I have no clue which part of my standard-browsing-behavior does cause it. So we need to narrow that part down before we can do regression testing.

That is on win7, hw acceleration enabled and all addons/plugins disabled on a relatively clean profile.
Comment 13 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-02-23 22:10:35 PST
FWIW, I wasn't able to reproduce this after doing a variety of canvas-y/plugin-y/video-y/webgl-y things for a while.  In case it matters, my laptop runs

  Graphics

        Adapter Description
        Intel(R) HD Graphics

        Vendor ID
        8086

        Device ID
        0046

        Adapter RAM
        Unknown

        Adapter Drivers
        igdumdx32 igd10umd32

        Driver Version
        8.15.10.2279

        Driver Date
        1-7-2011

        Direct2D Enabled
        true

        DirectWrite Enabled
        true (6.1.7600.16699, font cache n/a)

        WebGL Renderer
        Google Inc. -- ANGLE -- OpenGL ES 2.0 (ANGLE 0.0.0.541)

        GPU Accelerated Windows
        1/1 Direct3D 10
Comment 14 The 8472 2011-03-04 06:08:15 PST
(In reply to comment #13)
> I wasn't able to reproduce this after doing a variety of
> canvas-y/plugin-y/video-y/webgl-y things for a while.

Those things being affected presumbly are the symptoms of this bug, not the cause as my way of reproducing it involves none of these.
Comment 15 kolubinowicki 2011-03-16 09:58:59 PDT
This bug could be related with:
Bug 634163 - browser will not load pages or page content unless mouse cursor constantly moving.
Comment 16 The 8472 2011-03-16 11:38:13 PDT
Yes, the symptoms look quite similar.
Comment 17 Timothy Nikkel (:tnikkel) 2011-03-17 01:15:14 PDT
I wonder if bug 575515 could have caused this. Checking if this bug happens in 2010-11-29 and 2010-11-30 nightlies would determine if that was the case.
Comment 18 nicholas.bruno 2011-03-17 13:32:37 PDT
*** Bug 628354 has been marked as a duplicate of this bug. ***
Comment 19 kolubinowicki 2011-03-21 08:25:51 PDT
Duplicate ?:
Bug 612087 - A strange bug. I can't describe this issue in English clearly, please watch my demo video
Comment 20 :Felipe Gomes (needinfo me!) 2011-03-21 13:26:13 PDT
(In reply to comment #17)
> I wonder if bug 575515 could have caused this. Checking if this bug happens in
> 2010-11-29 and 2010-11-30 nightlies would determine if that was the case.

With the screencast on bug 612087, and other dup bugs, this also happens on 3.6, so shouldn't be related to bug 575515 (unless there is more than one issue). Most of the reports are from windows 7 64 bits (with a few exceptions using vista 32 bits).
There also appears to have a correlation with the file dialog being invoked.
Comment 21 Timothy Nikkel (:tnikkel) 2011-03-21 13:57:34 PDT
Bug 612087 is slightly different, it seems to only require the mouse cursor over the window for animations to happen, the other bugs require it to be moving.
Comment 22 The 8472 2011-03-21 20:30:54 PDT
(In reply to comment #17)
> I wonder if bug 575515 could have caused this. Checking if this bug happens in
> 2010-11-29 and 2010-11-30 nightlies would determine if that was the case.
It took me several hours, but i was able to reproduce the issue on
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b8pre) Gecko/20101128 Firefox/4.0b8pre

So i guess that wasn't our culprit.

(In reply to comment #20)
> There also appears to have a correlation with the file dialog being invoked.
Yeah, my "testing procedure", if you can call it that, includes POSTing files.

I've also seen a bunch of windows thumbnail-caches loaded into the process even after the dialogs were closed. I found that a back while looking at the virtual memory usage, but it was only a fixed amount (about 10), so i didn't think much about it. Not sure if that's of any relevance.
Comment 23 The 8472 2011-03-30 04:05:16 PDT
I might have found a piece to the puzzle. After trying to find a more reliable way to reproduce this issue i found that most cases in which it occurred had over 2GB virtual memory footprint. I suspect that those cases that had less than 2GB had at least addresses > 2GB at some point.

This would also explain why we're mostly seeing this on 64bit systems and only few 32bit ones. To go beyond 2G worth of addresses on 32bit windows one has to run windows with in /3G mode.

So, it's possible that something somewhere does pointer-magic by abusing the most significant bit and this somehow breaks event handling.

There are a lots of buts in this since I'm not sure that reaching 2GB alone does the deed, some other factors probably play into this too.
Comment 24 Wesley Crossman 2011-03-30 17:48:08 PDT
I'm running Firefox on a 32-bit netbook, and this happens to me constantly. The stats are 1.66 GHz Atom, 2 GB RAM, 2.7 GB commit charge (at the moment), Windows 7. I do believe the problem could be related to my using the file upload dialog.


Adapter Description: Intel(R) Graphics Media Accelerator 3150
Vendor ID: 8086
Device ID: a011
Adapter Driver: sigdumdx32
Driver Version: 8.14.10.2117
Driver Date: 4-19-2010
Direct2D Enabled: false
DirectWrite Enabled: true (6.1.7600.20905, font cache n/a)
WebGL Renderer: Google Inc. -- ANGLE -- OpenGL ES 2.0 (ANGLE 0.0.0.541
GPU Accelerated Windows: 0/2
Comment 25 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-03-30 20:33:04 PDT
I think what we need right now is just a complete stack trace using WinDbg or Visual Studio. Preferably with symbols loaded for the Windows system libraries.
Comment 26 Vladimir Vukicevic [:vlad] [:vladv] 2011-05-06 15:00:55 PDT
(In reply to comment #25)
> I think what we need right now is just a complete stack trace using WinDbg
> or Visual Studio. Preferably with symbols loaded for the Windows system
> libraries.

Stack trace of what/where?  Not sure what stack would be useful if the situation is what I found in comment #6.  Would be good to verify this somehow though, perhaps even with some code that detects the PeekMessage/WaitMessage fail.
Comment 27 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-05-09 03:27:24 PDT
I had been working with 8472 on IRC and it looked like we were in a deeply nested event loop, as if the entire browser was running inside a Windows modal dialog loop or something. But we couldn't tell for sure because his stack traces were bogus. So in fact any complete stack of the main thread while the browser is in the bad state would do.
Comment 28 The 8472 2011-05-09 03:46:42 PDT
I gathered that the stack traces i attached are not complete. Could you tell me which command would get you the data you need?
http://windbg.info/doc/1-common-cmds.html#15_call_stack
Comment 29 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-05-09 05:19:02 PDT
Hmm, those do look complete. But I remember the dumps we were looking at on IRC were less complete.

I suspect Vlad is right and we need to add instrumentation code to check for the case where PeekMessageW returns no message but WaitMessage returns immediately.
Comment 30 Jim Mathies [:jimm] 2011-05-24 13:25:25 PDT
*** Bug 634163 has been marked as a duplicate of this bug. ***
Comment 31 Jim Mathies [:jimm] 2011-05-24 13:29:04 PDT
There's a good chance bug 641705 will fix this in Fx 5. We were dropping gecko event messages in the plugin event processing code.
Comment 32 The 8472 2011-05-27 11:46:21 PDT
Yes, after several days of uptime following my usual browsing patterns that used to cause the issue the browser remained responsive/refreshed the UI properly (apart from somewhat inflated GC/CC times).

Marking as fixed.
Comment 33 The 8472 2011-06-03 17:40:30 PDT
Aaannd... it is back:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0a1) Gecko/20110602 Firefox/7.0a1

After two days of uptime for this session I'm seeing exactly the same behavior as before. Flash completely locks up, the UI only refreshes when the mouse is moved etc. etc.
Comment 34 bensteferzz 2011-06-08 10:37:27 PDT
I am also dealing this problem. It is *incredibly* annoying and should be fixed immediately.

My computer:
windows 7 x64
4gb ram
intel core i3
intel hd graphics


Here are 2 youtube videos showing what problem looks like:
http://www.youtube.com/watch?v=zLlNv7ZwANU
http://www.youtube.com/watch?v=yBT5qc1P_hU
Comment 35 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-06-08 20:40:29 PDT
(In reply to comment #33)
> Aaannd... it is back:
> Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0a1) Gecko/20110602 Firefox/7.0a1
> 
> After two days of uptime for this session I'm seeing exactly the same
> behavior as before. Flash completely locks up, the UI only refreshes when
> the mouse is moved etc. etc.

Can you narrow down the regression to a particular build when it started happening?
Comment 36 The 8472 2011-06-09 14:57:22 PDT
(In reply to comment #35)
> Can you narrow down the regression to a particular build when it started
> happening?

No, sorry. I'm not even 100% certain that it was gone for sure. I mean I intentionally did not restart FF for several days after the patch landed and also used it regularly and it seemed fine. But the occurrence of this issue is very erratic and can take between an hour and a day of firefox uptime to happen. Although it's most likely not related to uptime but to active use. So it's not really deterministic. I can only reproduce it by waiting until it happens.


Later this month or next month I'll get faster internet. Then I'll get a windows 7 virtual machine image from microsoft and see if i can reproduce it inside the VM and freeze it. That's a big if, considering how spurious that bug seems to be.
Comment 37 Ronan Burke 2011-06-22 03:45:05 PDT
I've reliably managed to reproduce this bug, I'm using 4.0.1 and it was happening back in 3.5 as well. 

It happens every time I try to upload pictures on my google blogger account. Maybe just using the google blogger account.

It happens at other times as well apparently randomly but I'm guaranteed it happens with blogger. I'm guessing some sort of script they use.

Really really really annoying, to the point of being serious.

I have fireftp 1.99.4 installed as well as java console 6.0.22 and 6.0.26, with AVG safesearch disabled at 9.0.0.872
Comment 38 Jim Mathies [:jimm] 2011-06-22 05:28:10 PDT
(In reply to comment #37)
> I've reliably managed to reproduce this bug, I'm using 4.0.1 and it was
> happening back in 3.5 as well. 
> 
> It happens every time I try to upload pictures on my google blogger account.
> Maybe just using the google blogger account.
> 
> It happens at other times as well apparently randomly but I'm guaranteed it
> happens with blogger. I'm guessing some sort of script they use.
> 
> Really really really annoying, to the point of being serious.
> 
> I have fireftp 1.99.4 installed as well as java console 6.0.22 and 6.0.26,
> with AVG safesearch disabled at 9.0.0.872

Ronan, can you confirm the problem is still present in 5.0?

http://www.mozilla.com/en-US/firefox/new/
Comment 39 Ronan Burke 2011-06-27 19:13:18 PDT
Hello yes the problem is still present. Happens on image upload only as far as I can see.
Comment 40 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-06-28 04:31:05 PDT
Matthew Gregan had this happen on his machine and did some debugging with my help.

At one point the main thread event queue had 249 pending XPCOM events. The appshell's mNativeEventPending flag was 1, but these events were not being processed. mEventloopNestinglevel was 1. Setting mNativeEventPending to 0 made his session unwedge permanently.

It seems very likely that the problem is simply that nsBaseAppShell::OnDispatchedEvent runs, changes mNativeEventPending from 0 to 1, and calls nsAppShell::ScheduleNativeEventCallback which does the PostMessage of the special appshell message, but then somehow that one message is lost, never received by nsAppShell::EventWindowProc. Once that message is lost, we're permanently stuck in a state where mNativeEventPending is 1 and ScheduleNativeEventCallback is never called again. The fact that setting mNativeEventPending to 0 fixed the problem suggests that nothing else bad is happening.
Comment 41 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-06-28 04:32:52 PDT
Matthew also reported that not long before his browser got wedged, he killed a runaway java.exe process, which supports a correlation between this bug and plugins or perhaps java specifically.
Comment 42 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-06-28 04:56:43 PDT
jimm, bent: I'm a bit scared by the logic of ProcessOrDeferMessage in WindowsMessageLoop. It seems to me that the deferred messages are processed next time WH_GETMESSAGE or WH_CALLWNDPROC fire. But these only fire when the event loop actually gets an event:

GetMsgProc:
> The system calls this function whenever the GetMessage or PeekMessage function
> has retrieved a message from an application message queue. Before returning the
> retrieved message to the caller, the system passes the message to the hook
> procedure.

CallWndProc:
> The system calls this function before calling the window procedure to process a
> message sent to the thread.

So while no message is sent to a main thread window, these hooks won't fire and deferred messages won't be delivered. Am I right?

This doesn't really explain the problem though. If that was the whole problem then moving the mouse over the window or whatever would cause PeekMessage to succeed in nsAppShell::ProcessNextNativeEvent, and then the GetMsgProc hook would fire, we'd run the deferred message and nsBaseAppShell::NativeEventCallback would get called, setting mNativeEventPending to 0 and we'd be unwedged.
Comment 43 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-06-28 05:34:06 PDT
I suppose it's possible that some plugin or some other software installs a message hook that causes our message to be dropped on the floor. It's also possible that a badly-written modal event loop could drop messages.

Given that, maybe it's worth having some kind of fallback mechanism so that if the appshell message is lost, we can still recover? Maybe grab a timestamp when posting the appshell message, and then have ProcessNextNativeEvent dispatch another one if more than a second has elapsed since we first posted the message? Then moving the mouse over the browser window would automatically unwedge it.
Comment 44 Jim Mathies [:jimm] 2011-06-28 09:59:20 PDT
(In reply to comment #43)
> I suppose it's possible that some plugin or some other software installs a
> message hook that causes our message to be dropped on the floor. It's also
> possible that a badly-written modal event loop could drop messages.
> 
> Given that, maybe it's worth having some kind of fallback mechanism so that
> if the appshell message is lost, we can still recover? Maybe grab a
> timestamp when posting the appshell message, and then have
> ProcessNextNativeEvent dispatch another one if more than a second has
> elapsed since we first posted the message? Then moving the mouse over the
> browser window would automatically unwedge it.

Seems reasonable, although I would love to know what is leaking these events. There was some discussion in bug 389931 between you and Mats about the possibility of this, I wonder if OOPP somehow made the loss of the event a more common issue.
Comment 45 Rob 2011-06-28 13:37:16 PDT
A closely related BR where moving the Mouse affects the Browser is demonstrated in Bug 661717 it is accompanied by low Memory. It is not close enough to be a dupe.
Comment 46 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-06-28 16:38:01 PDT
I don't think the issue I was talking about in bug 389931 would lead to mNativeEventPending being permanently true like it is here.

One thing that really worries me about WindowsMessageLoop is that it appears we can get into the following stack:
-- Inside GetMessage/PeekMessage, the WH_GETMESSAGE hook is triggered
-- DeferredMessageHook runs
-- DeferredSendMessage::Run calls the windowproc for some window
-- That triggers something modal that spins up a nested event loop!
How confident are we that Windows can fully handle a WM_GETMESSAGE hook spinning up a nested event loop inside GetMessage/PeekMessage? I'm not!
Comment 47 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-06-28 16:43:37 PDT
For one thing it means that GetMessage has to be reentrant. Even our hook has to be reentrant! I don't know how that would work since CallNextHookEx must refer to global state to figure out what the next hook is, unless there's some super tricky code to figure out which event loop it's being called from.
Comment 48 Mark Dekker 2011-06-29 02:37:06 PDT
This looks like almost the same bug as what i posted.
https://bugzilla.mozilla.org/show_bug.cgi?id=647174

here is a short movie that shows the problem live.
http://www.dwmusicstore.com/mark/firefox.wmv
Comment 49 Rob 2011-06-30 20:40:53 PDT
Created attachment 543334 [details]
FF Nightly 7.0a1 with Processor Pinning

Here is what WinXP Task Manager looks like after a couple of hours of running Firefox Nightly 7.0a1 .

The Processor (dual Core) pins (@100% on one Core) and then 'un-pins' a moment later, this means that you loose and regain control approx. every 2 seconds.


This was much worse when I used to run FF4 as it would remain pinned for several seconds, often followed by a crash. Last week in FF7 we had the same problem (pinning for VERY long periods of time), sometimes up to a minute, which was causing a BSOD for "watchdog.sys". The Error Reporter Popup send a few reports and that has not reoccurred THIS week).


Note that after using [File][Restart] the operation returns to normal even though the same 128 Tabs are reloaded.
Comment 50 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-06-30 22:05:54 PDT
I think comment #49 is not this bug.
Comment 51 daverodgerrs 2011-07-13 17:44:07 PDT
Wow, why does this bug still exist after half a year?

It's a serious bug and it's affecting many people and people have even posted nice videos showing you exactly what it looks like. After half a year it seems like firefox developers are no closer to fixing. In fact the videos show the bug happening in firefox 3.x so the bug has existed for much longer than half a year.

Why don't the developers provide specific instructions to users in order to get the information they need, or provide a special build of firefox that will log all necessary information to track this problem down.

It feels like there's no action being taken to actually fix this problem and instead developers are hoping it accidentally gets fixed with every new firefox release...
Comment 52 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-07-13 21:21:37 PDT
In fact, a lot of work has been done on this bug, including multiple sessions of developers working with users who are experiencing the bug, talking them through using a debugger on it. The fact that it typically takes days of browser usage to trigger, and we don't know what triggers it, makes it very hard for us to fix.

jmathies, bent, can you comment on comments #46 and #47?
Comment 53 Ronan Burke 2011-07-14 02:15:56 PDT
Robert have you tried just opening a blogger account and uploading a few images, large-ish ones? Does it every time for me.
Comment 54 Jim Mathies [:jimm] 2011-07-14 10:56:52 PDT
(In reply to comment #53)
> Robert have you tried just opening a blogger account and uploading a few
> images, large-ish ones? Does it every time for me.

I didn't have any luck reproducing with this. I tried compose post > click on add image > browse for local image > upload > select into post.

 (In reply to comment #52)
> In fact, a lot of work has been done on this bug, including multiple
> sessions of developers working with users who are experiencing the bug,
> talking them through using a debugger on it. The fact that it typically
> takes days of browser usage to trigger, and we don't know what triggers it,
> makes it very hard for us to fix.
> 
> jmathies, bent, can you comment on comments #46 and #47?

I'll get there, in the middle of something but I should be looking at this next week.
Comment 55 maps511 2011-07-26 16:04:31 PDT
This "page not loading without mouse activity over Firefox focus window" problem is widespread and very frustrating. It seems to me that it can occur even on a system which is recently booted or with a fresh Firefox session. I'm running Windows 7 on a Dell E6400 laptop (4 GB memory, mostly unused).  The characteristic trigger is that I'm following a link (typically from Google News summary screen to the full news report). If I do something else while I wait for the page to load, the "loading icon" just spins. As soon as I move the mouse over the Firefox screen the load actually makes progress.
Comment 56 Rob 2011-07-26 18:54:10 PDT
(In reply to comment #50)
> I think comment #49 is not this bug.

Agreed. Thought to have been caused by Spyware since after removing the 'pinning' no longer occurs.

Removing Attachment.
Comment 57 Rob 2011-07-26 18:59:50 PDT
(In reply to comment #56)
> (In reply to comment #50)
> > I think comment #49 is not this bug.
> 
> Agreed. Thought to have been caused by Spyware since after removing the
> 'pinning' no longer occurs.
> 
> Removing Attachment.

@Robert O'Callahan (:roc)
There is no Button to delete my attachment, you (or someone else) may fix this please.

Thank you,
Rob
Comment 58 bearfergeson 2011-07-28 14:28:45 PDT
Any progress on this issue?

I'm a firefox fan but having to restart multiple times a day is becoming a serious issue.

If there is a test build I could run to help figure out the problem, please post a link to it.
Comment 59 Jim Mathies [:jimm] 2011-07-28 14:44:06 PDT
(In reply to comment #58)
> Any progress on this issue?
> 
> I'm a firefox fan but having to restart multiple times a day is becoming a
> serious issue.
> 
> If there is a test build I could run to help figure out the problem, please
> post a link to it.

Yes, it's being worked on.
Comment 60 Jim Mathies [:jimm] 2011-08-01 03:32:41 PDT
(In reply to comment #46)
> I don't think the issue I was talking about in bug 389931 would lead to
> mNativeEventPending being permanently true like it is here.
> 
> One thing that really worries me about WindowsMessageLoop is that it appears
> we can get into the following stack:
> -- Inside GetMessage/PeekMessage, the WH_GETMESSAGE hook is triggered
> -- DeferredMessageHook runs
> -- DeferredSendMessage::Run calls the windowproc for some window
> -- That triggers something modal that spins up a nested event loop!
> How confident are we that Windows can fully handle a WM_GETMESSAGE hook
> spinning up a nested event loop inside GetMessage/PeekMessage? I'm not!

This doesn't appear to be an issue, apparently the get msg proc can block. Granted I haven't found any documentation stating this explicitly, but a local test app that throws up a dialog from within the procedure didn't have any side effects.

From within Fx, this would hook nsappshell's DispatchMessage or ipc's inner spin event loop dispatch. In either case having the thread wrapped up in the get message proc shouldn't be an issue.
Comment 61 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-08-01 04:27:48 PDT
Does the hook run reentrantly in that case?

But OK, whether or not we can understand what's happening here, I think we should create some kind of failsafe that resets mNativeEventPending periodically or if it's been set for too long. Possibly with telemetry to help us track down when it's needed.
Comment 62 Jim Mathies [:jimm] 2011-08-01 05:03:58 PDT
Created attachment 549761 [details] [diff] [review]
native callback timeout patch v.1

(In reply to comment #61)
> Does the hook run reentrantly in that case?

Well, we clear the hook on the first call into the callback:

http://mxr.mozilla.org/mozilla-central/source/ipc/glue/WindowsMessageLoop.cpp#150

The deferred events we deliver are also copied to a local variable, so if we somehow wrapped all the way around and back into this (which would involve completing another rpc call on the delivery of a deferred message) the deferred message data structures in our global scope would handle it.

> But OK, whether or not we can understand what's happening here, I think we
> should create some kind of failsafe that resets mNativeEventPending
> periodically or if it's been set for too long. Possibly with telemetry to
> help us track down when it's needed.

Posted, will push to try for a test run. One question here - we addref nsAppShell when we post a native event. Even so, it looks like a leak of nsAppShell would be OK, since it's an instance singleton. (In fact, with some hackish code added to drop every other native callback message with this patch applied, I didn't see any leaks reported on shutdown despite nsAppShell's ref count being way out of whack ??)
Comment 63 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-08-01 05:12:41 PDT
(In reply to comment #62)
> (In reply to comment #61)
> > Does the hook run reentrantly in that case?
> 
> Well, we clear the hook on the first call into the callback:
> 
> http://mxr.mozilla.org/mozilla-central/source/ipc/glue/WindowsMessageLoop.
> cpp#150
> 
> The deferred events we deliver are also copied to a local variable, so if we
> somehow wrapped all the way around and back into this (which would involve
> completing another rpc call on the delivery of a deferred message) the
> deferred message data structures in our global scope would handle it.

Yeah. My main worry was that if the hook is set up again, then we could be running a hook invocation inside another hook invocation and CallNextHookEx might get confused.
Comment 64 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-08-01 05:14:41 PDT
That patch looks good.
Comment 65 Jim Mathies [:jimm] 2011-08-01 05:17:05 PDT
(In reply to comment #63)
> (In reply to comment #62)
> > (In reply to comment #61)
> > > Does the hook run reentrantly in that case?
> > 
> > Well, we clear the hook on the first call into the callback:
> > 
> > http://mxr.mozilla.org/mozilla-central/source/ipc/glue/WindowsMessageLoop.
> > cpp#150
> > 
> > The deferred events we deliver are also copied to a local variable, so if we
> > somehow wrapped all the way around and back into this (which would involve
> > completing another rpc call on the delivery of a deferred message) the
> > deferred message data structures in our global scope would handle it.
> 
> Yeah. My main worry was that if the hook is set up again, then we could be
> running a hook invocation inside another hook invocation and CallNextHookEx
> might get confused.

I'll put that to the test in my test app and see what happens. Hopefully Windows does freak out.

FYI I was wrong on the leak reporting, didn't have that enabled in my console - it does leak. I don't think this is an issue though, do you? Might be interesting if we see this on a try run..

                                             Per-Inst   Leaked    Total      Rem      Mean       StdDev     Total      Rem      Mean       StdDev
  0 TOTAL                                          17       84  1659907        3 ( 1438.51 +/-  2066.83)   973336        6 ( 2686.61 +/-  4073.04)
207 nsBaseAppShell                                 68       68        1        1 (    1.00 +/-     0.00)     1906        5 (    8.14 +/-     1.62)
539 nsRunnable                                     12       12      894        1 (   32.08 +/-    40.46)     3202        1 (   47.33 +/-    64.06)
665 nsVoidArray                                     4        4    14361        1 ( 1241.26 +/-   602.97)        0        0 (    0.00 +/-     0.00)
Comment 66 Jim Mathies [:jimm] 2011-08-01 05:17:38 PDT
ehm, "*doesn't* freak out"
Comment 67 Jim Mathies [:jimm] 2011-08-01 05:53:06 PDT
Try builds for testing should be ready in about six hours:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jmathies@mozilla.com-6194aed56ba9
Comment 68 Jim Mathies [:jimm] 2011-08-02 12:09:22 PDT
Comment on attachment 549761 [details] [diff] [review]
native callback timeout patch v.1

Seems to be working well in a normal build locally. Haven't seen any issues.
Comment 69 Ben Turner (not reading bugmail, use the needinfo flag!) 2011-08-02 12:38:23 PDT
Should we be using mozilla::TimeStamp to handle PR_IntervalNow overflow?
Comment 70 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-08-02 15:26:05 PDT
Yes!
Comment 71 Steve the Pocket 2011-08-05 19:38:21 PDT
*** Bug 602019 has been marked as a duplicate of this bug. ***
Comment 72 Jim Mathies [:jimm] 2011-08-08 10:02:20 PDT
Created attachment 551487 [details] [diff] [review]
native callback timeout patch v.2
Comment 73 Jim Mathies [:jimm] 2011-08-08 10:03:20 PDT
Comment on attachment 551487 [details] [diff] [review]
native callback timeout patch v.2

Updated to use mozilla::TimeStamp and friends.
Comment 74 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-08-08 15:20:11 PDT
Comment on attachment 551487 [details] [diff] [review]
native callback timeout patch v.2

Review of attachment 551487 [details] [diff] [review]:
-----------------------------------------------------------------

Add "typedef mozilla::TimeStamp TimeStamp;" to nsAppShell so you don't need mozilla:: prefixes elsewhere.
Comment 75 Jim Mathies [:jimm] 2011-08-09 07:49:12 PDT
Pushed to inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/3015d5cb3a9c
Comment 76 :Ehsan Akhgari (busy, don't ask for review please) 2011-08-10 08:26:25 PDT
http://hg.mozilla.org/mozilla-central/rev/3015d5cb3a9c
Comment 77 Jim Mathies [:jimm] 2011-08-12 07:19:36 PDT
*** Bug 678124 has been marked as a duplicate of this bug. ***
Comment 78 jeffstelas 2011-08-13 10:46:43 PDT
If someone would answer the following 2 questions:

1) Has this bug been confirmed as fixed?
The status has been changed to "RESOLVED FIXED" right after the patch was posted but this is a bug that's been difficult to reproduce on demand and takes time to show. So have any of the people who were experiencing the problem reported that the problem is definitely gone?

2) When will the patch show up? Will it be in next week's firefox 6.0 release?
Comment 79 The 8472 2011-08-13 12:40:10 PDT
(In reply to jeffstelas from comment #78)
> 1) Has this bug been confirmed as fixed?
Status always gets to change after the patch has landed on trunk. If someone discovers that it's not fixed they can reopen the bug/request it being reopened.


> 2) When will the patch show up? Will it be in next week's firefox 6.0
> release?
No, it just landed on trunk, which means it's in nightly now. during the next branch it'll bubble up to aurora, then beta, then release (every 6 weeks) unless it gets backported to one of the branches.
Comment 80 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-08-13 14:28:25 PDT
It would be great if people who see this bug regularly could use a nightly build and let us know the results. Thanks!
Comment 81 jeffstelas 2011-08-13 20:55:51 PDT
(In reply to The 8472 from comment #79)

> > 2) When will the patch show up? Will it be in next week's firefox 6.0
> > release?
> No, it just landed on trunk, which means it's in nightly now. during the
> next branch it'll bubble up to aurora, then beta, then release (every 6
> weeks) unless it gets backported to one of the branches.

So the fix won't reach stable until version 9 ?
Comment 82 Jim Mathies [:jimm] 2011-08-13 21:32:38 PDT
(In reply to jeffstelas from comment #81)
> (In reply to The 8472 from comment #79)
> 
> > > 2) When will the patch show up? Will it be in next week's firefox 6.0
> > > release?
> > No, it just landed on trunk, which means it's in nightly now. during the
> > next branch it'll bubble up to aurora, then beta, then release (every 6
> > weeks) unless it gets backported to one of the branches.
> 
> So the fix won't reach stable until version 9 ?

If by 'stable' you mean release, yes. Considering the experimental nature of the patch, that's probably a good thing.

https://wiki.mozilla.org/RapidRelease/Calendar
Comment 83 Ed Morley [:emorley] 2011-08-14 01:38:17 PDT
(In reply to jeffstelas from comment #81)
> So the fix won't reach stable until version 9 ?

Version 8, see the "Target Milestone" field at the top of this page.
Comment 84 The 8472 2011-08-16 12:39:17 PDT
(In reply to roc from comment #80)
Well, I've seen the UI becoming a bit more sluggish, sometimes freezing the "tab loading" animation for fractions of a second and then everything becoming unstuck again. But I'm not sure if that's just GC pauses or the fix (workaround?) kicking in.

Considering the history of comment 32 and 33 I want to test some more before making a definite statement, leaving the browser session open for several days of normal usage.
Comment 85 jasonsrek 2011-10-03 22:21:05 PDT
This bug is not fixed.

According to release note url, this patch was merged into version 7.0 and should be fixed in that release:
http://www.mozilla.org/en-US/firefox/7.0/releasenotes/buglist.html

I am on Firefox 7.0.1 and this bug still happens. For me it's happened twice while a file is downloading. Please put out a proper fix for this longstanding bug
Comment 86 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2011-10-04 15:06:31 PDT
The release notes are incorrect, sorry. The fix is in Firefox 8. Please download a Firefox 8 beta build and see if that fixes it for you.

Note You need to log in before you can comment on or make changes to this bug.