UI becomes sluggish/does not redraw/repaint unless mouse is moved after long browsing sessions

RESOLVED FIXED in mozilla8

Status

()

Core
Widget: Win32
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: The 8472, Assigned: jimm)

Tracking

Trunk
mozilla8
x86
Windows 7
Points:
---

Firefox Tracking Flags

(blocking2.0 .x+)

Details

Attachments

(3 attachments, 2 obsolete attachments)

(Reporter)

Description

6 years ago
User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10pre) Gecko/20110118 Firefox/4.0b10pre
Build Identifier: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b10pre) Gecko/20110118 Firefox/4.0b10pre

My standard browsing session has 300+ tabs and firefox generally stays open 24 hours. After several hours of browsing the UI starts to become somewhat unresponsive. The symptoms are the following:
- autoscrolling with the middle mouse button becomes stuttery
- animated gifs are not animated properly (that includes the spinning tab load icons)
- loading pages seems stuck since the page is not re-drawn

All these issues only appear when i don't generate any input. As long as i'm moving my mouse or press keys it redraws the pages in realtime. As soon as i stop interacting with the UI it stops redrawing and only refreshes it every 2-3 seconds.

I experienced this issue with and without flash and with hardware acceleration on and off.

Reproducible: Always
(Reporter)

Updated

6 years ago
Version: unspecified → Trunk
(Reporter)

Comment 1

6 years ago
Created attachment 512679 [details]
2 threaddumps taken after several hours of uptime

It is extremely difficult to capture this behavior since moving the mouse or creating other UI events can dissipate it temporarily and since it appears in bursts.
Note that i'm seeing an IPC thread there despite no plugin-container.exe process existing at that time.
(Reporter)

Comment 2

6 years ago
Another indicator that this is IPC-related is that i had a youtube upload running overnight with their java uploader applet. Obviously the PC was unattended and the upload seemed to freeze up at some point during the night. When i maximized the FF window it started again.
Although I wasn't able to capture a stacktrace of that event.
(Reporter)

Comment 3

6 years ago
Created attachment 512751 [details]
threaddump taken after upload applet froze

I was able to reproduce the issue with the same browser session and the same applet, here's the stacktrace.
(Reporter)

Comment 4

6 years ago
I was able to reproduce the issue on a relatively clean profile with no addons and all plugins disabled. The only thing i did on that profile was browsing and submitting posts/images on 4chan and after a dozen posts or the UI started to become sluggish again.

So could this be
a) something related to POSTing
b) the file selection dialog
c) the bfcache on image-heavy sites
d) ????

Anyway, i can reproduce the issue reasonably well now, but i've no idea how to isolate it any further.
I can confirm this -- just happened to me, session wasn't particularly long.

It is similar to previous bugs where canvas animations or flash wouldn't update until mouse movement happened.

Marking this b? since we'll want to keep an eye on it -- it should probably be fixed in .x.  The8472 says he can reproduce it reliably, but I don't have any good ideas on how to actually figure out what's going on.
Status: UNCONFIRMED → NEW
blocking2.0: --- → ?
Ever confirmed: true
pulled this into the debugger to poke around.  One weird thing is that in nsAppShell::ProcessNextEvent, I seem to loop going through PeekMessageW returning false, then calling WaitMessage -- which is immediately returning, indicating that a message should be waiting.  PeekMessageW returns false. And this spin continues for a while, until it eventually somehow breaks out.  During this time, the appshell has a native event pending bit set.

So, we seem to have the state where WaitMessage thinks there's a message waiting, but PeekMessage disagrees.  That would certainly cause the problem that I'm seeing -- in particular, PeekMessage should be returning a message (because there is one -- the native event callback, at the very least!).
Moving to .x for now; Vlad, please renominate if this becomes something you think we should fix before release.
blocking2.0: ? → .x+
.x is fine -- I don't think we have nearly enough info about what's happening to block the release on it.  Moving it to Core Widget, though.

There are lots of variants of this bug -- plugins not repainting, video not repainting, canvas animations not repainting, animated gifs not repainting are all symptoms.  I think whatever's going on here is the underlying cuase.
Component: General → Widget: Win32
Product: Firefox → Core
QA Contact: general → win32
Summary: UI becomes sluggish/does not redraw unless mouse is moved after long browsing sessions → UI becomes sluggish/does not redraw/repaint unless mouse is moved after long browsing sessions
(Reporter)

Comment 9

6 years ago
Just to note, this does not just affect rendering. Flash completely locks up when you try to interact with it while the browser is in that state. E.g. trying to move the volume slider on youtube leads to the entire UI freezing for several minutes.
Keyboard input also gets dropped when the UI is in a frozen state.

But yeah, all those things most likely are just symptoms.
If any of you guys can reproduce this reliably, can you try to pinpoint a regression range?  Also, can you post a set of steps to reproduce that you're using?
(In reply to comment #6)
> pulled this into the debugger to poke around.  One weird thing is that in
> nsAppShell::ProcessNextEvent, I seem to loop going through PeekMessageW
> returning false, then calling WaitMessage -- which is immediately returning,
> indicating that a message should be waiting.  PeekMessageW returns false. And
> this spin continues for a while, until it eventually somehow breaks out. 
> During this time, the appshell has a native event pending bit set.

Which PeekMessageW returns false?  The ones in PeekUIMessage, or the one in nsAppShell::ProcessNextNativeEvent?

Also, what is the return value for WaitMessage?  If it's false, what does GetLastError say?

I think one case where this could be happening is when PeekMessageW dispatches a sent message.  I'm not exactly sure if it would return false in that case (if there is no window or thread message posted) or not.  Note that send messages do not go through the message queue (as they are processed synchronously).
(Reporter)

Comment 12

6 years ago
(In reply to comment #10)
I can reproduce it reliably, but my procedure is too crude and cumbersome to do it for regression testing. It basically consists for browsing a while on 4chan and posting about 20-40 posts with image uploads.
I have no clue which part of my standard-browsing-behavior does cause it. So we need to narrow that part down before we can do regression testing.

That is on win7, hw acceleration enabled and all addons/plugins disabled on a relatively clean profile.
FWIW, I wasn't able to reproduce this after doing a variety of canvas-y/plugin-y/video-y/webgl-y things for a while.  In case it matters, my laptop runs

  Graphics

        Adapter Description
        Intel(R) HD Graphics

        Vendor ID
        8086

        Device ID
        0046

        Adapter RAM
        Unknown

        Adapter Drivers
        igdumdx32 igd10umd32

        Driver Version
        8.15.10.2279

        Driver Date
        1-7-2011

        Direct2D Enabled
        true

        DirectWrite Enabled
        true (6.1.7600.16699, font cache n/a)

        WebGL Renderer
        Google Inc. -- ANGLE -- OpenGL ES 2.0 (ANGLE 0.0.0.541)

        GPU Accelerated Windows
        1/1 Direct3D 10
(Reporter)

Comment 14

6 years ago
(In reply to comment #13)
> I wasn't able to reproduce this after doing a variety of
> canvas-y/plugin-y/video-y/webgl-y things for a while.

Those things being affected presumbly are the symptoms of this bug, not the cause as my way of reproducing it involves none of these.

Comment 15

6 years ago
This bug could be related with:
Bug 634163 - browser will not load pages or page content unless mouse cursor constantly moving.
(Reporter)

Comment 16

6 years ago
Yes, the symptoms look quite similar.
I wonder if bug 575515 could have caused this. Checking if this bug happens in 2010-11-29 and 2010-11-30 nightlies would determine if that was the case.

Updated

6 years ago
Duplicate of this bug: 628354

Comment 19

6 years ago
Duplicate ?:
Bug 612087 - A strange bug. I can't describe this issue in English clearly, please watch my demo video
(In reply to comment #17)
> I wonder if bug 575515 could have caused this. Checking if this bug happens in
> 2010-11-29 and 2010-11-30 nightlies would determine if that was the case.

With the screencast on bug 612087, and other dup bugs, this also happens on 3.6, so shouldn't be related to bug 575515 (unless there is more than one issue). Most of the reports are from windows 7 64 bits (with a few exceptions using vista 32 bits).
There also appears to have a correlation with the file dialog being invoked.
Bug 612087 is slightly different, it seems to only require the mouse cursor over the window for animations to happen, the other bugs require it to be moving.
(Reporter)

Comment 22

6 years ago
(In reply to comment #17)
> I wonder if bug 575515 could have caused this. Checking if this bug happens in
> 2010-11-29 and 2010-11-30 nightlies would determine if that was the case.
It took me several hours, but i was able to reproduce the issue on
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b8pre) Gecko/20101128 Firefox/4.0b8pre

So i guess that wasn't our culprit.

(In reply to comment #20)
> There also appears to have a correlation with the file dialog being invoked.
Yeah, my "testing procedure", if you can call it that, includes POSTing files.

I've also seen a bunch of windows thumbnail-caches loaded into the process even after the dialogs were closed. I found that a back while looking at the virtual memory usage, but it was only a fixed amount (about 10), so i didn't think much about it. Not sure if that's of any relevance.
(Reporter)

Comment 23

6 years ago
I might have found a piece to the puzzle. After trying to find a more reliable way to reproduce this issue i found that most cases in which it occurred had over 2GB virtual memory footprint. I suspect that those cases that had less than 2GB had at least addresses > 2GB at some point.

This would also explain why we're mostly seeing this on 64bit systems and only few 32bit ones. To go beyond 2G worth of addresses on 32bit windows one has to run windows with in /3G mode.

So, it's possible that something somewhere does pointer-magic by abusing the most significant bit and this somehow breaks event handling.

There are a lots of buts in this since I'm not sure that reaching 2GB alone does the deed, some other factors probably play into this too.
Version: Trunk → Other Branch
(Reporter)

Updated

6 years ago
Version: Other Branch → Trunk

Comment 24

6 years ago
I'm running Firefox on a 32-bit netbook, and this happens to me constantly. The stats are 1.66 GHz Atom, 2 GB RAM, 2.7 GB commit charge (at the moment), Windows 7. I do believe the problem could be related to my using the file upload dialog.


Adapter Description: Intel(R) Graphics Media Accelerator 3150
Vendor ID: 8086
Device ID: a011
Adapter Driver: sigdumdx32
Driver Version: 8.14.10.2117
Driver Date: 4-19-2010
Direct2D Enabled: false
DirectWrite Enabled: true (6.1.7600.20905, font cache n/a)
WebGL Renderer: Google Inc. -- ANGLE -- OpenGL ES 2.0 (ANGLE 0.0.0.541
GPU Accelerated Windows: 0/2
I think what we need right now is just a complete stack trace using WinDbg or Visual Studio. Preferably with symbols loaded for the Windows system libraries.
(In reply to comment #25)
> I think what we need right now is just a complete stack trace using WinDbg
> or Visual Studio. Preferably with symbols loaded for the Windows system
> libraries.

Stack trace of what/where?  Not sure what stack would be useful if the situation is what I found in comment #6.  Would be good to verify this somehow though, perhaps even with some code that detects the PeekMessage/WaitMessage fail.
I had been working with 8472 on IRC and it looked like we were in a deeply nested event loop, as if the entire browser was running inside a Windows modal dialog loop or something. But we couldn't tell for sure because his stack traces were bogus. So in fact any complete stack of the main thread while the browser is in the bad state would do.
(Reporter)

Comment 28

6 years ago
I gathered that the stack traces i attached are not complete. Could you tell me which command would get you the data you need?
http://windbg.info/doc/1-common-cmds.html#15_call_stack
Hmm, those do look complete. But I remember the dumps we were looking at on IRC were less complete.

I suspect Vlad is right and we need to add instrumentation code to check for the case where PeekMessageW returns no message but WaitMessage returns immediately.
(Assignee)

Updated

6 years ago
Duplicate of this bug: 634163
(Assignee)

Comment 31

6 years ago
There's a good chance bug 641705 will fix this in Fx 5. We were dropping gecko event messages in the plugin event processing code.
(Reporter)

Comment 32

6 years ago
Yes, after several days of uptime following my usual browsing patterns that used to cause the issue the browser remained responsive/refreshed the UI properly (apart from somewhat inflated GC/CC times).

Marking as fixed.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
(Reporter)

Comment 33

6 years ago
Aaannd... it is back:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0a1) Gecko/20110602 Firefox/7.0a1

After two days of uptime for this session I'm seeing exactly the same behavior as before. Flash completely locks up, the UI only refreshes when the mouse is moved etc. etc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 34

6 years ago
I am also dealing this problem. It is *incredibly* annoying and should be fixed immediately.

My computer:
windows 7 x64
4gb ram
intel core i3
intel hd graphics


Here are 2 youtube videos showing what problem looks like:
http://www.youtube.com/watch?v=zLlNv7ZwANU
http://www.youtube.com/watch?v=yBT5qc1P_hU
(In reply to comment #33)
> Aaannd... it is back:
> Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0a1) Gecko/20110602 Firefox/7.0a1
> 
> After two days of uptime for this session I'm seeing exactly the same
> behavior as before. Flash completely locks up, the UI only refreshes when
> the mouse is moved etc. etc.

Can you narrow down the regression to a particular build when it started happening?
(Reporter)

Comment 36

6 years ago
(In reply to comment #35)
> Can you narrow down the regression to a particular build when it started
> happening?

No, sorry. I'm not even 100% certain that it was gone for sure. I mean I intentionally did not restart FF for several days after the patch landed and also used it regularly and it seemed fine. But the occurrence of this issue is very erratic and can take between an hour and a day of firefox uptime to happen. Although it's most likely not related to uptime but to active use. So it's not really deterministic. I can only reproduce it by waiting until it happens.


Later this month or next month I'll get faster internet. Then I'll get a windows 7 virtual machine image from microsoft and see if i can reproduce it inside the VM and freeze it. That's a big if, considering how spurious that bug seems to be.

Comment 37

6 years ago
I've reliably managed to reproduce this bug, I'm using 4.0.1 and it was happening back in 3.5 as well. 

It happens every time I try to upload pictures on my google blogger account. Maybe just using the google blogger account.

It happens at other times as well apparently randomly but I'm guaranteed it happens with blogger. I'm guessing some sort of script they use.

Really really really annoying, to the point of being serious.

I have fireftp 1.99.4 installed as well as java console 6.0.22 and 6.0.26, with AVG safesearch disabled at 9.0.0.872
(Assignee)

Comment 38

6 years ago
(In reply to comment #37)
> I've reliably managed to reproduce this bug, I'm using 4.0.1 and it was
> happening back in 3.5 as well. 
> 
> It happens every time I try to upload pictures on my google blogger account.
> Maybe just using the google blogger account.
> 
> It happens at other times as well apparently randomly but I'm guaranteed it
> happens with blogger. I'm guessing some sort of script they use.
> 
> Really really really annoying, to the point of being serious.
> 
> I have fireftp 1.99.4 installed as well as java console 6.0.22 and 6.0.26,
> with AVG safesearch disabled at 9.0.0.872

Ronan, can you confirm the problem is still present in 5.0?

http://www.mozilla.com/en-US/firefox/new/

Comment 39

6 years ago
Hello yes the problem is still present. Happens on image upload only as far as I can see.
Matthew Gregan had this happen on his machine and did some debugging with my help.

At one point the main thread event queue had 249 pending XPCOM events. The appshell's mNativeEventPending flag was 1, but these events were not being processed. mEventloopNestinglevel was 1. Setting mNativeEventPending to 0 made his session unwedge permanently.

It seems very likely that the problem is simply that nsBaseAppShell::OnDispatchedEvent runs, changes mNativeEventPending from 0 to 1, and calls nsAppShell::ScheduleNativeEventCallback which does the PostMessage of the special appshell message, but then somehow that one message is lost, never received by nsAppShell::EventWindowProc. Once that message is lost, we're permanently stuck in a state where mNativeEventPending is 1 and ScheduleNativeEventCallback is never called again. The fact that setting mNativeEventPending to 0 fixed the problem suggests that nothing else bad is happening.
Matthew also reported that not long before his browser got wedged, he killed a runaway java.exe process, which supports a correlation between this bug and plugins or perhaps java specifically.
jimm, bent: I'm a bit scared by the logic of ProcessOrDeferMessage in WindowsMessageLoop. It seems to me that the deferred messages are processed next time WH_GETMESSAGE or WH_CALLWNDPROC fire. But these only fire when the event loop actually gets an event:

GetMsgProc:
> The system calls this function whenever the GetMessage or PeekMessage function
> has retrieved a message from an application message queue. Before returning the
> retrieved message to the caller, the system passes the message to the hook
> procedure.

CallWndProc:
> The system calls this function before calling the window procedure to process a
> message sent to the thread.

So while no message is sent to a main thread window, these hooks won't fire and deferred messages won't be delivered. Am I right?

This doesn't really explain the problem though. If that was the whole problem then moving the mouse over the window or whatever would cause PeekMessage to succeed in nsAppShell::ProcessNextNativeEvent, and then the GetMsgProc hook would fire, we'd run the deferred message and nsBaseAppShell::NativeEventCallback would get called, setting mNativeEventPending to 0 and we'd be unwedged.
I suppose it's possible that some plugin or some other software installs a message hook that causes our message to be dropped on the floor. It's also possible that a badly-written modal event loop could drop messages.

Given that, maybe it's worth having some kind of fallback mechanism so that if the appshell message is lost, we can still recover? Maybe grab a timestamp when posting the appshell message, and then have ProcessNextNativeEvent dispatch another one if more than a second has elapsed since we first posted the message? Then moving the mouse over the browser window would automatically unwedge it.
(Assignee)

Comment 44

6 years ago
(In reply to comment #43)
> I suppose it's possible that some plugin or some other software installs a
> message hook that causes our message to be dropped on the floor. It's also
> possible that a badly-written modal event loop could drop messages.
> 
> Given that, maybe it's worth having some kind of fallback mechanism so that
> if the appshell message is lost, we can still recover? Maybe grab a
> timestamp when posting the appshell message, and then have
> ProcessNextNativeEvent dispatch another one if more than a second has
> elapsed since we first posted the message? Then moving the mouse over the
> browser window would automatically unwedge it.

Seems reasonable, although I would love to know what is leaking these events. There was some discussion in bug 389931 between you and Mats about the possibility of this, I wonder if OOPP somehow made the loss of the event a more common issue.

Comment 45

6 years ago
A closely related BR where moving the Mouse affects the Browser is demonstrated in Bug 661717 it is accompanied by low Memory. It is not close enough to be a dupe.
I don't think the issue I was talking about in bug 389931 would lead to mNativeEventPending being permanently true like it is here.

One thing that really worries me about WindowsMessageLoop is that it appears we can get into the following stack:
-- Inside GetMessage/PeekMessage, the WH_GETMESSAGE hook is triggered
-- DeferredMessageHook runs
-- DeferredSendMessage::Run calls the windowproc for some window
-- That triggers something modal that spins up a nested event loop!
How confident are we that Windows can fully handle a WM_GETMESSAGE hook spinning up a nested event loop inside GetMessage/PeekMessage? I'm not!
For one thing it means that GetMessage has to be reentrant. Even our hook has to be reentrant! I don't know how that would work since CallNextHookEx must refer to global state to figure out what the next hook is, unless there's some super tricky code to figure out which event loop it's being called from.

Comment 48

6 years ago
This looks like almost the same bug as what i posted.
https://bugzilla.mozilla.org/show_bug.cgi?id=647174

here is a short movie that shows the problem live.
http://www.dwmusicstore.com/mark/firefox.wmv

Comment 49

6 years ago
Created attachment 543334 [details]
FF Nightly 7.0a1 with Processor Pinning

Here is what WinXP Task Manager looks like after a couple of hours of running Firefox Nightly 7.0a1 .

The Processor (dual Core) pins (@100% on one Core) and then 'un-pins' a moment later, this means that you loose and regain control approx. every 2 seconds.


This was much worse when I used to run FF4 as it would remain pinned for several seconds, often followed by a crash. Last week in FF7 we had the same problem (pinning for VERY long periods of time), sometimes up to a minute, which was causing a BSOD for "watchdog.sys". The Error Reporter Popup send a few reports and that has not reoccurred THIS week).


Note that after using [File][Restart] the operation returns to normal even though the same 128 Tabs are reloaded.
I think comment #49 is not this bug.

Comment 51

6 years ago
Wow, why does this bug still exist after half a year?

It's a serious bug and it's affecting many people and people have even posted nice videos showing you exactly what it looks like. After half a year it seems like firefox developers are no closer to fixing. In fact the videos show the bug happening in firefox 3.x so the bug has existed for much longer than half a year.

Why don't the developers provide specific instructions to users in order to get the information they need, or provide a special build of firefox that will log all necessary information to track this problem down.

It feels like there's no action being taken to actually fix this problem and instead developers are hoping it accidentally gets fixed with every new firefox release...
In fact, a lot of work has been done on this bug, including multiple sessions of developers working with users who are experiencing the bug, talking them through using a debugger on it. The fact that it typically takes days of browser usage to trigger, and we don't know what triggers it, makes it very hard for us to fix.

jmathies, bent, can you comment on comments #46 and #47?

Comment 53

6 years ago
Robert have you tried just opening a blogger account and uploading a few images, large-ish ones? Does it every time for me.
(Assignee)

Updated

6 years ago
Assignee: nobody → jmathies
(Assignee)

Comment 54

6 years ago
(In reply to comment #53)
> Robert have you tried just opening a blogger account and uploading a few
> images, large-ish ones? Does it every time for me.

I didn't have any luck reproducing with this. I tried compose post > click on add image > browse for local image > upload > select into post.

 (In reply to comment #52)
> In fact, a lot of work has been done on this bug, including multiple
> sessions of developers working with users who are experiencing the bug,
> talking them through using a debugger on it. The fact that it typically
> takes days of browser usage to trigger, and we don't know what triggers it,
> makes it very hard for us to fix.
> 
> jmathies, bent, can you comment on comments #46 and #47?

I'll get there, in the middle of something but I should be looking at this next week.

Comment 55

6 years ago
This "page not loading without mouse activity over Firefox focus window" problem is widespread and very frustrating. It seems to me that it can occur even on a system which is recently booted or with a fresh Firefox session. I'm running Windows 7 on a Dell E6400 laptop (4 GB memory, mostly unused).  The characteristic trigger is that I'm following a link (typically from Google News summary screen to the full news report). If I do something else while I wait for the page to load, the "loading icon" just spins. As soon as I move the mouse over the Firefox screen the load actually makes progress.

Comment 56

6 years ago
(In reply to comment #50)
> I think comment #49 is not this bug.

Agreed. Thought to have been caused by Spyware since after removing the 'pinning' no longer occurs.

Removing Attachment.

Comment 57

6 years ago
(In reply to comment #56)
> (In reply to comment #50)
> > I think comment #49 is not this bug.
> 
> Agreed. Thought to have been caused by Spyware since after removing the
> 'pinning' no longer occurs.
> 
> Removing Attachment.

@Robert O'Callahan (:roc)
There is no Button to delete my attachment, you (or someone else) may fix this please.

Thank you,
Rob

Updated

6 years ago
Attachment #543334 - Attachment is obsolete: true

Comment 58

6 years ago
Any progress on this issue?

I'm a firefox fan but having to restart multiple times a day is becoming a serious issue.

If there is a test build I could run to help figure out the problem, please post a link to it.
(Assignee)

Comment 59

6 years ago
(In reply to comment #58)
> Any progress on this issue?
> 
> I'm a firefox fan but having to restart multiple times a day is becoming a
> serious issue.
> 
> If there is a test build I could run to help figure out the problem, please
> post a link to it.

Yes, it's being worked on.
(Assignee)

Comment 60

6 years ago
(In reply to comment #46)
> I don't think the issue I was talking about in bug 389931 would lead to
> mNativeEventPending being permanently true like it is here.
> 
> One thing that really worries me about WindowsMessageLoop is that it appears
> we can get into the following stack:
> -- Inside GetMessage/PeekMessage, the WH_GETMESSAGE hook is triggered
> -- DeferredMessageHook runs
> -- DeferredSendMessage::Run calls the windowproc for some window
> -- That triggers something modal that spins up a nested event loop!
> How confident are we that Windows can fully handle a WM_GETMESSAGE hook
> spinning up a nested event loop inside GetMessage/PeekMessage? I'm not!

This doesn't appear to be an issue, apparently the get msg proc can block. Granted I haven't found any documentation stating this explicitly, but a local test app that throws up a dialog from within the procedure didn't have any side effects.

From within Fx, this would hook nsappshell's DispatchMessage or ipc's inner spin event loop dispatch. In either case having the thread wrapped up in the get message proc shouldn't be an issue.
Does the hook run reentrantly in that case?

But OK, whether or not we can understand what's happening here, I think we should create some kind of failsafe that resets mNativeEventPending periodically or if it's been set for too long. Possibly with telemetry to help us track down when it's needed.
(Assignee)

Comment 62

6 years ago
Created attachment 549761 [details] [diff] [review]
native callback timeout patch v.1

(In reply to comment #61)
> Does the hook run reentrantly in that case?

Well, we clear the hook on the first call into the callback:

http://mxr.mozilla.org/mozilla-central/source/ipc/glue/WindowsMessageLoop.cpp#150

The deferred events we deliver are also copied to a local variable, so if we somehow wrapped all the way around and back into this (which would involve completing another rpc call on the delivery of a deferred message) the deferred message data structures in our global scope would handle it.

> But OK, whether or not we can understand what's happening here, I think we
> should create some kind of failsafe that resets mNativeEventPending
> periodically or if it's been set for too long. Possibly with telemetry to
> help us track down when it's needed.

Posted, will push to try for a test run. One question here - we addref nsAppShell when we post a native event. Even so, it looks like a leak of nsAppShell would be OK, since it's an instance singleton. (In fact, with some hackish code added to drop every other native callback message with this patch applied, I didn't see any leaks reported on shutdown despite nsAppShell's ref count being way out of whack ??)
(In reply to comment #62)
> (In reply to comment #61)
> > Does the hook run reentrantly in that case?
> 
> Well, we clear the hook on the first call into the callback:
> 
> http://mxr.mozilla.org/mozilla-central/source/ipc/glue/WindowsMessageLoop.
> cpp#150
> 
> The deferred events we deliver are also copied to a local variable, so if we
> somehow wrapped all the way around and back into this (which would involve
> completing another rpc call on the delivery of a deferred message) the
> deferred message data structures in our global scope would handle it.

Yeah. My main worry was that if the hook is set up again, then we could be running a hook invocation inside another hook invocation and CallNextHookEx might get confused.
That patch looks good.
(Assignee)

Comment 65

6 years ago
(In reply to comment #63)
> (In reply to comment #62)
> > (In reply to comment #61)
> > > Does the hook run reentrantly in that case?
> > 
> > Well, we clear the hook on the first call into the callback:
> > 
> > http://mxr.mozilla.org/mozilla-central/source/ipc/glue/WindowsMessageLoop.
> > cpp#150
> > 
> > The deferred events we deliver are also copied to a local variable, so if we
> > somehow wrapped all the way around and back into this (which would involve
> > completing another rpc call on the delivery of a deferred message) the
> > deferred message data structures in our global scope would handle it.
> 
> Yeah. My main worry was that if the hook is set up again, then we could be
> running a hook invocation inside another hook invocation and CallNextHookEx
> might get confused.

I'll put that to the test in my test app and see what happens. Hopefully Windows does freak out.

FYI I was wrong on the leak reporting, didn't have that enabled in my console - it does leak. I don't think this is an issue though, do you? Might be interesting if we see this on a try run..

                                             Per-Inst   Leaked    Total      Rem      Mean       StdDev     Total      Rem      Mean       StdDev
  0 TOTAL                                          17       84  1659907        3 ( 1438.51 +/-  2066.83)   973336        6 ( 2686.61 +/-  4073.04)
207 nsBaseAppShell                                 68       68        1        1 (    1.00 +/-     0.00)     1906        5 (    8.14 +/-     1.62)
539 nsRunnable                                     12       12      894        1 (   32.08 +/-    40.46)     3202        1 (   47.33 +/-    64.06)
665 nsVoidArray                                     4        4    14361        1 ( 1241.26 +/-   602.97)        0        0 (    0.00 +/-     0.00)
(Assignee)

Comment 66

6 years ago
ehm, "*doesn't* freak out"
(Assignee)

Comment 67

6 years ago
Try builds for testing should be ready in about six hours:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jmathies@mozilla.com-6194aed56ba9
(Assignee)

Comment 68

6 years ago
Comment on attachment 549761 [details] [diff] [review]
native callback timeout patch v.1

Seems to be working well in a normal build locally. Haven't seen any issues.
Attachment #549761 - Flags: review?(roc)
Should we be using mozilla::TimeStamp to handle PR_IntervalNow overflow?
Yes!

Updated

6 years ago
Duplicate of this bug: 602019
(Assignee)

Updated

6 years ago
Attachment #549761 - Flags: review?(roc)
(Assignee)

Comment 72

6 years ago
Created attachment 551487 [details] [diff] [review]
native callback timeout patch v.2
Attachment #549761 - Attachment is obsolete: true
(Assignee)

Comment 73

6 years ago
Comment on attachment 551487 [details] [diff] [review]
native callback timeout patch v.2

Updated to use mozilla::TimeStamp and friends.
Attachment #551487 - Flags: review?(roc)
Comment on attachment 551487 [details] [diff] [review]
native callback timeout patch v.2

Review of attachment 551487 [details] [diff] [review]:
-----------------------------------------------------------------

Add "typedef mozilla::TimeStamp TimeStamp;" to nsAppShell so you don't need mozilla:: prefixes elsewhere.
Attachment #551487 - Flags: review?(roc) → review+
(Assignee)

Comment 75

6 years ago
Pushed to inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/3015d5cb3a9c
http://hg.mozilla.org/mozilla-central/rev/3015d5cb3a9c
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla8
(Assignee)

Updated

6 years ago
Duplicate of this bug: 678124

Comment 78

6 years ago
If someone would answer the following 2 questions:

1) Has this bug been confirmed as fixed?
The status has been changed to "RESOLVED FIXED" right after the patch was posted but this is a bug that's been difficult to reproduce on demand and takes time to show. So have any of the people who were experiencing the problem reported that the problem is definitely gone?

2) When will the patch show up? Will it be in next week's firefox 6.0 release?
(Reporter)

Comment 79

6 years ago
(In reply to jeffstelas from comment #78)
> 1) Has this bug been confirmed as fixed?
Status always gets to change after the patch has landed on trunk. If someone discovers that it's not fixed they can reopen the bug/request it being reopened.


> 2) When will the patch show up? Will it be in next week's firefox 6.0
> release?
No, it just landed on trunk, which means it's in nightly now. during the next branch it'll bubble up to aurora, then beta, then release (every 6 weeks) unless it gets backported to one of the branches.
It would be great if people who see this bug regularly could use a nightly build and let us know the results. Thanks!

Comment 81

6 years ago
(In reply to The 8472 from comment #79)

> > 2) When will the patch show up? Will it be in next week's firefox 6.0
> > release?
> No, it just landed on trunk, which means it's in nightly now. during the
> next branch it'll bubble up to aurora, then beta, then release (every 6
> weeks) unless it gets backported to one of the branches.

So the fix won't reach stable until version 9 ?
(Assignee)

Comment 82

6 years ago
(In reply to jeffstelas from comment #81)
> (In reply to The 8472 from comment #79)
> 
> > > 2) When will the patch show up? Will it be in next week's firefox 6.0
> > > release?
> > No, it just landed on trunk, which means it's in nightly now. during the
> > next branch it'll bubble up to aurora, then beta, then release (every 6
> > weeks) unless it gets backported to one of the branches.
> 
> So the fix won't reach stable until version 9 ?

If by 'stable' you mean release, yes. Considering the experimental nature of the patch, that's probably a good thing.

https://wiki.mozilla.org/RapidRelease/Calendar
(In reply to jeffstelas from comment #81)
> So the fix won't reach stable until version 9 ?

Version 8, see the "Target Milestone" field at the top of this page.
(Reporter)

Comment 84

6 years ago
(In reply to roc from comment #80)
Well, I've seen the UI becoming a bit more sluggish, sometimes freezing the "tab loading" animation for fractions of a second and then everything becoming unstuck again. But I'm not sure if that's just GC pauses or the fix (workaround?) kicking in.

Considering the history of comment 32 and 33 I want to test some more before making a definite statement, leaving the browser session open for several days of normal usage.

Comment 85

6 years ago
This bug is not fixed.

According to release note url, this patch was merged into version 7.0 and should be fixed in that release:
http://www.mozilla.org/en-US/firefox/7.0/releasenotes/buglist.html

I am on Firefox 7.0.1 and this bug still happens. For me it's happened twice while a file is downloading. Please put out a proper fix for this longstanding bug
The release notes are incorrect, sorry. The fix is in Firefox 8. Please download a Firefox 8 beta build and see if that fixes it for you.
You need to log in before you can comment on or make changes to this bug.