Closed Bug 496788 Opened 11 years ago Closed 10 years ago

100% CPU spike when moving window

Categories

(Core :: Widget: Win32, defect, P2)

x86
Windows XP
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: mats, Assigned: mats)

References

()

Details

(Keywords: perf, regression, Whiteboard: [fixed by bug 499447])

Attachments

(4 obsolete files)

STEPS TO REPRODUCE
1. load http://dn.se/
2. CTRL+O

ACTUAL RESULT
Unable to use the file open dialog. Firefox process runs at 100% CPU.
After closing the dialog (press ESC) - Firefox returns to normal.

PLATFORMS AND BUILDS TESTED
On Windows XP:
  Bug occurs in current trunk build.
  Bug does not occur in Firefox 3.0.10 or the latest 1.9.1 nightly build.

On Linux and MacOSX: bug does not occur in trunk builds

SPECULATION
Did we change something in the way we process events in nested event loops
on Windows?
Hmmm... even without NoScript loaded, this does not fail for me (no unusual CPU consumption).

Running SM 2 trunk 20090606002218 on Vista SP1 x64.
is this only happening on that site or others as well?
what is browser.download.dir & browser.download.lastDir set to when the issue occures?
does it also happen in safe mode and/or with a new profile?

without *clear* *reliable working* steps i cannot find a regrange :-/.
I tried a few news sites but I can only reproduce it at http://dn.se/

Adblock plus is a requirement - sorry, I thought I was using a
clean profile, but apparently not.  STR:

0. create a new profile
1. install the Adblock plus (1.0.2 currently) from
   https://addons.mozilla.org/en-US/firefox/addon/1865
2. restart Firefox
3. pick a filter provider (I chose EasyList Germany, but I don't
   think it matters which one you choose), click Subscribe.
4. restart Firefox
5. If you have Flash installed: disable the Flash plugin and restart
   (dn.se is littered with CPU heavy Flash ads for Ikea at the moment
    which we want to avoid)
6. load http://www.dn.se/
7. CTRL+O  (or File->Open File...)
8. drag the file dialog window around for a bit, watch CPU% for the
   firefox.exe process in Windows Task Manager while doing this.

It runs at 50% (i.e. 100% of one of my CPU's cores) nearly the whole
time when moving the window.  Firefox 3.0.11 and 3.5rc2 are significantly
less CPU-heavy with the same profile.

The bug still occurs with the latest nightly:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090622 Minefield/3.6a1pre

Reproducible: always
You don't need to open a dialog to see the problem - just grab the main
window title bar and move the window around...
Regression window:  2009-05-05-04 -- 2009-05-06-04
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2a82ba52ed30&tochange=7518078cb842

Maybe bug 67752?
Summary: 100% CPU hang when opening file dialog → 100% CPU spike when moving window
Could be!  Do we end up in the "in resize" nested event loop when moving the window?  And if so, could we somehow tell that case apart from the resizing case?

Also, why exactly does that page keep wanting to reflow?  That's the other thing that bites us here, I would think....

I'd be really interested in what happens if you just disable the resize check in the Windows "user event pending" impl and see what that does to your observed issue.

And I'm really not sure what adblock has to do with any of this.
Blocks: ireflow
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to comment #6)
> Could be!  Do we end up in the "in resize" nested event loop when moving the
> window?  And if so, could we somehow tell that case apart from the resizing
> case?
Maybe we could log WM_SYSCOMMAND and look to see whether the last one was a size or a move. (We still need the existing code to know when to stop.)
It seems our reflow interruption code is too good! ;-)

I get about 1000-1400 nsWindow::HasPendingInputEvent() calls *per second*
while moving the window, all of which returns true because
UserIsMovingWindow() returns true because we're in a
WM_ENTERSIZEMOVE/WM_EXITSIZEMOVE block.

I think it would be possible to separate resize from move if we really
wanted to, but perhaps this isn't necessary...

> And I'm really not sure what adblock has to do with any of this.

Neither am I.
Attached patch wip (obsolete) — Splinter Review
This seems to fix it in my debug build...
Windows appears to send WM_SIZE/WM_WINDOWPOSCHANGED within the
ENTER/EXITSIZEMOVE block at a reasonable rate...
Yup... e.g., WM_WINDOWPOSCHANGING can be used to implement things like "snapping" to a display border.  The bracketing with enter/exit sizemove make it easy to pay attention to these (or not).
Hmm.  Would that patch let us back out the patch for bug 491700 without regressing that bug?
I backed out bug 491700 and applied the "wip" patch here -- it made the hang
in bug 491700 less likely to occur, but it's still reproducible.
Attached patch wip2 (obsolete) — Splinter Review
The "wip" patch helps with bug 499447 comment 11, it's still a bit
sluggish so I added a 20msec latency in CheckForInterrupt before an
event can cause an interrupt again and this patch fixes bug 499447 too.

It also appears to fix bug 491700 but I haven't analyzed that bug in detail
so maybe I was just lucky...
Comment on attachment 384562 [details] [diff] [review]
wip2

>+      } else if (sysMsg->message == WM_SIZE) {
>+        if (gIsInsideResizeMoveEventloop)
>+          gDidMoveOrResizeWindow = PR_TRUE;
>+      } else if (sysMsg->message == WM_WINDOWPOSCHANGED) {
>+        if (gIsInsideResizeMoveEventloop)
>+          gDidMoveOrResizeWindow = PR_TRUE;
DefWindowProc converts WM_WINDOWPOSCHANGED into WM_SIZE as necessary, so you shouldn't have to check both. Also you could probably just write this
gDidMoveOrResizeWindow = gIsInsideResizeMoveEventloop;

I wonder whether it's worth setting the flag conditionally on the resized window being a top-level window rather than on there being a window resize loop.
So... the reason I wasn't doing a time check in the reflow code is because I seemed to recall it being expensive on Windows.  Can you do a try server talos run with that patch?

The 20ms latency would certainly help with bug 491700, I'd think.
Attached patch wip3 (obsolete) — Splinter Review
(In reply to comment #14)
> DefWindowProc converts WM_WINDOWPOSCHANGED into WM_SIZE as necessary...

Ok, and I renamed the method UserDidResizeWindow() since I take it we're
not interested in interrupting for WM_MOVE.

====

This patch backs out bug 491700 since I think it's unnecessary now.

It also fixes bug 496500 and bug 499447, without regressing bug 491700.

There are try server builds named "496788_wip3" at:
http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaTry&maxdate=1245951247
Assignee: nobody → matspal
Attachment #384530 - Attachment is obsolete: true
Attachment #384562 - Attachment is obsolete: true
Attachment #385239 - Flags: review?(bzbarsky)
(In reply to comment #16)
> This patch backs out bug 491700 since I think it's unnecessary now.
> 
> It also fixes bug 496500 and bug 499447, without regressing bug 491700.
> 
> There are try server builds named "496788_wip3" at:
> http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaTry&maxdate=1245951247

I downloaded the try server build and I am able to reproduce the hang of bug 491700 on http://en.wikipedia.org/wiki/United_states
I can reproduce that too, thanks for testing.
Attachment #385239 - Attachment is obsolete: true
Attachment #385239 - Flags: review?(bzbarsky)
Attached patch wip4 (obsolete) — Splinter Review
I'm now regulating the return value of nsToolkit::UserDidResizeWindow()
based on timers.  
Within a WM_ENTER/EXITSIZEMOVE block, there are three states:
1. a "linger" period where we return PR_FALSE, it's started by a WM_SIZE,
   the (default) length of this period is 50ms, the timer is renewed on each
   WM_SIZE but there's a max total length, 100ms (default).
2. after the period above, there's a "post linger" period of 150ms
   where we return PR_TRUE. (WM_SIZE events are ignored)
3. after the period above, return PR_FALSE until the next WM_SIZE.

When not within a WM_ENTER/EXITSIZEMOVE block: return PR_FALSE.

Try server builds with this patch:
https://build.mozilla.org/tryserver-builds/mpalmgren@mozilla.com-496788_t4_4/
Attachment #385605 - Flags: review?(bzbarsky)
>+++ b/layout/base/nsPresContext.cpp
>-static PRUint32 sInterruptChecksToSkip = 200;
>+static PRUint32 sInterruptChecksToSkip = 30;

Why?  Is this better than 200 (and on all three tier1 platforms)?

>+++ b/widget/src/windows/nsToolkit.cpp
>+static PRUint32 sResizeEventLingerMillisec = 50;
>+static PRUint32 sResizeEventMaxLingerMillisec = 100;
>+static PRUint32 sResizeEventPostLingerMillisec = 150;
>+static PRBool sResizeEventInterrupt = PR_FALSE;

These could use documentation.  It's hard to review the rest of the patch without that.

Note that HasPendingInputEvent is used for more than just interruptible reflow.  Is the new behavior appropriate for the other uses (at least the parser-interruption behavior)?  It looks like we now don't treat window moves as user events, with your patch?

Someone familiar with Windows widgetry should probably be reviewing most of this code anyway, not me...
This really needs a different reviewer....  roc, do you know who'd be competent to review this?
Flags: blocking1.9.2?
Attachment #385605 - Flags: review?(bzbarsky) → review?(tellrob)
This patch is severely underdocumented. I want to remove the window hook (see bug 503892) so I am not in favor of this fix. Does the patch posted in bug 503892 fix the problem?
I tried a similar set of steps:
Fresh profile, installed no script (ick flash), went to http://dn.se, Ctrl+O and dragged the dialog around around.

On XP without the patch from bug 503892 I can reproduce this with the above steps. On Windows 7 with that patch (and it's prerequisites) applied, I cannot hit 100% cpu usage. There is still a non-trivial amount of cpu usage but it's not nearly as bad.
It sounds like it might be worth taking bug 503892 (which should in fact fix this issue, as well as fixing bug 491700 in a better way.  And then we should look into putting some of Mats' timer stuff in the interrupt code guts...
Depends on: 503892
I don't see how bug 503892 will fix bug 491700. Bug 503892 just seems like a different way to determine if we are in the move/size loop. If I'm understanding it correctly, we'll still always return true when in the move/size loop. Indeed, I tried out the method from bug 503892 and turned off posting reflow off a timer from bug 491700 and the hang from bug 491700 returns. (I didn't apply the patch from bug 503892 directly to avoid having to recompile with changed headers, just stopped setting up the hook, and changed HasPendingInputEvent.)
Ah, ok.  Makes sense.  Sounds like the right approach is to do whatever in bug 503892, add a timer to the guts of ireflow, and then see about bug 491700...
I think the behaviour of Mats' patch where we only return true in HasPendingInputEvent if we actually received some input, instead of always returning true when in the move/size loop is desirable. I think that alone should fix bug 496500 and bug 499447. Is it possible to do something similar with the GetGUIThreadInfo way from bug 503892?
Rob, are you able to review Mats' patch here?  It'd be very nice to get this addressed...

Or is what's left of it now the timer part?
Flags: blocking1.9.2? → blocking1.9.2+
Priority: -- → P2
I didn't manage to reproduce the issue (I notice a CPU spike when moving the Open File or Firefox window, but it goes down to 0 when the move ends).

I tried an alternative approach in bug 499447 that is simply to delay the ireflow timer (I wasn't aware of this bug previously). Maybe it could be interesting to see if it fixes the issue here?
I tested both a trunk debug build and 20090819 Namoroka/3.6a2pre nightly
and I can't reproduce the original problem on http://dn.se.  I tested
also with the patch in bug 499447 which raises the timer to 10ms,
and it still works (in the debug build).

I did notice that the CPU usage is a bit high until you actually
release the mouse button, that is:
(0. open the Windows Task Manager to watch CPU usage)
1. load http://news.bbc.co.uk/sport/low or
   http://en.wikipedia.org/wiki/United_states
2. resize the window width back and forth
3. stop moving the mouse but do not release the mouse button
   -- notice that the CPU usage does not go down much, still
   about 40% for me.
4. release the mouse button -- now the CPU usage drops down
   to normal

This seems like less of a problem, so I'm lowering severity
to Normal.  We should probably make it not blocking1.9.2
as well.
Severity: critical → normal
Flags: blocking1.9.2+ → blocking1.9.2?
Comment on attachment 385605 [details] [diff] [review]
wip4

I think this patch is obsolete now.
Maybe the essence of the WM_SIZE part can help with
the remaining problem though.
Attachment #385605 - Attachment is obsolete: true
Attachment #385605 - Flags: review?(tellrob)
As per comment 30
Flags: blocking1.9.2? → blocking1.9.2-
The patch for bug 499447 should have fixed this.  Someone want to test in tomorrow's nightlies?
WFM, both the original STR and the one in comment 30.
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.3a1pre) Gecko/20091007 Minefield/3.7a1pre

Resolving as fixed by bug 499447 per comment 34.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by bug 499447]
Depends on: 499447
You need to log in before you can comment on or make changes to this bug.