Closed Bug 157144 Opened 18 years ago Closed 17 years ago

Painting gets suppressed more than it probably should

Categories

(Core :: DOM: UI Events & Focus Handling, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla1.2alpha

People

(Reporter: jesup, Assigned: kmcclusk)

References

()

Details

(Keywords: perf, regression, Whiteboard: [adt2] [ETA 09/10])

Attachments

(1 file, 12 obsolete files)

14.54 KB, patch
dougt
: review+
kinmoz
: superreview+
jesup
: approval+
Details | Diff | Splinter Review
Example: go to that url, select the Drag example, play around.  Note that
updates can get very delayed, even on fast machines (probably much worse on slow
machines).

This is a well-known issue with the event queues on various platforms (including
Win32 and Linux, maybe not on Mac).  Paint events get lower priority than the
mouse/invalidate events, and if the mouse keeps moving enough, the paint never
happens.

Perhaps we should limit the amount of time that a paint can be suppressed?  or
flush the event queue after processing a mouse move?  or coalesce mouse moves? 
(this might have some side effects if someone tries to do (for example) build a
drawing program, but I'm not sure that's a big issue.)

See also bug 142635 and bug 153083 (and probably others).

Original reporter is pep.pepe@razorfish.de in a posting in n.p.m.performance
Keywords: perf
Blocks: 154568
This bug depends on #21762 and should be fixed with progression in #140789
Keywords: regression
Hardware: PC → All
Taking this bug
Assignee: joki → kmcclusk
This patch demonstrates that the paint starvation issue can be improved by
forcing periodic synchronous paints. This patch is not adequate however since
it will force a synchronous paint only if a reflow has been processed. It is
still possible that the paints could be starved by computations that don't
result in reflows.

In addition, I have measured a 3.5% slowdown in jrgm's pageload tests with this
patch installed. Probably the result of forcing each page to paint before
firing the onload handler.
Attachment #92892 - Flags: needs-work+
Priority: -- → P1
Target Milestone: --- → mozilla1.2alpha
Blocks: 129640
Here is a snippet for msdn documentation for WIN32 which describes the issue:

"With the exception of the WM_PAINT message, the system always posts messages at
the end of a message queue. This ensures that a window receives its input
messages in the proper first-in, first-out (FIFO) sequence. The WM_PAINT
message, however, is kept in the queue and is forwarded to the window procedure
only when the queue contains no other messages. Multiple WM_PAINT messages for
the same window are combined into a single WM_PAINT message, consolidating all
invalid parts of the client area into a single area. Combining WM_PAINT messages
reduces the number of times a window must redraw the contents of its client area. "
"The WM_PAINT message, however, is kept in the queue and is forwarded to the
window procedure only when the queue contains no other messages"

Since we rely on the WM_PAINT events to be dispatched from the message pump we
will never paint unless the msg queue is empty. 
Keywords: nsbeta1
Whiteboard: [adt2 RTM] [DIGBug]
QA Contact: rakeshmishra → desale
Blocks: 117436
Blocks: 124178
Blocks: 155160
Work in progress:

- Schedules a PL_Event to perform a synchronous paint whenever a widget has
been invalidated. This prevents paint starvation caused by continuous
scheduling of PL_Events which happens when data arrives quickly. (Examples
include: Loading a local large file from disk or loading a large file from the
cache).

- Synchronizes the display before processing user events (Mouse and keyboard).
This prevents paint starvation caused by continuous processing of a series of
user-events.
Attachment #92892 - Attachment is obsolete: true
Hmm.. how will this keyboard event thing affect editor performance?
The low priority of paint events is somewhat Windows-specific, isn't it? 
Wouldn't it also be fixed by backing out the changes in revision 3.41 of
widget/src/windows/nsAppShell.cpp, since we used to force different
prioritization of paint events, but then changed to the standard Windows
prioritization to improve performance (IIRC, a significant page load performance
gain, perhaps close to 10%, which made Mozilla's performance roughly equivalent
to that of winEmbed).  (Does GTK do something similar?  Could we fix it there in
a similar way?)

Also (mainly in reply to comment 0), paint suppression and the prioritization of
paint events are two very different issues.
(w.r.t. rev 1.41 of nsAppShell.cpp, I tried reverting that change (see 
http://bugzilla.mozilla.org/show_bug.cgi?id=132759#c99) and in current builds
I'm not seeing it as a significant difference anymore. Either I'm on crack,
or something has changed to make this moot).
"The low priority of paint events is somewhat Windows-specific, isn't it?"

I believe both GTK and WIN32 handle paint events the same way. They do not
dispatch pending paint events until the event queue is empty. They gather up the
invalidated areas and dispatch a single paint to limit the number of paint
dispatches. I'm not sure about the Mac.

Since Gecko is embedded in other applications which have their own message pump
we can not count on any special handling of the platform specific messages.
I was told that one of the reasons Hyatt removed the special event handling in
nsAppShell.cpp was so that Mozilla behaved like embedded applications in regard
to how event messages were processed. The embedded application has control over
the event loop and we can not require them to process paint messages ahead of
other events.
Attachment #94073 - Flags: needs-work+
As part of the embedding API's I see we do have a stub
NS_HandleEmbeddingEvent(msg, wasHandled) which currently does nothing except
return NS_OK.  Perhaps we could implement this and process all of the events as
a FIFO list. But we would need to restrict it to process only the events for
embedded window. WIN32 does have support for processing only the events
associated with a particular window and it's children. I think the fundamental
problem with our event processing is that we favor particular messages over
others when processing the event queue. This leads to starvation of the lower
priority messages. I think we should experiment with processing the messages in
the order in which they are posted.
Keywords: nsbeta1nsbeta1+
There are three major issues that I am seeing with event starvation on WIN32.

1. UI Lockup/Hang when attempting to moving a top-level window, resizing the
top-level window, or click on native menu within an embedding app while a page
is loading from the net, cache, or local disk.  Mozilla completely locks up
until the entire page is loaded.  
2. Paint events are starved when loading a page from the cache or local disk.
3. Paint events are starved when mouse pointer is moved. This makes it difficult
to implement drag and drop using JavaScript, CSS, DOM because we will never
update until the pointer stops moving. The mouse move events are processed with
a higher priority then the paint events so the element is not displayed until
the user stops dragging.
 
case 1. Can *not* be solved by reverted back to revision 3.41, which inverts the
event priority placing the WM_APP below particular events, because once we
initiate a window move,or  resize Windows spins a separate event processing
loop. We have no control over the event processing at this point.

case 2. Can be solved by reverting to revision 3.41  as dbaron suggests. But we
would also have to force embedding apps to do special processing of their event
queues.

case 3. Can *not* be solved by reverted to revision 3.41 because the paints are
still at a lower priority than the mouse events.

To solve the first two cases, I think we have to abandon the use of WM_APP
events as the mechanism for PL_Event notification in plevent.c. Instead we could
use a WM_TIMER event with a duration of 0. The TIMER events have the lowest
priority so all system events will be processed at a higher priority which fixes
both case 1 and case 2.  

case 3 may be solved by forcing a synchronous paint of the invalidated areas
when  specific user events such as mouse move, mouse down etc. are processed in
nsWindow.cpp. I think case3. should be handled as a separate issue. I will spin
off a new bug for it.


Test case:

1. Save a large file such as
(http://lxr.mozilla.org/seamonkey/source/layout/html/style/src/nsCSSFrameConstructor.cpp)
 locally.
2. Load the local file.

3. Try moving the window by grabbing the title bar. Mozilla freezes until the
entire page is loaded. No user interaction will happen once it freezes. Try
resizing the window. It freezes. Try clicking a menu item. It does not display.

4. Use mfcembed to load the local file. The page will not display until the
entire page has loaded. Try clicking on a menu or moving the top-level mfcembed
window while the local file is loading and the app will freeze until the entire
page is loaded.
The patch i've attached changes plevent.c to use a WM_TIMER message instead of
WM_APP message as the PL_Event notification mechanism.

benefits:

1. Mozilla remains interactive during page loads from the net, disk, or cache.
2. Embedded applications, mfcembed, winembed remain interactive during page
loads without modifying their event processing loops.
3. Mozilla and embedded applications incrementally display the page without
modification of their event processing loops.
4. CPU utilization for www.netscape.com dropped from 51% to 25%.

I have run jrgm's page loader tests on 750Mhz  WINXP machine over a cable modem.
I can *not* detect a performance degradation but my precision is only within 6%.
 (http://cowtools.mcom.com/page-loader/loader.pl)

jrgm: can you apply the patch and run a comparison to see if page load is affected?

If you apply the patch you will use WM_TIMER events which makes Mozilla
responsive during page loads.
if you comment out  #define USE_TIMER_EVENT and recompile, it will revert back
to the current behavior which uses WM_APP events.
Attachment #94073 - Attachment is obsolete: true
Comment on attachment 94661 [details] [diff] [review]
Use a WM_TIMER event instead of WM_APP event for PLEvent notification

These changes look good and the inital data sounds great.  Lets get this
checked in soon so we have lots of time this milestone to detect any problems.

I haven't seen a comment reference win16 in some time!

r=dougt
Attachment #94661 - Flags: review+
I'll give this a run through later today.
(note: using the updated patch, attachment 94693 [details] [diff] [review], for the tests below).

I did a series of runs, first with a Buffy (current trunk) build, with
and without USE_TIMER_EVENT defined. Then I repeated them for mfcEmbed
standalone builds, with and without the define. On win2k, this was
about 4% slower for the Buffy build mfcEmbed build. However, the
"slower" builds were more responsive in e.g., I can resize and move
the window while loading nsCSSFrameConstructor.cpp in the patched
build, but I'm locked out of the game with the unpatched build (on
both 98 and 2k).

mfcEmbed, win2k,   500MHz, 12MB 
-------------------------------
  Without patch:    Avg. Median : 781 msec	Average     : 791 msec
  With patch   :    Avg. Median : 813 msec	Average     : 821 msec
                                 -----
                                   +4%

Buffy,    win2k,   500MHz, 12MB 
-------------------------------
  Without patch:    Avg. Median : 870 msec	Average     : 874 msec		
  With patch   :    Avg. Median : 905 msec	Average     : 909 msec		
                                 -----
                                   +4%
                    
[Un]fortunately, I ran the same tests on a win98 system. I'm really sorry 
to say that (apparently) something seriously goes wrong on that OS.

mfcEmbed, win98SE, 400MHz, 128MB
---------------------------------
  Without patch:    Avg. Median : 1037 msec	Average     : 1062 msec	
  With patch   :    Avg. Median : 1226 msec	Average     : 1246 msec	
                                 -----
                                   +18%                       

Buffy,    win98SE, 400MHz, 128MB
---------------------------------
  Without patch:    Avg. Median : 1132 msec	Average     : 1152 msec		
  With patch   :    Avg. Median : 1312 msec	Average     : 1324 msec		
                                 -----
                                   +16%

The performance degradation occurs on all the test pages.
FWIW the problem trying to be fixed here isn't a problem on 
os/2 because it doesn't merge paint requests and it 
automagicly bumps execution priority of threads having 
window messages waiting. I'd wish y'all check out the 
result of these changes on linux, mac, os/2, etc, before 
doing anything drastic... and the change proposed here does 
make drastic change in message delivery patterns, I think.  
Sam: The proposed change would only affect WIN32. Linux and Mac use completely
different mechanisms for the native PL_Event notification which are unaffected
by the proposed change. I'm glad to hear that OS2 does not suffer from this issue. 

"...the change proposed here does make drastic change in message delivery patterns"

Yes, it does on WIN32. 
> I can resize and move the window while loading nsCSSFrameConstructor.cpp in the 
> patched build

Cool!  Can you also scroll using the scrollbar and using the keyboard?  (For bug
129640.)
This is the same problem as we see in bug 132759 when Flash uses WM_USER+1
messages.

To reduce the timer performance hit, would it work to use a timer for every 
10th or 20th time (or something like that)? 
"would it work to use a timer for every 10th or 20th time"

Unfortunately no. The problem is the timer callback is not called soon enough,
not that it is called too much. There seems to be a large lag between  timer
callbacks even though the duration is set to 0. If I set up the timer to
generate WM_TIMER events instead of using the callback and I also modify the
nsAppShell event processing to pull WM_TIMER's off the event queue first it is
*still slow*.    However, If I modify nsAppShell to process any pending
PL_Events before processing any WM_ events then performance is back to normal.  
Brian: Oops.. I misunderstood your last comment. Yes it would probably help
performance to only use timers every n times.  
How about using WM_TIMER, but with the following changes:
-- in nsEventQueue::ProcessPendingEvents, after calling PL_ProcessPendingEvents
the first time, check to see if there are now more pending events, and if there
are, go back to the beginning and process them. Repeat until the queue is empty.
-- BUT before each call to PL_ProcessPendingEvents, call ::GetQueueStatus() to
see if there are any pending paint or input messages, and if there are, return
immediately. For maximum responsiveness one could even call ::GetQueueStatus()
inside the event processing loop in PL_ProcessPendingEvents.
roc wrote:
>How about using WM_TIMER, but with the following changes:
>-- in nsEventQueue::ProcessPendingEvents, after calling PL_ProcessPendingEvents
>the first time, check to see if there are now more pending events, and if there
>are, go back to the beginning and process them. Repeat until the queue is empty.

I recall this is how plevent.c used to work, several years ago, and combined
with lengthy compute-bound control flow into gecko from plevents, it led to
horribly high latency and poor UI responsiveness.  I'm not positive that
WM_TIMER was the message type used, but I am sure the old-old plevent.c code
used to loop till the queue was empty.  The change troy and others made was to
count its population at the top of PL_ProcessPendingEvents, and loop over only
that many events.

/be
> it led to horribly high latency and poor UI responsiveness

Yeah, that's why I suggested using ::GetQueueStatus to leave the loop as soon as
incoming input or painting is detected.
roc: I saw that, but it still seems like other uses (are there any?) of PL_PPE
could suffer the change in semantics.  Maybe no one else calls PL_PPE, and this
is all a matter of (re-)definition.

/be
There don't appear to be any other uses of PL_ProcessPendingEvents (except
possibly in the BeOS nsToolkit? Whatever...)

One uninterrupted run through the queue per call to PL_ProcessPendingEvents is
probably necessary to make our "deferred painting" work. (Basically, when a
bunch of reflows and repaints occur at once, we need to make sure that the
painting happens AFTER all reflows have been processed, so we delay the
invalidates by issuing custom PLEvents which then issue native invalidates. We
want to finish procesing the originally-pending PLEvents before those native
invalidates are serviced.) So don't try to stick a call to ::GetQueueStatus
inside PL_ProcessPendingEvents itself.
"... before each call to PL_ProcessPendingEvents, call ::GetQueueStatus() to
see if there are any pending paint or input messages."

There is almost always a pending paint message as the result of invalidates from
the first PL_ProcessPendingEvents call so we end up doing almost the same number
of SetTimer's and pay the price of their dispatch latencies.

We could toggle between using Timers and WM_APP events using ::GetQueueStatus to
determine if there are any input messages. This would allow Mozilla to remain
interactive during long page loads, but this would not solve the paint
starvation issue if their isn't any user input during the page load. However,
the paint starvation issue could be solved separately, by scheduling a PL_Event
to do synchronous paint of any invalidated area when a nsWindow is invalidated. 
I'm not positive I understand all the subtleties of PL_ProcessPendingEvents 
(brendan and others: slap me if I'm not making sense) but I'm okay as long as we 
realize that ::DispatchMessage(&msg) actually loops sucking events out of the 3
queues (unprioritized) when the user is moving/resizing the window. This of 
course means that posted messages are serviced by windows code (out of our 
control) before user events like mouse drag and keyboard when the user is 
moving/resizing the window.
I'm also sure I don't understand this patch well... that 
said, why SetTimer with elapsed=0 ? There's a maximum 
useful rate to process screen changes. Screen vertical 
refresh rate ranges 50-100Hz, that is 10-20ms, so pumping 
screen updates faster than this is visually ineffective. 
Sam
Using set timer is not about delaying the activity to some later time but
about making the *new* request be serviced *after* the user input that is 
*already* waiting in the hardware message queue. If Microsoft had done the 
obivious thing of making user input on a UI based system have top priority 
instead of bottom priority then we would not have to be fiddling like 
this. 

overview of what has been discussed in this bug that hopefully will clear up
some questions:

When a PL_PostEvent is called a WM_APP msg is Posted to the WIN32 native message
queue. (If one has not been posted already). The PLEventQueue is separate from
the native event queue. The PLEventQueue will often contain numerous events
while there will be only one WM_APP msg posted to the native message queue. The
WM_APP event is used as notification to indicate there is at at least one event
and possible more events in the PLEventQueue that need to be processed.  (Other
platforms use different mechanism for this notification. Pipes are used on XP_UNIX)

When control returns to the message processing loop which performs 
TranslateMessage, DispatchMessage it will process the WM_APP msg that was posted
and call a routine which in turn makes a call to PL_ProcessPendingEvents which 
processes pending PL_Events.

When each PL_Event in the  PLEventQueue is processed it may call PL_PL_PostEvent
which puts additional events on the PL_EventQ.  The posting of the PL_Event will
result in a new WM_APP event being posted to the native message queue. To
prevent continuously looping over the PL_Events in the PLEventQ as the result of
this additional message posts we only process the number of events that are
sitting in the PL_EventQ at the time PL_ProcessPendingEvents is called.

In theory this should work fine. Except for the fact that messages that are
"posted" to the native WIN32 message queue  using ::PostMessage are processed
*before* all other messages.  We overcome this somewhat in nsAppShell where the
main message loop for Mozilla resides. It processes the WM_MOUSE and WM_KEY
events first by peeking into the native message queue and pulling them out
before the WM_APP events.  

But this technique completely falls apart when the user grabs the title bar of
the window or tries to resize the window, because Windows spins its own event
loop which will process the "posted" messages ahead of all other messages. When
the posted WM_APP events are processed they often result in a new WM_APP event
being inserted into the native message queue because additional PL_Events are
generated. This causes a complete lockup of the UI because the message loop
spends all of its time working on the posted messages, while all other events
mouse move, paints etc. remain in the native message queue until the "posted"
native messages stop coming in.

Interestingly the 4.x code base does not have this problem even though it also
uses "posted" WM_APP events. But it uses PL_Events sparingly. They are only used
to indicate data has arrived from the network.  In Gecko the PL_events are used
for network data notification,  Reflow notification, timer notification,  paint
invalidation. etc.,etc.

My solution was to use a timer's instead of "posted" WM_APP messages.. The
reason timers fix this problem is because they generate "non posted" messages. 
Since the native timer messages are not posted, but instead they are generated
by windows internal timer code they do not starve out other "non posted" events
which occur when the user moves or resizes the window.

The problem with SetTimer is there seems to be a latency before the actual timer
dispatch which slows down page load.

Roc's proposed solution was intended to eliminate the number of times SetTimer
needed to called. It involves processing more PL_Events without returning to the
native message pump. He was noting that many of the calls to process the
PLEvents result in additional posts to the PLEventQueue. He proposed that we
continue processing the additional posted PL_Events unless there was a pending
input or paint event in the native message Q. 

In practice the native message queue will often contain a paint event which is
the result of invalidation performed while processing the PL_Events so we return
immediately, and end up performing the same number SetTimer calls.  

If I use Roc's proposal and  return only when there is an input event,
performance improves but is still not as good as the when the WM_APP messages
where used. It appears to be about 2% slower than use WM_APP messages on WinXP.

I also tried switching dynamically between using WM_APP events and WM_TIMER
events based on the user input. When user input is pending I used timers. This
brings performance which matches using WM_APP messages, but the app is sluggish
when dragging or resizing the window. (You are not completely "locked out" but
we are not as nearly as responsive as I.E or 4.x during the long page load using
this technique.)  

A good solution would be if we could find a way to generate a "non posted" event
that we can use for notification other than using timers.  Could pipes be used
on WIN32 just as they are on UNIX?. Is there a way of connecting a WIN32 pipe to
the GUI event processing so we can register a callback when data arrives on the
pipe.  If the pipe implementation on WIN32 used posted messages this would not
work because we would be back to the original problem.
How about if we add a lower priority queue and test for starvation?

Perhaps we could move the WM_USER, WM_APP, and PL_Event* messages to a non-OS 
queue that we prioritize lower than the system level keyboard/mouse queue.

The basic code would look like this:

  set starvation timer

  while (1) {
    did_something = false
    if (have event other than wm_paint or wm_timer) {
      get next message (in natural order) (except wm_paint or wm_timer)
      if (wm_user or wm_app message)
        move to lower priority queue
      else 
        service the message
      did_something = true
      if (not starving)
        go to start of loop
    }

    if (have lower priority message) {
      service lower priority messages for a while
      did_something = true
      if (not starving)
        go to start of loop
    }

    if (have wm_paint) {
      service wm_paint messages for a while
      did_something = true
      if (not starving)
        go to start of loop
    }

    if (have wm_timer) {
      service wm_timer messages for a while
      did_something = true
      if (not starving)
        go to start of loop
    }

    if (did_something == false) // nothing to do
      wait for a message/timer to arrive/fire

    reset starvation timer
  }


I want to _not_ remove wm_paint messages from the OS message queue early. The OS  
naturally combines update regions and this minimizes repainting.

We still have to handle the window move/resize case. This is the case where 
inside the DispatchMessage system call it internally reads the queue (ie: we 
cannot manage the queue). One way to deal with this is by having code in the 
window proc that detects wm_user and wm_app messages that have not been moved to 
the lower priority queue. In the window proc we can then move them to the lower 
priority (non OS) queue.
Brian: The patch I attached does something similar to what your proposing. It
keeps metrics on event starvation before decided whether to post a WM_APP
message or set a timer.  The concept is that the posting allows us to send a
high priority notification and setting a timer allows us to send low priority
notification.  If we are not staving input or paint events then we keep posting
WM_APP events. If we detect starvation we use a timer event which allows any
pending input and paint events to be processed.

 I need to do some more tuning and testing, but the patch fixes the UI lockup
problem/paint starvation issue and seems to do it without affecting page load
performance. 

 I've been trying to avoid doing more special processing in the nsAppShell::Run
since embedded applications have their own event loops and it may not be
possible for them to modify message processing priorities.
Note: The patch I attached uses counters to indicate event starvation.(i.e
number of consecutive times a WM_APP msg was posted while there was a pending
input or paint) It would probably be better to compare elapsed time instead. I'm
going to experiment with using elapsed times instead.
Kevin: any thoughts on what to do when a plugin like Flash is posting events and is
able to use 100% of CPU?
I find I can fix the "Flash plugin uses 100% CPU" and starves out user input bug, 
bug 132759.

  define a window proc
  in the window proc test if input is pending: 
  if (HIWORD(GetQueueStatus(QS_INPUT))) {
    defer the message by putting in on a timer with delay 0
  }
Brian: Great! This approach is compatible with attachment 95604 [details] [diff] [review].  So our general
solution appears to be to demote the priority of WM_APP, WM_USER events by
converting them to timer events when input is starved.  

Should you also include paint starvation? Do the drop-down menus in Mozilla
appear quickly and correctly on pages which use a plugin which generates WM_APP
and WM_USER events?
> Should you also include paint starvation? 

Yes, I had not looked at paint yet as I've been worried about the "moz freezes
the *entire* UI and requires Ctrl-Alt-Delete to break the freeze" problem.

Assuming paint and timer messages come thru the window proc perhaps we can
add something like this to the window proc (assuming it does not have a
significant performance issue).

  DWORD currentTickTime = GetTickCount();

  // tend the starvation timers
  if (msg.message == WM_PAINT)
    lastPaintTime = currentTickTime
  else if (msg.message == WM_TIMER)
    lastTimerTime = currentTickTime

  // check for starvation
  else if (msg.message >= WM_USER) {
    PRBool shouldDelay = PR_FALSE;
    // check for input starvation (right now!)
    if (HIWORD(GetQueueStatus(QS_INPUT)))
      shouldDelay = PR_TRUE;

    // check for paint/timer starvation (every now and then)
    else if (currentTickTime != lastTickCount) {
      lastTickCount = currentTickTime;
      if ((currentTickTime-lastPaintTime) > MAX_PAINT_STARVATION_TIME) {
        if (GetQueueStatus(WM_PAINT))
          shouldDelay = PR_TRUE;
        else
          lastPaintTime = currentTickTime
      }
      if ((currentTickTime-lastTimerTime) > MAX_TIMER_STARVATION_TIME) {
        if (GetQueueStatus(WM_TIMER))
          shouldDelay = PR_TRUE;
        else
          lastTimerTime = currentTickTime
      }
    }

    // I'm not sure if we need to test for WM_HOTKEY or if that is
    // handled as a posted message below WM_USER

    if (shouldDelay)
      defer the message by putting in on a timer with delay 0
  }
Shanjian: if this works I believe we may no longer need the patch you just 
added for IME event loop problems, bug 57332.
I realized that the use of a single lastTickCount timer to limit the number of
times I check probably won't work exactly like I expect if there are multiple
sources calling PostMessage. 

  else if (currentTickTime != lastTickCount) {
    lastTickCount = currentTickTime;
    if ((currentTickTime-lastPaintTime) > MAX_PAINT_STARVATION_TIME) {

It probably needs to look at each message time individually. Just servicing one 
message and then resetting lastTickCount ignores other messages. I suspect I 
should just drop lastTickCount and do something like this:

  else if ((msg.time-lastPaintTime) > MAX_PAINT_STARVATION_TIME) {

Will this same kind of issue apply to your patch?

+static PRBool _md_EventIsStarved(PRBool isPending, PRUint32 
...
+  if (isPending) {
+    /* was_pending must be false so record the current time
+     * so the elapsed time can be computed the next time this
+     * function is called 
+     */
+    *lastTime = PR_IntervalToMilliseconds(PR_IntervalNow());
+    *wasPending = PR_TRUE; 
+    return(PR_FALSE); 
+  }
The drag example in http://www.domapi.com/ seems to gone. :-(
Another good url when you dont see what you are dragging is:
http://www.dhtmlcentral.com

Just drag around the different Windows. Cannot see them in Windows 2000 while
been dragged.
I filed bug 163526 for the drag starvation issue. This bug (157144) covers only
the starvation issue when loading from the cache or local file. 
Whiteboard: [adt2 RTM] [DIGBug] → [adt2 RTM]
"Will this same kind of issue apply to your patch?"

Brian: I don't think this is a significant issue for attachment 95717 [details] [diff] [review].  It would
be more accurate to use the actual message time of the msg that was starved but
the approximation of using the last time that a msg was posted with a pending
paint or input msg seems to be good enough. I separate the paint and input times
so their elapsed times are calculated separately. 

For paint events we would need to capture the time of the first invalidate.
Since paints are accumulated the time that is returned in the msg structure is
not the first time. AFAIK it always gives back the current tick count making it
useless for calculating the elapsed time.  
using attachment 96068 [details] [diff] [review] I recorded 6774 calls to PostMessage vs. 39 calls to
SetTimer when running jrgm's page load tests. SetTimer was called during page
load due to paint starvation. Since SetTimer is used infrequently during page
load unless the user is interacting with the UI, page load times should not be
affected.  
> Since SetTimer is used infrequently during page load unless the user is 
> interacting with the UI, page load times should not be
> affected.  

For bug 132759 my results feel very similar to this: very few calls to delay
the messages except when the user is dragging/resizing the window (and with
my current test patch the UI response is quite acceptable).
jrgm: Could you run your perf test again with attachment 96197 [details] [diff] [review]? Thanks! I can
not detect any performance impact using the patch. In addition SetTimer is only
called for approx 0.5% of the notifications when running your page load tests.
Attachment #96197 - Attachment is obsolete: true
I have completed the fix (attachment 96247 [details] [diff] [review]). It is ready to be reviewed.
I've opened bug 164084 to discuss whether we should remove the PeekMessage code.
Please cc: yourself if interested.
I do not have any knowledge of PL_Events but I have been looking at MSWin event 
loops quite a bit recently and the timer code design and logic looks good to me.

+#define PAINT_STARVATION_LIMIT   750

Its my gut level feeling that when navigating menus with a mouse a delay of 
over 250 milliseconds make it hard for the user.

I note that this does not look for timer starvation. Do / should we care about 
timer starvation here?
>I note that this does not look for timer starvation. Do / should we 
>care about timer starvation here?

Basicly, that's my problem with this fix, to the extent that 'I have a 
problem'. Win32 GPI is what it is, does what it does, it works the way 
it works. Once you take on task of 'fixing it' to behave differently in 
one bit of code.... then where do you stop?

I raise this as a question. Don't say I know the answer.
    
"Its my gut level feeling that when navigating menus with a mouse a delay of 
over 250 milliseconds make it hard for the user."

Whe navigating menus the INPUT_STARVATION_LIMIT of 50 ms is what will be
triggered, So the App will remain responsive during the long page load. I have
confirmed this by loading a local copy of a long file. I am able to mouse over
menu items and the user response is acceptable. When the input starvation limit
has been reached, a timer is used which allows all pending input AND paints will
be processed before the PL_Event notification, so the effective update rate when
there is a user interaction is around 50 ms not 750ms. The
PAINT_STARVATION_LIMIT is triggered when the user has not interacted with the
page at all during a long page load of cached or local file.

Mozilla timers are implemented using the PL_Event mechanism.  We will not starve
out the Mozilla timers because they are already being processed as  PL_Events.  

"Basicly, that's my problem with this fix, to the extent that 'I have a
problem'. Win32 GPI is what it is, does what it does, it works the way
it works"

Actually this fix does *not* change the way the WIN32 GDI works. It does not
attempt to fix it. It doesn't invert the order of any messages or similar
nastiness, which is what is currently going on in nsAppShell by the way. In
fact, we will probably be able to remove the code in nsAppShell which pulls out
the mouse and key events ahead of other events once this patch goes in. It also
allows embedded apps to run using unmodified msg processing loops.  The WIN32
message processing does not work exactly the same way as other platforms. Thats
to be expected. The Widget and Gfx modules are littered with code where we  have
identified native platform behavior discrepancies and make them compatible with
the XP code's expectations. This is just happens to be another instance of that.

This patch recognizes how the WIN32 msg queue processing works (Posted events
are always processed first) and uses a mechanism for dispatching events
PL_Events which is compatible with it. In Mozilla the basic model for
dispatching most everything is a PL_Event. Currently we use a WIN32 GDI
Mechanism which makes the notification the highest priority because it is a
posted event. WIN32 GDI also provides a mechanism for providing a notification
with the lowest priority which is the timer.  This patch just tries to make an
intelligent choice when to use one or the other. When there is user interaction
we want to use the lowest priority  because we want to favor interactively over
page load time which means using a timer. When there isn't any user interaction
we want to use the fastest mechanism which is a posted event.

This is not a case where we are in the dark about what is really going on here.
We know exactly why the starvation occurs. We have the microsoft documentation
of how it processes posted events, paint Events and timers.  With that
knowledge, a solution has been crafted which takes into account microsoft's
implementation and the Mozilla cross platform code expectations.

Will there be other cases which we will have to address?  Maybe.  But the
current patch solves the major UI lockup and paint starvation issue when loading
a long document. I think we can build upon this solution for other starvation
issues. The concept is pretty simple. We can either post high priority or low
priority notification and we need to determine when to use one over the other.
What's the landing plan? If you want this in 1.2 we should get it reviewed and
landed ASAP.
> When navigating menus the INPUT_STARVATION_LIMIT of 50 ms is what will be
> triggered, ... so the effective update rate when there is a user interaction 
> is around 50 ms not 750ms. 

That's plenty fast enough, Since this should also be the case for my plugin
patch I will go experiment with extending the paint starvation time there.

> We will not starve out the Mozilla timers because they are already being 
> processed as PL_Events.  

I thought I saw the timer starvation test firing occasionally when testing my
plugin patch. Are there any non-mozilla timers we need to care about?
Comment on attachment 96247 [details] [diff] [review]
Improved patch: Use WM_ENTERSIZEMOVE and WM_EXITSIZEMOVE to detect when the user is moving the window

1.  Maybe move or copy the comment above _md_EventIsStarved about starvation
limits to where the #defines are.  I was wondering what where were for and had
to search for other references.

2. Remove the win16 defines. lets not kid ourselves.

3. Please be consistent with the return statement:

+	 return(PR_TRUE); /* pending and over the limit */
+      }
+
+      return (PR_FALSE); /* pending but within the limit */
+  }

either a space or no space , not both.	I would rather have you just drop the
parenthesis.

4. You are not removing the property set by SetProp.  You probably can do the
same for the WindowsHookEx, but it is not as critical.
Note: I removed all of the WIN16 references in plevent.c
Attachment #96247 - Attachment is obsolete: true
Comment on attachment 96735 [details] [diff] [review]
patch incorporating Doug's suggestions

thanks. r=dougt

Just a thought, maybe we should have a #define which disables this
optimization.
Attachment #96735 - Flags: review+
Comment on attachment 96735 [details] [diff] [review]
patch incorporating Doug's suggestions

sr=kin@netscape.com

I like doug's idea of using ifdefs, initially, so that we can turn on/off this
code to see if it's contributing to any unforseen regressions.
Attachment #96735 - Flags: superreview+
Attachment #96735 - Attachment is obsolete: true
> What's the landing plan? If you want this in 1.2 we should get it reviewed 
> and landed ASAP.

I assume roc actually meant 1.1. But I don't think we really need to force 
this into 1.1 at the last moment.

Sorry, I didn't get some more tests in (something about some other release and 
all :-). I'm updating my trunk build now and should have results later tonight.
(I'll also be checking win98 as well, since we previously saw a much different 
response to changes in event behaviour on that platform than on a NT 
derivative.)
I meant 1.2. There's no one anyone is going to force anything into 1.1 at this
point :-). I was just wondering if you're planning to land this in 1.2.

This kind of change has been very regression-prone in the past and the trunk is
already quite unstable. We need to land this with plenty of time before 1.2 is
due to ship to give us time to deal with any possible regressions.
*** Bug 164931 has been marked as a duplicate of this bug. ***
Blocks: 164931
(roc: sorry for misunderstanding your comment).

I built the trunk optimized from this afternoon (but reverted the pref
change that regressed pageload by a couple of percent [bug 30088] in
my build). I applied attachment 96761 [details] [diff] [review] and tested the A and B builds on
win2k and win98

win2k/500MHz/128MB
  Avg. Median :  919 msec                Average : 929 msec
  Avg. Median :  923 msec                Average : 936 msec
                --------                          --------
                +0.4%                             +0.7%

win98/400MHz/128MB
  Avg. Median : 1181 msec               Average : 1206 msec
  Avg. Median : 1211 msec               Average : 1232 msec
              --------                            --------
                +2.5%                             +2.1%

So, the win2k result is effectively none (I don't consider that change
to be statistically significant). The win98 change does appear to be
real, although it may be acceptable given that the browser should be
more responsive now. It's also curious that almost all of the change
on win98 occurs because of >10% slowdowns on four particular dummy
pages (zdnet.com, gamespot.com, DOM Core, voodooextreme.com). I should
probably test this again on win98 to be sure. Also, perhaps there is
some "characteristic" of those pages that leads to this slowdown
(i.e., perhaps we can identify something that can be independently
fixed for those pages specifically, or win98 generally). Note: I
didn't do any testing of the UI responsiveness (although I'll look at
that tomorrow on win2k and win98).
jrgm: The 2.5% to 2.1% perf hit on Win98 is probably the result of those pages
triggering the paint starvation limit more frequently. So they are painting more
than they used to. Could you try changing following define in plevent.c?

from:
PAINT_STARVATION_LIMIT   750

to:
PAINT_STARVATION_LIMIT  2000

And re-run your test on Win98. Thanks!
Blocks: 165039
I tried Kevin's suggestion to change the value to 2000 for 
PAINT_STARVATION_LIMIT and reran the test on win98. This did 
improve things although it did still come in about a percent 
slower than without the patch at all, with most of the "loss" coming
in the two slowest/largest pages in the test sequence. Should I try 
a different setting for that define?

I did try playing with the browser while loading nsCSSFrameConstructor.cpp
and the browser is more responsive, both resizing and moving while the document
loads. (Subjectively it seems more responsive on win2k/500MHz than on 
win98/400MHz, although I don't know if that is due to the speed or the OS 
difference).
"Should I try a different setting for that define?"

3000 or 4000 should bring it below 1%. The relationship is linear. 750 results
in a 2.5% slowdown and a setting of 2000 results in 1% slowdown.  

3000 should bring it to approx .6%
4000 should bring it to approx .5%

I think .6% on WIN98 is a reasonable price to pay. The .6% slowdown is result of
legitimate painting of the page. Part of our current page load performance is
the result of starving painting for long periods of time.  Setting a limit of 3
seconds on the starvation seems reasonable even if it slows the page load tests
on WIN98 down by .6%. 
I agree that it is reasonable to take a bit of absolute load slow down to give
the user feedback that something is happening. I myself would opt for the 
shorter time with better feedback and slightly slower loading. It is very 
disconcerting to see moz "freeze up" even for 4 seconds.

John: could you tell us the times for the pages that seem to be most affected?
John: and the URLs so I can get a feel of loading the pages?
There's nothing to stop us using different values for different Windows
versions. We could do a run-time test and use 750 on NT systems and 3000 on
Win9x systems.
Comment on attachment 97188 [details] [diff] [review]
Same as the last patch + roc's suggestion to determine the paint starvation limit based on OStype. WIN9x=3000 all others=750

Can you do something like this instead:

os.dwOSVersionInfoSize = sizeof(OSVERSIONINFO);

This way it is more clear that you are talking about the size of the type.
Attachment #97188 - Attachment is obsolete: true
Comment on attachment 97212 [details] [diff] [review]
Same as last patch + Doug's suggestion for sizeof(OSVERSIONINFO)

sr=kin@netscape.com

Not a big deal, but I was just wondering, since the starvation limit will never
change after it is calculated by _md_GetPaintStarvationLimit(), should we be
caching the value in _pl_NativeNotify to avoid the function call overhead to
_md_GetPaintStarvationLimit() each time we call _md_EventIsStarved()?

+    if (_md_EventIsStarved( (qstatus & QS_PAINT),
_md_GetPaintStarvationLimit(), 

or is the overhead so small it's not worth it?
Attachment #97212 - Flags: superreview+
Running jrgm's page loader tests results in just over 6000 calls to
_md_GetPaintStarvationLimit(). Compared with all of the calls and processing
performed to actually load the pages, reflow, render, etc. etc. This has to be
insignificant. 
Comment on attachment 97212 [details] [diff] [review]
Same as last patch + Doug's suggestion for sizeof(OSVERSIONINFO)

r=dougt
Attachment #97212 - Flags: review+
"could you tell us the times for the pages that seem to be most affected?"

                         b      b/a      c      c/a      a
-------------------------------------------------------------
www.w3.org_DOML2Core	3924	105%	3827	103%	3730
www.voodooextreme.com	4022	107%	3917	104%	3762

where a is without the patch, b is the patched build with '750',
and c is the patched build with '2000' as PAINT_STARVATION_LIMIT,
and these numbers generated on win98/400MHz/128MB. The URLs to "static"
copies of these test pages are:
http://cowtools.mcom.com/perf/content/base/www.voodooextreme.com/index.html
http://cowtools.mcom.com/perf/content/base/www.w3.org_DOML2Core/index.html

I agree that taking a small hit on win98 in exchange for not locking up the UI 
is the right choice (of course).
I checked in attachment 97212 [details] [diff] [review] 

If you suspect this checkin is causing a problem you can prevent it from using
the timer by commenting out the following line in mozilla/xpcom/threads/plevent.c

#define USE_TIMER
So is this fixed now, or are we leaving this open for more patches and/or tuning
of the values?
Keywords: mozilla1.1mozilla1.2
Its fixed. 
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Nominating for 1.0 branch as improving DHTML perf would be good for embedding
customers. Marking as edt1.0.2.
Whiteboard: [adt2 RTM] → [adt2] [ETA 09/09]
desale: can you pls verify this as fixed on the trunk? thanks!
Whiteboard: [adt2] [ETA 09/09] → [adt2] [ETA 09/10]
Kevin,
This one is marked "FIXED" on 08-30-2002.
Which testcase do you think I should use to verify this Fix ? 

I tried the testcase provided at URL of this bug (Played around with Drag
example) with 2002-09-09-08-trunk & 2002-08-15-04-trunk & I did not find any
difference in painting updates (No difference in Results).

If this one is Fixed on 08-30-2002, then two different trunks before & after
08-30-2002 should produce different test results.

Or is it marked Fixed on basis of any other test (jrgm's performance test or
something else). Please let me know which test I should use to verify this fix.
Prashant:

Use the following test case:

1. Save a large file such as
(http://lxr.mozilla.org/seamonkey/source/layout/html/style/src/nsCSSFrameConstructor.cpp)
 locally.
2. Load the local file.

3. Try moving the window by grabbing the title bar. Before this patch went in
Mozilla would freeze until the entire page is loaded. No user interaction would
happen once it froze. Try resizing the window. It freezes. Try clicking a menu
item. It does not display.  With the current trunk builds It should remain
interactive.

4. Use mfcembed to load the local file. The page will not display until the
entire page has loaded. Try clicking on a menu or moving the top-level mfcembed
window while the local file is loading and the app will freeze until the entire
page is loaded. With a current trunk build it should display after a few seconds
well before the entire document has completed its load.
Yes,

With testcase provided by Kevin, I can see that problem is fixed in todays 
trunk.
I tested two trunks one 2002-09-09-08-trunk & other 2002-08-15-04-trunk from 
last month before fix checkin. I can see the bug is fixed.
I tested on WinXP.

Marking verified
Status: RESOLVED → VERIFIED
Keywords: edt1.0.2edt1.0.2+
Randell/Chofmann: Looks like the EDT wants this on the 1.0 branch. Can you pls
provide 1.0 branch approval for this patch?
Keywords: approval
Comment on attachment 97212 [details] [diff] [review]
Same as last patch + Doug's suggestion for sizeof(OSVERSIONINFO)

a=rjesup@wgate.com for 1.0 branch.  CHange mozilla1.0.2+ to fixed1.0.2 when
checked in.

Please check in ASAP for 1.0.2 tagging.
Attachment #97212 - Flags: approval+
Fix checked into the 1.0 branch.
desale: pls verify this as fixed for the 1.0 branch, then replace "fixed1.0.2",
with "verified1.0.2". thanks!
Verified on 1.0 branch
This bug was listed as OS all, but from what I gather from reading the comments,
it is fixed on Windows only.  If that's correct, have separate bugs been filed
for Linux and Mac on this issue, as Kevin indicated he would do for bug #165039?

The fix for this and bug #165039 sound like something that'll really make a huge
difference in the interactive feel of Mozilla, and even if we give up one or two
percent in page load time it'll definitely be worth it.  Looking forward to
seeing these fixes make it to Linux :-)
Filed bug 179995 to remove nsAppshell code that munges order of mouse/key events
as per comment 60 here.

Regarding other OS's, from a comment here:


Sam: The proposed change would only affect WIN32. Linux and Mac use completely
different mechanisms for the native PL_Event notification which are unaffected
by the proposed change. I'm glad to hear that OS2 does not suffer from this issue. 

"...the change proposed here does make drastic change in message delivery patterns"

Yes, it does on WIN32. 
*** Bug 125260 has been marked as a duplicate of this bug. ***
Component: Event Handling → User events and focus handling
You need to log in before you can comment on or make changes to this bug.