Closed Bug 157144 Opened 23 years ago Closed 22 years ago

Painting gets suppressed more than it probably should

Categories

(Core :: DOM: UI Events & Focus Handling, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla1.2alpha

People

(Reporter: jesup, Assigned: kmcclusk)

References

()

Details

(Keywords: perf, regression, Whiteboard: [adt2] [ETA 09/10])

Attachments

(1 file, 12 obsolete files)

14.54 KB, patch
dougt
: review+
kinmoz
: superreview+
jesup
: approval+
Details | Diff | Splinter Review
Example: go to that url, select the Drag example, play around. Note that updates can get very delayed, even on fast machines (probably much worse on slow machines). This is a well-known issue with the event queues on various platforms (including Win32 and Linux, maybe not on Mac). Paint events get lower priority than the mouse/invalidate events, and if the mouse keeps moving enough, the paint never happens. Perhaps we should limit the amount of time that a paint can be suppressed? or flush the event queue after processing a mouse move? or coalesce mouse moves? (this might have some side effects if someone tries to do (for example) build a drawing program, but I'm not sure that's a big issue.) See also bug 142635 and bug 153083 (and probably others). Original reporter is pep.pepe@razorfish.de in a posting in n.p.m.performance
Keywords: perf
Blocks: 154568
This bug depends on #21762 and should be fixed with progression in #140789
Keywords: regression
Hardware: PC → All
Taking this bug
Assignee: joki → kmcclusk
This patch demonstrates that the paint starvation issue can be improved by forcing periodic synchronous paints. This patch is not adequate however since it will force a synchronous paint only if a reflow has been processed. It is still possible that the paints could be starved by computations that don't result in reflows. In addition, I have measured a 3.5% slowdown in jrgm's pageload tests with this patch installed. Probably the result of forcing each page to paint before firing the onload handler.
Attachment #92892 - Flags: needs-work+
Priority: -- → P1
Target Milestone: --- → mozilla1.2alpha
Blocks: 129640
Here is a snippet for msdn documentation for WIN32 which describes the issue: "With the exception of the WM_PAINT message, the system always posts messages at the end of a message queue. This ensures that a window receives its input messages in the proper first-in, first-out (FIFO) sequence. The WM_PAINT message, however, is kept in the queue and is forwarded to the window procedure only when the queue contains no other messages. Multiple WM_PAINT messages for the same window are combined into a single WM_PAINT message, consolidating all invalid parts of the client area into a single area. Combining WM_PAINT messages reduces the number of times a window must redraw the contents of its client area. "
"The WM_PAINT message, however, is kept in the queue and is forwarded to the window procedure only when the queue contains no other messages" Since we rely on the WM_PAINT events to be dispatched from the message pump we will never paint unless the msg queue is empty.
Keywords: nsbeta1
Whiteboard: [adt2 RTM] [DIGBug]
QA Contact: rakeshmishra → desale
Blocks: 117436
Blocks: 124178
Blocks: 155160
Work in progress: - Schedules a PL_Event to perform a synchronous paint whenever a widget has been invalidated. This prevents paint starvation caused by continuous scheduling of PL_Events which happens when data arrives quickly. (Examples include: Loading a local large file from disk or loading a large file from the cache). - Synchronizes the display before processing user events (Mouse and keyboard). This prevents paint starvation caused by continuous processing of a series of user-events.
Attachment #92892 - Attachment is obsolete: true
Hmm.. how will this keyboard event thing affect editor performance?
The low priority of paint events is somewhat Windows-specific, isn't it? Wouldn't it also be fixed by backing out the changes in revision 3.41 of widget/src/windows/nsAppShell.cpp, since we used to force different prioritization of paint events, but then changed to the standard Windows prioritization to improve performance (IIRC, a significant page load performance gain, perhaps close to 10%, which made Mozilla's performance roughly equivalent to that of winEmbed). (Does GTK do something similar? Could we fix it there in a similar way?) Also (mainly in reply to comment 0), paint suppression and the prioritization of paint events are two very different issues.
(w.r.t. rev 1.41 of nsAppShell.cpp, I tried reverting that change (see http://bugzilla.mozilla.org/show_bug.cgi?id=132759#c99) and in current builds I'm not seeing it as a significant difference anymore. Either I'm on crack, or something has changed to make this moot).
"The low priority of paint events is somewhat Windows-specific, isn't it?" I believe both GTK and WIN32 handle paint events the same way. They do not dispatch pending paint events until the event queue is empty. They gather up the invalidated areas and dispatch a single paint to limit the number of paint dispatches. I'm not sure about the Mac. Since Gecko is embedded in other applications which have their own message pump we can not count on any special handling of the platform specific messages. I was told that one of the reasons Hyatt removed the special event handling in nsAppShell.cpp was so that Mozilla behaved like embedded applications in regard to how event messages were processed. The embedded application has control over the event loop and we can not require them to process paint messages ahead of other events.
Attachment #94073 - Flags: needs-work+
As part of the embedding API's I see we do have a stub NS_HandleEmbeddingEvent(msg, wasHandled) which currently does nothing except return NS_OK. Perhaps we could implement this and process all of the events as a FIFO list. But we would need to restrict it to process only the events for embedded window. WIN32 does have support for processing only the events associated with a particular window and it's children. I think the fundamental problem with our event processing is that we favor particular messages over others when processing the event queue. This leads to starvation of the lower priority messages. I think we should experiment with processing the messages in the order in which they are posted.
Keywords: nsbeta1nsbeta1+
There are three major issues that I am seeing with event starvation on WIN32. 1. UI Lockup/Hang when attempting to moving a top-level window, resizing the top-level window, or click on native menu within an embedding app while a page is loading from the net, cache, or local disk. Mozilla completely locks up until the entire page is loaded. 2. Paint events are starved when loading a page from the cache or local disk. 3. Paint events are starved when mouse pointer is moved. This makes it difficult to implement drag and drop using JavaScript, CSS, DOM because we will never update until the pointer stops moving. The mouse move events are processed with a higher priority then the paint events so the element is not displayed until the user stops dragging. case 1. Can *not* be solved by reverted back to revision 3.41, which inverts the event priority placing the WM_APP below particular events, because once we initiate a window move,or resize Windows spins a separate event processing loop. We have no control over the event processing at this point. case 2. Can be solved by reverting to revision 3.41 as dbaron suggests. But we would also have to force embedding apps to do special processing of their event queues. case 3. Can *not* be solved by reverted to revision 3.41 because the paints are still at a lower priority than the mouse events. To solve the first two cases, I think we have to abandon the use of WM_APP events as the mechanism for PL_Event notification in plevent.c. Instead we could use a WM_TIMER event with a duration of 0. The TIMER events have the lowest priority so all system events will be processed at a higher priority which fixes both case 1 and case 2. case 3 may be solved by forcing a synchronous paint of the invalidated areas when specific user events such as mouse move, mouse down etc. are processed in nsWindow.cpp. I think case3. should be handled as a separate issue. I will spin off a new bug for it. Test case: 1. Save a large file such as (http://lxr.mozilla.org/seamonkey/source/layout/html/style/src/nsCSSFrameConstructor.cpp) locally. 2. Load the local file. 3. Try moving the window by grabbing the title bar. Mozilla freezes until the entire page is loaded. No user interaction will happen once it freezes. Try resizing the window. It freezes. Try clicking a menu item. It does not display. 4. Use mfcembed to load the local file. The page will not display until the entire page has loaded. Try clicking on a menu or moving the top-level mfcembed window while the local file is loading and the app will freeze until the entire page is loaded.
The patch i've attached changes plevent.c to use a WM_TIMER message instead of WM_APP message as the PL_Event notification mechanism. benefits: 1. Mozilla remains interactive during page loads from the net, disk, or cache. 2. Embedded applications, mfcembed, winembed remain interactive during page loads without modifying their event processing loops. 3. Mozilla and embedded applications incrementally display the page without modification of their event processing loops. 4. CPU utilization for www.netscape.com dropped from 51% to 25%. I have run jrgm's page loader tests on 750Mhz WINXP machine over a cable modem. I can *not* detect a performance degradation but my precision is only within 6%. (http://cowtools.mcom.com/page-loader/loader.pl) jrgm: can you apply the patch and run a comparison to see if page load is affected? If you apply the patch you will use WM_TIMER events which makes Mozilla responsive during page loads. if you comment out #define USE_TIMER_EVENT and recompile, it will revert back to the current behavior which uses WM_APP events.
Attachment #94073 - Attachment is obsolete: true
Comment on attachment 94661 [details] [diff] [review] Use a WM_TIMER event instead of WM_APP event for PLEvent notification These changes look good and the inital data sounds great. Lets get this checked in soon so we have lots of time this milestone to detect any problems. I haven't seen a comment reference win16 in some time! r=dougt
Attachment #94661 - Flags: review+
I'll give this a run through later today.
(note: using the updated patch, attachment 94693 [details] [diff] [review], for the tests below). I did a series of runs, first with a Buffy (current trunk) build, with and without USE_TIMER_EVENT defined. Then I repeated them for mfcEmbed standalone builds, with and without the define. On win2k, this was about 4% slower for the Buffy build mfcEmbed build. However, the "slower" builds were more responsive in e.g., I can resize and move the window while loading nsCSSFrameConstructor.cpp in the patched build, but I'm locked out of the game with the unpatched build (on both 98 and 2k). mfcEmbed, win2k, 500MHz, 12MB ------------------------------- Without patch: Avg. Median : 781 msec Average : 791 msec With patch : Avg. Median : 813 msec Average : 821 msec ----- +4% Buffy, win2k, 500MHz, 12MB ------------------------------- Without patch: Avg. Median : 870 msec Average : 874 msec With patch : Avg. Median : 905 msec Average : 909 msec ----- +4% [Un]fortunately, I ran the same tests on a win98 system. I'm really sorry to say that (apparently) something seriously goes wrong on that OS. mfcEmbed, win98SE, 400MHz, 128MB --------------------------------- Without patch: Avg. Median : 1037 msec Average : 1062 msec With patch : Avg. Median : 1226 msec Average : 1246 msec ----- +18% Buffy, win98SE, 400MHz, 128MB --------------------------------- Without patch: Avg. Median : 1132 msec Average : 1152 msec With patch : Avg. Median : 1312 msec Average : 1324 msec ----- +16% The performance degradation occurs on all the test pages.
FWIW the problem trying to be fixed here isn't a problem on os/2 because it doesn't merge paint requests and it automagicly bumps execution priority of threads having window messages waiting. I'd wish y'all check out the result of these changes on linux, mac, os/2, etc, before doing anything drastic... and the change proposed here does make drastic change in message delivery patterns, I think.
Sam: The proposed change would only affect WIN32. Linux and Mac use completely different mechanisms for the native PL_Event notification which are unaffected by the proposed change. I'm glad to hear that OS2 does not suffer from this issue. "...the change proposed here does make drastic change in message delivery patterns" Yes, it does on WIN32.
> I can resize and move the window while loading nsCSSFrameConstructor.cpp in the > patched build Cool! Can you also scroll using the scrollbar and using the keyboard? (For bug 129640.)
This is the same problem as we see in bug 132759 when Flash uses WM_USER+1 messages. To reduce the timer performance hit, would it work to use a timer for every 10th or 20th time (or something like that)?
"would it work to use a timer for every 10th or 20th time" Unfortunately no. The problem is the timer callback is not called soon enough, not that it is called too much. There seems to be a large lag between timer callbacks even though the duration is set to 0. If I set up the timer to generate WM_TIMER events instead of using the callback and I also modify the nsAppShell event processing to pull WM_TIMER's off the event queue first it is *still slow*. However, If I modify nsAppShell to process any pending PL_Events before processing any WM_ events then performance is back to normal.
Brian: Oops.. I misunderstood your last comment. Yes it would probably help performance to only use timers every n times.
How about using WM_TIMER, but with the following changes: -- in nsEventQueue::ProcessPendingEvents, after calling PL_ProcessPendingEvents the first time, check to see if there are now more pending events, and if there are, go back to the beginning and process them. Repeat until the queue is empty. -- BUT before each call to PL_ProcessPendingEvents, call ::GetQueueStatus() to see if there are any pending paint or input messages, and if there are, return immediately. For maximum responsiveness one could even call ::GetQueueStatus() inside the event processing loop in PL_ProcessPendingEvents.
roc wrote: >How about using WM_TIMER, but with the following changes: >-- in nsEventQueue::ProcessPendingEvents, after calling PL_ProcessPendingEvents >the first time, check to see if there are now more pending events, and if there >are, go back to the beginning and process them. Repeat until the queue is empty. I recall this is how plevent.c used to work, several years ago, and combined with lengthy compute-bound control flow into gecko from plevents, it led to horribly high latency and poor UI responsiveness. I'm not positive that WM_TIMER was the message type used, but I am sure the old-old plevent.c code used to loop till the queue was empty. The change troy and others made was to count its population at the top of PL_ProcessPendingEvents, and loop over only that many events. /be
> it led to horribly high latency and poor UI responsiveness Yeah, that's why I suggested using ::GetQueueStatus to leave the loop as soon as incoming input or painting is detected.
roc: I saw that, but it still seems like other uses (are there any?) of PL_PPE could suffer the change in semantics. Maybe no one else calls PL_PPE, and this is all a matter of (re-)definition. /be
There don't appear to be any other uses of PL_ProcessPendingEvents (except possibly in the BeOS nsToolkit? Whatever...) One uninterrupted run through the queue per call to PL_ProcessPendingEvents is probably necessary to make our "deferred painting" work. (Basically, when a bunch of reflows and repaints occur at once, we need to make sure that the painting happens AFTER all reflows have been processed, so we delay the invalidates by issuing custom PLEvents which then issue native invalidates. We want to finish procesing the originally-pending PLEvents before those native invalidates are serviced.) So don't try to stick a call to ::GetQueueStatus inside PL_ProcessPendingEvents itself.
"... before each call to PL_ProcessPendingEvents, call ::GetQueueStatus() to see if there are any pending paint or input messages." There is almost always a pending paint message as the result of invalidates from the first PL_ProcessPendingEvents call so we end up doing almost the same number of SetTimer's and pay the price of their dispatch latencies. We could toggle between using Timers and WM_APP events using ::GetQueueStatus to determine if there are any input messages. This would allow Mozilla to remain interactive during long page loads, but this would not solve the paint starvation issue if their isn't any user input during the page load. However, the paint starvation issue could be solved separately, by scheduling a PL_Event to do synchronous paint of any invalidated area when a nsWindow is invalidated.
I'm not positive I understand all the subtleties of PL_ProcessPendingEvents (brendan and others: slap me if I'm not making sense) but I'm okay as long as we realize that ::DispatchMessage(&msg) actually loops sucking events out of the 3 queues (unprioritized) when the user is moving/resizing the window. This of course means that posted messages are serviced by windows code (out of our control) before user events like mouse drag and keyboard when the user is moving/resizing the window.
I'm also sure I don't understand this patch well... that said, why SetTimer with elapsed=0 ? There's a maximum useful rate to process screen changes. Screen vertical refresh rate ranges 50-100Hz, that is 10-20ms, so pumping screen updates faster than this is visually ineffective. Sam
Using set timer is not about delaying the activity to some later time but about making the *new* request be serviced *after* the user input that is *already* waiting in the hardware message queue. If Microsoft had done the obivious thing of making user input on a UI based system have top priority instead of bottom priority then we would not have to be fiddling like this.
overview of what has been discussed in this bug that hopefully will clear up some questions: When a PL_PostEvent is called a WM_APP msg is Posted to the WIN32 native message queue. (If one has not been posted already). The PLEventQueue is separate from the native event queue. The PLEventQueue will often contain numerous events while there will be only one WM_APP msg posted to the native message queue. The WM_APP event is used as notification to indicate there is at at least one event and possible more events in the PLEventQueue that need to be processed. (Other platforms use different mechanism for this notification. Pipes are used on XP_UNIX) When control returns to the message processing loop which performs TranslateMessage, DispatchMessage it will process the WM_APP msg that was posted and call a routine which in turn makes a call to PL_ProcessPendingEvents which processes pending PL_Events. When each PL_Event in the PLEventQueue is processed it may call PL_PL_PostEvent which puts additional events on the PL_EventQ. The posting of the PL_Event will result in a new WM_APP event being posted to the native message queue. To prevent continuously looping over the PL_Events in the PLEventQ as the result of this additional message posts we only process the number of events that are sitting in the PL_EventQ at the time PL_ProcessPendingEvents is called. In theory this should work fine. Except for the fact that messages that are "posted" to the native WIN32 message queue using ::PostMessage are processed *before* all other messages. We overcome this somewhat in nsAppShell where the main message loop for Mozilla resides. It processes the WM_MOUSE and WM_KEY events first by peeking into the native message queue and pulling them out before the WM_APP events. But this technique completely falls apart when the user grabs the title bar of the window or tries to resize the window, because Windows spins its own event loop which will process the "posted" messages ahead of all other messages. When the posted WM_APP events are processed they often result in a new WM_APP event being inserted into the native message queue because additional PL_Events are generated. This causes a complete lockup of the UI because the message loop spends all of its time working on the posted messages, while all other events mouse move, paints etc. remain in the native message queue until the "posted" native messages stop coming in. Interestingly the 4.x code base does not have this problem even though it also uses "posted" WM_APP events. But it uses PL_Events sparingly. They are only used to indicate data has arrived from the network. In Gecko the PL_events are used for network data notification, Reflow notification, timer notification, paint invalidation. etc.,etc. My solution was to use a timer's instead of "posted" WM_APP messages.. The reason timers fix this problem is because they generate "non posted" messages. Since the native timer messages are not posted, but instead they are generated by windows internal timer code they do not starve out other "non posted" events which occur when the user moves or resizes the window. The problem with SetTimer is there seems to be a latency before the actual timer dispatch which slows down page load. Roc's proposed solution was intended to eliminate the number of times SetTimer needed to called. It involves processing more PL_Events without returning to the native message pump. He was noting that many of the calls to process the PLEvents result in additional posts to the PLEventQueue. He proposed that we continue processing the additional posted PL_Events unless there was a pending input or paint event in the native message Q. In practice the native message queue will often contain a paint event which is the result of invalidation performed while processing the PL_Events so we return immediately, and end up performing the same number SetTimer calls. If I use Roc's proposal and return only when there is an input event, performance improves but is still not as good as the when the WM_APP messages where used. It appears to be about 2% slower than use WM_APP messages on WinXP. I also tried switching dynamically between using WM_APP events and WM_TIMER events based on the user input. When user input is pending I used timers. This brings performance which matches using WM_APP messages, but the app is sluggish when dragging or resizing the window. (You are not completely "locked out" but we are not as nearly as responsive as I.E or 4.x during the long page load using this technique.) A good solution would be if we could find a way to generate a "non posted" event that we can use for notification other than using timers. Could pipes be used on WIN32 just as they are on UNIX?. Is there a way of connecting a WIN32 pipe to the GUI event processing so we can register a callback when data arrives on the pipe. If the pipe implementation on WIN32 used posted messages this would not work because we would be back to the original problem.
How about if we add a lower priority queue and test for starvation? Perhaps we could move the WM_USER, WM_APP, and PL_Event* messages to a non-OS queue that we prioritize lower than the system level keyboard/mouse queue. The basic code would look like this: set starvation timer while (1) { did_something = false if (have event other than wm_paint or wm_timer) { get next message (in natural order) (except wm_paint or wm_timer) if (wm_user or wm_app message) move to lower priority queue else service the message did_something = true if (not starving) go to start of loop } if (have lower priority message) { service lower priority messages for a while did_something = true if (not starving) go to start of loop } if (have wm_paint) { service wm_paint messages for a while did_something = true if (not starving) go to start of loop } if (have wm_timer) { service wm_timer messages for a while did_something = true if (not starving) go to start of loop } if (did_something == false) // nothing to do wait for a message/timer to arrive/fire reset starvation timer } I want to _not_ remove wm_paint messages from the OS message queue early. The OS naturally combines update regions and this minimizes repainting. We still have to handle the window move/resize case. This is the case where inside the DispatchMessage system call it internally reads the queue (ie: we cannot manage the queue). One way to deal with this is by having code in the window proc that detects wm_user and wm_app messages that have not been moved to the lower priority queue. In the window proc we can then move them to the lower priority (non OS) queue.
Brian: The patch I attached does something similar to what your proposing. It keeps metrics on event starvation before decided whether to post a WM_APP message or set a timer. The concept is that the posting allows us to send a high priority notification and setting a timer allows us to send low priority notification. If we are not staving input or paint events then we keep posting WM_APP events. If we detect starvation we use a timer event which allows any pending input and paint events to be processed. I need to do some more tuning and testing, but the patch fixes the UI lockup problem/paint starvation issue and seems to do it without affecting page load performance. I've been trying to avoid doing more special processing in the nsAppShell::Run since embedded applications have their own event loops and it may not be possible for them to modify message processing priorities.
Note: The patch I attached uses counters to indicate event starvation.(i.e number of consecutive times a WM_APP msg was posted while there was a pending input or paint) It would probably be better to compare elapsed time instead. I'm going to experiment with using elapsed times instead.
Kevin: any thoughts on what to do when a plugin like Flash is posting events and is able to use 100% of CPU?
I find I can fix the "Flash plugin uses 100% CPU" and starves out user input bug, bug 132759. define a window proc in the window proc test if input is pending: if (HIWORD(GetQueueStatus(QS_INPUT))) { defer the message by putting in on a timer with delay 0 }
Brian: Great! This approach is compatible with attachment 95604 [details] [diff] [review]. So our general solution appears to be to demote the priority of WM_APP, WM_USER events by converting them to timer events when input is starved. Should you also include paint starvation? Do the drop-down menus in Mozilla appear quickly and correctly on pages which use a plugin which generates WM_APP and WM_USER events?
> Should you also include paint starvation? Yes, I had not looked at paint yet as I've been worried about the "moz freezes the *entire* UI and requires Ctrl-Alt-Delete to break the freeze" problem. Assuming paint and timer messages come thru the window proc perhaps we can add something like this to the window proc (assuming it does not have a significant performance issue). DWORD currentTickTime = GetTickCount(); // tend the starvation timers if (msg.message == WM_PAINT) lastPaintTime = currentTickTime else if (msg.message == WM_TIMER) lastTimerTime = currentTickTime // check for starvation else if (msg.message >= WM_USER) { PRBool shouldDelay = PR_FALSE; // check for input starvation (right now!) if (HIWORD(GetQueueStatus(QS_INPUT))) shouldDelay = PR_TRUE; // check for paint/timer starvation (every now and then) else if (currentTickTime != lastTickCount) { lastTickCount = currentTickTime; if ((currentTickTime-lastPaintTime) > MAX_PAINT_STARVATION_TIME) { if (GetQueueStatus(WM_PAINT)) shouldDelay = PR_TRUE; else lastPaintTime = currentTickTime } if ((currentTickTime-lastTimerTime) > MAX_TIMER_STARVATION_TIME) { if (GetQueueStatus(WM_TIMER)) shouldDelay = PR_TRUE; else lastTimerTime = currentTickTime } } // I'm not sure if we need to test for WM_HOTKEY or if that is // handled as a posted message below WM_USER if (shouldDelay) defer the message by putting in on a timer with delay 0 }
Shanjian: if this works I believe we may no longer need the patch you just added for IME event loop problems, bug 57332.
I realized that the use of a single lastTickCount timer to limit the number of times I check probably won't work exactly like I expect if there are multiple sources calling PostMessage. else if (currentTickTime != lastTickCount) { lastTickCount = currentTickTime; if ((currentTickTime-lastPaintTime) > MAX_PAINT_STARVATION_TIME) { It probably needs to look at each message time individually. Just servicing one message and then resetting lastTickCount ignores other messages. I suspect I should just drop lastTickCount and do something like this: else if ((msg.time-lastPaintTime) > MAX_PAINT_STARVATION_TIME) { Will this same kind of issue apply to your patch? +static PRBool _md_EventIsStarved(PRBool isPending, PRUint32 ... + if (isPending) { + /* was_pending must be false so record the current time + * so the elapsed time can be computed the next time this + * function is called + */ + *lastTime = PR_IntervalToMilliseconds(PR_IntervalNow()); + *wasPending = PR_TRUE; + return(PR_FALSE); + }
The drag example in http://www.domapi.com/ seems to gone. :-( Another good url when you dont see what you are dragging is: http://www.dhtmlcentral.com Just drag around the different Windows. Cannot see them in Windows 2000 while been dragged.
I filed bug 163526 for the drag starvation issue. This bug (157144) covers only the starvation issue when loading from the cache or local file.
Whiteboard: [adt2 RTM] [DIGBug] → [adt2 RTM]
"Will this same kind of issue apply to your patch?" Brian: I don't think this is a significant issue for attachment 95717 [details] [diff] [review]. It would be more accurate to use the actual message time of the msg that was starved but the approximation of using the last time that a msg was posted with a pending paint or input msg seems to be good enough. I separate the paint and input times so their elapsed times are calculated separately. For paint events we would need to capture the time of the first invalidate. Since paints are accumulated the time that is returned in the msg structure is not the first time. AFAIK it always gives back the current tick count making it useless for calculating the elapsed time.
using attachment 96068 [details] [diff] [review] I recorded 6774 calls to PostMessage vs. 39 calls to SetTimer when running jrgm's page load tests. SetTimer was called during page load due to paint starvation. Since SetTimer is used infrequently during page load unless the user is interacting with the UI, page load times should not be affected.
> Since SetTimer is used infrequently during page load unless the user is > interacting with the UI, page load times should not be > affected. For bug 132759 my results feel very similar to this: very few calls to delay the messages except when the user is dragging/resizing the window (and with my current test patch the UI response is quite acceptable).
jrgm: Could you run your perf test again with attachment 96197 [details] [diff] [review]? Thanks! I can not detect any performance impact using the patch. In addition SetTimer is only called for approx 0.5% of the notifications when running your page load tests.
Attachment #96197 - Attachment is obsolete: true
I have completed the fix (attachment 96247 [details] [diff] [review]). It is ready to be reviewed.
I've opened bug 164084 to discuss whether we should remove the PeekMessage code. Please cc: yourself if interested.
I do not have any knowledge of PL_Events but I have been looking at MSWin event loops quite a bit recently and the timer code design and logic looks good to me. +#define PAINT_STARVATION_LIMIT 750 Its my gut level feeling that when navigating menus with a mouse a delay of over 250 milliseconds make it hard for the user. I note that this does not look for timer starvation. Do / should we care about timer starvation here?
>I note that this does not look for timer starvation. Do / should we >care about timer starvation here? Basicly, that's my problem with this fix, to the extent that 'I have a problem'. Win32 GPI is what it is, does what it does, it works the way it works. Once you take on task of 'fixing it' to behave differently in one bit of code.... then where do you stop? I raise this as a question. Don't say I know the answer.
"Its my gut level feeling that when navigating menus with a mouse a delay of over 250 milliseconds make it hard for the user." Whe navigating menus the INPUT_STARVATION_LIMIT of 50 ms is what will be triggered, So the App will remain responsive during the long page load. I have confirmed this by loading a local copy of a long file. I am able to mouse over menu items and the user response is acceptable. When the input starvation limit has been reached, a timer is used which allows all pending input AND paints will be processed before the PL_Event notification, so the effective update rate when there is a user interaction is around 50 ms not 750ms. The PAINT_STARVATION_LIMIT is triggered when the user has not interacted with the page at all during a long page load of cached or local file. Mozilla timers are implemented using the PL_Event mechanism. We will not starve out the Mozilla timers because they are already being processed as PL_Events. "Basicly, that's my problem with this fix, to the extent that 'I have a problem'. Win32 GPI is what it is, does what it does, it works the way it works" Actually this fix does *not* change the way the WIN32 GDI works. It does not attempt to fix it. It doesn't invert the order of any messages or similar nastiness, which is what is currently going on in nsAppShell by the way. In fact, we will probably be able to remove the code in nsAppShell which pulls out the mouse and key events ahead of other events once this patch goes in. It also allows embedded apps to run using unmodified msg processing loops. The WIN32 message processing does not work exactly the same way as other platforms. Thats to be expected. The Widget and Gfx modules are littered with code where we have identified native platform behavior discrepancies and make them compatible with the XP code's expectations. This is just happens to be another instance of that. This patch recognizes how the WIN32 msg queue processing works (Posted events are always processed first) and uses a mechanism for dispatching events PL_Events which is compatible with it. In Mozilla the basic model for dispatching most everything is a PL_Event. Currently we use a WIN32 GDI Mechanism which makes the notification the highest priority because it is a posted event. WIN32 GDI also provides a mechanism for providing a notification with the lowest priority which is the timer. This patch just tries to make an intelligent choice when to use one or the other. When there is user interaction we want to use the lowest priority because we want to favor interactively over page load time which means using a timer. When there isn't any user interaction we want to use the fastest mechanism which is a posted event. This is not a case where we are in the dark about what is really going on here. We know exactly why the starvation occurs. We have the microsoft documentation of how it processes posted events, paint Events and timers. With that knowledge, a solution has been crafted which takes into account microsoft's implementation and the Mozilla cross platform code expectations. Will there be other cases which we will have to address? Maybe. But the current patch solves the major UI lockup and paint starvation issue when loading a long document. I think we can build upon this solution for other starvation issues. The concept is pretty simple. We can either post high priority or low priority notification and we need to determine when to use one over the other.
What's the landing plan? If you want this in 1.2 we should get it reviewed and landed ASAP.
> When navigating menus the INPUT_STARVATION_LIMIT of 50 ms is what will be > triggered, ... so the effective update rate when there is a user interaction > is around 50 ms not 750ms. That's plenty fast enough, Since this should also be the case for my plugin patch I will go experiment with extending the paint starvation time there. > We will not starve out the Mozilla timers because they are already being > processed as PL_Events. I thought I saw the timer starvation test firing occasionally when testing my plugin patch. Are there any non-mozilla timers we need to care about?
Comment on attachment 96247 [details] [diff] [review] Improved patch: Use WM_ENTERSIZEMOVE and WM_EXITSIZEMOVE to detect when the user is moving the window 1. Maybe move or copy the comment above _md_EventIsStarved about starvation limits to where the #defines are. I was wondering what where were for and had to search for other references. 2. Remove the win16 defines. lets not kid ourselves. 3. Please be consistent with the return statement: + return(PR_TRUE); /* pending and over the limit */ + } + + return (PR_FALSE); /* pending but within the limit */ + } either a space or no space , not both. I would rather have you just drop the parenthesis. 4. You are not removing the property set by SetProp. You probably can do the same for the WindowsHookEx, but it is not as critical.
Note: I removed all of the WIN16 references in plevent.c
Attachment #96247 - Attachment is obsolete: true
Comment on attachment 96735 [details] [diff] [review] patch incorporating Doug's suggestions thanks. r=dougt Just a thought, maybe we should have a #define which disables this optimization.
Attachment #96735 - Flags: review+
Comment on attachment 96735 [details] [diff] [review] patch incorporating Doug's suggestions sr=kin@netscape.com I like doug's idea of using ifdefs, initially, so that we can turn on/off this code to see if it's contributing to any unforseen regressions.
Attachment #96735 - Flags: superreview+
Attachment #96735 - Attachment is obsolete: true
> What's the landing plan? If you want this in 1.2 we should get it reviewed > and landed ASAP. I assume roc actually meant 1.1. But I don't think we really need to force this into 1.1 at the last moment. Sorry, I didn't get some more tests in (something about some other release and all :-). I'm updating my trunk build now and should have results later tonight. (I'll also be checking win98 as well, since we previously saw a much different response to changes in event behaviour on that platform than on a NT derivative.)
I meant 1.2. There's no one anyone is going to force anything into 1.1 at this point :-). I was just wondering if you're planning to land this in 1.2. This kind of change has been very regression-prone in the past and the trunk is already quite unstable. We need to land this with plenty of time before 1.2 is due to ship to give us time to deal with any possible regressions.
*** Bug 164931 has been marked as a duplicate of this bug. ***
Blocks: 164931
(roc: sorry for misunderstanding your comment). I built the trunk optimized from this afternoon (but reverted the pref change that regressed pageload by a couple of percent [bug 30088] in my build). I applied attachment 96761 [details] [diff] [review] and tested the A and B builds on win2k and win98 win2k/500MHz/128MB Avg. Median : 919 msec Average : 929 msec Avg. Median : 923 msec Average : 936 msec -------- -------- +0.4% +0.7% win98/400MHz/128MB Avg. Median : 1181 msec Average : 1206 msec Avg. Median : 1211 msec Average : 1232 msec -------- -------- +2.5% +2.1% So, the win2k result is effectively none (I don't consider that change to be statistically significant). The win98 change does appear to be real, although it may be acceptable given that the browser should be more responsive now. It's also curious that almost all of the change on win98 occurs because of >10% slowdowns on four particular dummy pages (zdnet.com, gamespot.com, DOM Core, voodooextreme.com). I should probably test this again on win98 to be sure. Also, perhaps there is some "characteristic" of those pages that leads to this slowdown (i.e., perhaps we can identify something that can be independently fixed for those pages specifically, or win98 generally). Note: I didn't do any testing of the UI responsiveness (although I'll look at that tomorrow on win2k and win98).
jrgm: The 2.5% to 2.1% perf hit on Win98 is probably the result of those pages triggering the paint starvation limit more frequently. So they are painting more than they used to. Could you try changing following define in plevent.c? from: PAINT_STARVATION_LIMIT 750 to: PAINT_STARVATION_LIMIT 2000 And re-run your test on Win98. Thanks!
Blocks: 165039
I tried Kevin's suggestion to change the value to 2000 for PAINT_STARVATION_LIMIT and reran the test on win98. This did improve things although it did still come in about a percent slower than without the patch at all, with most of the "loss" coming in the two slowest/largest pages in the test sequence. Should I try a different setting for that define? I did try playing with the browser while loading nsCSSFrameConstructor.cpp and the browser is more responsive, both resizing and moving while the document loads. (Subjectively it seems more responsive on win2k/500MHz than on win98/400MHz, although I don't know if that is due to the speed or the OS difference).
"Should I try a different setting for that define?" 3000 or 4000 should bring it below 1%. The relationship is linear. 750 results in a 2.5% slowdown and a setting of 2000 results in 1% slowdown. 3000 should bring it to approx .6% 4000 should bring it to approx .5% I think .6% on WIN98 is a reasonable price to pay. The .6% slowdown is result of legitimate painting of the page. Part of our current page load performance is the result of starving painting for long periods of time. Setting a limit of 3 seconds on the starvation seems reasonable even if it slows the page load tests on WIN98 down by .6%.
I agree that it is reasonable to take a bit of absolute load slow down to give the user feedback that something is happening. I myself would opt for the shorter time with better feedback and slightly slower loading. It is very disconcerting to see moz "freeze up" even for 4 seconds. John: could you tell us the times for the pages that seem to be most affected?
John: and the URLs so I can get a feel of loading the pages?
There's nothing to stop us using different values for different Windows versions. We could do a run-time test and use 750 on NT systems and 3000 on Win9x systems.
Comment on attachment 97188 [details] [diff] [review] Same as the last patch + roc's suggestion to determine the paint starvation limit based on OStype. WIN9x=3000 all others=750 Can you do something like this instead: os.dwOSVersionInfoSize = sizeof(OSVERSIONINFO); This way it is more clear that you are talking about the size of the type.
Attachment #97188 - Attachment is obsolete: true
Comment on attachment 97212 [details] [diff] [review] Same as last patch + Doug's suggestion for sizeof(OSVERSIONINFO) sr=kin@netscape.com Not a big deal, but I was just wondering, since the starvation limit will never change after it is calculated by _md_GetPaintStarvationLimit(), should we be caching the value in _pl_NativeNotify to avoid the function call overhead to _md_GetPaintStarvationLimit() each time we call _md_EventIsStarved()? + if (_md_EventIsStarved( (qstatus & QS_PAINT), _md_GetPaintStarvationLimit(), or is the overhead so small it's not worth it?
Attachment #97212 - Flags: superreview+
Running jrgm's page loader tests results in just over 6000 calls to _md_GetPaintStarvationLimit(). Compared with all of the calls and processing performed to actually load the pages, reflow, render, etc. etc. This has to be insignificant.
Comment on attachment 97212 [details] [diff] [review] Same as last patch + Doug's suggestion for sizeof(OSVERSIONINFO) r=dougt
Attachment #97212 - Flags: review+
"could you tell us the times for the pages that seem to be most affected?" b b/a c c/a a ------------------------------------------------------------- www.w3.org_DOML2Core 3924 105% 3827 103% 3730 www.voodooextreme.com 4022 107% 3917 104% 3762 where a is without the patch, b is the patched build with '750', and c is the patched build with '2000' as PAINT_STARVATION_LIMIT, and these numbers generated on win98/400MHz/128MB. The URLs to "static" copies of these test pages are: http://cowtools.mcom.com/perf/content/base/www.voodooextreme.com/index.html http://cowtools.mcom.com/perf/content/base/www.w3.org_DOML2Core/index.html I agree that taking a small hit on win98 in exchange for not locking up the UI is the right choice (of course).
I checked in attachment 97212 [details] [diff] [review] If you suspect this checkin is causing a problem you can prevent it from using the timer by commenting out the following line in mozilla/xpcom/threads/plevent.c #define USE_TIMER
So is this fixed now, or are we leaving this open for more patches and/or tuning of the values?
Keywords: mozilla1.1mozilla1.2
Its fixed.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Nominating for 1.0 branch as improving DHTML perf would be good for embedding customers. Marking as edt1.0.2.
Whiteboard: [adt2 RTM] → [adt2] [ETA 09/09]
desale: can you pls verify this as fixed on the trunk? thanks!
Whiteboard: [adt2] [ETA 09/09] → [adt2] [ETA 09/10]
Kevin, This one is marked "FIXED" on 08-30-2002. Which testcase do you think I should use to verify this Fix ? I tried the testcase provided at URL of this bug (Played around with Drag example) with 2002-09-09-08-trunk & 2002-08-15-04-trunk & I did not find any difference in painting updates (No difference in Results). If this one is Fixed on 08-30-2002, then two different trunks before & after 08-30-2002 should produce different test results. Or is it marked Fixed on basis of any other test (jrgm's performance test or something else). Please let me know which test I should use to verify this fix.
Prashant: Use the following test case: 1. Save a large file such as (http://lxr.mozilla.org/seamonkey/source/layout/html/style/src/nsCSSFrameConstructor.cpp) locally. 2. Load the local file. 3. Try moving the window by grabbing the title bar. Before this patch went in Mozilla would freeze until the entire page is loaded. No user interaction would happen once it froze. Try resizing the window. It freezes. Try clicking a menu item. It does not display. With the current trunk builds It should remain interactive. 4. Use mfcembed to load the local file. The page will not display until the entire page has loaded. Try clicking on a menu or moving the top-level mfcembed window while the local file is loading and the app will freeze until the entire page is loaded. With a current trunk build it should display after a few seconds well before the entire document has completed its load.
Yes, With testcase provided by Kevin, I can see that problem is fixed in todays trunk. I tested two trunks one 2002-09-09-08-trunk & other 2002-08-15-04-trunk from last month before fix checkin. I can see the bug is fixed. I tested on WinXP. Marking verified
Status: RESOLVED → VERIFIED
Keywords: edt1.0.2edt1.0.2+
Randell/Chofmann: Looks like the EDT wants this on the 1.0 branch. Can you pls provide 1.0 branch approval for this patch?
Keywords: approval
Comment on attachment 97212 [details] [diff] [review] Same as last patch + Doug's suggestion for sizeof(OSVERSIONINFO) a=rjesup@wgate.com for 1.0 branch. CHange mozilla1.0.2+ to fixed1.0.2 when checked in. Please check in ASAP for 1.0.2 tagging.
Attachment #97212 - Flags: approval+
Fix checked into the 1.0 branch.
desale: pls verify this as fixed for the 1.0 branch, then replace "fixed1.0.2", with "verified1.0.2". thanks!
Verified on 1.0 branch
This bug was listed as OS all, but from what I gather from reading the comments, it is fixed on Windows only. If that's correct, have separate bugs been filed for Linux and Mac on this issue, as Kevin indicated he would do for bug #165039? The fix for this and bug #165039 sound like something that'll really make a huge difference in the interactive feel of Mozilla, and even if we give up one or two percent in page load time it'll definitely be worth it. Looking forward to seeing these fixes make it to Linux :-)
Filed bug 179995 to remove nsAppshell code that munges order of mouse/key events as per comment 60 here. Regarding other OS's, from a comment here: Sam: The proposed change would only affect WIN32. Linux and Mac use completely different mechanisms for the native PL_Event notification which are unaffected by the proposed change. I'm glad to hear that OS2 does not suffer from this issue. "...the change proposed here does make drastic change in message delivery patterns" Yes, it does on WIN32.
*** Bug 125260 has been marked as a duplicate of this bug. ***
Component: Event Handling → User events and focus handling
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: