Last Comment Bug 710935 - Measure lag in handling user input
: Measure lag in handling user input
Status: RESOLVED FIXED
[Snappy:P1]
:
Product: Core
Classification: Components
Component: General (show other bugs)
: unspecified
: x86 Windows 7
: -- normal with 3 votes (vote)
: mozilla15
Assigned To: Brian R. Bondy [:bbondy]
:
Mentors:
Depends on: 759449
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-14 16:57 PST by (dormant account)
Modified: 2012-05-29 12:09 PDT (History)
18 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch v1. (4.46 KB, patch)
2012-02-29 16:52 PST, Brian R. Bondy [:bbondy]
taras.mozilla: review-
Details | Diff | Review
Patch v2. (5.38 KB, patch)
2012-03-01 07:46 PST, Brian R. Bondy [:bbondy]
taras.mozilla: review-
Details | Diff | Review
Patch v3. (5.63 KB, patch)
2012-03-01 10:57 PST, Brian R. Bondy [:bbondy]
taras.mozilla: review+
Details | Diff | Review
Patch v4. (6.20 KB, patch)
2012-03-01 13:20 PST, Brian R. Bondy [:bbondy]
no flags Details | Diff | Review
Patch v5. (8.06 KB, patch)
2012-05-12 08:16 PDT, Brian R. Bondy [:bbondy]
benjamin: review+
Details | Diff | Review

Description (dormant account) 2011-12-14 16:57:37 PST
In order to measure responsiveness improvements we need to be able to measure event-handling lag. The most obvious theoretical way to do this would be to check if there was an event handler that took a long time preceding a user input event(ie mouseclick or keyboard input).

I would like to stick the time taken into a telemetry histogram.
Comment 1 Ted Mielczarek [:ted.mielczarek] 2011-12-20 08:14:45 PST
What do you mean by "event-handling lag"? The responsiveness instrumentation I wrote is sampling the time taken for the main thread to service events sent from a background thread, which is the same way that keyboard/mouse events get into our event loop. Granted this under-measures any time taken by event handling code after the native event is serviced, but it seems like a good estimate.
Comment 2 (dormant account) 2011-12-20 12:10:33 PST
(In reply to Ted Mielczarek [:ted, :luser] from comment #1)
> What do you mean by "event-handling lag"? The responsiveness instrumentation
> I wrote is sampling the time taken for the main thread to service events
> sent from a background thread, which is the same way that keyboard/mouse
> events get into our event loop. Granted this under-measures any time taken
> by event handling code after the native event is serviced, but it seems like
> a good estimate.

I would like to be able to land heuristics for mitigating lag to actual user input. For that we need to be measure lag that occurs only when responding to user events. A general purpose lag-meter does not help here.

For example, we can suspend event processing on background tabs, cc, gc, etc while moving mouse and then they will take longer once they actually run, but at that point the user wont be attempting to use firefox, so it's ok.
Comment 3 (dormant account) 2012-02-14 17:04:41 PST
Brian, can you look into this? This should be a good way to track snappy progress.
Comment 4 Brian R. Bondy [:bbondy] 2012-02-29 16:52:59 PST
Created attachment 601824 [details] [diff] [review]
Patch v1.

Basically I do the same as you suggested in Comment 0, I measure the amount of time each message takes to process.
If the next message is a UI message and if the last message took more than 50ms, then I report it.

I didn't include things <50ms because it will be almost all of the data for sure (>99%).
I.e. what is important that we measure here is the distribution of how bad the problems that do exist are.  

I believe that measuring via QueryPerformanceCounter gives you the most accurate results so that's how I did it instead of doing it via TimeStamp::Now().

For bucket sizes, the minimum is 50ms, the maximum is 30s, and there are 200 buckets.
So that means each bucket contains a range of 149ms.
Comment 5 (dormant account) 2012-02-29 21:25:02 PST
Comment on attachment 601824 [details] [diff] [review]
Patch v1.

This is missing an important piece. We need to check that the event that took some time happened right before the interaction..ie there were <1millisecond between the preceeding event and user activity.
We should also sum overhead for events that happen immediately after each other. Ie you have 10 events back-to-back preceding user activity -> some event loop flooding is occurring.
Comment 6 Brian R. Bondy [:bbondy] 2012-02-29 21:30:17 PST
> This is missing an important piece. We need to check that the event that took 
> some time happened right before the interaction..ie there were <1millisecond 
> between the preceeding event and user activity.

I had thought of this but thought those cases would be insignificant.  But I agree it's better having that extra check in there. 

> We should also sum overhead for events that happen immediately after each other.
> Ie you have 10 events back-to-back preceding user activity -> some event 
> loop flooding is occurring.

I hadn't thought of this and agree this will give us better stats.

I'll do a new patch tomorrow morning for both items, thanks for the quick review.
Comment 7 Brian R. Bondy [:bbondy] 2012-03-01 07:46:19 PST
Created attachment 601970 [details] [diff] [review]
Patch v2.

Review comments implemented.
Comment 8 (dormant account) 2012-03-01 10:42:30 PST
Comment on attachment 601970 [details] [diff] [review]
Patch v2.

Seems like you could roll QueryPerformanceCounter(&tickEnd); into GetTimeSpanMS(tickStart, tickEnd) to save a line of code.


 
+        // Stop measuring the amount the elapsed time for this message
+        QueryPerformanceCounter(&tickEnd);
+        lastLagTimeMS += GetTimeSpanMS(tickStart, tickEnd);

If I'm reading the code right, you must reuse "GetTimeSpanMS(tickEnd, tickStart) < 10" guard for resetting lastLagTimeMS  when events are not adjacent.
Comment 9 Brian R. Bondy [:bbondy] 2012-03-01 10:53:22 PST
> Seems like you could roll QueryPerformanceCounter(&tickEnd); into 
> GetTimeSpanMS(tickStart, tickEnd) to save a line of code.

On the first check we need to use the end time from the last message, so we need to keep that value anyway.  So I don't think this will buy us anything in code readability.


Implementing the other comment though.
Comment 10 Brian R. Bondy [:bbondy] 2012-03-01 10:57:13 PST
Created attachment 602031 [details] [diff] [review]
Patch v3.

Now reset the lag time anytime more than 10ms passes even if there was no ui input.
Comment 11 (dormant account) 2012-03-01 11:44:49 PST
Comment on attachment 602031 [details] [diff] [review]
Patch v3.

r+ + weak suggestions:

+static double
+GetTimeSpanMS(const LARGE_INTEGER &tickStart, const LARGE_INTEGER &tickEnd) 
+{

I'm not convinced of the benefit of using a double there(over a long).

s/FXWIDGET/EVENTLOOP//

As discussed on vidyo, I feel we should lower 10ms threshold to something that is 
an absolute minimum that still catches stuff.

bsmedberg should review this before landing.
Comment 12 Brian R. Bondy [:bbondy] 2012-03-01 13:20:20 PST
Created attachment 602082 [details] [diff] [review]
Patch v4.

- Carrying forward r+.  
- Changed to 2ms time. 
- I left it at double since the value is exact enough for those decimal places to be useful if we want to gate one of the checks on it.
Comment 13 Benjamin Smedberg [:bsmedberg] 2012-03-08 10:07:44 PST
Comment on attachment 602082 [details] [diff] [review]
Patch v4.

I don't understand why this code is sufficient or useful. Don't we expect that most of our significant pauses (GCs/CCs/networking-induced pauses) will occur during XPCOM events (nsIRunnables) and not native events? If so, I'm pretty sure that this code doesn't account or measure the most of the lagginess that we want to measure.
Comment 14 Brian R. Bondy [:bbondy] 2012-03-08 11:40:25 PST
It was my understanding that we wanted to measure the amount of time between us actually handling a message for user input.  I'll removing myself from this task in case someone else can take it, I've been moved primarily on Win8 work for now.
Comment 15 (dormant account) 2012-03-12 15:13:20 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> Comment on attachment 602082 [details] [diff] [review]
> Patch v4.
> 
> I don't understand why this code is sufficient or useful. Don't we expect
> that most of our significant pauses (GCs/CCs/networking-induced pauses) will
> occur during XPCOM events (nsIRunnables) and not native events? If so, I'm
> pretty sure that this code doesn't account or measure the most of the
> lagginess that we want to measure.

< bsmedberg> it's that we exit out of this loop to process XPCOM events, so it goes a little haywire
Thus best place to measure cumulative lag is inbetween HangMonitor::NotifyActivity and HangMonitor::Suspend. HangMonitor should expose a property to access cumulative lag.
Comment 16 Brian R. Bondy [:bbondy] 2012-03-12 15:18:30 PDT
Thanks for clarifying, I'll take another spin.
Comment 17 (dormant account) 2012-04-03 15:31:25 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #16)
> Thanks for clarifying, I'll take another spin.

This seems to have slipped through. Do you have an eta on wrapping this up?
Comment 18 Brian R. Bondy [:bbondy] 2012-04-04 05:11:29 PDT
Since I don't have a lot of time outside of win8 work, I'll be working on the prefetch task this week and this task next week.
Comment 19 Brian R. Bondy [:bbondy] 2012-04-19 07:56:15 PDT
I have not had a chance to start on this yet.  I noticed that the etherpad lists this as "is this still relevant?".  

Is it? If so, is it a priority (currently listed as P1)?
Comment 20 (dormant account) 2012-04-19 13:45:51 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #19)
> I have not had a chance to start on this yet.  I noticed that the etherpad
> lists this as "is this still relevant?".  

That's referring to the fact that there is nothing to discuss in your absence.
> 
> Is it? If so, is it a priority (currently listed as P1)?

Yes.
Comment 21 (dormant account) 2012-04-19 13:46:19 PDT
By yes, I meant it's still a P1.
Comment 22 Brian R. Bondy [:bbondy] 2012-05-12 06:13:19 PDT
I'm working on this over the weekend by the way, I'll have a patch before the end of the weekend.
Comment 23 Brian R. Bondy [:bbondy] 2012-05-12 08:16:01 PDT
Created attachment 623431 [details] [diff] [review]
Patch v5.

note: I used \r\n line endings in HangMonitor.cpp/.h because that's the file's convention already. I can change the whole file to \n after the review if you want as part of the nits.  I didn't change it yet because otherwise it'll be hard to see where the changes are. 

The task works by counting cumulative amounts of lag time when time is spent away from processing a UI event when a UI event is pending.  It does not count time after Suspend is called, it continues counting towards the cumulative value after that when NotifyActivity is called.
Comment 24 Benjamin Smedberg [:bsmedberg] 2012-05-15 12:18:03 PDT
Comment on attachment 623431 [details] [diff] [review]
Patch v5.

How expensive is PeekMessage? Because NotifyActivity is called very frequently, we need to be careful that this doesn't make us slower in general. Have you run this through try-talos?
Comment 25 Brian R. Bondy [:bbondy] 2012-05-15 12:32:58 PDT
We call it in the same way in the main message loop so I don't think it's very expensive, but if it is, we may want to change that too.  I'll try running it through the talos tests on try to see.
Comment 26 :Ehsan Akhgari (out sick) 2012-05-15 12:36:03 PDT
PeekMessage should be fairly inexpensive, it just looks at the queue while potentially grabbing a lock, and returns immediately.  And the lock would not be held by anyone else unless another thread is SendMessage'ing us or something, which should be fairly rare.
Comment 27 Brian R. Bondy [:bbondy] 2012-05-15 13:58:11 PDT
Results will be here once done:
https://tbpl.mozilla.org/?tree=Try&rev=b0aef0a92ad3
Comment 28 Brian R. Bondy [:bbondy] 2012-05-15 21:26:58 PDT
I think the talos results look good, but you can see them in the link on Comment 27.
Comment 29 Brian R. Bondy [:bbondy] 2012-05-25 06:24:47 PDT
http://hg.mozilla.org/mozilla-central/rev/831ff2eb9a28
Comment 30 Bill Gianopoulos [:WG9s] 2012-05-25 07:51:36 PDT
Seems to be a lot more red in builds since this landed.  Just sayin'! ;-)
Comment 31 Brian R. Bondy [:bbondy] 2012-05-25 07:56:18 PDT
Should it be backed out? I tried this on a couple try builds and also elm builds. 
See here:
https://tbpl.mozilla.org/?tree=Try&rev=a54d19c0610e
Comment 32 Bill Gianopoulos [:WG9s] 2012-05-25 08:00:53 PDT
Not sure.  After I added that comment I found things saying there might be infrastructure issues causing the reds.  Just should be looking at both things I guess, but not necessarily jumping to conclusions at this point.
Comment 33 Brian R. Bondy [:bbondy] 2012-05-25 08:03:07 PDT
Looks like the push before started getting them for Linux builds, then my push had them happen for all other platforms too. I don't think it is related to this push.
Comment 34 Brian R. Bondy [:bbondy] 2012-05-25 10:47:14 PDT
Everything's back to normal, I just won the prize of having the first changeset to see the problem wider spread problem :)

Note You need to log in before you can comment on or make changes to this bug.