Closed Bug 710935 Opened 12 years ago Closed 12 years ago

Measure lag in handling user input

Categories

(Core :: General, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla15

People

(Reporter: taras.mozilla, Assigned: bbondy)

References

Details

(Whiteboard: [Snappy:P1])

Attachments

(1 file, 4 obsolete files)

In order to measure responsiveness improvements we need to be able to measure event-handling lag. The most obvious theoretical way to do this would be to check if there was an event handler that took a long time preceding a user input event(ie mouseclick or keyboard input).

I would like to stick the time taken into a telemetry histogram.
Component: MathML → General
QA Contact: mathml → general
What do you mean by "event-handling lag"? The responsiveness instrumentation I wrote is sampling the time taken for the main thread to service events sent from a background thread, which is the same way that keyboard/mouse events get into our event loop. Granted this under-measures any time taken by event handling code after the native event is serviced, but it seems like a good estimate.
(In reply to Ted Mielczarek [:ted, :luser] from comment #1)
> What do you mean by "event-handling lag"? The responsiveness instrumentation
> I wrote is sampling the time taken for the main thread to service events
> sent from a background thread, which is the same way that keyboard/mouse
> events get into our event loop. Granted this under-measures any time taken
> by event handling code after the native event is serviced, but it seems like
> a good estimate.

I would like to be able to land heuristics for mitigating lag to actual user input. For that we need to be measure lag that occurs only when responding to user events. A general purpose lag-meter does not help here.

For example, we can suspend event processing on background tabs, cc, gc, etc while moving mouse and then they will take longer once they actually run, but at that point the user wont be attempting to use firefox, so it's ok.
Brian, can you look into this? This should be a good way to track snappy progress.
Whiteboard: [Snappy] → [Snappy:P1]
Assignee: nobody → netzen
Attached patch Patch v1. (obsolete) — Splinter Review
Basically I do the same as you suggested in Comment 0, I measure the amount of time each message takes to process.
If the next message is a UI message and if the last message took more than 50ms, then I report it.

I didn't include things <50ms because it will be almost all of the data for sure (>99%).
I.e. what is important that we measure here is the distribution of how bad the problems that do exist are.  

I believe that measuring via QueryPerformanceCounter gives you the most accurate results so that's how I did it instead of doing it via TimeStamp::Now().

For bucket sizes, the minimum is 50ms, the maximum is 30s, and there are 200 buckets.
So that means each bucket contains a range of 149ms.
Attachment #601824 - Flags: review?(taras.mozilla)
Comment on attachment 601824 [details] [diff] [review]
Patch v1.

This is missing an important piece. We need to check that the event that took some time happened right before the interaction..ie there were <1millisecond between the preceeding event and user activity.
We should also sum overhead for events that happen immediately after each other. Ie you have 10 events back-to-back preceding user activity -> some event loop flooding is occurring.
Attachment #601824 - Flags: review?(taras.mozilla) → review-
> This is missing an important piece. We need to check that the event that took 
> some time happened right before the interaction..ie there were <1millisecond 
> between the preceeding event and user activity.

I had thought of this but thought those cases would be insignificant.  But I agree it's better having that extra check in there. 

> We should also sum overhead for events that happen immediately after each other.
> Ie you have 10 events back-to-back preceding user activity -> some event 
> loop flooding is occurring.

I hadn't thought of this and agree this will give us better stats.

I'll do a new patch tomorrow morning for both items, thanks for the quick review.
Attached patch Patch v2. (obsolete) — Splinter Review
Review comments implemented.
Attachment #601824 - Attachment is obsolete: true
Attachment #601970 - Flags: review?(taras.mozilla)
Comment on attachment 601970 [details] [diff] [review]
Patch v2.

Seems like you could roll QueryPerformanceCounter(&tickEnd); into GetTimeSpanMS(tickStart, tickEnd) to save a line of code.


 
+        // Stop measuring the amount the elapsed time for this message
+        QueryPerformanceCounter(&tickEnd);
+        lastLagTimeMS += GetTimeSpanMS(tickStart, tickEnd);

If I'm reading the code right, you must reuse "GetTimeSpanMS(tickEnd, tickStart) < 10" guard for resetting lastLagTimeMS  when events are not adjacent.
Attachment #601970 - Flags: review?(taras.mozilla) → review-
> Seems like you could roll QueryPerformanceCounter(&tickEnd); into 
> GetTimeSpanMS(tickStart, tickEnd) to save a line of code.

On the first check we need to use the end time from the last message, so we need to keep that value anyway.  So I don't think this will buy us anything in code readability.


Implementing the other comment though.
Attached patch Patch v3. (obsolete) — Splinter Review
Now reset the lag time anytime more than 10ms passes even if there was no ui input.
Attachment #601970 - Attachment is obsolete: true
Attachment #602031 - Flags: review?(taras.mozilla)
Comment on attachment 602031 [details] [diff] [review]
Patch v3.

r+ + weak suggestions:

+static double
+GetTimeSpanMS(const LARGE_INTEGER &tickStart, const LARGE_INTEGER &tickEnd) 
+{

I'm not convinced of the benefit of using a double there(over a long).

s/FXWIDGET/EVENTLOOP//

As discussed on vidyo, I feel we should lower 10ms threshold to something that is 
an absolute minimum that still catches stuff.

bsmedberg should review this before landing.
Attachment #602031 - Flags: review?(taras.mozilla) → review+
Attached patch Patch v4. (obsolete) — Splinter Review
- Carrying forward r+.  
- Changed to 2ms time. 
- I left it at double since the value is exact enough for those decimal places to be useful if we want to gate one of the checks on it.
Attachment #602031 - Attachment is obsolete: true
Attachment #602082 - Flags: superreview?(benjamin)
Attachment #602082 - Flags: review+
Comment on attachment 602082 [details] [diff] [review]
Patch v4.

I don't understand why this code is sufficient or useful. Don't we expect that most of our significant pauses (GCs/CCs/networking-induced pauses) will occur during XPCOM events (nsIRunnables) and not native events? If so, I'm pretty sure that this code doesn't account or measure the most of the lagginess that we want to measure.
It was my understanding that we wanted to measure the amount of time between us actually handling a message for user input.  I'll removing myself from this task in case someone else can take it, I've been moved primarily on Win8 work for now.
Assignee: netzen → nobody
Attachment #602082 - Flags: superreview?(benjamin)
Attachment #602082 - Flags: review+
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> Comment on attachment 602082 [details] [diff] [review]
> Patch v4.
> 
> I don't understand why this code is sufficient or useful. Don't we expect
> that most of our significant pauses (GCs/CCs/networking-induced pauses) will
> occur during XPCOM events (nsIRunnables) and not native events? If so, I'm
> pretty sure that this code doesn't account or measure the most of the
> lagginess that we want to measure.

< bsmedberg> it's that we exit out of this loop to process XPCOM events, so it goes a little haywire
Thus best place to measure cumulative lag is inbetween HangMonitor::NotifyActivity and HangMonitor::Suspend. HangMonitor should expose a property to access cumulative lag.
Thanks for clarifying, I'll take another spin.
Assignee: nobody → netzen
(In reply to Brian R. Bondy [:bbondy] from comment #16)
> Thanks for clarifying, I'll take another spin.

This seems to have slipped through. Do you have an eta on wrapping this up?
Since I don't have a lot of time outside of win8 work, I'll be working on the prefetch task this week and this task next week.
I have not had a chance to start on this yet.  I noticed that the etherpad lists this as "is this still relevant?".  

Is it? If so, is it a priority (currently listed as P1)?
(In reply to Brian R. Bondy [:bbondy] from comment #19)
> I have not had a chance to start on this yet.  I noticed that the etherpad
> lists this as "is this still relevant?".  

That's referring to the fact that there is nothing to discuss in your absence.
> 
> Is it? If so, is it a priority (currently listed as P1)?

Yes.
By yes, I meant it's still a P1.
I'm working on this over the weekend by the way, I'll have a patch before the end of the weekend.
Attached patch Patch v5.Splinter Review
note: I used \r\n line endings in HangMonitor.cpp/.h because that's the file's convention already. I can change the whole file to \n after the review if you want as part of the nits.  I didn't change it yet because otherwise it'll be hard to see where the changes are. 

The task works by counting cumulative amounts of lag time when time is spent away from processing a UI event when a UI event is pending.  It does not count time after Suspend is called, it continues counting towards the cumulative value after that when NotifyActivity is called.
Attachment #602082 - Attachment is obsolete: true
Attachment #623431 - Flags: review?(benjamin)
Comment on attachment 623431 [details] [diff] [review]
Patch v5.

How expensive is PeekMessage? Because NotifyActivity is called very frequently, we need to be careful that this doesn't make us slower in general. Have you run this through try-talos?
We call it in the same way in the main message loop so I don't think it's very expensive, but if it is, we may want to change that too.  I'll try running it through the talos tests on try to see.
PeekMessage should be fairly inexpensive, it just looks at the queue while potentially grabbing a lock, and returns immediately.  And the lock would not be held by anyone else unless another thread is SendMessage'ing us or something, which should be fairly rare.
Results will be here once done:
https://tbpl.mozilla.org/?tree=Try&rev=b0aef0a92ad3
I think the talos results look good, but you can see them in the link on Comment 27.
Attachment #623431 - Flags: review?(benjamin) → review+
http://hg.mozilla.org/mozilla-central/rev/831ff2eb9a28
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla15
Seems to be a lot more red in builds since this landed.  Just sayin'! ;-)
Should it be backed out? I tried this on a couple try builds and also elm builds. 
See here:
https://tbpl.mozilla.org/?tree=Try&rev=a54d19c0610e
Not sure.  After I added that comment I found things saying there might be infrastructure issues causing the reds.  Just should be looking at both things I guess, but not necessarily jumping to conclusions at this point.
Looks like the push before started getting them for Linux builds, then my push had them happen for all other platforms too. I don't think it is related to this push.
Everything's back to normal, I just won the prize of having the first changeset to see the problem wider spread problem :)
Depends on: 759449
You need to log in before you can comment on or make changes to this bug.