If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Parent INPUT_EVENT_RESPONSE_MS p95 regressed from 16.3 to 18ms on 2017-04-05 build

RESOLVED INCOMPLETE

Status

()

Core
General
RESOLVED INCOMPLETE
5 months ago
3 months ago

People

(Reporter: Harald, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [qf:p1])

(Reporter)

Description

5 months ago
Telemetry: https://mzl.la/2qw9MiR
(Reporter)

Updated

5 months ago
Summary: Parent Input Latency regressed on p95, 16.3 to 18ms, on 2017-04-05 build → Parent INPUT_EVENT_RESPONSE_MS p95 regressed from 16.3 to 18ms on 2017-04-05 build
JP, was anyone investigating this already?
Flags: needinfo?(jpr)

Comment 2

5 months ago
Not that I'm aware of.
Flags: needinfo?(jpr)

Comment 3

5 months ago
To be clear: this shows up *only* in the chrome process. The equivalent metric on the content process shows no variation at the same time: https://mzl.la/2pDnf6V
Regression range: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=38894655c89e68bcd8f45d31a0d3005f2c2b53db&tochange=3c68d659c2b715f811708f043a1e7169d77be2ba
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #4)
> Regression range:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=38894655c89e68bcd8f45d31a0d3005f2c2b53db&tochange=3c68
> d659c2b715f811708f043a1e7169d77be2ba

I looked at everything in this range trying to find things that looked suspicious, in no particular order:

 * https://hg.mozilla.org/mozilla-central/rev/0f320509eddb (bug 1351071)
 * https://hg.mozilla.org/mozilla-central/rev/4d8192b5ac7e (bug 1305950)
 * https://hg.mozilla.org/mozilla-central/rev/ec044ae4bb52 (bug 1353533)

I'll needinfo the assigneed of these bugs on asking about them, requesting for a backout to see if that helps things.
Depends on: 1353533
Depends on: 1351071
Depends on: 1305950

Comment 6

5 months ago
Interesting, if it is bug 1351071, that may be another reason to uncompress omni.ja.  See bug 1362377.

Comment 7

5 months ago
(In reply to JP Rosevear [:jpr] from comment #6)
> Interesting, if it is bug 1351071, that may be another reason to uncompress
> omni.ja.  See bug 1362377.

Bug 1351071 isn't related to omni.ja compression; it's about not shipping pre-compiled JS bytecode in Firefox updates.
Tonight's Nightly should have the backout of bug 1351071.

Harald, do you know how long will we need to wait before we know whether the regression did in fact come from that bug?
Flags: needinfo?(hkirschner)
No longer depends on: 1305950
(Reporter)

Comment 9

5 months ago
1 day between code reaching users pings starting to come in and you could start seeing a trend. 2 days wait time is best to have most submissions.

Now, end of May 10th, t.m.o has already captured some submissions for May 9th; but shows no improvement.
Flags: needinfo?(hkirschner)
Depends on: 1364920
Due to bug 1364920, I don't trust any of the data we have on telemetry right now.
Depends on: 1364556
No longer depends on: 1364920
Whiteboard: [qf:p1]
See Also: → bug 1365334
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #10)
> Due to bug 1364920, I don't trust any of the data we have on telemetry right
> now.

The data is now up-to-date. The root cause was a non backwards-compatible change in Histograms.json; the job is now back to normal.
The backout of bug 1351071 doesn't seem to have changed anything on the telemetry data, I just asked for the patch there to be relanded.

Bug 1305950 based on Will's description there sounds like an unlikely culprit, and Tim said in bug 1353533 that he doesn't think that one may be a likely cause either.

It's possible that I may have missed something in the regression range...  Kan-Ru, any chance you can have a look please?
Flags: needinfo?(kchen)
I found :dbaron landed changes to PLDHash both on Apr 5 and May 5.

David helped backing out changes from bug 1352888 and 1352889 to see if that changes anything.
Flags: needinfo?(kchen)
The backout of bug 1352888 didn't make any difference.  I'm going to reland it, but that also means you need to look for a different culprit here.
Flags: needinfo?(kchen)
I think I might have found the cause of the input response regression.

First I noticed that the telemetry histogram for Windows looks very different to other platforms. For Windows the input response time are spread out but OSX and Linux are concentrated on the zero to one interval. I can reproduce this locally on my laptops.

Second I tried to profile wheel events because my telemetry data locally shows there are sometimes 29ms input response time -- I should be able to find something in the profile. From the profile I can see my mouse can produce 1 wheel scroll message per 8ms (0.2ms resolution) and the main thread was quiet, nothing in between the message handler.

The profile does not show any sign of input delay. Could it be that event time was wrong? All the INPUT_*_MS probe use formula like (TimeStamp::Now() - Event->mTimeStamp).ToMilliseconds() to calculate the latency so Event->mTimeStamp must use the same base as TimeStamp::Now(). Indeed the WidgetEventTime class initialize mTimeStamp with TimeStamp::Now()¹. But the widgets always try to update mTimeStamp to the native event time.

For Windows the native event time comes from ::GetMessageTime(). According to MSDN² it "retrieves the message time for the last message retrieved by the GetMessage function. The time is a long integer that specifies the elapsed time, in milliseconds, from the time the system was started to the time the message was created (that is, placed in the thread's message queue)."

The timestamp includes the time in the message queue and may have the resolution from 10ms to 16ms (it is not specified but many source say GetMessageTime is implemented by using GetTickCount)³. TimeStamp::Now() uses the high resolution timer.

So the possible explanation might be:

1. The time in the Windows message queue is not controlled by gecko, some Windows update or third-party intercepting the messages may have increased the latency.
2. The mTimeStamp has low resolution so the probe is flawed.
3. Both

I tried to use TimeStamp::Now() as the mTimeStamp for events and the telemetry does look better with latency in the zero to one interval. Next week I can try to measure the latency in the Windews message queue and the GetMessageTime resolution when I can use the Windows machine in the office.

Brian, it seems you implemented the current native time to TimeStamp conversion, what do you think?

[1]: http://searchfox.org/mozilla-central/rev/972b7a5149bbb2eaab48aafa87678c33d5f2f045/widget/BasicEvents.h#268
[2]: https://msdn.microsoft.com/en-us/library/windows/desktop/ms644939(v=vs.85).aspx
[3]: https://msdn.microsoft.com/en-us/library/windows/desktop/ms725496(v=vs.85).aspx
Flags: needinfo?(kchen) → needinfo?(bbirtles)
(Reporter)

Comment 16

4 months ago
Related, I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1318524 after a discussion with Chrome Input engineers but didn't have good arguments for why this is needed. Looks like we have some evidence now so it might make sense to normalize event time stamps.

Comment 17

4 months ago
> For Windows the native event time comes from ::GetMessageTime(). According
> to MSDN² it "retrieves the message time for the last message retrieved by
> the GetMessage function. The time is a long integer that specifies the
> elapsed time, in milliseconds, from the time the system was started to the
> time the message was created (that is, placed in the thread's message
> queue)."

Yeah, we want to include this time explicitly in the measurement because it's a sign of our event loop being backed up. It's an unfortunate bug that we don't seem to be able to get similar measurements out of the Mac/GTK event systems, so we primarily are looking at input latency on Windows right now.

> The timestamp includes the time in the message queue and may have the
> resolution from 10ms to 16ms (it is not specified but many source say
> GetMessageTime is implemented by using GetTickCount)³. TimeStamp::Now() uses
> the high resolution timer.
> 
> So the possible explanation might be:
> 
> 1. The time in the Windows message queue is not controlled by gecko, some
> Windows update or third-party intercepting the messages may have increased
> the latency.

Still important and a regression, but would be good to know.

> I tried to use TimeStamp::Now() as the mTimeStamp for events and the
> telemetry does look better with latency in the zero to one interval. Next
> week I can try to measure the latency in the Windews message queue and the
> GetMessageTime resolution when I can use the Windows machine in the office.
> 
> Brian, it seems you implemented the current native time to TimeStamp
> conversion, what do you think?

It's an interesting theory. I'd almost say it's more important to have lower resolution and include the pending dispatch time, so we should use GetTickCount(64) as the basis for both measures.

Comment 18

4 months ago
Luckily a lot of derived metrics from this probe are worried about the incidence of events in the >100ms (or even higher thresholds) range which will be affected somewhat less by a multi-ms-resolution timer like GetTickCount.

So we have a low-resolution time for platform delay, and high-resolution time for gecko delay. Given that we want to include platform delay we're kinda stuck with a low resolution result. My only concern is whether the low-res and high-res timer origins drift.

IOW, if I get a low-resolution "42" from the OS and a high-resolution "54.382" when we're done with it, is it safe to say that the "real" duration is 12.382 plus or minus the low-resolution timer's uncertainty?
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #15)
> I tried to use TimeStamp::Now() as the mTimeStamp for events and the
> telemetry does look better with latency in the zero to one interval.

Sorry I'm not quite sure I follow. If this just means that using TimeStamp::Now() produces less latency, as measured by this metric, then that's not surprising. I'm pretty sure you know this, but, in general, we expect the result of TimeStamp::Now(), called when creating the Gecko event, to be later than the actual OS event time (as described in [1]). As a result, calculating (TimeStamp::Now() - Event->mTimeStamp).ToMilliseconds() is generally going to produce a smaller number if Event->mTimeStamp is set from TimeStamp::Now() rather than calculated from the actual time the message was queued.

Obviously that will be even more pronounced when we are flooded with events as the delta between the time when each event is added to the thread's message queue and the time when we actually process the event, create the equivalent Gecko event, and assign its mTimeStamp member will grow. And obviously the intention here is that this latency *should* be reflected in the event's timestamp member -- that is, we should not be trying to smooth the times out by using TimeStamp::Now().

But I'm pretty sure you know all that so I guess you're saying something else. Perhaps that you've determined the increase in latency is only for this case where the event loop is backed up?

I don't know of any external factors that might have caused this regression. As for internal factors, if this is only in the chrome process could it be something like increased use of Promises in chrome code starving the event loop? Or changes in chrome refresh driver scheduling? Or something of that nature?

[1] http://searchfox.org/mozilla-central/rev/507743376d1ba753d14ab6b9305b7c6358570be8/widget/SystemTimeConverter.h#53

(In reply to Chris H-C :chutten from comment #18)
> So we have a low-resolution time for platform delay, and high-resolution
> time for gecko delay. Given that we want to include platform delay we're
> kinda stuck with a low resolution result. My only concern is whether the
> low-res and high-res timer origins drift.
> 
> IOW, if I get a low-resolution "42" from the OS and a high-resolution
> "54.382" when we're done with it, is it safe to say that the "real" duration
> is 12.382 plus or minus the low-resolution timer's uncertainty?

The event time to TimeStamp conversion is pretty complicated and is supposed to handle the different kinds of clock skew we might encounter. Although I wrote it, I suspect Karl actually understands it better than I (and I believe he had some concerns about the different time sources on Windows at one point) so adding him to CC.
Flags: needinfo?(bbirtles)
Thanks, Benjamin and Brian. I understand that we want to include the time in the OS message queue. OSX and GTK+ backends are also taking account of the time from the native event so it's strange that the histogram look so different on Windows.

If I understand correctly, the effect of the low-res time to the SystemTimeConverter is like this:

(with the timeline below, all units in milliseconds, assuming Time and TimeStamp has same origin)

 L0   L0'              L1
  |---+------|----------|----+--+--|---+--->
      H0                     E  H1     T
  \.../                 \..../
    D1                    D2

L0 is the mReferenceTime, the time of the first native event.
H0 is the mReferenceTimeStamp calculated in UpdateReferenceTime()
   UpdateReferenceTime() tries match H0 and L0 by calculate D1 from L0' - L0 but
   L0' can be identical to L0 due to low-res timer. So H0 has a non-zero offset D1.
H1 is roughlyNow.
E  is the timestamp of a native input event. (unknown)
L1 is the time from the event E, due to the low-res timer, L1 and the real timestamp
   has a offset D2

SystemTimeConverter calculates the timestamp of event E from L1 using the formula

  S = H1 - ((H1 - H0) - (L1 - L0))
    = H1 - H1 + H0 + L1 - L0
    = H0 + (L1 - L0)

If we replace L1 and L0 with offsets:

  S = H0 + ((E - D2) - (H0 - D1))
    = H0 - (H0 - D1) + (E - D2)
    = E + D1 - D2

The calculated timestamp S and the real timestamp E has a offset (D1 - D2)

When we want to calculate the latency probe INPUT_EVENT_RESPONSE_MS when T

  INPUT_EVENT_RESPONSE_MS = T - S
                          = T - E - D1 + D2

D1 is usually small and D2 can vary between 0 to 16ms (the resolution of GetTickCount(64)) so the INPUT_EVENT_RESPONSE_MS value can vary between 0ms to 18ms if (T-E) is 0ms to 2ms.

I think this can explain why the histogram on Windows has value between 0ms and 24ms and many fall between 8ms and 17ms.

I haven't verified the value of D1 and D2 but I've confirmed GetTickCount/GetTickCount64/GetMessageTime all have low resolution.

This doesn't explain the jump on 4/5 and on 5/6 nightly build but maybe we can investigate if D1 or D2 changed on that days.
Any updates, Kan-Ru?
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #21)
> Any updates, Kan-Ru?

Ah, I was going to write a summary.

So, I tried various way to measure and bisect. My conclusion is still the same as comment 15 and comment 20. We don't have enough resolution so I won't trust the telemetry when the measurement is close to 1x milliseconds.

IMO it's not worth spending more time finding the cause unless we see more real cases. First, I measured the ProcessNextNativeEvent throughput, we can easily process 100 events/ms while the telemetry still shows 17ms latency. Second, for normal inputs, the lag is acceptable up to around 100ms; we should continue to monitor INPUT_EVENT_RESPONSE_MS but I'm not too worried about it for now.

If we want to get a more accurate input latency telemetry number, we should use something like the toolkit/xre/EventTracer.cpp: dispatch synthesized event to the native event queue and measure the time when it's processed using hi-res timestamp. We should not constantly dispatch such event though. I think we can randomly dispatch the trace event when we know the user is interacting with the page, this way the result will be more representative.

What do you think?
Flags: needinfo?(ehsan)
Flags: needinfo?(benjamin)

Comment 23

3 months ago
I believe this is almost exactly what EVENTLOOP_UI_ACTIVITY_MS does in the HangMonitor: http://searchfox.org/mozilla-central/source/xpcom/threads/HangMonitor.cpp
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #22)
> If we want to get a more accurate input latency telemetry number, we should
> use something like the toolkit/xre/EventTracer.cpp: dispatch synthesized
> event to the native event queue and measure the time when it's processed
> using hi-res timestamp. We should not constantly dispatch such event though.
> I think we can randomly dispatch the trace event when we know the user is
> interacting with the page, this way the result will be more representative.

I think this may be a good idea.  And it's interesting that the probe that chutten mentions in comment 23 doesn't show a similar regression: https://mzl.la/2roT5Ty
Flags: needinfo?(ehsan)

Comment 25

3 months ago
I in general agree with comment 22. Latency below a threshold doesn't matter to users and so the difference between 16 and 18 is both invisible to users and well within the noise threshold of this measurement. That's why for quantum and mission control we've picked thresholds like 250ms and are focusing on those because they are clearly terrible and measurement variability of 16ms doesn't matter.

One nit to pick, however: we bandy about the "acceptable up to 100ms" rule and that's not strictly true. When the user is taking a new action, latency up to 100ms is generally imperceptible. However during active usage such as typing, dragging, etc, the user perception of jank narrows considerably and we'd like our eventual target to be "second 60fps frame after the action" which is 16-32ms.

I don't know what this means for this bug: closed as INCOMPLETE or INVALID seems fine to me. Let's chase things that actually matter.
Flags: needinfo?(benjamin)

Comment 26

3 months ago
Regarding synthesized latency metric: I wouldn't want to *replace* the coalesced user input metric, because we have pretty high confidence that it reflects actual user behavior. However, it's fine to supplement that with more precise metrics if that would be helpful for diagnosis or regression detection.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #25)
> One nit to pick, however: we bandy about the "acceptable up to 100ms" rule
> and that's not strictly true. When the user is taking a new action, latency
> up to 100ms is generally imperceptible. However during active usage such as
> typing, dragging, etc, the user perception of jank narrows considerably and
> we'd like our eventual target to be "second 60fps frame after the action"
> which is 16-32ms.

Agree. But in order to measure that we need more precise probes.

(In reply to Benjamin Smedberg [:bsmedberg] from comment #26)
> Regarding synthesized latency metric: I wouldn't want to *replace* the
> coalesced user input metric, because we have pretty high confidence that it
> reflects actual user behavior. However, it's fine to supplement that with
> more precise metrics if that would be helpful for diagnosis or regression
> detection.

Agree again. Just keep in mind the resolution limit of the metrics when review the results.

(In reply to Chris H-C :chutten from comment #23)
> I believe this is almost exactly what EVENTLOOP_UI_ACTIVITY_MS does in the
> HangMonitor:
> http://searchfox.org/mozilla-central/source/xpcom/threads/HangMonitor.cpp

I skimmed the code and it indeed looks like they are measuring almost the same thing but only Windows data is collected.
See comment 25. We decided this bug is INCOMPLETE due to the limitation of the probes.
Status: NEW → RESOLVED
Last Resolved: 3 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.