Closed Bug 1221674 Opened 4 years ago Closed 4 years ago

Add telemetry probe in the content process to measure the time between refresh driver ticks

Categories

(Core :: Graphics, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
firefox44 --- fixed
firefox45 --- fixed

People

(Reporter: mchang, Assigned: mchang)

References

Details

Attachments

(1 file, 5 obsolete files)

No description provided.
This is a lot more involved than I expected since we needed to find out the vsync rate to determine if the tick is actually delayed, and we can only get that from the parent process. This patch adds a new method to VsyncSource::GetVsyncRate(), which queries the hardware for the vsync rate in the parent process. When the content refresh driver starts, it queries through IPC, the vsync rate on the parent side. On NotifyVsyncs(), it starts recording telemetry probes once the vsync rate is available. I checked with Avih and b2g isn't a huge priority, so there isn't telemetry support to get the vsync rate there yet.
Attachment #8683383 - Flags: review?(bugmail.mozilla)
Comment on attachment 8683383 [details] [diff] [review]
Add telemetry probe to detect late refresh driver ticks

Review of attachment 8683383 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/base/nsRefreshDriver.cpp
@@ +374,5 @@
> +        Telemetry::Accumulate(Telemetry::FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS,
> +                              vsyncLatency.ToMilliseconds());
> +      } else if (mVsyncRate != TimeDuration::Forever()) {
> +        TimeDuration contentDelay = (TimeStamp::Now() - mLastChildTick) - mVsyncRate;
> +        if (contentDelay.ToMilliseconds() < 0 ){

nit: move space from before ')' to after it

@@ +377,5 @@
> +        TimeDuration contentDelay = (TimeStamp::Now() - mLastChildTick) - mVsyncRate;
> +        if (contentDelay.ToMilliseconds() < 0 ){
> +          // Vsyncs are noisy and some can come at a rate quicker than
> +          // the reported hardware rate. In those cases, consider that we have 0 delay.
> +          contentDelay = TimeDuration::FromMilliseconds(0);

I would actually not bother doing this, assuming telemetry can handle negative numbers in their histograms. It's easy enough to normalize this to >= 0 in the dashboard if we need to, and it might be useful to also have data on how noisy the vsyncs are. WDYT?
Attachment #8683383 - Flags: review?(bugmail.mozilla) → review+
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> Comment on attachment 8683383 [details] [diff] [review]
> Add telemetry probe to detect late refresh driver ticks
> 
> Review of attachment 8683383 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: layout/base/nsRefreshDriver.cpp
> @@ +374,5 @@
> > +        Telemetry::Accumulate(Telemetry::FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS,
> > +                              vsyncLatency.ToMilliseconds());
> > +      } else if (mVsyncRate != TimeDuration::Forever()) {
> > +        TimeDuration contentDelay = (TimeStamp::Now() - mLastChildTick) - mVsyncRate;
> > +        if (contentDelay.ToMilliseconds() < 0 ){
> 
> nit: move space from before ')' to after it
> 
> @@ +377,5 @@
> > +        TimeDuration contentDelay = (TimeStamp::Now() - mLastChildTick) - mVsyncRate;
> > +        if (contentDelay.ToMilliseconds() < 0 ){
> > +          // Vsyncs are noisy and some can come at a rate quicker than
> > +          // the reported hardware rate. In those cases, consider that we have 0 delay.
> > +          contentDelay = TimeDuration::FromMilliseconds(0);
> 
> I would actually not bother doing this, assuming telemetry can handle
> negative numbers in their histograms. It's easy enough to normalize this to
> >= 0 in the dashboard if we need to, and it might be useful to also have
> data on how noisy the vsyncs are. WDYT?

That sounds fine to me, I just thought it would make the data cleaner. I agree though, it would be nice to see how noisy vsync is in the real world.
Mason: 
- so if want to compare e10s vs non-e10s frame delays, is it correct to compare the aggregate histogram of FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS & FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS from e10s against the FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS histogram of non-e10s?
- Don't forget to uplift this patch as well
Flags: needinfo?(mchang)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #6)
> Mason: 
> - so if want to compare e10s vs non-e10s frame delays, is it correct to
> compare the aggregate histogram of FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS &
> FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS from e10s against the
> FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS histogram of non-e10s?

Correct, but we expect that FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS, with e10s and with APZ, will increase significantly, but to the user, this would appear as checker boarding rather than as jank. For main thread scrolling, this should be roughly the same, but maybe 1-2 ms higher as there is delay with sending an IPC message from parent -> content.

> - Don't forget to uplift this patch as well

I'll uplift this to aurora. It doesn't need to go to beta as there is no e10s there.
Sorry, I want to make sure I understand your comments completely:

> we expect that FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS, with e10s and with APZ, will increase significantly

1) You're saying you expect this content histogram to report increased latencies only in profiles where both e10s & APZ are enabled, right?
2) And in reality, the user will see checkerboarding instead of jank, so this content histogram will be over-reporting jank then, right?
3) Why exactly does e10s+APZ cause reported jank to increase significantly (vs just e10s)?

> For main thread scrolling, this should be roughly the same, but maybe 1-2 ms higher as there is delay with sending an IPC message from parent -> content.

4) Where is the extra latency coming from? From parent process forwarding the event to the content process?
5) If so, then the reported additional jank is real, right?
> I'll uplift this to aurora. It doesn't need to go to beta as there is no e10s there.

We're running e10s A/B experiments on the current Beta, so let's uplift to Beta too
Blocks: 1222894
https://hg.mozilla.org/mozilla-central/rev/43305927701d
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #8)
> Sorry, I want to make sure I understand your comments completely:
> 
> > we expect that FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS, with e10s and with APZ, will increase significantly
> 
> 1) You're saying you expect this content histogram to report increased
> latencies only in profiles where both e10s & APZ are enabled, right?

Correct.

> 2) And in reality, the user will see checkerboarding instead of jank, so
> this content histogram will be over-reporting jank then, right?

Yes, but in the cases where APZ + e10s is enabled, the meaning of this probe should stop being jank and should become somewhat more synonymous with paint / layout in scrolling cases. In cases with games / content animations, then it will be jank, but I think in reality, it will be difficult to distinguish the cases just from the histogram.

> 3) Why exactly does e10s+APZ cause reported jank to increase significantly
> (vs just e10s)?

With APZ, scrolling is handled by APZ and APZ requests content to paint a display port. The displayport is significantly bigger than the viewport (up to 5x I hope after some more tweaking on desktop and high mem systems!). APZ by design paints the content above/below the viewport such that at the start of a scroll, there will be no checkerboarding. While scrolling fast, APZ predicts where the scroll will be and attempts to paint that content as well. Therefore, per frame, we will paint more content than we would without APZ. Since we're painting more, we expect the time to paint per refresh driver tick to increase, hence this telemetry probe should increase. Does that make sense? In reality, this probe should not be thought of as jank in the e10s / APZ scrolling case. Since the actual scroll event is handled by APZ in the parent process, the scroll would be smooth to the user.

> 
> > For main thread scrolling, this should be roughly the same, but maybe 1-2 ms higher as there is delay with sending an IPC message from parent -> content.
> 
> 4) Where is the extra latency coming from? From parent process forwarding
> the event to the content process?

Yes, the vsync happens and there can be 1-2 ms delay from the parent process to the content process, depending on how busy the main thread is in the content process.

> 5) If so, then the reported additional jank is real, right?

Not in the e10s/APZ case. See the response to question 3.

Does this clarify everything for you?
Flags: needinfo?(mchang) → needinfo?(vladan.bugzilla)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #9)
> > I'll uplift this to aurora. It doesn't need to go to beta as there is no e10s there.
> 
> We're running e10s A/B experiments on the current Beta, so let's uplift to
> Beta too

This patch is kind of risky and I wouldn't want to lift all the way to beta. Will aurora be enough? That means we'll have one release that doesn't have this data.
> Does this clarify everything for you?

It does, thanks!

(In reply to Mason Chang [:mchang] from comment #12)
> This patch is kind of risky and I wouldn't want to lift all the way to beta.
> Will aurora be enough? That means we'll have one release that doesn't have
> this data.

We really need this probe for the experiment we're running on Beta (in a week), but if it's risky (wrt crashing), then we have no choice but to wait to uplift to Beta. Is that the case?

Either way, let's keep the bug open until we uplift to beta
Flags: needinfo?(vladan.bugzilla)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #13)
> We really need this probe for the experiment we're running on Beta (in a
> week), but if it's risky (wrt crashing), then we have no choice but to wait
> to uplift to Beta. Is that the case?
> 
> Either way, let's keep the bug open until we uplift to beta

Yes, it uses some extra hardware features to detect the vsync rate. I guess we can watch crash-stats for a bit and see if anything comes up. I'm also relatively uplift averse in general, but if you insist, we'll uplift.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 1198199
Approval Request Comment
[Feature/regressing bug #]: None. This is requested by Vladan so that we can measure performance on beta of e10s vs non-e10s.
[User impact if declined]: None
[Describe test coverage new/current, TreeHerder]: Manual testing, mochitest
[Risks and why]: Low - Med - Adds a new ping to request the vsync rate from hardware during startup. Might risk a start up crash, but I don't see anything on crash-stats for 45 right now.
[String/UUID change made/needed]: None
Attachment #8685167 - Flags: approval-mozilla-aurora?
Note.. the new code misuses the term "Rate": a rate is the frequency that something happens, and an interval is the duration between adjacent occurrences.  So, a vsync rate would be "60 Hz", as in 60 frames per second, whereas a vsync interval would be "1000 / 60 ms".  The code that was added here returns a TimeDuration from GetVsyncRate(), where that TimeDuration represents the vsync interval.  I've seen this confusion in other bits of code too.

I'm going to rename the functions/variables that were added here (and elsewhere) in the vsync rework to use Interval instead of Rate to avoid this.
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #16)
> Note.. the new code misuses the term "Rate": a rate is the frequency that
> something happens, and an interval is the duration between adjacent
> occurrences.  So, a vsync rate would be "60 Hz", as in 60 frames per second,
> whereas a vsync interval would be "1000 / 60 ms".  The code that was added
> here returns a TimeDuration from GetVsyncRate(), where that TimeDuration
> represents the vsync interval.  I've seen this confusion in other bits of
> code too.
> 
> I'm going to rename the functions/variables that were added here (and
> elsewhere) in the vsync rework to use Interval instead of Rate to avoid this.

Thanks for finding that! Please feel free to rename it :).
I looked at this patch uplift request but will wait for a day or two before approving uplift to Aurora because Mason has expressed a possible risk of startup crash. I'd like to be certain (or gather more data) that we are not regressing stability-wise before uplifting to Aurora44.
Blocks: 1198650
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> > +        if (contentDelay.ToMilliseconds() < 0 ){
> > +          // Vsyncs are noisy and some can come at a rate quicker than
> > +          // the reported hardware rate. In those cases, consider that we have 0 delay.
> > +          contentDelay = TimeDuration::FromMilliseconds(0);
> 
> I would actually not bother doing this, assuming telemetry can handle
> negative numbers in their histograms. It's easy enough to normalize this to
> >= 0 in the dashboard if we need to, and it might be useful to also have
> data on how noisy the vsyncs are. WDYT?

Telemetry can't handle negative values :/ Can you put this back?
Flags: needinfo?(mchang)
Flags: needinfo?(mchang)
Attachment #8686795 - Flags: review?(vladan.bugzilla)
(In reply to Ritu Kothari (:ritu) from comment #18)
> I looked at this patch uplift request but will wait for a day or two before
> approving uplift to Aurora because Mason has expressed a possible risk of
> startup crash. I'd like to be certain (or gather more data) that we are not
> regressing stability-wise before uplifting to Aurora44.

I haven't seen any crashes in crash-stats, so this looks stable so far.
No longer blocks: 1222894
Comment on attachment 8686795 [details] [diff] [review]
Part 2: Correct for negative content delay values

Review of attachment 8686795 [details] [diff] [review]:
-----------------------------------------------------------------

Avi: What do you think of reporting frames that are too early(?) as being on time (0 delay). It seems like we should be reporting the absolute value of the delay instead, no?

Mason, can you fix up the 2 frame delay histogram definition:
- add an alert_emails field, perf-telemetry-alerts@mozilla.com is fine
- add a bug_numbers field with this bug #
- change expires_in_version to "never"
Attachment #8686795 - Flags: review?(vladan.bugzilla) → review?(avihpit)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #22)
> Comment on attachment 8686795 [details] [diff] [review]
> Part 2: Correct for negative content delay values
> 
> Review of attachment 8686795 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Avi: What do you think of reporting frames that are too early(?) as being on
> time (0 delay). It seems like we should be reporting the absolute value of
> the delay instead, no?

I think this would be bad to do the absolute value. For example, if a vsync comes in early, it's not actually a bad thing, but it is bad if we're very late. If a vsync came in very early (can happen on weird hardware), the user wouldn't notice anything. If it came in late, the user could see jank. Reporting the absolute value could skew the results that bad things are happening when they might not actually be. Does that make sense?


> 
> Mason, can you fix up the 2 frame delay histogram definition:
> - add an alert_emails field, perf-telemetry-alerts@mozilla.com is fine
> - add a bug_numbers field with this bug #
> - change expires_in_version to "never"

Done!
Attachment #8686795 - Attachment is obsolete: true
Attachment #8686795 - Flags: review?(avihpit)
Attachment #8686843 - Flags: review?(avihpit)
Looks good, thanks
Comment on attachment 8686843 [details] [diff] [review]
Part 2: Correct for negative content delay values

Review of attachment 8686843 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/base/nsRefreshDriver.cpp
@@ +377,5 @@
>          TimeDuration contentDelay = (TimeStamp::Now() - mLastChildTick) - mVsyncRate;
> +        if (contentDelay.ToMilliseconds() < 0 ){
> +          // Vsyncs are noisy and some can come at a rate quicker than
> +          // the reported hardware rate. In those cases, consider that we have 0 delay.
> +          contentDelay = TimeDuration::FromMilliseconds(0);

I like this. Early should not really happen, and if it does, recording "0" sounds quite good to me.

If we end up with many "0" at the collected telemetry values, we can reconsider (maybe even a different probe).
Attachment #8686843 - Flags: review?(avihpit) → review+
Mason, I am glad that this looks good so far stability wise. Do you need to also update the aurora patch to handle negative values? I noticed an r+ on patch2 and wondering if the nom'd aurora patch is good or needs changes.
Flags: needinfo?(mchang)
(In reply to Ritu Kothari (:ritu) from comment #26)
> Mason, I am glad that this looks good so far stability wise. Do you need to
> also update the aurora patch to handle negative values? I noticed an r+ on
> patch2 and wondering if the nom'd aurora patch is good or needs changes.

Part 2 just landed in inbound today. I'd have to update the nom'd aurora patch to include part 2. Part 2 should be fairly stable and isn't very risky. We can land both part 1 and 2 together in aurora. Will that work? Thanks!
Flags: needinfo?(mchang) → needinfo?(rkothari)
https://hg.mozilla.org/mozilla-central/rev/207584b0b49d
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
(In reply to Mason Chang [:mchang] from comment #28)
 
> Part 2 just landed in inbound today. I'd have to update the nom'd aurora
> patch to include part 2. Part 2 should be fairly stable and isn't very
> risky. We can land both part 1 and 2 together in aurora. Will that work?
> Thanks!

If part2 is optional and was not originally planned for aurora uplift, we could just uplift part1. To me it seemed, part1 alone might not be sufficient. But you know the code better and I will go with your recommendation. So should we just uplift part1? OR part1 + part2? If you want both 1+2, then re-nominate the correct patches for aurora uplift when ready and I will approve. Hope that helps!
Flags: needinfo?(rkothari) → needinfo?(mchang)
Comment on attachment 8686843 [details] [diff] [review]
Part 2: Correct for negative content delay values

See comment 15.
Flags: needinfo?(mchang)
Attachment #8686843 - Flags: approval-mozilla-aurora?
(In reply to Ritu Kothari (:ritu) from comment #30)
> (In reply to Mason Chang [:mchang] from comment #28)
>  
> > Part 2 just landed in inbound today. I'd have to update the nom'd aurora
> > patch to include part 2. Part 2 should be fairly stable and isn't very
> > risky. We can land both part 1 and 2 together in aurora. Will that work?
> > Thanks!
> 
> If part2 is optional and was not originally planned for aurora uplift, we
> could just uplift part1. To me it seemed, part1 alone might not be
> sufficient. But you know the code better and I will go with your
> recommendation. So should we just uplift part1? OR part1 + part2? If you
> want both 1+2, then re-nominate the correct patches for aurora uplift when
> ready and I will approve. Hope that helps!

Thanks! We should uplift both part 1 and part 2.
Comment on attachment 8685167 [details] [diff] [review]
Add telemetry probe to detect late refresh driver ticks aurora

This patch is needed in 44 (and 43 too perhaps), it has been in Nightly for over a week, seems safe to uplift to Aurora44.
Attachment #8685167 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8686843 [details] [diff] [review]
Part 2: Correct for negative content delay values

This patch is needed on Aurora too as telemetry cannot handle negative values. Aurora44+.
Attachment #8686843 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attached patch Combined aurora patch (obsolete) — Splinter Review
Combined part1 and part 2 patch rebased for aurora. Can you please uplift this patch? Thanks!
Flags: needinfo?(mchang) → needinfo?(wkocher)
Attachment #8689716 - Flags: review+
It looks like having the bug_numbers field in Histograms.js is unsupported on aurora. Was this a new feature added to Telemetry in Gecko 45? Thanks!
Flags: needinfo?(vladan.bugzilla)
Flags: needinfo?(wkocher)
Sorry, uploaded wrong patch! This is a combined patch that should work for aurora. It deletes the bug_number field from Histograms.json. Can you please uplift this patch? Thanks!
Attachment #8689716 - Attachment is obsolete: true
Flags: needinfo?(wkocher)
Someone will, possibly won't be me, though.
Flags: needinfo?(wkocher)
Chutten from irc was nice enough to verify that bug_numbers in Histograms.json was an added feature in Gecko 45, bug 1219733.
Flags: needinfo?(vladan.bugzilla)
Comment on attachment 8689718 [details] [diff] [review]
Combined aurora patch

Combined patch that is rebased for Aurora since the original patches didn't apply. Can we please get approval for this combined patch? Thanks!
Flags: needinfo?(rkothari)
Attachment #8689718 - Flags: approval-mozilla-aurora?
Comment on attachment 8683383 [details] [diff] [review]
Add telemetry probe to detect late refresh driver ticks

Obsolete for aurora.
Attachment #8683383 - Attachment is obsolete: true
Comment on attachment 8685167 [details] [diff] [review]
Add telemetry probe to detect late refresh driver ticks aurora

Obsolete for aurora.
Attachment #8685167 - Attachment is obsolete: true
Comment on attachment 8686843 [details] [diff] [review]
Part 2: Correct for negative content delay values

Obsolete for aurora.
Attachment #8686843 - Attachment is obsolete: true
Flags: needinfo?(rkothari)
Attachment #8689718 - Flags: approval-mozilla-aurora?
You need to log in before you can comment on or make changes to this bug.