Closed
Bug 1221674
Opened 9 years ago
Closed 9 years ago
Add telemetry probe in the content process to measure the time between refresh driver ticks
Categories
(Core :: Graphics, defect)
Core
Graphics
Tracking
()
RESOLVED
FIXED
mozilla45
People
(Reporter: mchang, Assigned: mchang)
References
Details
Attachments
(1 file, 5 obsolete files)
19.60 KB,
patch
|
Details | Diff | Splinter Review |
No description provided.
Assignee | ||
Comment 1•9 years ago
|
||
This is a lot more involved than I expected since we needed to find out the vsync rate to determine if the tick is actually delayed, and we can only get that from the parent process. This patch adds a new method to VsyncSource::GetVsyncRate(), which queries the hardware for the vsync rate in the parent process. When the content refresh driver starts, it queries through IPC, the vsync rate on the parent side. On NotifyVsyncs(), it starts recording telemetry probes once the vsync rate is available. I checked with Avih and b2g isn't a huge priority, so there isn't telemetry support to get the vsync rate there yet.
Attachment #8683383 -
Flags: review?(bugmail.mozilla)
Comment 2•9 years ago
|
||
Comment on attachment 8683383 [details] [diff] [review]
Add telemetry probe to detect late refresh driver ticks
Review of attachment 8683383 [details] [diff] [review]:
-----------------------------------------------------------------
::: layout/base/nsRefreshDriver.cpp
@@ +374,5 @@
> + Telemetry::Accumulate(Telemetry::FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS,
> + vsyncLatency.ToMilliseconds());
> + } else if (mVsyncRate != TimeDuration::Forever()) {
> + TimeDuration contentDelay = (TimeStamp::Now() - mLastChildTick) - mVsyncRate;
> + if (contentDelay.ToMilliseconds() < 0 ){
nit: move space from before ')' to after it
@@ +377,5 @@
> + TimeDuration contentDelay = (TimeStamp::Now() - mLastChildTick) - mVsyncRate;
> + if (contentDelay.ToMilliseconds() < 0 ){
> + // Vsyncs are noisy and some can come at a rate quicker than
> + // the reported hardware rate. In those cases, consider that we have 0 delay.
> + contentDelay = TimeDuration::FromMilliseconds(0);
I would actually not bother doing this, assuming telemetry can handle negative numbers in their histograms. It's easy enough to normalize this to >= 0 in the dashboard if we need to, and it might be useful to also have data on how noisy the vsyncs are. WDYT?
Attachment #8683383 -
Flags: review?(bugmail.mozilla) → review+
Assignee | ||
Comment 3•9 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> Comment on attachment 8683383 [details] [diff] [review]
> Add telemetry probe to detect late refresh driver ticks
>
> Review of attachment 8683383 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: layout/base/nsRefreshDriver.cpp
> @@ +374,5 @@
> > + Telemetry::Accumulate(Telemetry::FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS,
> > + vsyncLatency.ToMilliseconds());
> > + } else if (mVsyncRate != TimeDuration::Forever()) {
> > + TimeDuration contentDelay = (TimeStamp::Now() - mLastChildTick) - mVsyncRate;
> > + if (contentDelay.ToMilliseconds() < 0 ){
>
> nit: move space from before ')' to after it
>
> @@ +377,5 @@
> > + TimeDuration contentDelay = (TimeStamp::Now() - mLastChildTick) - mVsyncRate;
> > + if (contentDelay.ToMilliseconds() < 0 ){
> > + // Vsyncs are noisy and some can come at a rate quicker than
> > + // the reported hardware rate. In those cases, consider that we have 0 delay.
> > + contentDelay = TimeDuration::FromMilliseconds(0);
>
> I would actually not bother doing this, assuming telemetry can handle
> negative numbers in their histograms. It's easy enough to normalize this to
> >= 0 in the dashboard if we need to, and it might be useful to also have
> data on how noisy the vsyncs are. WDYT?
That sounds fine to me, I just thought it would make the data cleaner. I agree though, it would be nice to see how noisy vsync is in the real world.
Assignee | ||
Comment 4•9 years ago
|
||
Successful try - https://treeherder.mozilla.org/#/jobs?repo=try&revision=817ef12325cf
Comment 6•9 years ago
|
||
Mason:
- so if want to compare e10s vs non-e10s frame delays, is it correct to compare the aggregate histogram of FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS & FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS from e10s against the FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS histogram of non-e10s?
- Don't forget to uplift this patch as well
Flags: needinfo?(mchang)
Assignee | ||
Comment 7•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #6)
> Mason:
> - so if want to compare e10s vs non-e10s frame delays, is it correct to
> compare the aggregate histogram of FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS &
> FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS from e10s against the
> FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS histogram of non-e10s?
Correct, but we expect that FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS, with e10s and with APZ, will increase significantly, but to the user, this would appear as checker boarding rather than as jank. For main thread scrolling, this should be roughly the same, but maybe 1-2 ms higher as there is delay with sending an IPC message from parent -> content.
> - Don't forget to uplift this patch as well
I'll uplift this to aurora. It doesn't need to go to beta as there is no e10s there.
Comment 8•9 years ago
|
||
Sorry, I want to make sure I understand your comments completely:
> we expect that FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS, with e10s and with APZ, will increase significantly
1) You're saying you expect this content histogram to report increased latencies only in profiles where both e10s & APZ are enabled, right?
2) And in reality, the user will see checkerboarding instead of jank, so this content histogram will be over-reporting jank then, right?
3) Why exactly does e10s+APZ cause reported jank to increase significantly (vs just e10s)?
> For main thread scrolling, this should be roughly the same, but maybe 1-2 ms higher as there is delay with sending an IPC message from parent -> content.
4) Where is the extra latency coming from? From parent process forwarding the event to the content process?
5) If so, then the reported additional jank is real, right?
Comment 9•9 years ago
|
||
> I'll uplift this to aurora. It doesn't need to go to beta as there is no e10s there.
We're running e10s A/B experiments on the current Beta, so let's uplift to Beta too
Updated•9 years ago
|
Blocks: e10s-measurement
Comment 10•9 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox45:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
Assignee | ||
Comment 11•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #8)
> Sorry, I want to make sure I understand your comments completely:
>
> > we expect that FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS, with e10s and with APZ, will increase significantly
>
> 1) You're saying you expect this content histogram to report increased
> latencies only in profiles where both e10s & APZ are enabled, right?
Correct.
> 2) And in reality, the user will see checkerboarding instead of jank, so
> this content histogram will be over-reporting jank then, right?
Yes, but in the cases where APZ + e10s is enabled, the meaning of this probe should stop being jank and should become somewhat more synonymous with paint / layout in scrolling cases. In cases with games / content animations, then it will be jank, but I think in reality, it will be difficult to distinguish the cases just from the histogram.
> 3) Why exactly does e10s+APZ cause reported jank to increase significantly
> (vs just e10s)?
With APZ, scrolling is handled by APZ and APZ requests content to paint a display port. The displayport is significantly bigger than the viewport (up to 5x I hope after some more tweaking on desktop and high mem systems!). APZ by design paints the content above/below the viewport such that at the start of a scroll, there will be no checkerboarding. While scrolling fast, APZ predicts where the scroll will be and attempts to paint that content as well. Therefore, per frame, we will paint more content than we would without APZ. Since we're painting more, we expect the time to paint per refresh driver tick to increase, hence this telemetry probe should increase. Does that make sense? In reality, this probe should not be thought of as jank in the e10s / APZ scrolling case. Since the actual scroll event is handled by APZ in the parent process, the scroll would be smooth to the user.
>
> > For main thread scrolling, this should be roughly the same, but maybe 1-2 ms higher as there is delay with sending an IPC message from parent -> content.
>
> 4) Where is the extra latency coming from? From parent process forwarding
> the event to the content process?
Yes, the vsync happens and there can be 1-2 ms delay from the parent process to the content process, depending on how busy the main thread is in the content process.
> 5) If so, then the reported additional jank is real, right?
Not in the e10s/APZ case. See the response to question 3.
Does this clarify everything for you?
Flags: needinfo?(mchang) → needinfo?(vladan.bugzilla)
Assignee | ||
Comment 12•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #9)
> > I'll uplift this to aurora. It doesn't need to go to beta as there is no e10s there.
>
> We're running e10s A/B experiments on the current Beta, so let's uplift to
> Beta too
This patch is kind of risky and I wouldn't want to lift all the way to beta. Will aurora be enough? That means we'll have one release that doesn't have this data.
Comment 13•9 years ago
|
||
> Does this clarify everything for you?
It does, thanks!
(In reply to Mason Chang [:mchang] from comment #12)
> This patch is kind of risky and I wouldn't want to lift all the way to beta.
> Will aurora be enough? That means we'll have one release that doesn't have
> this data.
We really need this probe for the experiment we're running on Beta (in a week), but if it's risky (wrt crashing), then we have no choice but to wait to uplift to Beta. Is that the case?
Either way, let's keep the bug open until we uplift to beta
Flags: needinfo?(vladan.bugzilla)
Assignee | ||
Comment 14•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #13)
> We really need this probe for the experiment we're running on Beta (in a
> week), but if it's risky (wrt crashing), then we have no choice but to wait
> to uplift to Beta. Is that the case?
>
> Either way, let's keep the bug open until we uplift to beta
Yes, it uses some extra hardware features to detect the vsync rate. I guess we can watch crash-stats for a bit and see if anything comes up. I'm also relatively uplift averse in general, but if you insist, we'll uplift.
Updated•9 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 15•9 years ago
|
||
Approval Request Comment
[Feature/regressing bug #]: None. This is requested by Vladan so that we can measure performance on beta of e10s vs non-e10s.
[User impact if declined]: None
[Describe test coverage new/current, TreeHerder]: Manual testing, mochitest
[Risks and why]: Low - Med - Adds a new ping to request the vsync rate from hardware during startup. Might risk a start up crash, but I don't see anything on crash-stats for 45 right now.
[String/UUID change made/needed]: None
Attachment #8685167 -
Flags: approval-mozilla-aurora?
Note.. the new code misuses the term "Rate": a rate is the frequency that something happens, and an interval is the duration between adjacent occurrences. So, a vsync rate would be "60 Hz", as in 60 frames per second, whereas a vsync interval would be "1000 / 60 ms". The code that was added here returns a TimeDuration from GetVsyncRate(), where that TimeDuration represents the vsync interval. I've seen this confusion in other bits of code too.
I'm going to rename the functions/variables that were added here (and elsewhere) in the vsync rework to use Interval instead of Rate to avoid this.
Assignee | ||
Comment 17•9 years ago
|
||
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #16)
> Note.. the new code misuses the term "Rate": a rate is the frequency that
> something happens, and an interval is the duration between adjacent
> occurrences. So, a vsync rate would be "60 Hz", as in 60 frames per second,
> whereas a vsync interval would be "1000 / 60 ms". The code that was added
> here returns a TimeDuration from GetVsyncRate(), where that TimeDuration
> represents the vsync interval. I've seen this confusion in other bits of
> code too.
>
> I'm going to rename the functions/variables that were added here (and
> elsewhere) in the vsync rework to use Interval instead of Rate to avoid this.
Thanks for finding that! Please feel free to rename it :).
I looked at this patch uplift request but will wait for a day or two before approving uplift to Aurora because Mason has expressed a possible risk of startup crash. I'd like to be certain (or gather more data) that we are not regressing stability-wise before uplifting to Aurora44.
Comment 19•9 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> > + if (contentDelay.ToMilliseconds() < 0 ){
> > + // Vsyncs are noisy and some can come at a rate quicker than
> > + // the reported hardware rate. In those cases, consider that we have 0 delay.
> > + contentDelay = TimeDuration::FromMilliseconds(0);
>
> I would actually not bother doing this, assuming telemetry can handle
> negative numbers in their histograms. It's easy enough to normalize this to
> >= 0 in the dashboard if we need to, and it might be useful to also have
> data on how noisy the vsyncs are. WDYT?
Telemetry can't handle negative values :/ Can you put this back?
Updated•9 years ago
|
Flags: needinfo?(mchang)
Assignee | ||
Comment 20•9 years ago
|
||
Flags: needinfo?(mchang)
Attachment #8686795 -
Flags: review?(vladan.bugzilla)
Assignee | ||
Comment 21•9 years ago
|
||
(In reply to Ritu Kothari (:ritu) from comment #18)
> I looked at this patch uplift request but will wait for a day or two before
> approving uplift to Aurora because Mason has expressed a possible risk of
> startup crash. I'd like to be certain (or gather more data) that we are not
> regressing stability-wise before uplifting to Aurora44.
I haven't seen any crashes in crash-stats, so this looks stable so far.
Comment 22•9 years ago
|
||
Comment on attachment 8686795 [details] [diff] [review]
Part 2: Correct for negative content delay values
Review of attachment 8686795 [details] [diff] [review]:
-----------------------------------------------------------------
Avi: What do you think of reporting frames that are too early(?) as being on time (0 delay). It seems like we should be reporting the absolute value of the delay instead, no?
Mason, can you fix up the 2 frame delay histogram definition:
- add an alert_emails field, perf-telemetry-alerts@mozilla.com is fine
- add a bug_numbers field with this bug #
- change expires_in_version to "never"
Attachment #8686795 -
Flags: review?(vladan.bugzilla) → review?(avihpit)
Assignee | ||
Comment 23•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #22)
> Comment on attachment 8686795 [details] [diff] [review]
> Part 2: Correct for negative content delay values
>
> Review of attachment 8686795 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Avi: What do you think of reporting frames that are too early(?) as being on
> time (0 delay). It seems like we should be reporting the absolute value of
> the delay instead, no?
I think this would be bad to do the absolute value. For example, if a vsync comes in early, it's not actually a bad thing, but it is bad if we're very late. If a vsync came in very early (can happen on weird hardware), the user wouldn't notice anything. If it came in late, the user could see jank. Reporting the absolute value could skew the results that bad things are happening when they might not actually be. Does that make sense?
>
> Mason, can you fix up the 2 frame delay histogram definition:
> - add an alert_emails field, perf-telemetry-alerts@mozilla.com is fine
> - add a bug_numbers field with this bug #
> - change expires_in_version to "never"
Done!
Attachment #8686795 -
Attachment is obsolete: true
Attachment #8686795 -
Flags: review?(avihpit)
Attachment #8686843 -
Flags: review?(avihpit)
Comment 24•9 years ago
|
||
Looks good, thanks
Comment 25•9 years ago
|
||
Comment on attachment 8686843 [details] [diff] [review]
Part 2: Correct for negative content delay values
Review of attachment 8686843 [details] [diff] [review]:
-----------------------------------------------------------------
::: layout/base/nsRefreshDriver.cpp
@@ +377,5 @@
> TimeDuration contentDelay = (TimeStamp::Now() - mLastChildTick) - mVsyncRate;
> + if (contentDelay.ToMilliseconds() < 0 ){
> + // Vsyncs are noisy and some can come at a rate quicker than
> + // the reported hardware rate. In those cases, consider that we have 0 delay.
> + contentDelay = TimeDuration::FromMilliseconds(0);
I like this. Early should not really happen, and if it does, recording "0" sounds quite good to me.
If we end up with many "0" at the collected telemetry values, we can reconsider (maybe even a different probe).
Attachment #8686843 -
Flags: review?(avihpit) → review+
Mason, I am glad that this looks good so far stability wise. Do you need to also update the aurora patch to handle negative values? I noticed an r+ on patch2 and wondering if the nom'd aurora patch is good or needs changes.
Flags: needinfo?(mchang)
Comment 27•9 years ago
|
||
Assignee | ||
Comment 28•9 years ago
|
||
(In reply to Ritu Kothari (:ritu) from comment #26)
> Mason, I am glad that this looks good so far stability wise. Do you need to
> also update the aurora patch to handle negative values? I noticed an r+ on
> patch2 and wondering if the nom'd aurora patch is good or needs changes.
Part 2 just landed in inbound today. I'd have to update the nom'd aurora patch to include part 2. Part 2 should be fairly stable and isn't very risky. We can land both part 1 and 2 together in aurora. Will that work? Thanks!
Flags: needinfo?(mchang) → needinfo?(rkothari)
Comment 29•9 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
(In reply to Mason Chang [:mchang] from comment #28)
> Part 2 just landed in inbound today. I'd have to update the nom'd aurora
> patch to include part 2. Part 2 should be fairly stable and isn't very
> risky. We can land both part 1 and 2 together in aurora. Will that work?
> Thanks!
If part2 is optional and was not originally planned for aurora uplift, we could just uplift part1. To me it seemed, part1 alone might not be sufficient. But you know the code better and I will go with your recommendation. So should we just uplift part1? OR part1 + part2? If you want both 1+2, then re-nominate the correct patches for aurora uplift when ready and I will approve. Hope that helps!
Flags: needinfo?(rkothari) → needinfo?(mchang)
Assignee | ||
Comment 31•9 years ago
|
||
Comment on attachment 8686843 [details] [diff] [review]
Part 2: Correct for negative content delay values
See comment 15.
Flags: needinfo?(mchang)
Attachment #8686843 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 32•9 years ago
|
||
(In reply to Ritu Kothari (:ritu) from comment #30)
> (In reply to Mason Chang [:mchang] from comment #28)
>
> > Part 2 just landed in inbound today. I'd have to update the nom'd aurora
> > patch to include part 2. Part 2 should be fairly stable and isn't very
> > risky. We can land both part 1 and 2 together in aurora. Will that work?
> > Thanks!
>
> If part2 is optional and was not originally planned for aurora uplift, we
> could just uplift part1. To me it seemed, part1 alone might not be
> sufficient. But you know the code better and I will go with your
> recommendation. So should we just uplift part1? OR part1 + part2? If you
> want both 1+2, then re-nominate the correct patches for aurora uplift when
> ready and I will approve. Hope that helps!
Thanks! We should uplift both part 1 and part 2.
Comment on attachment 8685167 [details] [diff] [review]
Add telemetry probe to detect late refresh driver ticks aurora
This patch is needed in 44 (and 43 too perhaps), it has been in Nightly for over a week, seems safe to uplift to Aurora44.
Attachment #8685167 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
status-firefox44:
--- → affected
Comment on attachment 8686843 [details] [diff] [review]
Part 2: Correct for negative content delay values
This patch is needed on Aurora too as telemetry cannot handle negative values. Aurora44+.
Attachment #8686843 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 35•9 years ago
|
||
bugherder uplift |
I had to back out the uplift for build bustage:
https://treeherder.mozilla.org/logviewer.html#?job_id=1503166&repo=mozilla-aurora
https://hg.mozilla.org/releases/mozilla-aurora/rev/0c4c44a9fe9a
Flags: needinfo?(mchang)
Assignee | ||
Comment 37•9 years ago
|
||
Combined part1 and part 2 patch rebased for aurora. Can you please uplift this patch? Thanks!
Flags: needinfo?(mchang) → needinfo?(wkocher)
Attachment #8689716 -
Flags: review+
Assignee | ||
Comment 38•9 years ago
|
||
It looks like having the bug_numbers field in Histograms.js is unsupported on aurora. Was this a new feature added to Telemetry in Gecko 45? Thanks!
Flags: needinfo?(vladan.bugzilla)
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(wkocher)
Assignee | ||
Comment 39•9 years ago
|
||
Sorry, uploaded wrong patch! This is a combined patch that should work for aurora. It deletes the bug_number field from Histograms.json. Can you please uplift this patch? Thanks!
Attachment #8689716 -
Attachment is obsolete: true
Flags: needinfo?(wkocher)
Someone will, possibly won't be me, though.
Flags: needinfo?(wkocher)
Assignee | ||
Comment 41•9 years ago
|
||
Chutten from irc was nice enough to verify that bug_numbers in Histograms.json was an added feature in Gecko 45, bug 1219733.
Flags: needinfo?(vladan.bugzilla)
Assignee | ||
Comment 42•9 years ago
|
||
Comment on attachment 8689718 [details] [diff] [review]
Combined aurora patch
Combined patch that is rebased for Aurora since the original patches didn't apply. Can we please get approval for this combined patch? Thanks!
Flags: needinfo?(rkothari)
Attachment #8689718 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 43•9 years ago
|
||
Comment on attachment 8683383 [details] [diff] [review]
Add telemetry probe to detect late refresh driver ticks
Obsolete for aurora.
Attachment #8683383 -
Attachment is obsolete: true
Assignee | ||
Comment 44•9 years ago
|
||
Comment on attachment 8685167 [details] [diff] [review]
Add telemetry probe to detect late refresh driver ticks aurora
Obsolete for aurora.
Attachment #8685167 -
Attachment is obsolete: true
Assignee | ||
Comment 45•9 years ago
|
||
Comment on attachment 8686843 [details] [diff] [review]
Part 2: Correct for negative content delay values
Obsolete for aurora.
Attachment #8686843 -
Attachment is obsolete: true
Comment 46•9 years ago
|
||
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(rkothari)
Assignee | ||
Updated•9 years ago
|
Attachment #8689718 -
Flags: approval-mozilla-aurora?
You need to log in
before you can comment on or make changes to this bug.
Description
•