Closed Bug 1693636 Opened 4 years ago Closed 8 months ago

CHECKERBOARDING_SEVERITY regression on 2021-02-02

Categories

(Core :: Graphics: WebRender, defect)

defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox-esr78 --- unaffected
firefox85 --- unaffected
firefox86 --- wontfix
firefox87 --- wontfix

People

(Reporter: cpeterson, Unassigned, NeedInfo)

References

(Regression)

Details

(Keywords: regression)

CHECKERBOARDING_SEVERITY regressed on 2021-02-02. Windows regressed the most, but macOS and Linux also regressed some. Windows also regressed on 2021-01-24 (86 Nightly), but neither macOS nor Linux regressed then.

Looking at older telemetry, this recent Windows regression might be a reversal of an earlier, discontinuous improvements in CHECKERBOARDING_SEVERITY in 2020:

Is this a regression from enabling the SW-WR fallback in bug 1689203?

Summary: CHECKERBOARDING_SEVERITY regression starting 2021-02-02 → CHECKERBOARDING_SEVERITY regression on 2021-02-02

(In reply to Chris Peterson [:cpeterson] from comment #1)

This could be https://hg.mozilla.org/mozilla-central/rev/d27a5a6bbd45fe79e656c4b5f47c373a1898e23b bug 1687927 which stopped us from setting a full sized display port on many element (in favour of a much smaller one). That regression was then fixed by bug 1691160 (perhaps creating full display ports in more cases than before bug 1687927 even). So we should look around 2021-02-09 when bug 1691160 landed to see if we see a movement in this metric.

Blocks: gfx-triage
Severity: -- → S3

And a better approximation with interpolation within buckets: https://sql.telemetry.mozilla.org/queries/78168/source#194377

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Looking a percentiles shows a less clear picture:
https://sql.telemetry.mozilla.org/queries/78127/source#194284

So I narrowed down the ranges a bit and came up with suggested bugs:

We had a big improvement of this here:
20201028214727
4,019.50
20201029040710
562.28
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9b78e4ae32ae&tochange=d54d820d6a8d24a86d6e03d976280182cbd7758d
Botond Ballo — Bug 1669861 - Use the visual scroll offset consistently for DisplayPortMargins computations. r=kats
Botond Ballo — Bug 1669861 - Rename DisplayPortMargins::WithAdjustment() to FromAPZ(). r=kats

and then a regression here:
20210123215311
756.87
20210124091450
3,400.99
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3b776181c922&tochange=e65982f456435928e5ed52db9a96d12836663e7a
Bug 1682919 - Avoid creating a new DisplayPortMargins object as a side effect of querying the displayport. r=tnikkel

and perhaps another one here:
20210202095107
3,164.34
20210202214809
4,254.18
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=57bcdf857d44&tochange=5ff587e0026f2929319cd35e4bed9c76ff2db986
Bug 1687927 - Don't request a repaint for a scroll update of type NewScrollFrame that doesn't change the scroll offset. r=botond

Bug 1691160 doesn't seem to have moved this.

Timothy, do you have a story that might explain what's going on here?

Flags: needinfo?(jmuizelaar) → needinfo?(tnikkel)

Hmm, for the first improvement then regression (in 2020): bug 1682919 is fixing a regression caused by bug 1669861, so that makes some sense, however I don't think we understand fully why bug 1669861 caused bug 1682919 (and there are still similar, but rare, bugs like bug 1682919 that remain, ie bug 1693224).

As for the 2021 regression, bug 1687927 causing a regression makes sense because it prevents us from (un-intentionally) setting a full display port on every scroll frame that apz knows about. However, bug 1687927 was fixing a regression that happened earlier in 2020 (around May?), so after bug 1687927 I would have expected us to have just returned to the old normal of early 2020. Bug 1691160 not moving things means that the scrollframe that are causing the checkerboard regression were not the "first scroll frame encountered in the document".

(leaving needinfo because there is obviously more to do here)
(asking botond if he has any thoughts too)

Flags: needinfo?(botond)

(In reply to Timothy Nikkel (:tnikkel) from comment #8)

Hmm, for the first improvement then regression (in 2020): bug 1682919 is fixing a regression caused by bug 1669861, so that makes some sense

Yeah, bug 1682919 cancelling out an effect introduced by bug 1669861 makes sense, although I don't really understand why bug 1669861 would have such a large effect on this metric to begin with.

however I don't think we understand fully why bug 1669861 caused bug 1682919

I think bug 1682919 comment 8 describes a pretty direct link. Is there something else that remains unclear?

Flags: needinfo?(botond)

(In reply to Botond Ballo [:botond] from comment #9)

however I don't think we understand fully why bug 1669861 caused bug 1682919

I think bug 1682919 comment 8 describes a pretty direct link. Is there something else that remains unclear?

Oh, I thought you said somewhere that you didn't fully understand what was going on there? Maybe that was an earlier comment.

(In reply to Timothy Nikkel (:tnikkel) from comment #10)

(In reply to Botond Ballo [:botond] from comment #9)

however I don't think we understand fully why bug 1669861 caused bug 1682919

I think bug 1682919 comment 8 describes a pretty direct link. Is there something else that remains unclear?

Oh, I thought you said somewhere that you didn't fully understand what was going on there? Maybe that was an earlier comment.

Yeah, that was bug 1682919 comment 5.

(In reply to Timothy Nikkel (:tnikkel) from comment #12)

Okay, so understanding those questions seems like the most promising lead we have for understanding that checkboard regression.

To summarize, the cause of bug 1682919 was that we did things in the following sequence in ScrollToImpl():

  1. Update the layout scroll offset
  2. Query the composed displayport for the purpose of the paint-skip check.
  3. Update the visual scroll offset.

In case where the displayport was suppressed (and only in that case), the query in step (2) made a call to DisplayPortMargins::Empty() which observed and recorded the difference between the visual and layout offsets (for the purpose of applying it to the displayport later at painting time). As it happened to observe the offsets between (1) and (3), it recorded a bogus difference that doesn't actually reflect e.g. the visual viewport being scrolled away from the layout viewport origin.

The reason it's really strange is that this bug caused a decrease in checkerboarding severity is that:

  1. It only affected displayport-suppressed scenarios (which I would expect play a relatively small role in the overall metric)
  2. In those scenarios, it caused checkerboarding because the displayport being suppressed meant there were no margins to absorb the incorrectly offset displayport. (If it had affected non-suppressed scenarios, I could maybe envision that the bogus adjustment happened to result in better-predicted displayports and thus less checkerboarding, but even then an effect size this large is surprising.)

(In reply to Jim Mathies [:jimm] from comment #5)

Wider graph for the checker boarding -
https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=Median&cumulative=0&end_date=2021-02-17&include_spill=0&keys=!__none__!__none__&max_channel_version=nightly%252F88&measure=CHECKERBOARD_SEVERITY&min_channel_version=nightly%252F80&os=Windows_NT&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2021-01-26&trim=1&use_submission_date=0

(In reply to Jeff Muizelaar [:jrmuizel] from comment #6)

Looking a percentiles shows a less clear picture:
https://sql.telemetry.mozilla.org/queries/78127/source#194284

Comparing these two graphs they seem to conflict on the 25th percentile line. (You have to change the telemetry.mozilla.org graph to show the 25th percentile line). On the telemetry.mozilla.org graph for most of Nov, Dec and Jan the 25th percentile line is at it's lowest value and pretty stable. The sql link has the 25th percentile line increasing at the start of November, maybe going a bit lower in December, and then increasing again in January.

Are these supposed to align? I haven't look at the sql query, so I'm not sure what it's measuring exactly.

Flags: needinfo?(jmuizelaar)

I'm still confused by Jeff's graphs but if I just look at the tmo links and extend back further

https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=Median&cumulative=0&end_date=2021-02-17&include_spill=0&keys=!__none__!__none__&max_channel_version=nightly%252F88&measure=CHECKERBOARD_SEVERITY&min_channel_version=nightly%252F76&os=Windows_NT&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2021-01-26&trim=1&use_submission_date=0

I get a picture that mostly makes sense. Looking at that graph we start May 2020 at our highest position of about 5k. Then bug 1627012 (which caused us to set a full display port on any scroll frame that apz knows about) lands on 2020-05-28 and we see a sharp drop to about 3.7k. Then on 2020-10-29 bug 1669861 (should just be a cleanup, we don't know why) lands and we get a sharp drop to 800. Then on 2021-01-24 bug 1682919 lands, which fixes a regression from bug 1669861 and we get a sharp rise to 1.8k, partially but not fully reverting the corresponding rise from bug 1669861. Then on 2021-02-02 bug 1687927 lands which fixes the full display port setting regression from bug 1627012 and we set a sharp rise back to 5k.

So we are just back where we started after two regressions and fixes. Although there is some uncertainty why one of them moved the numbers.

Jeff's graph is still a bit worrying because it shows us still getting worse here that the tmo data does not. Other than understanding that I think the only other potential action to be taken here is to consider whether we want to set more full display ports in order to improve this metric or we are happy with the current trade off.

Flags: needinfo?(tnikkel)
No longer blocks: gfx-triage
Flags: needinfo?(jmuizelaar)

Restoring needinfo I suspect was accidentally cleared while triaging.

Flags: needinfo?(jmuizelaar)
No longer blocks: sw-wr-perf

In investigating bug 1713360, a different CHECKERBOARDING_SEVERITY regression we've learned some important things about the CHECKERBOARDING_SEVERITY metrics itself. From talking with botond and hiro their understanding of CHECKERBOARDING_SEVERITY is the following. The way the metric is computed is that everytime we experience checkerboarding we calculate the number of ms and pixels affected, multiply those and then sqrt root it. When the checkerboard event is over we submit a CHECKERBOARDING_SEVERITY event to telemetry.

So just looking at the median (or mean, etc) is misleading, you need to at the very least look at the number of events submitted (since we could have an increase in low severity events bringing the median/mean down but actually we are doing worse because we have more events). (Ideally looking at the distribution too.)

So with this knowledge we re-look at the graph in comment 15.

We start May 2020 at our highest position of about 5k. Then bug 1627012 (which caused us to set a full display port on any scroll frame that apz knows about) lands on 2020-05-28 and we see a sharp drop in the median to about 3.7k, the number of CS events stays stable at around 250k. This makes sense, setting a full displayport should decrease checkerboard severity.

Then on 2020-10-29 bug 1669861 (should just be a cleanup, we don't know why) lands and we get a sharp drop to 800 in the median value. The number of CS events jumps to around 800k.

Then on 2021-01-24 bug 1682919 lands, which fixes a regression from bug 1669861 and we get a sharp rise to 1.8k in the median value, partially but not fully reverting the corresponding rise from bug 1669861. The number of CS events goes down to about 500k.

Then on 2021-02-02 bug 1687927 lands which fixes the full display port setting regression from bug 1627012 and we set a sharp rise back to 5k in the median value. The number of CS events stays stable. This makes sense. No longer setting full display ports should increase checkerboard severity.

So bug 1669861 seems to have introduced a large number of checkerboarding events of low severity. Bug 1682919 removed some of these extra CS events, but not all of them.

See Also: → 1715961
Has Regression Range: --- → yes
Status: REOPENED → RESOLVED
Closed: 4 years ago8 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.