Closed Bug 1617179 Opened 2 years ago Closed 2 years ago

CHECKERBOARD_DURATION reports times longer than 10 minutes

Categories

(Core :: Panning and Zooming, defect, P3)

All
Android
defect

Tracking

()

RESOLVED FIXED
mozilla76
Tracking Status
firefox76 --- fixed

People

(Reporter: Dexter, Assigned: kats)

References

Details

Attachments

(6 files)

While looking at our Glean SDK error stream, we found that the CHECKERBOARD_DURATION histogram, being reported through GeckoView->Glean in Fenix as gfx.checkerboard.duration, is reporting loading times longer than 10 minutes.

This could be either a bug in the GeckoView code measuring that on Android, or we're really measuring > 10 minutes time.

See Also: → 1591153

I'm not sure how to operate that SQL query, but questions I have:

  • What percentage of values are over the 10 minutes threshold?
  • What percentage of values are over the 1 minute threshold? Is it comparable to the numbers we get on desktop which shows a tail of ~1.63% data points where it's greater than ~1 minute?
  • Can you provide more details on the distribution of these big values? I'm not sure if they are getting grouped into buckets like they do with desktop telemetry or not, but either way getting a bit more detail on the distribution might give clues as to what's going on.
Flags: needinfo?(alessio.placitelli)

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #1)

I'm not sure how to operate that SQL query, but questions I have:

  • What percentage of values are over the 10 minutes threshold?

This was reported 546 times and affects 6% of the clients.

  • What percentage of values are over the 1 minute threshold? Is it comparable to the numbers we get on desktop which shows a tail of ~1.63% data points where it's greater than ~1 minute?
  • Can you provide more details on the distribution of these big values? I'm not sure if they are getting grouped into buckets like they do with desktop telemetry or not, but either way getting a bit more detail on the distribution might give clues as to what's going on.

I'm afraid I don't have that data at hand. I put together a sample query with CDF/PDF to help you with your investigation here.

Please let us know if anything is needed of us!

Flags: needinfo?(alessio.placitelli)

I spent some time looking at this and couldn't make much sense of it. about:telemetry also doesn't work on Fenix, so I can't try and catch the problem in action. For now this is going into the P3 bucket as it is unactionable.

OS: Unspecified → Android
Priority: -- → P3
Hardware: Unspecified → All

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #3)

I spent some time looking at this and couldn't make much sense of it. about:telemetry also doesn't work on Fenix, so I can't try and catch the problem in action. For now this is going into the P3 bucket as it is unactionable.

Fenix supports the Glean SDK facilities for debugging telemetry, including dumping to logcat and the debug view.

botond: Should this be retriaged in light of the latest comment, that it is actually visible and actionable, using Glean's debug facilities?

Flags: needinfo?(botond)

Kats, does comment 4 unblock further investigation of this?

Flags: needinfo?(botond) → needinfo?(kats)

Sure, I can look into it further. Not convinced that it needs upgrading from P3 yet though, unless we can demonstrate that it's actually a user-visible product bug.

Assignee: nobody → kats
Flags: needinfo?(kats)

Sounds good, thanks!

So I discovered while investigating bug 1617427 that a scenario similar to the one in that bug gets erroneously detected as a checkerboard. This will happen on pretty much any page where there's a really big nested scrollframe, such as github code listings. The check for whether or not it's checkerboarding doesn't account for ancestor scroll containers clipping the "checkerboarded" portion out of view. The test pages I've constructed so far all produce "infinite" checkerboarding events - they never end, so they never submit to telemetry. But I could probably construct a case where the checkerboard event can terminate based on your scroll position, and in that case I should be able to create "checkerboard" events of arbitrary length. I'll try to do that first, then see what's the best way to fix it.

Note that this may not actually be the thing that's generating the data we're seeing in glean, but it's certainly a possibility. After fixing this one we'll have to see if the data reported in glean changes or not, and if not, look for additional defects in the code that might be causing it.

Attached file Example page

I was able to construct a page that reproduces the problem (when loaded in GVE in the android emulator). The specific values in the page may vary per platform and depend on screen size and so on, but it at least confirms my theory that such a case is possible and probably not too hard to hit in the code. It would certainly explain abnormally long checkerboard times reported in telemetry.

I should be able to fix it relatively easily using the patches from bug 1617427, but there's a potential impact on performance since it would be doing a relatively expensive check per composite. I'll take a closer look to see if this can be optimized further.

I had written some patches but I ran afoul of the lock ordering in my design. I'll have to rethink it a bit.

https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=f6da9140be04a5e2d8a5330b20126eb9fa9af6c1

I'm a curious one, so sorry about me jumping in :-) I'm trying to understand comment 9 and comment 10 a bit better: do they suggest that there was a bug in some gfx feature, a bug in the way things were measured or both? (again, for my own curiosity)

Flags: needinfo?(kats)

The case I found was a bug in the gfx measurement code. So it's not a user-visible bug but we are measuring checkerboarding incorrectly in this case.

Flags: needinfo?(kats)

The operations of recording checkerboard events and advancing the animations
across all APZs are conceptually linked. This patch extracts a helper that
does these operations in a single function, that is then called from both
the WR and non-WR codepaths. This will allow us to expand this code a bit
without having to duplicate it in multiple places.

The goal of this patch is to enhance GetCheckerboardMagnitude so that it
correctly accounts for clipping by ancestor APZC instances, the same way
IsCurrentlyCheckerboarding does. However, IsCurrentlyCheckerboarding uses
the mParent parent pointer when computing the clipped composition bounds,
which requires the tree lock. GetCheckerboardMagnitude is not allowed to
acquire the tree lock as it runs on the sampler thread.

This patch therefore takes another approach: the APZCTreeManager code
precomputes the clipped composition bounds for each APZC (which it can do
relatively efficiently and holding just the map lock), and then provides
that rect to the individual APZCs when they invoke GetCheckerboardMagnitude.

The additional changes to GetCheckerboardMagnitude are copied from the
IsCurrentlyCheckerboarding codepath, and the next patch will remove the latter
function entirely since GetCheckerboardMagnitude will do everything that is
needed for IsCurrentlyCheckerboarding.

Depends on D67375

IsCurrentlyCheckerboarding can be removed now, as it does the same underlying
computation as "GetCheckerboardMagnitude(...) > 0". The only difference is that
the caller now needs to compute the clipped composition bounds, which is
easy enough to do at the one call site.

Depends on D67376

Attachment #9134221 - Attachment description: Bug 1617179 - Add a map to access an APZC's parent on the sampler thread. r?botond → Bug 1617179 - Update mApzcMap to allow access to an APZC's parent on the sampler thread. r?botond
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c41e90684dc9
Extract a helper method across WR and non-WR codepaths. r=botond
https://hg.mozilla.org/integration/autoland/rev/10df53f9927f
Update mApzcMap to allow access to an APZC's parent on the sampler thread. r=botond
https://hg.mozilla.org/integration/autoland/rev/014e0b537c11
Precompute clipped composition bounds and fix GetCheckerboardMagnitude. r=botond
https://hg.mozilla.org/integration/autoland/rev/1f0cdf3fd531
Remove IsCurrentlyCheckerboarding. r=botond
https://hg.mozilla.org/integration/autoland/rev/d1990005fc98
Add a MOZ_LOG for checkerboard events. r=botond
No longer blocks: 1617175
Duplicate of this bug: 1617175
Regressions: CVE-2020-15675
You need to log in before you can comment on or make changes to this bug.