Closed Bug 1609205 Opened 3 months ago Closed 2 months ago

gfx.status.compositor_last_seen is NULL in an unexpectedly frequent number of pings

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chutten, Unassigned)

References

Details

(Whiteboard: [telemetry:glean-rs:m?])

Once again the compositor data doesn't look right: https://sql.telemetry.mozilla.org/queries/67229/source#170242

A couple of things jumped out:

  • we should filter to only clients with an updated build
  • some might be pings from users who haven't ever started a compositor since the metric was introduced
  • might be interesting to see if there are clients who have a value and then "lose" it sometime along their seq pings. If clients ever go from having a value to not having it, that'd be a clear and obvious sign of a problem.

A preliminary quick count of all clients seen since 2019-12-20 (when the patch adding the metric was merged): https://sql.telemetry.mozilla.org/queries/67590/source

About a third of all possible clients that could have told us about a compositor have never done so. That's weirdly high to me. In addition, most of them had told us they'd painted content sometime along the way... so they should've had a compositor to tell us about.

Quick and dirty dimensions checks on device manufacturer, android sdk, app build, geo.country come up with nothing: the proportion of affected clients (ones which haven't ever sent us a compositor_last_seen but have at least once sent us a gfx_content_paint_time) remains between about 20-30% for each dimension value.

Priority: -- → P1
Whiteboard: [telemetry:glean-rs:m?]

I dug in a little more and discovered that we've been looking at gfx_content_paint_time a little bit wrong. A better condition for checking "has content been painted" is gfx_content_paint_time IS NOT NULL AND gfx_content_paint_time > 0. Narrowing the focus to nightlies since 2020-01-08 which have the fixes in bug 1604862), I see that over 20% have at least one ping where Fenix gfx painted, but didn't see a compositor: https://sql.telemetry.mozilla.org/queries/67608/source

I'm calling this condition of reporting paint times but not reporting a gfx_status_compositor_last_seen as "weird".

So this is over 20% of weird clients from a little over 6% "weird" pings. Maybe this is a temporary condition that clients pass through?


Assuming I've not failed to write this analysis properly, it looks as though only 5 times in pings since 2020-01-08 (and in builds built since 2020-01-08) have ever stopped telling us what compositor they saw, if they had previously told us about one: https://sql.telemetry.mozilla.org/queries/67616/source

This suggests that the clients and pings telling us they don't know of any compositors are really sticking by their stories. This also suggests that "user"-lifetime metrics are working just fine as far as their "never cleared" behaviour is concerned.

:mdboom and I had a conversation about where to go from here. Some notes:

  • Still not clear at this juncture if this is "weird Fenix/Glean/Gecko behaviour that just needs to be understood" or "there is actually something wrong that needs to be fixed"
  • Really happy that "user"-lifetime metrics appear to be cleared of wrongdoing (only 5 pings out of 24k)
  • Less happy that over 1% of pings are showing problems with "application" lifetimes (350 pings out of 24k)
  • Theory: That the Glean SDK's "pre-init buffer" of 500 accumulation events is being overrun, meaning that gecko datapoints are being dropped on the floor before Fenix inits the Glean SDK.
    • Explains weirdness with both compositor metrics and also why it's not a universal fault
    • Not likely unless Fenix has delayed Glean SDK init to be more than 5s after Gecko starts up
      • Could be slower/faster hardware related.
  • Actions:
    • Instrument the Glean SDK's "pre-init" buffer to see how big it is and how many metric events try to fit into it
    • Instrument Glean SDK init and engine-gecko init to see which order they come in and how far apart
    • File a bug to audit which log messages in the Glean SDK should be turned into data collections
    • Check if there's a device model correlation

(In reply to Chris H-C :chutten from comment #4)

  • Theory: That the Glean SDK's "pre-init buffer" of 500 accumulation events is being overrun, meaning that gecko datapoints are being dropped on the floor before Fenix inits the Glean SDK.

Note: the queue has a maximum size of 100, not 500.

While talking to Sebastian for the Fennec-Fenix data migration, I independently came around to this very same theory.
This is further corroborated by the fact that the perf team saw this, which is why I was investigating :)

Depends on: 1609482

As of Fenix nightly build 2020-01-26 18:01, the issue with the compositor_last_seen metric is resolved, due to a fix in the startup Glean initialization in Fenix.

Ultimately orthogonal, but related to this investigation: there is some work do to around documentation of the application lifetime (bug 1612280), experimenting with the timeout value for EXTRACT (bug 1612282) and experimenting with timers and lifecycle events as a way to flushing the EXTRACT buffer (bug 1612283).

Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED

For reference, this is the fix referred to in the previous comment.

You need to log in before you can comment on or make changes to this bug.