Closed Bug 1601091 Opened 2 years ago Closed 2 years ago

gfx.status.compositor is NULL in an unexpectedly frequent number of pings

Categories

(Core :: Graphics, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla74
Tracking Status
firefox72 --- wontfix
firefox73 --- fixed
firefox74 --- fixed

People

(Reporter: chutten, Assigned: ktaeleman)

References

Details

(Whiteboard: [telemetry:glean-rs:m11])

Attachments

(3 files)

:jrmuizel writes in on Slack to let us know that gfx.status.compositor is NULL far too often. The metric, a string passed up via Geckoview Streaming Telemetry, should always be non-null in any application session that has ever loaded a geckoview.

Here's the Glean SDK metric (showing it has an application lifetime). Here's [the Telemetry Scalar]https://searchfox.org/mozilla-central/rev/8bc24752246aeac8a9aed566cf1caccf88d97d11/toolkit/components/telemetry/Scalars.yaml#3046() and the collection code (showing that it only has non-null values when set).

We haven't categorically ruled out that the collection code runs as frequently as we think it does, but given how bulletproof this code is on Firefox Desktop (gfx.compositor is pretty darn solid) I'm inclined to believe that the error's not there.

Remembering some talks in Whistler about how geckoviews sometimes aren't even launched if the user doesn't load web content, I looked at how many pings do and do not report a compositor based on how many searches they recorded: https://sql.telemetry.mozilla.org/queries/66656/source#168991

Looks like that theory's a bust too.

Here's a query that has shows pings that have a null compositor but have metrics.timing_distribution.gfx_content_paint_time.values https://sql.telemetry.mozilla.org/queries/66662/source

Priority: -- → P3
Whiteboard: [telemetry:glean-rs:m?]

I've slightly modified Jeff's query to filter for start_time, to make sure that this is not about old data being sent with new build ids (bug 1601080). Looks like this is not the case. A much simpler query is available here

Assignee: nobody → alessio.placitelli
Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:m11]
Priority: P3 → P1

After investigating this a bit more, the NULL we're seeing might come due to the following causes:

  • the Gecko engine might get initialized after a ping is generated, due to the way Fenix performs init. This means that, if we generate an overdue metrics ping early at startup, we might miss these value, hence the NULLs;
  • there seems to be a widespread adoption of force-closing apps in background. We recently added telemetry for that. This might result in producing spurious metrics pings, which might contain partial data (no lifetime: application) due to the way the WorkManager is designed.

We're currently working to address the second issue and will provide more updates as soon as possible.

Depends on: 1602828

Is there something we can do in the near term to get more reliable data for this metric? Maybe misusing a histogram instead of a scalar?

Flags: needinfo?(alessio.placitelli)

We could consider metrics with different lifetimes. Metrics with "lifetime: user" never reset their value, ever. For something like gfx.compositor it does mean that it'll have a value even in situations where no compositor was ever active (and there might be other corner cases as well): it changes the "value is in too few pings" problem for a "value is in too many pings" problem.

If you'd prefer that problem over the current one, maybe adding a user-lifetime metric named "gfx.last_compositor_created" (or what-have-you) would help advertise the difference. I think bug 1602828 will end up resolving the original issue with application lifetime metrics in a way that will enable the sort of analyses you want, so it might be worth leaving the existing metrics alone pending its resolution and adding a new one to cover your analysis needs in the meantime (with a nice, short expiry date).

What do you think?

I think having a gfx.last_compositor_created is a fine short term solution. We'll pursue that for now.

I'll take a look at adding this.

Flags: needinfo?(ktaeleman)

(In reply to Chris H-C :chutten from comment #5)

If you'd prefer that problem over the current one, maybe adding a user-lifetime metric named "gfx.last_compositor_created" (or what-have-you) would help advertise the difference. I think bug 1602828 will end up resolving the original issue with application lifetime metrics in a way that will enable the sort of analyses you want, so it might be worth leaving the existing metrics alone pending its resolution and adding a new one to cover your analysis needs in the meantime (with a nice, short expiry date).

Please note that the NULLs are not exclusively due to bug 1602828: they are also due to 'force-closing' (which is a common thing on mobile) and due to when GV gets initialized by Fenix.

While gfx.last_compositor_created is a good short term solution for bug 1602828, it will hide the other two problems in addition to changing the semantics a bit, as Chris outlined: this is not a blocker for your solution, just something you need to account for when looking at the data.

Flags: needinfo?(alessio.placitelli)
Assignee: alessio.placitelli → nobody

Untaking: the short term solution to this will be the addition of a new metric by Kris. Part of the long-term solution to this will come with bug 1602828 and bug 1602824.

  • Duplicate gfx.compositor and gfx.adapter.device_id with last_seen
    versions that have a 'user' lifetime to temporarily work around
    telemetry null string issues.
Assignee: nobody → ktaeleman
Status: NEW → ASSIGNED
Pushed by ktaeleman@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/559f8162e4a4
gfx.status.compositor is NULL in an unexpectedly frequent number of pings. r=chutten
Flags: needinfo?(ktaeleman)
Blocks: 1556534
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

:jcristau Can this change be uplifted to 72 beta so it makes it into the Fenix release with GV 72? Thanks!

Flags: needinfo?(jcristau)

Please fill in the uplift request form (click on "details" next to the attachment in bugzilla, flip "approval-mozilla-beta" flag to "?", which will bring up the form)

Flags: needinfo?(jcristau)

(In reply to Julien Cristau [:jcristau] from comment #14)

Please fill in the uplift request form (click on "details" next to the attachment in bugzilla, flip "approval-mozilla-beta" flag to "?", which will bring up the form)

The presence of that flag seems to depend on the bug's component, and this particular component (Data Platform and Tools :: Glean SDK) doesn't have it.

(Perhaps the right thing to do here would be to move the bug into a component that reflects the code the patch touched, i.e. something in Core :: Graphics).

Indeed, thanks for pointing that out.

Component: Glean: SDK → Graphics
Product: Data Platform and Tools → Core
Target Milestone: --- → mozilla73

Comment on attachment 9115827 [details]
Bug 1601091 - gfx.status.compositor is NULL in an unexpectedly frequent number of pings.

Beta/Release Uplift Approval Request

  • User impact if declined: We will lack telemetry to measure the impact of webrender on devices where it is enabled
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It duplicates already exposed telemetry with a different lifetime
  • String changes made/needed:
Attachment #9115827 - Flags: approval-mozilla-beta?

Comment on attachment 9115827 [details]
Bug 1601091 - gfx.status.compositor is NULL in an unexpectedly frequent number of pings.

working around a telemetry issue in gv, approved for 72.0b9

Attachment #9115827 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
See Also: → 1609205
Attached file Data collection review
Attachment #9122802 - Flags: data-review?(chutten)

Re-opening to extend temporary telemetry

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 9122802 [details]
Data collection review

    Is the provided Data Collection Review complete, correct, and data-review+ by a Data Steward?

Yes.

    Is the data collection covered by the existing Firefox Privacy Notice?

Yes.

---
Result: datareview+
Attachment #9122802 - Flags: data-review?(chutten) → data-review+
Pushed by ktaeleman@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d88ce16b2079
Renewing temporary last_seen telemetry since we are still uncertain about the fix r=chutten

Comment on attachment 9122725 [details]
Bug 1601091 - Renewing temporary last_seen telemetry since we are still uncertain about the fix

Beta/Release Uplift Approval Request

  • User impact if declined: No user impact. Not uplifting this could cause missing graphics telemetry for Fenix resulting in the graphics team being unable to monitor performance and stability.
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It only telemetry extends expiry dates in telemetry configuration files.
  • String changes made/needed:
Attachment #9122725 - Flags: approval-mozilla-beta?
Status: REOPENED → NEW
Target Milestone: mozilla73 → ---

Comment on attachment 9122725 [details]
Bug 1601091 - Renewing temporary last_seen telemetry since we are still uncertain about the fix

Telemetry probe expiration extension needed for GV. Approved for 73.0b10.

Attachment #9122725 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Status: NEW → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla74
You need to log in before you can comment on or make changes to this bug.