gfx.status.compositor is NULL in an unexpectedly frequent number of pings
Categories
(Core :: Graphics, defect, P1)
Tracking
()
People
(Reporter: chutten, Assigned: ktaeleman)
References
Details
(Whiteboard: [telemetry:glean-rs:m11])
Attachments
(3 files)
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details | Review |
550 bytes,
text/plain
|
chutten
:
data-review+
|
Details |
:jrmuizel writes in on Slack to let us know that gfx.status.compositor
is NULL
far too often. The metric, a string passed up via Geckoview Streaming Telemetry, should always be non-null in any application session that has ever loaded a geckoview.
Here's the Glean SDK metric (showing it has an application lifetime). Here's [the Telemetry Scalar]https://searchfox.org/mozilla-central/rev/8bc24752246aeac8a9aed566cf1caccf88d97d11/toolkit/components/telemetry/Scalars.yaml#3046() and the collection code (showing that it only has non-null values when set).
We haven't categorically ruled out that the collection code runs as frequently as we think it does, but given how bulletproof this code is on Firefox Desktop (gfx.compositor
is pretty darn solid) I'm inclined to believe that the error's not there.
Remembering some talks in Whistler about how geckoviews sometimes aren't even launched if the user doesn't load web content, I looked at how many pings do and do not report a compositor based on how many searches they recorded: https://sql.telemetry.mozilla.org/queries/66656/source#168991
Looks like that theory's a bust too.
Comment 1•1 year ago
|
||
Here's a query that has shows pings that have a null compositor but have metrics.timing_distribution.gfx_content_paint_time.values
https://sql.telemetry.mozilla.org/queries/66662/source
Updated•1 year ago
|
Comment 2•1 year ago
•
|
||
I've slightly modified Jeff's query to filter for start_time
, to make sure that this is not about old data being sent with new build ids (bug 1601080). Looks like this is not the case. A much simpler query is available here
Updated•1 year ago
|
Updated•1 year ago
|
Comment 3•1 year ago
|
||
After investigating this a bit more, the NULL
we're seeing might come due to the following causes:
- the Gecko engine might get initialized after a ping is generated, due to the way Fenix performs init. This means that, if we generate an
overdue
metrics ping early at startup, we might miss these value, hence the NULLs; - there seems to be a widespread adoption of force-closing apps in background. We recently added telemetry for that. This might result in producing spurious
metrics
pings, which might contain partial data (nolifetime: application
) due to the way theWorkManager
is designed.
We're currently working to address the second issue and will provide more updates as soon as possible.
Comment 4•1 year ago
|
||
Is there something we can do in the near term to get more reliable data for this metric? Maybe misusing a histogram instead of a scalar?
Reporter | ||
Comment 5•1 year ago
|
||
We could consider metrics with different lifetimes. Metrics with "lifetime: user" never reset their value, ever. For something like gfx.compositor
it does mean that it'll have a value even in situations where no compositor was ever active (and there might be other corner cases as well): it changes the "value is in too few pings" problem for a "value is in too many pings" problem.
If you'd prefer that problem over the current one, maybe adding a user-lifetime metric named "gfx.last_compositor_created" (or what-have-you) would help advertise the difference. I think bug 1602828 will end up resolving the original issue with application lifetime metrics in a way that will enable the sort of analyses you want, so it might be worth leaving the existing metrics alone pending its resolution and adding a new one to cover your analysis needs in the meantime (with a nice, short expiry date).
What do you think?
Comment 6•1 year ago
|
||
I think having a gfx.last_compositor_created is a fine short term solution. We'll pursue that for now.
Comment 8•1 year ago
|
||
(In reply to Chris H-C :chutten from comment #5)
If you'd prefer that problem over the current one, maybe adding a user-lifetime metric named "gfx.last_compositor_created" (or what-have-you) would help advertise the difference. I think bug 1602828 will end up resolving the original issue with application lifetime metrics in a way that will enable the sort of analyses you want, so it might be worth leaving the existing metrics alone pending its resolution and adding a new one to cover your analysis needs in the meantime (with a nice, short expiry date).
Please note that the NULLs are not exclusively due to bug 1602828: they are also due to 'force-closing' (which is a common thing on mobile) and due to when GV gets initialized by Fenix.
While gfx.last_compositor_created
is a good short term solution for bug 1602828, it will hide the other two problems in addition to changing the semantics a bit, as Chris outlined: this is not a blocker for your solution, just something you need to account for when looking at the data.
Updated•1 year ago
|
Comment 9•1 year ago
|
||
Untaking: the short term solution to this will be the addition of a new metric by Kris. Part of the long-term solution to this will come with bug 1602828 and bug 1602824.
Assignee | ||
Comment 10•1 year ago
|
||
- Duplicate gfx.compositor and gfx.adapter.device_id with last_seen
versions that have a 'user' lifetime to temporarily work around
telemetry null string issues.
Updated•1 year ago
|
Comment 11•1 year ago
|
||
Pushed by ktaeleman@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/559f8162e4a4 gfx.status.compositor is NULL in an unexpectedly frequent number of pings. r=chutten
Assignee | ||
Updated•1 year ago
|
Comment 12•1 year ago
|
||
bugherder |
Assignee | ||
Comment 13•1 year ago
|
||
:jcristau Can this change be uplifted to 72 beta so it makes it into the Fenix release with GV 72? Thanks!
Comment 14•1 year ago
|
||
Please fill in the uplift request form (click on "details" next to the attachment in bugzilla, flip "approval-mozilla-beta" flag to "?", which will bring up the form)
Comment 15•1 year ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #14)
Please fill in the uplift request form (click on "details" next to the attachment in bugzilla, flip "approval-mozilla-beta" flag to "?", which will bring up the form)
The presence of that flag seems to depend on the bug's component, and this particular component (Data Platform and Tools :: Glean SDK) doesn't have it.
(Perhaps the right thing to do here would be to move the bug into a component that reflects the code the patch touched, i.e. something in Core :: Graphics).
Comment 16•1 year ago
|
||
Indeed, thanks for pointing that out.
Updated•1 year ago
|
Assignee | ||
Comment 17•1 year ago
|
||
Comment on attachment 9115827 [details]
Bug 1601091 - gfx.status.compositor is NULL in an unexpectedly frequent number of pings.
Beta/Release Uplift Approval Request
- User impact if declined: We will lack telemetry to measure the impact of webrender on devices where it is enabled
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): It duplicates already exposed telemetry with a different lifetime
- String changes made/needed:
Comment 18•1 year ago
|
||
Comment on attachment 9115827 [details]
Bug 1601091 - gfx.status.compositor is NULL in an unexpectedly frequent number of pings.
working around a telemetry issue in gv, approved for 72.0b9
Comment 19•1 year ago
|
||
bugherderuplift |
Assignee | ||
Comment 20•1 year ago
|
||
Assignee | ||
Comment 21•1 year ago
|
||
Assignee | ||
Comment 22•1 year ago
|
||
Re-opening to extend temporary telemetry
Reporter | ||
Comment 23•1 year ago
|
||
Comment on attachment 9122802 [details]
Data collection review
Is the provided Data Collection Review complete, correct, and data-review+ by a Data Steward?
Yes.
Is the data collection covered by the existing Firefox Privacy Notice?
Yes.
---
Result: datareview+
Comment 24•1 year ago
|
||
Pushed by ktaeleman@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d88ce16b2079 Renewing temporary last_seen telemetry since we are still uncertain about the fix r=chutten
Assignee | ||
Comment 25•1 year ago
|
||
Comment on attachment 9122725 [details]
Bug 1601091 - Renewing temporary last_seen telemetry since we are still uncertain about the fix
Beta/Release Uplift Approval Request
- User impact if declined: No user impact. Not uplifting this could cause missing graphics telemetry for Fenix resulting in the graphics team being unable to monitor performance and stability.
- Is this code covered by automated tests?: Unknown
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): It only telemetry extends expiry dates in telemetry configuration files.
- String changes made/needed:
Updated•1 year ago
|
Comment 26•1 year ago
|
||
Comment on attachment 9122725 [details]
Bug 1601091 - Renewing temporary last_seen telemetry since we are still uncertain about the fix
Telemetry probe expiration extension needed for GV. Approved for 73.0b10.
Comment 27•1 year ago
|
||
bugherderuplift |
Comment 28•1 year ago
|
||
bugherder |
Description
•