gfx.status.compositor is NULL in an unexpectedly frequent number of pings
Categories
(Core :: Graphics, defect, P1)
Tracking
()
People
(Reporter: chutten, Assigned: ktaeleman)
References
Details
(Whiteboard: [telemetry:glean-rs:m11])
Attachments
(3 files)
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details | Review |
550 bytes,
text/plain
|
chutten
:
data-review+
|
Details |
:jrmuizel writes in on Slack to let us know that gfx.status.compositor
is NULL
far too often. The metric, a string passed up via Geckoview Streaming Telemetry, should always be non-null in any application session that has ever loaded a geckoview.
Here's the Glean SDK metric (showing it has an application lifetime). Here's [the Telemetry Scalar]https://searchfox.org/mozilla-central/rev/8bc24752246aeac8a9aed566cf1caccf88d97d11/toolkit/components/telemetry/Scalars.yaml#3046() and the collection code (showing that it only has non-null values when set).
We haven't categorically ruled out that the collection code runs as frequently as we think it does, but given how bulletproof this code is on Firefox Desktop (gfx.compositor
is pretty darn solid) I'm inclined to believe that the error's not there.
Remembering some talks in Whistler about how geckoviews sometimes aren't even launched if the user doesn't load web content, I looked at how many pings do and do not report a compositor based on how many searches they recorded: https://sql.telemetry.mozilla.org/queries/66656/source#168991
Looks like that theory's a bust too.
Comment 1•6 years ago
|
||
Here's a query that has shows pings that have a null compositor but have metrics.timing_distribution.gfx_content_paint_time.values
https://sql.telemetry.mozilla.org/queries/66662/source
Updated•6 years ago
|
Comment 2•6 years ago
•
|
||
I've slightly modified Jeff's query to filter for start_time
, to make sure that this is not about old data being sent with new build ids (bug 1601080). Looks like this is not the case. A much simpler query is available here
Updated•6 years ago
|
Updated•6 years ago
|
Comment 3•6 years ago
|
||
After investigating this a bit more, the NULL
we're seeing might come due to the following causes:
- the Gecko engine might get initialized after a ping is generated, due to the way Fenix performs init. This means that, if we generate an
overdue
metrics ping early at startup, we might miss these value, hence the NULLs; - there seems to be a widespread adoption of force-closing apps in background. We recently added telemetry for that. This might result in producing spurious
metrics
pings, which might contain partial data (nolifetime: application
) due to the way theWorkManager
is designed.
We're currently working to address the second issue and will provide more updates as soon as possible.
Comment 4•6 years ago
|
||
Is there something we can do in the near term to get more reliable data for this metric? Maybe misusing a histogram instead of a scalar?
Reporter | ||
Comment 5•6 years ago
|
||
We could consider metrics with different lifetimes. Metrics with "lifetime: user" never reset their value, ever. For something like gfx.compositor
it does mean that it'll have a value even in situations where no compositor was ever active (and there might be other corner cases as well): it changes the "value is in too few pings" problem for a "value is in too many pings" problem.
If you'd prefer that problem over the current one, maybe adding a user-lifetime metric named "gfx.last_compositor_created" (or what-have-you) would help advertise the difference. I think bug 1602828 will end up resolving the original issue with application lifetime metrics in a way that will enable the sort of analyses you want, so it might be worth leaving the existing metrics alone pending its resolution and adding a new one to cover your analysis needs in the meantime (with a nice, short expiry date).
What do you think?
Comment 6•6 years ago
|
||
I think having a gfx.last_compositor_created is a fine short term solution. We'll pursue that for now.
Comment 8•6 years ago
|
||
(In reply to Chris H-C :chutten from comment #5)
If you'd prefer that problem over the current one, maybe adding a user-lifetime metric named "gfx.last_compositor_created" (or what-have-you) would help advertise the difference. I think bug 1602828 will end up resolving the original issue with application lifetime metrics in a way that will enable the sort of analyses you want, so it might be worth leaving the existing metrics alone pending its resolution and adding a new one to cover your analysis needs in the meantime (with a nice, short expiry date).
Please note that the NULLs are not exclusively due to bug 1602828: they are also due to 'force-closing' (which is a common thing on mobile) and due to when GV gets initialized by Fenix.
While gfx.last_compositor_created
is a good short term solution for bug 1602828, it will hide the other two problems in addition to changing the semantics a bit, as Chris outlined: this is not a blocker for your solution, just something you need to account for when looking at the data.
Updated•6 years ago
|
Comment 9•6 years ago
|
||
Untaking: the short term solution to this will be the addition of a new metric by Kris. Part of the long-term solution to this will come with bug 1602828 and bug 1602824.
Assignee | ||
Comment 10•6 years ago
|
||
- Duplicate gfx.compositor and gfx.adapter.device_id with last_seen
versions that have a 'user' lifetime to temporarily work around
telemetry null string issues.
Updated•6 years ago
|
Comment 11•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Comment 12•6 years ago
|
||
bugherder |
Assignee | ||
Comment 13•6 years ago
|
||
:jcristau Can this change be uplifted to 72 beta so it makes it into the Fenix release with GV 72? Thanks!
Comment 14•6 years ago
|
||
Please fill in the uplift request form (click on "details" next to the attachment in bugzilla, flip "approval-mozilla-beta" flag to "?", which will bring up the form)
Comment 15•6 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #14)
Please fill in the uplift request form (click on "details" next to the attachment in bugzilla, flip "approval-mozilla-beta" flag to "?", which will bring up the form)
The presence of that flag seems to depend on the bug's component, and this particular component (Data Platform and Tools :: Glean SDK) doesn't have it.
(Perhaps the right thing to do here would be to move the bug into a component that reflects the code the patch touched, i.e. something in Core :: Graphics).
Comment 16•6 years ago
|
||
Indeed, thanks for pointing that out.
Updated•6 years ago
|
Assignee | ||
Comment 17•6 years ago
|
||
Comment on attachment 9115827 [details]
Bug 1601091 - gfx.status.compositor is NULL in an unexpectedly frequent number of pings.
Beta/Release Uplift Approval Request
- User impact if declined: We will lack telemetry to measure the impact of webrender on devices where it is enabled
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): It duplicates already exposed telemetry with a different lifetime
- String changes made/needed:
Comment 18•6 years ago
|
||
Comment on attachment 9115827 [details]
Bug 1601091 - gfx.status.compositor is NULL in an unexpectedly frequent number of pings.
working around a telemetry issue in gv, approved for 72.0b9
Comment 19•6 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 20•6 years ago
|
||
Assignee | ||
Comment 21•6 years ago
|
||
Assignee | ||
Comment 22•6 years ago
|
||
Re-opening to extend temporary telemetry
Reporter | ||
Comment 23•6 years ago
|
||
Comment 24•6 years ago
|
||
Assignee | ||
Comment 25•6 years ago
|
||
Comment on attachment 9122725 [details]
Bug 1601091 - Renewing temporary last_seen telemetry since we are still uncertain about the fix
Beta/Release Uplift Approval Request
- User impact if declined: No user impact. Not uplifting this could cause missing graphics telemetry for Fenix resulting in the graphics team being unable to monitor performance and stability.
- Is this code covered by automated tests?: Unknown
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): It only telemetry extends expiry dates in telemetry configuration files.
- String changes made/needed:
Updated•6 years ago
|
Comment 26•6 years ago
|
||
Comment on attachment 9122725 [details]
Bug 1601091 - Renewing temporary last_seen telemetry since we are still uncertain about the fix
Telemetry probe expiration extension needed for GV. Approved for 73.0b10.
Comment 27•6 years ago
|
||
bugherder uplift |
Comment 28•6 years ago
|
||
bugherder |
Description
•