Closed Bug 1704842 Opened 4 years ago Closed 3 years ago

Add a way to determine whether gfx compositor telemetry for geckoview is valid

Categories

(Core :: Graphics: WebRender, task, P3)

task

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox89 --- fixed
firefox90 --- fixed

People

(Reporter: jnicol, Assigned: jnicol)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

We've had an ongoing issue of seeing a lot of unexpected values for the gfx_status_compositor and gfx_feature_webrender telemetry probes. Either empty values, or possibly stale (eg reporting "opengl" when we'd expect the device to now be running webrender).

We think this might be because Fenix has been opened and geckoview has been initialized, but no tab has been opened therefore no compositor has been initialized so these metrics have not been updated.

Perhaps we can initialize these metrics to a special uninitialized value earlier during geckoview startup, so that we can determine whether this is indeed the cause. Or maybe adding a metric to indicate whether graphics has been initialized yet on this app run.

Jim and I discussed this further on element, and realized that we probably don't want to overwrite the existing metrics on startup, as that will lose useful information. And an extra flag to check for initialization is redundant, as gfx_feature_webrender can already be used to establish whether a compositor has been initialized on the most recent run, due to its application lifetime.

What we really want to know is whether the gfx_status_compositor metric is outdated or not. So we should add a new metric with a user lifetime (to match gfx_status_compositor). We will just report the geckoview version with this metric, the same as the existing geckoview_version metric, except it only gets set when a compositor is initialized, whereas geckoview_version is set when gecko initializes.

Chutten, do I need to go through data review for this, if I'm effectively duplicating an existing metric?

Flags: needinfo?(chutten)

(In reply to Jamie Nicol [:jnicol] from comment #1)

Chutten, do I need to go through data review for this, if I'm effectively duplicating an existing metric?

It'd probably be a good idea anyway since it'd have its own purpose, owner, and expiry. And if someone comes around looking for why we collect it this way, it's a good artefact to help with that (data collection review being the only place one documents why we're collecting things).

Speaking of which, what are the questions you wish to answer with this collection? This seems like a roundabout way to answer anything besides "what was the version of the previous compositor selection algorithm?".

Flags: needinfo?(chutten) → needinfo?(jnicol)

The issue we have is that lots of the values we see for gfx_status_compositor report "opengl" when we'd expect them to say "webrender". The theory is that these users haven't opened a tab (and therefore initialized a compositor) since Fenix has been updated to a version on which their device should get webrender. But we'd like to verify that, as it's a surprisingly high number of users and we just want to be certain there isn't an actual issue preventing them getting webrender. Do you have any ideas for a better way of doing that?

Flags: needinfo?(jnicol) → needinfo?(chutten)

Glean has the "application" lifetime which is "values will be sent in pings from now until the application exits" which sounds like what you actually want. There were a lot of questions about how exactly this worked for GFX's compositor instrumentation back when it was being hammered out last year, though, which is why I didn't bring it up at first.

But that's kinda exactly what you want, isn't it? Pings sent before a compositor is init will have no value for compositor. Pings sent after a compositor is init will have the current value. So instead of

  1. User uses the browser, gets a value of opengl
  2. User upgrades the browser, doesn't use it yet (but it is running after the update because Android application lifecycle management is weird), the value is still opengl (even though the compositor choosing logic would choose webrender if we asked)
  3. User finally loads a page, gets a value of webrender

We'd have

  1. User uses the browser, gets a value of opengl
  2. User upgrades the browser, doesn't use it yet (but ...), the value is absent/null
  3. User finally loads a page, gets a value of webrender

(( We have this graphically demonstrated in the docs ))

Alternatively, perhaps there'd be a way to ask the compositor-choosing logic what value it would choose if it would run right now? Then you could ask it around app startup and have webrender reported from step 2 onwards.

Does this help? I might be making this confusing.

In sum: I recommend adding an "application" lifetime version of the compositor metric (which wouldn't require data review, but it's a good thing to do anyway for Future You's benefit), since that seems to be the actual instrumentation you're interested in. A "user" lifetime copy of geckoview engine version would also work (and also not require data review, but it'd be a Good Idea), but it gets a little fuzzier about what the intent of it is.

Flags: needinfo?(chutten)

(In reply to Chris H-C :chutten from comment #4)

Glean has the "application" lifetime which is "values will be sent in pings from now until the application exits" which sounds like what you actually want. There were a lot of questions about how exactly this worked for GFX's compositor instrumentation back when it was being hammered out last year, though, which is why I didn't bring it up at first.

But that's kinda exactly what you want, isn't it? Pings sent before a compositor is init will have no value for compositor. Pings sent after a compositor is init will have the current value. So instead of

  1. User uses the browser, gets a value of opengl
  2. User upgrades the browser, doesn't use it yet (but it is running after the update because Android application lifecycle management is weird), the value is still opengl (even though the compositor choosing logic would choose webrender if we asked)
  3. User finally loads a page, gets a value of webrender

We'd have

  1. User uses the browser, gets a value of opengl
  2. User upgrades the browser, doesn't use it yet (but ...), the value is absent/null
  3. User finally loads a page, gets a value of webrender

Hey Chris, here's the reasoning for leveraging user lifetimes. In your first example case with user lifetime, and after landing this change, we would have -

  1. User uses the browser -
    compositor = 'opengl'
    gfx_geckoview_version = 87.0
    gockerview_version = 87.0

  2. User upgrades the browser, doesn't use it yet -
    compositor = 'opengl'
    gfx_geckoview_version = 87.0
    gockerview_version = 88.0

  3. User loads a page -
    compositor = 'webrender'
    gfx_geckoview_version = 88.0
    gockerview_version = 88.0

^ This tells us what we want to know.

If we switch to application lifetimes, we start to lose data -

  1. User uses the browser -
    compositor = 'opengl'
    gfx_geckoview_version = 87.0
    gockerview_version = 87.0

  2. User restarts the browser, doesn't use it yet -
    compositor = null
    gfx_geckoview_version = null
    gockerview_version = 87.0

  3. User upgrades the browser, doesn't use it yet -
    compositor = null
    gfx_geckoview_version = null
    gockerview_version = 88.0

  4. User loads a page -
    compositor = 'webrender'
    gfx_geckoview_version = 88.0
    gockerview_version = 88.0

  5. User restarts the browser, doesn't use it yet -
    compositor = null
    gfx_geckoview_version = null
    gockerview_version = 88.0

In scenarios #2 and #3 & #5 above, our telemetry not longer has any value. Hence why we want to stick with user lifetimes that persist across sessions.

Flags: needinfo?(chutten)

I think I understand that much. Where I lose you is why we'd want there to be values in 2, 3 and 5. The user hasn't run the compositor selection algorithm yet, so any information you could report would be stale. NULL makes the most sense to me.

But! It's not important that I understand. I'm not the one running the analysis. : ) So don't feel you must trouble yourself explaining it to me if you don't want to.

The important stuff is: the proposed changes (recording the previous geckoview version in a "user" lifetime metric) will work as you expect, and a Data Review, though not strictly necessary, is beneficial for archaeology when someone in the future goes looking for artefacts about why the metrics are collected in the way they are.

(( If you're interested in what I'd do differently, I'd keep the "user"-lifetime metric because you probably have dashboards and stuff where you've gotten used to how it looks. I'd add an "application"-lifetime metric that echoes it with a different lifetime, using "it's null" as a signal that there's been no compositor yet this app session. If I wanted to do away with the "user"-lifetime compositor metric later, I could reconstitute its appearance in SQL using LAST_VALUE(metrics.strings.gfx_compositor_app IGNORE NULLS) PARTITION BY client_info.client_id ORDER BY ping_info.seq ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS compositor_user to smear the most recent non-NULL value across pings that otherwise would report NULL ))

Flags: needinfo?(chutten)
Blocks: wr-android
Severity: -- → S3
Priority: -- → P3

Okay, thanks Chris that's very helpful. I think you are probably right that we want to move things over to an application lifetime in the longer term. But right now we just want to verify the reason for the opengl compositors, and adding the version probe seems like the best option for that.

Attached file data_review.md
Attachment #9216362 - Flags: data-review?(chutten)

On Fenix we see many users with a value of opengl for the
gfx_status_compositor telemetry probe, despite the fact they are
running versions of gecko for which webrender should be enabled on
their device.

The theory is that although gecko is being initialized the user has
not opened a tab, and therefore initialized a compositor, since they
have been updated to a version which should be running webrender by
default. To verify that this is the case, this patch adds a new probe
gfx_status_last_compositor_gecko_version, which reports the gecko
version at the time a compositor is initialized. This will be used to
confirm that the users whose compositor unexpectedly reports opengl
have indeed not initialized a compositor in a while.

Assignee: nobody → jnicol
Status: NEW → ASSIGNED

Comment on attachment 9216362 [details]
data_review.md

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Jamie Nicol is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.


Result: datareview+

Attachment #9216362 - Flags: data-review?(chutten) → data-review+
Pushed by jnicol@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3d0ed017553b Expose gecko version at time compositor is initialized to geckoview_streaming telemetry. r=janerik
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch

Comment on attachment 9216372 [details]
Bug 1704842 - Expose gecko version at time compositor is initialized to geckoview_streaming telemetry. r?jmathies

Beta/Release Uplift Approval Request

  • User impact if declined: None
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): New Telemetry probe for tracking webrender rollout on Android. Doesn't impact user experience.
  • String changes made/needed: None
Attachment #9216372 - Flags: approval-mozilla-beta?

Comment on attachment 9216372 [details]
Bug 1704842 - Expose gecko version at time compositor is initialized to geckoview_streaming telemetry. r?jmathies

Approved for 89 beta 11, thanks.

Attachment #9216372 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: