Closed Bug 1546149 Opened 8 months ago Closed 11 days ago

Figure out the telemetry story for GeckoView

Categories

(Core :: Graphics: WebRender, task, P3)

Other Branch
Unspecified
Android
task

Tracking

()

RESOLVED FIXED

People

(Reporter: kats, Unassigned)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [wr-amvp][wr-q2][geckoview:fenix:p3])

In order to make sure we're not regressing Android performance when we turn on WebRender on GeckoView, we should figure out how to measure performance.

I talked to chutten on IRC and learned the following things:

The way telemetry works on GeckoView is that GV collects the data and stores it, and the embedding application is responsible for submitting the data. Fenix for example uses the glean SDK to submit the data.

On the backend it is possible to distinguish data submitted from Fennec vs different GV embeddings (Fenix, reference browser, etc.) which is basically all we need, as long as we have the right probes in place, and can actually get at the data.

In terms of getting at the data, it's not published to telemetry.m.o and whether or not we can get to it from databricks is "still up in the air". sql.telemetry.m.o is apparently where Fenix and ReferenceBrowser data ends up for now.

In terms of probes, :snorp pointed me to GV_STARTUP_RUNTIME_MS which is the GV-equivalent of FENNEC_STARTUP_TIME_GECKOREADY which is one that we should probably track, in case WR regresses gecko startup time.

So there's some more details to be figured out here, with the end goal being to construct a dashboard where we can monitor the probe data to ensure we're not regressing things.

Whiteboard: [wr-amvp][wr-q2]

Chris, are there any other bugs/info that might be relevant here?

Flags: needinfo?(cpeterson)

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #1)

Chris, are there any other bugs/info that might be relevant here?

If you have technical questions, :Dexter and :Frank worked on GeckoView's telemetry system. :esawin implemented some GeckoView probes for page load performance that might be relevant probes or examples: bug 1499418.

Here's an example dashboard we created for comparing GeckoView vs WebView in Focus. (You will need to click each graph's "Gv_exp" field and "Select All" to display the GeckoView and WebView data.)

https://sql.telemetry.mozilla.org/dashboard/focus-8-0-release-dashboard

Flags: needinfo?(cpeterson)
OS: Unspecified → Android
See Also: → 1499418
Whiteboard: [wr-amvp][wr-q2] → [wr-amvp][wr-q2][geckoview:fenix:p3]
Assignee: nobody → kats

After some more discussion with :chutten and :mdroettboom on slack (#gv) it sounds like there's a bunch of work to be done before Fenix can submit gecko telemetry probes. Much of that work is at the geckoview layer involving adding a metrics.yaml file for glean and hooking it up (somehow) such that GV sends along the data to glean. I found bug 1497997 which seems kinda related but seems to be not specifically about Fenix but more generally about tracking GV performance. So I think there might not be a bug tracking this work as of yet. I'll ask around more and file something if there isn't.

Ah, :mdroettboom and :esawin pointed me to bug 1497812 which is tracking the GV -> glean integration.

So right now the only GV product that is submitting Gecko telemetry probes is Focus for Android, because it doesn't use glean but instead uses its predecessor component.

Depends on: 1497812

https://sql.telemetry.mozilla.org/queries/63009/source#161595 is a sample query pulling the MEMORY_TOTAL probe from Android Focus Nightly instances. I couldn't figure out how to truncate the x-axis so that the data isn't all squished up on the left. But anyway this table (mobile_metrics_aggregates) doesn't seem to have data that I can use to cross-reference WR on vs off. So I'll have to dig a bit more to see where I can find that.

For that particular query I'd try adding HAVING SUM(h.v) > 0 to remove from the result set any values that don't have counts in them. Other options would be to do some sort of clamp like

CASE
  WHEN CAST(h.k as INTEGER) > <some upper limit> THEN <some upper limit>
  ELSE CAST(h.k as INTEGER)
END AS mem_kb

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #3)

After some more discussion with :chutten and :mdroettboom on slack (#gv) it sounds like there's a bunch of work to be done before Fenix can submit gecko telemetry probes. Much of that work is at the geckoview layer involving adding a metrics.yaml file for glean and hooking it up (somehow) such that GV sends along the data to glean. I found bug 1497997 which seems kinda related but seems to be not specifically about Fenix but more generally about tracking GV performance. So I think there might not be a bug tracking this work as of yet. I'll ask around more and file something if there isn't.

For the medium to long term, that "GV Telemetry into Glean SDK" needs an integration story and a scheduled project.

For the short-term, if there are priority metrics that you need to collect, you can:

  • Add a new Glean metric to the specific application code or to the android-components code.
  • Collect the metric value from GeckoView (pull from app code or push from GV).
    With that you can analyze the data just as other Glean metrics.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #5)

But anyway this table (mobile_metrics_aggregates) doesn't seem to have data that I can use to cross-reference WR on vs off. So I'll have to dig a bit more to see where I can find that.

I was hoping this would just be in some other table, but according to :frank, this is the schema used to submit the mobile metrics, and it doesn't really contain any environment data. There's an experiments field which we might be able to populate with WR on vs off, if we're doing it in the context of an experiment, but it would be good to have the compositor and other environment bits in that data as well. I don't know how much effort it will be to add this and plumb it through our data pipeline so that it comes out in a usable format.

I could be wrong, but I think metrics lifetimes will satisfy the "is WR on or off?" case. You can have a boolean metric gfx.webrender_enabled (or a string metric gfx.compositor, or however is most useful to your analyses) of lifetime application and you can set it at app startup and whenever it changes.

Would that work?

(In reply to Chris H-C :chutten from comment #9)

Would that work?

That does sound like it would work in the "happy case" where everything works out the way we want. With the desktop analyses for WR on we ran into a number of "unhappy cases" where we had to dig into additional environment details to try and understand why we were seeing the results we were seeing. So if possible I'd prefer to get a fuller environment in telemetry, but if that doesn't work out then this is better than nothing.

Unassigning since I'm not actively doing anything here at the moment, it's waiting on dependencies and other teams.

Just making note of some things we should look at before calling this done:

There are a bunch of scalars that need to be added:
gfx_display_count
gfx_display_h
gfx_display_v
gfx_adapter
gfx_compositor
gfx_status_webrender

From this spreadsheet https://docs.google.com/spreadsheets/d/1EZGVGmybvF1sH-XrsXMlplK7fWQfDniIeqXrkfDQrlc/edit#gid=0

Here's documentation on how to expose scalars through Gecko->Glean https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/start/report-gecko-telemetry-in-glean.html#reporting-a-scalar

See Also: → 1568755
Depends on: 1594145
Status: NEW → RESOLVED
Closed: 11 days ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.