Add automated Glean datasets, across all channels, to per-app datasets
Categories
(Data Platform and Tools :: Glean Platform, task)
Tracking
(Not tracked)
People
(Reporter: frank, Assigned: ascholtz)
References
(Blocks 1 open bug)
Details
(Whiteboard: [dataplatform])
Attachments
(1 file)
Specifically, we'd start with: baseline_clients_daily
and baseline_clients_last_seen
. This bug covers creating a view that includes those in the app dataset. It should union them, with the appropriate channel name as another column, e.g.:
SELECT "release" AS channel, *
FROM org_mozilla_firefox.baseline_clients_daily
UNION ALL
SELECT "nightly", *
FROM org_mozilla_fenix.baseline_clients_daily
Reporter | ||
Updated•4 years ago
|
Comment 1•3 years ago
•
|
||
The existing code for glean_usage
in bigquery-etl is now overly complicated, as it was originally written to handle generating queries as well as running them. We now deploy the generated views as part of the normal view publishing process, and this will let us significantly simplify the code needed.
I think we should at this point define an interface for this in the bqetl
CLI and simplify the code.
We'd have a command that would be responsible for generating all the per-appId queries and views along with the per-app dataset_metadata.yaml and union views:
bqetl glean_usage generate --project_id <id> --sql-dir <defaults to sql/> --app_name <defaults to all apps>
Then I think we can change the Airflow jobs to use an existing query interface, running on top of content generated from the above. Something like:
bqetl query backfill <some flags> "*.baseline_clients_daily_v1"
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Comment 2•3 years ago
|
||
Some additional details here, since Anna has picked this up.
We could consider having two separate commands, one for generating the appId-specific stuff, the other for the app_name-specific dataset and union views. It's probably fine to do both together, as it would ensure we're consistent between the two.
The https://mozilla.github.io/probe-scraper/#operation/getAppListings API is now stable enough to use for this. Currently, it's necessary to apply some logic to get per-app level view, as seen in https://github.com/mozilla/lookml-generator/blob/main/generator/namespaces.py and https://github.com/mozilla/glean-dictionary/blob/e1aa55153939a7ca21554285e34a0829b1fc0ffe/scripts/build-glean-metadata.py
We'll eventually have an "applications" endpoint that provides the nested view, but that doesn't exist yet.
We'll also need to verify what's actually been deployed via the same mechanism as in stable view generation: https://github.com/mozilla/bigquery-etl/blob/a081bf22b53c9f103528526d9c4270270c9f226f/bigquery_etl/view/generate_stable_views.py#L226
Comment 3•3 years ago
|
||
Assignee | ||
Comment 4•3 years ago
|
||
The machinery for generating the per-app datasets is now in place. Datasets for fenix
and firefox_ios
are available in BigQuery.
Assignee | ||
Updated•2 years ago
|
Description
•