Closed Bug 1708169 Opened 4 years ago Closed 3 years ago

Add automated Glean datasets, across all channels, to per-app datasets

Categories

(Data Platform and Tools :: Glean Platform, task)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: ascholtz)

References

(Blocks 1 open bug)

Details

(Whiteboard: [dataplatform])

Attachments

(1 file)

Specifically, we'd start with: baseline_clients_daily and baseline_clients_last_seen. This bug covers creating a view that includes those in the app dataset. It should union them, with the appropriate channel name as another column, e.g.:

SELECT "release" AS channel, *
FROM org_mozilla_firefox.baseline_clients_daily
UNION ALL
SELECT "nightly", *
FROM org_mozilla_fenix.baseline_clients_daily
Depends on: 1708166
Component: General → Glean Platform
No longer depends on: 1708166
Whiteboard: [data-platform-infra-wg]
Depends on: 1708166

The existing code for glean_usage in bigquery-etl is now overly complicated, as it was originally written to handle generating queries as well as running them. We now deploy the generated views as part of the normal view publishing process, and this will let us significantly simplify the code needed.

I think we should at this point define an interface for this in the bqetl CLI and simplify the code.

We'd have a command that would be responsible for generating all the per-appId queries and views along with the per-app dataset_metadata.yaml and union views:

bqetl glean_usage generate --project_id <id> --sql-dir <defaults to sql/> --app_name <defaults to all apps>

Then I think we can change the Airflow jobs to use an existing query interface, running on top of content generated from the above. Something like:

bqetl query backfill <some flags> "*.baseline_clients_daily_v1"
Assignee: nobody → ascholtz
Status: NEW → ASSIGNED

Some additional details here, since Anna has picked this up.

We could consider having two separate commands, one for generating the appId-specific stuff, the other for the app_name-specific dataset and union views. It's probably fine to do both together, as it would ensure we're consistent between the two.

The https://mozilla.github.io/probe-scraper/#operation/getAppListings API is now stable enough to use for this. Currently, it's necessary to apply some logic to get per-app level view, as seen in https://github.com/mozilla/lookml-generator/blob/main/generator/namespaces.py and https://github.com/mozilla/glean-dictionary/blob/e1aa55153939a7ca21554285e34a0829b1fc0ffe/scripts/build-glean-metadata.py

We'll eventually have an "applications" endpoint that provides the nested view, but that doesn't exist yet.

We'll also need to verify what's actually been deployed via the same mechanism as in stable view generation: https://github.com/mozilla/bigquery-etl/blob/a081bf22b53c9f103528526d9c4270270c9f226f/bigquery_etl/view/generate_stable_views.py#L226

The machinery for generating the per-app datasets is now in place. Datasets for fenix and firefox_ios are available in BigQuery.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Whiteboard: [data-platform-infra-wg] → [dataplatform]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: