Closed Bug 1708264 Opened 4 years ago Closed 4 years ago

Define user-facing and derived BigQuery datasets in bigquery-etl

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

References

Details

(Whiteboard: [dataplatform])

Attachments

(1 file)

Link to GitHub pull-request: https://github.com/mozilla/bigquery-etl/pull/1987 4 years ago GitHub Bugzilla PR Linker 49 bytes, text/x-github-pull-request		Details \| Review

Jeff Klukas [:klukas] (UTC-4)

Assignee

Description

•

4 years ago

Currently, we rely on logic in cloudops-infra to define BigQuery datasets for derived tables and for user-facing views. Any configuration tweaks or additional datasets have to be codified in that repo as well. See:

https://github.com/mozilla-services/cloudops-infra/blob/master/projects/data-shared/tf/prod/envs/prod/bigquery/namespaces.tfvars.json

Since the actual tables and views are defined in bigquery-etl, we'd like to move dataset configuration to bigquery-etl as well. This will make it easier for data engineers to define new task-specific datasets more easily, and it will better enable automation like per-Glean application datasets (see https://bugzilla.mozilla.org/show_bug.cgi?id=1708169).

In bigquery-etl, we will add support for dataset_metadata.yaml files that largely follow the existing format of namespaces.tfvars.json, including dataset_base_acl and workgroup_access fields.

This bug encompasses adding the relevant machinery to bigquery-etl and porting over the content of namespaces.tfvars.json to appropriate dataset_metadata.yaml files.

Wesley Dawson [:whd]

Comment 1

•

4 years ago

Some notes:

This work includes user-facing and non-user-facing datasets (and possibly some other special cases e.g. tmp, udf).
The logic that currently determines which datasets to propagate to mozdata lives here. For now we're probably most concerned with codifying view datasets. It's worth noting that dataset type and dataset base acl don't have a 1:1 relationship currently which is why both need to exist. We may be able to refactor this but it's not planned for the first pass.
There is currently by ops logic a _derived dataset generated for every ingestion namespace.
What exists in namespaces.tfvars.json are only entries that either need ACLs (which are annotated with "ingestion_dataset": true) or don't correspond to ingestion datasets. We're potentially no longer planning to automatically create the _derived table per ingestion namespace automatically, which will increase overhead when one is needed but decrease clutter when one it not. In any case, all existing namespace _derived datasets should probably exist in bqetl metadata.
This work is likely to precede the final rollout of terraform modernization.
As such I'm going to be shimming the new format into the old for the 0.11 stack. This should be fairly simple collecting yaml files and merging them into the existing namespaces.tfvars.json file. For now I plan to keep namespaces.tfvars.json in cloudops-infra and merge it with bigquery-etl metadata, preferring namespaces.tfvars.json. This will make it somewhat safer to apply from a permissions perspective, but logic will eventually need to be put in place to avoid accidentally unrestricting datasets via bqetl-metadata. For the first pass, the datasets we're trying to add will probably all be mozilla-confidential and that can even be enforced by ops logic override.
As long as the bigquery-etl branch is never deployed in such a way that it relies on a ingestion (live/stable) datasets that haven't been added to generated-schemas this should be safe to deploy without dependency issues.
The schemas build job will poll both branches and when either is updated, it will pull in the latest version of both. Schemas deploys can take a long time (30+ min) so it may make sense to do something smart here when both branches are being updated in close proximity.

GitHub Bugzilla PR Linker

Comment 2

•

4 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/bigquery-etl/pull/1987 — Details

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

4 years ago

Assignee: nobody → jklukas

Priority: -- → P1

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 3

•

4 years ago

As of https://github.com/mozilla/bigquery-etl/pull/1988 we now have dataset_metadata.yaml files present for each derived and user-facing dataset present in the generated-sql branch. :whd is going to integrating this in a phased approach, letting existing metadata defined in ops logic take precedence, and removing some of that configuration in stages as we verify that no unexpected changes will be made.

Wesley Dawson [:whd]

Comment 4

•

4 years ago

•

Edited

I have a PoC branch at https://github.com/mozilla-services/cloudops-infra/compare/dataset_metadata?expand=1 that applies cleanly (i.e. no-op) in all stage projects (shared, mozdata, and rally) that integrates bqetl metadata into the deployment pipeline. I plan to clean this is up and get it reviewed tomorrow, after which we can begin to define new datasets using this mechanism.

Summary: Define user-facing BigQuery datasets in bigquery-etl → Define user-facing and derived BigQuery datasets in bigquery-etl

Wesley Dawson [:whd]

Comment 5

•

4 years ago

https://github.com/mozilla-services/cloudops-infra/pull/3063

r? :robotblake since :jason is out. At this point it's a no-op deploy and it might be nice to have some datasets defined exclusively in bqetl (e.g. bug #1708166) for testing purposes.

Flags: needinfo?(bimsland)

Wesley Dawson [:whd]

Comment 6

•

4 years ago

The PR was reviewed today and I will land it tomorrow, after which I think this bug can be closed.

Flags: needinfo?(bimsland)

Wesley Dawson [:whd]

Comment 7

•

4 years ago

The PR has been deployed successfully.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

Frank Bertsch [:frank]

Updated

•

4 years ago

Component: General → Glean Platform

Anna Scholtz [:ascholtz]

Updated

•

2 years ago

Whiteboard: [data-platform-infra-wg] → [dataplatform]

BMO Automation

Updated

•

6 months ago

Product: Data Platform and Tools → Data Platform and Tools Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Define user-facing and derived BigQuery datasets in bigquery-etl

Categories

(Data Platform and Tools Graveyard :: Glean Platform, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: klukas, Assigned: klukas)

References

Details

(Whiteboard: [dataplatform])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type