Closed Bug 1708264 Opened 4 years ago Closed 4 years ago

Define user-facing and derived BigQuery datasets in bigquery-etl

Categories

(Data Platform and Tools :: Glean Platform, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

References

Details

(Whiteboard: [dataplatform])

Attachments

(1 file)

Currently, we rely on logic in cloudops-infra to define BigQuery datasets for derived tables and for user-facing views. Any configuration tweaks or additional datasets have to be codified in that repo as well. See:

https://github.com/mozilla-services/cloudops-infra/blob/master/projects/data-shared/tf/prod/envs/prod/bigquery/namespaces.tfvars.json

Since the actual tables and views are defined in bigquery-etl, we'd like to move dataset configuration to bigquery-etl as well. This will make it easier for data engineers to define new task-specific datasets more easily, and it will better enable automation like per-Glean application datasets (see https://bugzilla.mozilla.org/show_bug.cgi?id=1708169).

In bigquery-etl, we will add support for dataset_metadata.yaml files that largely follow the existing format of namespaces.tfvars.json, including dataset_base_acl and workgroup_access fields.

This bug encompasses adding the relevant machinery to bigquery-etl and porting over the content of namespaces.tfvars.json to appropriate dataset_metadata.yaml files.

Some notes:

  1. This work includes user-facing and non-user-facing datasets (and possibly some other special cases e.g. tmp, udf).
    The logic that currently determines which datasets to propagate to mozdata lives here. For now we're probably most concerned with codifying view datasets. It's worth noting that dataset type and dataset base acl don't have a 1:1 relationship currently which is why both need to exist. We may be able to refactor this but it's not planned for the first pass.
  2. There is currently by ops logic a _derived dataset generated for every ingestion namespace.
    What exists in namespaces.tfvars.json are only entries that either need ACLs (which are annotated with "ingestion_dataset": true) or don't correspond to ingestion datasets. We're potentially no longer planning to automatically create the _derived table per ingestion namespace automatically, which will increase overhead when one is needed but decrease clutter when one it not. In any case, all existing namespace _derived datasets should probably exist in bqetl metadata.
  3. This work is likely to precede the final rollout of terraform modernization.
    As such I'm going to be shimming the new format into the old for the 0.11 stack. This should be fairly simple collecting yaml files and merging them into the existing namespaces.tfvars.json file. For now I plan to keep namespaces.tfvars.json in cloudops-infra and merge it with bigquery-etl metadata, preferring namespaces.tfvars.json. This will make it somewhat safer to apply from a permissions perspective, but logic will eventually need to be put in place to avoid accidentally unrestricting datasets via bqetl-metadata. For the first pass, the datasets we're trying to add will probably all be mozilla-confidential and that can even be enforced by ops logic override.
  4. As long as the bigquery-etl branch is never deployed in such a way that it relies on a ingestion (live/stable) datasets that haven't been added to generated-schemas this should be safe to deploy without dependency issues.
    The schemas build job will poll both branches and when either is updated, it will pull in the latest version of both. Schemas deploys can take a long time (30+ min) so it may make sense to do something smart here when both branches are being updated in close proximity.
Assignee: nobody → jklukas
Priority: -- → P1

As of https://github.com/mozilla/bigquery-etl/pull/1988 we now have dataset_metadata.yaml files present for each derived and user-facing dataset present in the generated-sql branch. :whd is going to integrating this in a phased approach, letting existing metadata defined in ops logic take precedence, and removing some of that configuration in stages as we verify that no unexpected changes will be made.

I have a PoC branch at https://github.com/mozilla-services/cloudops-infra/compare/dataset_metadata?expand=1 that applies cleanly (i.e. no-op) in all stage projects (shared, mozdata, and rally) that integrates bqetl metadata into the deployment pipeline. I plan to clean this is up and get it reviewed tomorrow, after which we can begin to define new datasets using this mechanism.

Summary: Define user-facing BigQuery datasets in bigquery-etl → Define user-facing and derived BigQuery datasets in bigquery-etl

https://github.com/mozilla-services/cloudops-infra/pull/3063

r? :robotblake since :jason is out. At this point it's a no-op deploy and it might be nice to have some datasets defined exclusively in bqetl (e.g. bug #1708166) for testing purposes.

Flags: needinfo?(bimsland)

The PR was reviewed today and I will land it tomorrow, after which I think this bug can be closed.

Flags: needinfo?(bimsland)

The PR has been deployed successfully.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Component: General → Glean Platform
Whiteboard: [data-platform-infra-wg] → [dataplatform]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: