Define user-facing and derived BigQuery datasets in bigquery-etl
Categories
(Data Platform and Tools :: Glean Platform, enhancement, P1)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: klukas)
References
Details
(Whiteboard: [dataplatform])
Attachments
(1 file)
Currently, we rely on logic in cloudops-infra to define BigQuery datasets for derived tables and for user-facing views. Any configuration tweaks or additional datasets have to be codified in that repo as well. See:
Since the actual tables and views are defined in bigquery-etl, we'd like to move dataset configuration to bigquery-etl as well. This will make it easier for data engineers to define new task-specific datasets more easily, and it will better enable automation like per-Glean application datasets (see https://bugzilla.mozilla.org/show_bug.cgi?id=1708169).
In bigquery-etl, we will add support for dataset_metadata.yaml files that largely follow the existing format of namespaces.tfvars.json, including dataset_base_acl
and workgroup_access
fields.
This bug encompasses adding the relevant machinery to bigquery-etl and porting over the content of namespaces.tfvars.json to appropriate dataset_metadata.yaml files.
Comment 1•4 years ago
|
||
Some notes:
- This work includes user-facing and non-user-facing datasets (and possibly some other special cases e.g.
tmp
,udf
).
The logic that currently determines which datasets to propagate to mozdata lives here. For now we're probably most concerned with codifyingview
datasets. It's worth noting that dataset type and dataset base acl don't have a 1:1 relationship currently which is why both need to exist. We may be able to refactor this but it's not planned for the first pass. - There is currently by ops logic a
_derived
dataset generated for every ingestion namespace.
What exists innamespaces.tfvars.json
are only entries that either need ACLs (which are annotated with"ingestion_dataset": true
) or don't correspond to ingestion datasets. We're potentially no longer planning to automatically create the_derived
table per ingestion namespace automatically, which will increase overhead when one is needed but decrease clutter when one it not. In any case, all existing namespace_derived
datasets should probably exist in bqetl metadata. - This work is likely to precede the final rollout of terraform modernization.
As such I'm going to be shimming the new format into the old for the 0.11 stack. This should be fairly simple collecting yaml files and merging them into the existing namespaces.tfvars.json file. For now I plan to keep namespaces.tfvars.json in cloudops-infra and merge it with bigquery-etl metadata, preferringnamespaces.tfvars.json
. This will make it somewhat safer to apply from a permissions perspective, but logic will eventually need to be put in place to avoid accidentally unrestricting datasets via bqetl-metadata. For the first pass, the datasets we're trying to add will probably all bemozilla-confidential
and that can even be enforced by ops logic override. - As long as the bigquery-etl branch is never deployed in such a way that it relies on a ingestion (live/stable) datasets that haven't been added to
generated-schemas
this should be safe to deploy without dependency issues.
The schemas build job will poll both branches and when either is updated, it will pull in the latest version of both. Schemas deploys can take a long time (30+ min) so it may make sense to do something smart here when both branches are being updated in close proximity.
Comment 2•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 3•4 years ago
|
||
As of https://github.com/mozilla/bigquery-etl/pull/1988 we now have dataset_metadata.yaml
files present for each derived and user-facing dataset present in the generated-sql
branch. :whd is going to integrating this in a phased approach, letting existing metadata defined in ops logic take precedence, and removing some of that configuration in stages as we verify that no unexpected changes will be made.
Comment 4•4 years ago
•
|
||
I have a PoC branch at https://github.com/mozilla-services/cloudops-infra/compare/dataset_metadata?expand=1 that applies cleanly (i.e. no-op) in all stage projects (shared, mozdata, and rally) that integrates bqetl metadata into the deployment pipeline. I plan to clean this is up and get it reviewed tomorrow, after which we can begin to define new datasets using this mechanism.
Comment 5•4 years ago
|
||
https://github.com/mozilla-services/cloudops-infra/pull/3063
r? :robotblake since :jason is out. At this point it's a no-op deploy and it might be nice to have some datasets defined exclusively in bqetl (e.g. bug #1708166) for testing purposes.
Comment 6•4 years ago
|
||
The PR was reviewed today and I will land it tomorrow, after which I think this bug can be closed.
Comment 7•4 years ago
|
||
The PR has been deployed successfully.
Updated•3 years ago
|
Updated•2 years ago
|
Description
•