Provision mozdata project and BigQuery datasets
Categories
(Data Platform and Tools Graveyard :: Operations, task, P1)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: whd)
References
Details
The first step for implementing the Project Layout for BigQuery Users (mozdata) proposal is to provision the new project named mozdata
and to update the ops logic to create datasets with appropriate permissions.
The set of datasets in mozdata
should be nearly identical to the datasets present in derived-datasets
, including a user-writable analysis
dataset and user-readable per-namespace datasets for publishing user-facing views.
Once this is in place, we can start testing any logic modifications needed to be able to publish views into this new dataset.
Assignee | ||
Comment 1•4 years ago
|
||
I've started some preliminary cleanup work to consolidate dataset management in https://github.com/mozilla-services/cloudops-infra/pull/2654 and https://github.com/mozilla-services/cloudops-infra/pull/2658.
One benefit of this cleanup is that it forces categorization of datasets into those that are user-facing, and those that are not, which is important for deciding which datasets to provision in mozdata
. This should mostly correspond to bq ls --max_results 1000 --project_id moz-fx-data-derived-datasets
, however there are a couple of datasets defined in derived-datasets
that aren't defined in shared-prod
, so those need to be reconciled (here listed with my best guess at reconciliation):
backfill
(delete)org_mozilla_fogotype
(delete)org_mozilla_samples_glean
(delete)search_staging
(delete?)spark_bigquery_staging_us
(delete)taskclusteretl
(copy?)validation
(copy?)
As for datasets defined in shared-prod, the general goal is to create a view (no suffix) dataset in mozdata for any namespace configuration that contains configuration for a "view" dataset.
Here are some of the ambiguous ones, or ones that I would like additional verification from somebody else about:
-
activity_stream_bi
: https://bugzilla.mozilla.org/show_bug.cgi?id=1616895
My assumption is that this is an internal dataset meant to grant access to data for hosted services external to Mozilla, and should therefore not be replicated tomozdata
. It's also not inderived-datasets
which indicates the same. As a side note: this was configured before table ACLs were in use and we could probably revisit how this is structured to leverage table ACLs. -
amo_dev
andamo_prod
: https://bugzilla.mozilla.org/show_bug.cgi?id=1633746
After re-reading the bug (particularly https://bugzilla.mozilla.org/show_bug.cgi?id=1633746#c18) I'm not sure if this dataset is meant to be user-facing or not. I originally thought the dataset was provisioned for usage by AMO (as specific AMO service accounts are granted access to it). Since this dataset contains both views and tables, depending on whether this is meant to be user-facing or now, we may want to split this into user-facing datasets and non-user-facing datasets. The current configuration is assuming non-user-facing, but a dataset exists inderived-datasets
. -
External tables: https://bugzilla.mozilla.org/show_bug.cgi?id=1588654
I'm assuming these are internal and shouldn't be replicated to mozdata (they don't exist inderived-datasets
). -
internet_outages
: https://bugzilla.mozilla.org/show_bug.cgi?id=1645147
I think this is meant to be user-facing but should probably be split into a user-facing views dataset and an internal dataset. This could technically be syndicated frommoz-fx-data-shared-prod
intomozdata
but we probably want to think a bit on the structure of things here. Notably because it contains data it doesn't exist inderived-datasets
but is probably still meant to be user-facing. -
monitoring
Contains both tables and views, generally for operational purposes but should perhaps be user-facing as well. Once again could probably be syndicated, but we should pin down a desired structure and stick to it. -
revenue_derived
: https://bugzilla.mozilla.org/show_bug.cgi?id=1644474
I'm not sure what to make of this but it has default (mozilla-all) permissions right now. I'm assuming it doesn't need to be user-facing given it's got derived in the name. -
static
Similarly we should decide whether this should be syndicated or else how we want to handle this dataset. -
udf
udf_js
udf_legacy
I believe these will eventually be removed from shared entirely in favor of themozdata
versions.
Basically, we need to decide on how we want to handle user-facing datasets that are not exclusively views, and whether we want mozdata
to be clean (no data, only views) or also include data in some cases. I've been using view
to mean user-facing" in ops configuration but we may want to separate the notion of "user-facing" from view
if mozdata
isn't clean.
Reporter | ||
Comment 2•4 years ago
|
||
activity_stream_bi
Indeed, this does not need to be replicated to mozdata. The Mode user account accessing the tables in shared-prod directly. I also concur that given table-level ACLs, we could move these derived tables back into activity_stream_derived
and grant the Mode user access to just the tables it needs, removing activity_stream_bi
.
?ni :kirill for confirmation that Mode still needs to look at these tables.
amo_dev and amo_prod
These are intended primarily for use by the AMO stats application rather than analytics users. This is another case where table-level ACLs would allow us to remove a special case.
Ideally, we would provision amo_stats_derived
for tables and amo_stats
for user-facing views, and we would user table-level ACLs to grant the prod and dev AMO apps to just the appropriate tables and views within these datasets.
Absent such a refactor, it would be nice to still publish the views in these datasets to mozdata
.
External tables
Agreed. These do not need to be replicated to mozdata.
internet_outages
This looks similar to amo_prod and amo_dev in that it contains both tables and views because we wanted to grant external access at the dataset level. Using table-level ACLs, this can be refactored to be a normal derived and user-facing pair.
monitoring
I don't see any particular reason why we can't factor out the tables here to a monitoring_derived
dataset.
revenue_derived
?ni :frank. Does this dataset need to exist? Are we writing derived data to a separate project instead?
static
Static tables are small and published by hand when they are created or changed. We should update docs to make it clear that they should be published to all relevant projects. My assumption is that we will simply duplicate them everywhere they're needed.
In additional, there are some tables that we have "archived" to shared-prod.static on a one-off basis. I think that can continue to be the case, as these are not expected to be useful for analysis so don't need to be duplicated everywhere.
udf udf_js udf_legacy
I would like to continue to publish these to all projects, at least for the time being. We can make a decision further down the road about whether to remove from shared-prod.
Basically, we need to decide on how we want to handle user-facing datasets that are not exclusively views, and whether we want mozdata to be clean (no data, only views) or also include data in some cases. I've been using view to mean user-facing" in ops configuration but we may want to separate the notion of "user-facing" from view if mozdata isn't clean.
I anticipate that the only cases where we have concrete tables in mozdata, they are managed by hand rather than by ops logic / Airflow. mozdata will have user-writeable analysis dataset(s) and it will have the static
dataset where data platform engineers can publish tables as needed. I don't expect the ops logic to handle any republishing of tables, only views.
Comment 3•4 years ago
|
||
Hi Jeff! Mode does indeed still need access to this data. It's our only way of combining Pocket data with Mozilla data in a Python notebook (without actually moving Mozilla data into a Pocket system).
Assignee | ||
Comment 4•4 years ago
|
||
The Mode user account accessing the tables in shared-prod directly. I also concur that given table-level ACLs, we could move these derived tables back into activity_stream_derived and grant the Mode user access to just the tables it needs, removing activity_stream_bi.
One thing to note here is that we haven't sorted out the general case WRT how table ACLs look with _derived
datasets (since data there is largely managed by airflow, which doesn't have access to write table ACLs). For the specific cases such as this where we are obviating a separate dataset we can do that without requiring full ops management via terraform and some metadata/configuration similar to how namespace datasets are currently managed.
As an aside, depending on how external vendors' code works, table ACLs may be insufficient (as we've seen with e.g. GCS). For instance, if they perform dataset-level list operations for some reason, this will fail without metadataviewer on the dataset. Additionally I'm not sure if we want to be setting up table ACLs for e.g. static.normal_distribution
granting access to external service accounts.
Regardless of any of the above activity_stream_bi
shouldn't be replicated to mozdata
and refactoring ACLs there can be handled separately.
Comment 5•4 years ago
|
||
revenue_derived
?ni :frank. Does this dataset need to exist? Are we writing derived data to a separate project instead?
Indeed, we are writing client_ltv
there. It probably makes sense to keep revenue_derived
as all revenue datasets, Sunah may be working on some more in the coming months.
Assignee | ||
Comment 6•4 years ago
|
||
The majority of the logic for provisioning mozdata
datasets is implemented in https://github.com/mozilla-services/cloudops-infra/pull/2787. It includes better classification metadata to determine which types of datasets to propagate. I plan to add to that PR the relatively simpler project scaffolding and associated jenkins automation to build out the mozdata project+views in early January.
Assignee | ||
Comment 7•4 years ago
|
||
Stage (project id mozdata-nonprod
) is configured in https://github.com/mozilla-services/cloudops-infra/compare/mozdata...mozdata_project?expand=1, currently only with datasets and views provisioning. There is some new logic for the prod deployment to only deploy views datasets that I expect to evolve as mozdata
(prod) is provisioned and bigquery-etl
/telemetry-airflow
are updated to publish to it.
I still need to define the resources module for e.g. project-level permissions and replacement re:dash data source service accounts. The project proposal includes a GCS bucket but left the name unspecified. I'm going to go with gs://mozdata-analysis
unless another name is preferred (since this is meant to be user-facing I'm not going with the more traditional moz-fx-data-prod
prefix). After provisioning the resources
module I plan to provision prod, get review, and consider mozdata
ready for data eng to populate. I expect there to be additional followup ops work e.g. around metadata as we consolidate user-facing datasets that can use table-ACLs instead of separate datasets.
moz-fx-data-derived-datasets
currently explicitly includes telemetry_derived
(possibly because it is needed for the views publishing script to succeed), but I'm not going to propagate that. I'm additionally ignoring all datasets from comment #1 with the expectation that redefine them the standard way as we migrate.
Reporter | ||
Comment 8•4 years ago
|
||
gs://mozdata-analysis
sounds great.
I will plan to prioritize getting views publishing working as soon as I hear that the mozdata project is ready to populate.
Assignee | ||
Comment 9•4 years ago
|
||
mozdata
has been provisioned and should mostly be in a state that engineering has access to begin populating it. However, infrastructure logic is still being finalized as mozdata
ended up requiring two relatively large/interface-breaking changes:
- New mechanism for automating
terraform apply
on modules to safely pick up ACL updates
This can be seen as a generalization of the existing logic for propagating ACLs to BQ, and will be discussed in the ops staff meeting tomorrow. We'll likely extend usage of this to other ops-managed projects and automatically apply ACL updates to sandbox projects as well. - Syndication support for multiple syndication projects
Previously we assumed a 1:1 syndicated:syndicate relationship, and that will likely be de facto true in the future, but during theshared-prod
:mozdata
transition we'll likely want the same syndication in both projects. Currently all syndicated datasets (justglam_etl
) are user-facing.
Here's a list of authorized views that will need to be republished (it may be worth migrating some of these to google_bigquery_syndicated_dataset
, but they currently all exist directly in user-facing datasets):
Skipping authorized view definition /tmp/sql/moz-fx-data-shared-prod/firefox_accounts/fxa_amplitude_email_clicks/view.sql
Skipping authorized view definition /tmp/sql/moz-fx-data-shared-prod/pocket/pocket_reach_mau/view.sql
Skipping authorized view definition /tmp/sql/moz-fx-data-shared-prod/telemetry/buildhub2/view.sql
Skipping authorized view definition /tmp/sql/moz-fx-data-shared-prod/activity_stream/tile_id_types/view.sql
Re:dash accounts have been provisioned but not put through the access-groups
loop described in https://github.com/mozilla-services/cloudops-infra/pull/2749#issuecomment-741277440 (this will happen post review+merge of both mozdata
PRs). One of the applications of the new ACL update mechanism above is to automate these steps.
Assignee | ||
Comment 10•4 years ago
|
||
https://github.com/mozilla-services/cloudops-infra/pull/2808/ has been filed. Some propagation will be stalled while branches are being reviewed and merged. There are likely small additional permissions that will need to be updated as we go, and possibly other resources that are currently shared-prod only such as udf_js_lib
.
Assignee | ||
Comment 11•4 years ago
|
||
The above PRs has been merged and mozdata should be fully integrated into the ops deployment pipeline.
The work of provisioning _derived
datasets mentioned as cleanup in comment #1 e.g. https://github.com/mozilla/bigquery-etl/pull/1677 can be tracked as part of the views population efforts.
Updated•2 years ago
|
Description
•