Closed Bug 1671934 Opened 4 years ago Closed 4 years ago

Provision mozdata project and BigQuery datasets

Categories

(Data Platform and Tools Graveyard :: Operations, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: whd)

References

Details

The first step for implementing the Project Layout for BigQuery Users (mozdata) proposal is to provision the new project named mozdata and to update the ops logic to create datasets with appropriate permissions.

The set of datasets in mozdata should be nearly identical to the datasets present in derived-datasets, including a user-writable analysis dataset and user-readable per-namespace datasets for publishing user-facing views.

Once this is in place, we can start testing any logic modifications needed to be able to publish views into this new dataset.

I've started some preliminary cleanup work to consolidate dataset management in https://github.com/mozilla-services/cloudops-infra/pull/2654 and https://github.com/mozilla-services/cloudops-infra/pull/2658.

One benefit of this cleanup is that it forces categorization of datasets into those that are user-facing, and those that are not, which is important for deciding which datasets to provision in mozdata. This should mostly correspond to bq ls --max_results 1000 --project_id moz-fx-data-derived-datasets, however there are a couple of datasets defined in derived-datasets that aren't defined in shared-prod, so those need to be reconciled (here listed with my best guess at reconciliation):

  • backfill (delete)
  • org_mozilla_fogotype (delete)
  • org_mozilla_samples_glean(delete)
  • search_staging (delete?)
  • spark_bigquery_staging_us (delete)
  • taskclusteretl (copy?)
  • validation (copy?)

As for datasets defined in shared-prod, the general goal is to create a view (no suffix) dataset in mozdata for any namespace configuration that contains configuration for a "view" dataset.

Here are some of the ambiguous ones, or ones that I would like additional verification from somebody else about:

  • activity_stream_bi: https://bugzilla.mozilla.org/show_bug.cgi?id=1616895
    My assumption is that this is an internal dataset meant to grant access to data for hosted services external to Mozilla, and should therefore not be replicated to mozdata. It's also not in derived-datasets which indicates the same. As a side note: this was configured before table ACLs were in use and we could probably revisit how this is structured to leverage table ACLs.

  • amo_dev and amo_prod: https://bugzilla.mozilla.org/show_bug.cgi?id=1633746
    After re-reading the bug (particularly https://bugzilla.mozilla.org/show_bug.cgi?id=1633746#c18) I'm not sure if this dataset is meant to be user-facing or not. I originally thought the dataset was provisioned for usage by AMO (as specific AMO service accounts are granted access to it). Since this dataset contains both views and tables, depending on whether this is meant to be user-facing or now, we may want to split this into user-facing datasets and non-user-facing datasets. The current configuration is assuming non-user-facing, but a dataset exists in derived-datasets.

  • External tables: https://bugzilla.mozilla.org/show_bug.cgi?id=1588654
    I'm assuming these are internal and shouldn't be replicated to mozdata (they don't exist in derived-datasets).

  • internet_outages: https://bugzilla.mozilla.org/show_bug.cgi?id=1645147
    I think this is meant to be user-facing but should probably be split into a user-facing views dataset and an internal dataset. This could technically be syndicated from moz-fx-data-shared-prod into mozdata but we probably want to think a bit on the structure of things here. Notably because it contains data it doesn't exist in derived-datasets but is probably still meant to be user-facing.

  • monitoring
    Contains both tables and views, generally for operational purposes but should perhaps be user-facing as well. Once again could probably be syndicated, but we should pin down a desired structure and stick to it.

  • revenue_derived: https://bugzilla.mozilla.org/show_bug.cgi?id=1644474
    I'm not sure what to make of this but it has default (mozilla-all) permissions right now. I'm assuming it doesn't need to be user-facing given it's got derived in the name.

  • static
    Similarly we should decide whether this should be syndicated or else how we want to handle this dataset.

  • udf udf_js udf_legacy
    I believe these will eventually be removed from shared entirely in favor of the mozdata versions.

Basically, we need to decide on how we want to handle user-facing datasets that are not exclusively views, and whether we want mozdata to be clean (no data, only views) or also include data in some cases. I've been using view to mean user-facing" in ops configuration but we may want to separate the notion of "user-facing" from view if mozdata isn't clean.

activity_stream_bi

Indeed, this does not need to be replicated to mozdata. The Mode user account accessing the tables in shared-prod directly. I also concur that given table-level ACLs, we could move these derived tables back into activity_stream_derived and grant the Mode user access to just the tables it needs, removing activity_stream_bi.

?ni :kirill for confirmation that Mode still needs to look at these tables.

amo_dev and amo_prod

These are intended primarily for use by the AMO stats application rather than analytics users. This is another case where table-level ACLs would allow us to remove a special case.

Ideally, we would provision amo_stats_derived for tables and amo_stats for user-facing views, and we would user table-level ACLs to grant the prod and dev AMO apps to just the appropriate tables and views within these datasets.

Absent such a refactor, it would be nice to still publish the views in these datasets to mozdata.

External tables

Agreed. These do not need to be replicated to mozdata.

internet_outages

This looks similar to amo_prod and amo_dev in that it contains both tables and views because we wanted to grant external access at the dataset level. Using table-level ACLs, this can be refactored to be a normal derived and user-facing pair.

monitoring

I don't see any particular reason why we can't factor out the tables here to a monitoring_derived dataset.

revenue_derived

?ni :frank. Does this dataset need to exist? Are we writing derived data to a separate project instead?

static

Static tables are small and published by hand when they are created or changed. We should update docs to make it clear that they should be published to all relevant projects. My assumption is that we will simply duplicate them everywhere they're needed.

In additional, there are some tables that we have "archived" to shared-prod.static on a one-off basis. I think that can continue to be the case, as these are not expected to be useful for analysis so don't need to be duplicated everywhere.

udf udf_js udf_legacy

I would like to continue to publish these to all projects, at least for the time being. We can make a decision further down the road about whether to remove from shared-prod.

Basically, we need to decide on how we want to handle user-facing datasets that are not exclusively views, and whether we want mozdata to be clean (no data, only views) or also include data in some cases. I've been using view to mean user-facing" in ops configuration but we may want to separate the notion of "user-facing" from view if mozdata isn't clean.

I anticipate that the only cases where we have concrete tables in mozdata, they are managed by hand rather than by ops logic / Airflow. mozdata will have user-writeable analysis dataset(s) and it will have the static dataset where data platform engineers can publish tables as needed. I don't expect the ops logic to handle any republishing of tables, only views.

Flags: needinfo?(kirill)
Flags: needinfo?(fbertsch)

Hi Jeff! Mode does indeed still need access to this data. It's our only way of combining Pocket data with Mozilla data in a Python notebook (without actually moving Mozilla data into a Pocket system).

Flags: needinfo?(kirill)

The Mode user account accessing the tables in shared-prod directly. I also concur that given table-level ACLs, we could move these derived tables back into activity_stream_derived and grant the Mode user access to just the tables it needs, removing activity_stream_bi.

One thing to note here is that we haven't sorted out the general case WRT how table ACLs look with _derived datasets (since data there is largely managed by airflow, which doesn't have access to write table ACLs). For the specific cases such as this where we are obviating a separate dataset we can do that without requiring full ops management via terraform and some metadata/configuration similar to how namespace datasets are currently managed.

As an aside, depending on how external vendors' code works, table ACLs may be insufficient (as we've seen with e.g. GCS). For instance, if they perform dataset-level list operations for some reason, this will fail without metadataviewer on the dataset. Additionally I'm not sure if we want to be setting up table ACLs for e.g. static.normal_distribution granting access to external service accounts.

Regardless of any of the above activity_stream_bi shouldn't be replicated to mozdata and refactoring ACLs there can be handled separately.

revenue_derived

?ni :frank. Does this dataset need to exist? Are we writing derived data to a separate project instead?

Indeed, we are writing client_ltv there. It probably makes sense to keep revenue_derived as all revenue datasets, Sunah may be working on some more in the coming months.

Flags: needinfo?(fbertsch)

The majority of the logic for provisioning mozdata datasets is implemented in https://github.com/mozilla-services/cloudops-infra/pull/2787. It includes better classification metadata to determine which types of datasets to propagate. I plan to add to that PR the relatively simpler project scaffolding and associated jenkins automation to build out the mozdata project+views in early January.

Status: NEW → ASSIGNED
Points: --- → 2
Priority: -- → P1
QA Contact: dparfitt → jthomas

Stage (project id mozdata-nonprod) is configured in https://github.com/mozilla-services/cloudops-infra/compare/mozdata...mozdata_project?expand=1, currently only with datasets and views provisioning. There is some new logic for the prod deployment to only deploy views datasets that I expect to evolve as mozdata (prod) is provisioned and bigquery-etl/telemetry-airflow are updated to publish to it.

I still need to define the resources module for e.g. project-level permissions and replacement re:dash data source service accounts. The project proposal includes a GCS bucket but left the name unspecified. I'm going to go with gs://mozdata-analysis unless another name is preferred (since this is meant to be user-facing I'm not going with the more traditional moz-fx-data-prod prefix). After provisioning the resources module I plan to provision prod, get review, and consider mozdata ready for data eng to populate. I expect there to be additional followup ops work e.g. around metadata as we consolidate user-facing datasets that can use table-ACLs instead of separate datasets.

moz-fx-data-derived-datasets currently explicitly includes telemetry_derived (possibly because it is needed for the views publishing script to succeed), but I'm not going to propagate that. I'm additionally ignoring all datasets from comment #1 with the expectation that redefine them the standard way as we migrate.

gs://mozdata-analysis sounds great.

I will plan to prioritize getting views publishing working as soon as I hear that the mozdata project is ready to populate.

mozdata has been provisioned and should mostly be in a state that engineering has access to begin populating it. However, infrastructure logic is still being finalized as mozdata ended up requiring two relatively large/interface-breaking changes:

  1. New mechanism for automating terraform apply on modules to safely pick up ACL updates
    This can be seen as a generalization of the existing logic for propagating ACLs to BQ, and will be discussed in the ops staff meeting tomorrow. We'll likely extend usage of this to other ops-managed projects and automatically apply ACL updates to sandbox projects as well.
  2. Syndication support for multiple syndication projects
    Previously we assumed a 1:1 syndicated:syndicate relationship, and that will likely be de facto true in the future, but during the shared-prod:mozdata transition we'll likely want the same syndication in both projects. Currently all syndicated datasets (just glam_etl) are user-facing.

Here's a list of authorized views that will need to be republished (it may be worth migrating some of these to google_bigquery_syndicated_dataset, but they currently all exist directly in user-facing datasets):

Skipping authorized view definition /tmp/sql/moz-fx-data-shared-prod/firefox_accounts/fxa_amplitude_email_clicks/view.sql
Skipping authorized view definition /tmp/sql/moz-fx-data-shared-prod/pocket/pocket_reach_mau/view.sql
Skipping authorized view definition /tmp/sql/moz-fx-data-shared-prod/telemetry/buildhub2/view.sql
Skipping authorized view definition /tmp/sql/moz-fx-data-shared-prod/activity_stream/tile_id_types/view.sql

Re:dash accounts have been provisioned but not put through the access-groups loop described in https://github.com/mozilla-services/cloudops-infra/pull/2749#issuecomment-741277440 (this will happen post review+merge of both mozdata PRs). One of the applications of the new ACL update mechanism above is to automate these steps.

https://github.com/mozilla-services/cloudops-infra/pull/2808/ has been filed. Some propagation will be stalled while branches are being reviewed and merged. There are likely small additional permissions that will need to be updated as we go, and possibly other resources that are currently shared-prod only such as udf_js_lib.

The above PRs has been merged and mozdata should be fully integrated into the ops deployment pipeline.

The work of provisioning _derived datasets mentioned as cleanup in comment #1 e.g. https://github.com/mozilla/bigquery-etl/pull/1677 can be tracked as part of the views population efforts.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.