Closed Bug 1664294 Opened 5 years ago Closed 5 years ago

Provision datasets and keys for Account Ecosystem Telemetry ETL

Categories

(Data Platform and Tools Graveyard :: Operations, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: whd)

References

Details

As discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1659988 we are responding to the "Decryption oracle attack" vector by locking down access to the decoded ping tables for AET doctypes.

But we also have some additional needs for organizing and anonymizing derived data that we do present for analysis. A full description of the currently proposed ETL structure is in the AET Data Pipeline Security writeup.

From the Data Ops side, we will need to provision a new HMAC key that is available to pods launched by Airflow. We will use that key in some bigquery-etl queries to encrypt ecosystem_client_id and potentially other identifier fields. I would also like to have access to that HMAC key from my GCP credentials as I develop the relevant queries.

I also think it makes sense to provision some BigQuery datasets specific to AET:

  • account_ecosystem for user-facing views
  • account_ecosystem_derived for the underlying derived tables, and
  • account_ecosystem_restricted for intermediate stages of ETL that may not be sufficiently anonymized for general Mozilla access

The account_ecosystem_restricted dataset should only have read/write access for Airflow, Data Platform, and Data Ops.

:whd - Does that all make sense and seem straight-forward from your perspective? I am open to suggestions on improvements to the above plan.

Blocks: aet-pipeline

WRT HMAC, we can use the method developed in https://bugzilla.mozilla.org/show_bug.cgi?id=1634179, WIP here. That PR includes using the terraform beta provider for gcp that includes table ACLs, which seems like a good fit for this temporary access.

WRT BigQuery datasets for AET: I want to move to a metadata-driven approach to dataset provisioning in future that I'll be writing a proposal for, but in the short term we'll create these datasets with the expected permissions using current provisioning logic (manual terraform definition) early next week.

We discussed briefly whether this ETL work warrants project-level isolation (leveraging syndication) but decided that it doesn't, given that the _restricted dataset may not even be necessary (it will still be provisioned).

https://github.com/mozilla-services/cloudops-infra/pull/2482/ has been reviewed, but will not be tested or deployed to stage/prod until https://github.com/mozilla/probe-scraper/pull/228 is merged (likely today or tomorrow), because stage is currently in a bad state as a result of the aforementioned issue.

One particular note that I've tabled regarding "Data Platform" access is whether that includes access via sandbox service accounts. For now it does not, and if we did decide to add it, that access would likely be read.

The datasets have been deployed and the keys have been generated via the following:

CREATE TABLE airflow_query_keys.aet_prod AS
SELECT
  KEYS.NEW_KEYSET('AEAD_AES_GCM_256') AS keyset
CREATE TABLE `moz-fx-data-shared-prod.account_ecosystem_restricted.encrypted_keys_v1` AS
SELECT
  'aet_hmac_prod' as key_id, AEAD.ENCRYPT((select keyset from airflow_query_keys.aet_prod),'$KEY', 'aet_hmac_prod') AS ciphertext

:klukas confirmed his ability to query the keyset and use it:

WITH hmac_key AS (
  SELECT
    AEAD.DECRYPT_BYTES(
      (SELECT keyset FROM `moz-fx-dataops-secrets.airflow_query_keys.aet_prod`),
      ciphertext,
      CAST(key_id AS BYTES)
    ) AS value
  FROM
    `moz-fx-data-shared-prod.account_ecosystem_restricted.encrypted_keys_v1`
  WHERE
    key_id = 'aet_hmac_prod'
)
SELECT
  TO_HEX(
    udf.hmac_sha256((SELECT * FROM hmac_key), CAST('hi' AS BYTES))
  ) AS user_id
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.