Provision datasets and keys for Account Ecosystem Telemetry ETL
Categories
(Data Platform and Tools Graveyard :: Operations, task)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: whd)
References
Details
As discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1659988 we are responding to the "Decryption oracle attack" vector by locking down access to the decoded ping tables for AET doctypes.
But we also have some additional needs for organizing and anonymizing derived data that we do present for analysis. A full description of the currently proposed ETL structure is in the AET Data Pipeline Security writeup.
From the Data Ops side, we will need to provision a new HMAC key that is available to pods launched by Airflow. We will use that key in some bigquery-etl queries to encrypt ecosystem_client_id and potentially other identifier fields. I would also like to have access to that HMAC key from my GCP credentials as I develop the relevant queries.
I also think it makes sense to provision some BigQuery datasets specific to AET:
account_ecosystemfor user-facing viewsaccount_ecosystem_derivedfor the underlying derived tables, andaccount_ecosystem_restrictedfor intermediate stages of ETL that may not be sufficiently anonymized for general Mozilla access
The account_ecosystem_restricted dataset should only have read/write access for Airflow, Data Platform, and Data Ops.
:whd - Does that all make sense and seem straight-forward from your perspective? I am open to suggestions on improvements to the above plan.
| Reporter | ||
Updated•5 years ago
|
| Assignee | ||
Comment 1•5 years ago
|
||
WRT HMAC, we can use the method developed in https://bugzilla.mozilla.org/show_bug.cgi?id=1634179, WIP here. That PR includes using the terraform beta provider for gcp that includes table ACLs, which seems like a good fit for this temporary access.
WRT BigQuery datasets for AET: I want to move to a metadata-driven approach to dataset provisioning in future that I'll be writing a proposal for, but in the short term we'll create these datasets with the expected permissions using current provisioning logic (manual terraform definition) early next week.
We discussed briefly whether this ETL work warrants project-level isolation (leveraging syndication) but decided that it doesn't, given that the _restricted dataset may not even be necessary (it will still be provisioned).
| Assignee | ||
Comment 2•5 years ago
|
||
https://github.com/mozilla-services/cloudops-infra/pull/2482/ has been reviewed, but will not be tested or deployed to stage/prod until https://github.com/mozilla/probe-scraper/pull/228 is merged (likely today or tomorrow), because stage is currently in a bad state as a result of the aforementioned issue.
One particular note that I've tabled regarding "Data Platform" access is whether that includes access via sandbox service accounts. For now it does not, and if we did decide to add it, that access would likely be read.
| Assignee | ||
Comment 3•5 years ago
|
||
The datasets have been deployed and the keys have been generated via the following:
CREATE TABLE airflow_query_keys.aet_prod AS
SELECT
KEYS.NEW_KEYSET('AEAD_AES_GCM_256') AS keyset
CREATE TABLE `moz-fx-data-shared-prod.account_ecosystem_restricted.encrypted_keys_v1` AS
SELECT
'aet_hmac_prod' as key_id, AEAD.ENCRYPT((select keyset from airflow_query_keys.aet_prod),'$KEY', 'aet_hmac_prod') AS ciphertext
:klukas confirmed his ability to query the keyset and use it:
WITH hmac_key AS (
SELECT
AEAD.DECRYPT_BYTES(
(SELECT keyset FROM `moz-fx-dataops-secrets.airflow_query_keys.aet_prod`),
ciphertext,
CAST(key_id AS BYTES)
) AS value
FROM
`moz-fx-data-shared-prod.account_ecosystem_restricted.encrypted_keys_v1`
WHERE
key_id = 'aet_hmac_prod'
)
SELECT
TO_HEX(
udf.hmac_sha256((SELECT * FROM hmac_key), CAST('hi' AS BYTES))
) AS user_id
Updated•3 years ago
|
Description
•