Remove data collected during hotfix rollout from GCP
Categories
(Data Platform and Tools :: General, task)
Tracking
(Not tracked)
People
(Reporter: kparlante, Assigned: klukas)
References
Details
Please remove desktop telemetry data and activity stream data for the affected period: between 2019-05-04T11:00:00Z and 2019-05-11T11:00:00Z
| Assignee | ||
Comment 1•6 years ago
|
||
Putting together scripts in https://github.com/mozilla/bigquery-etl/pull/131
| Assignee | ||
Comment 2•6 years ago
|
||
Take a backfill of affected data
Created archive tables in BQ:
moz-fx-data-derived-datasets:static.archive_20190516_clients_last_seen_v1
moz-fx-data-derived-datasets:static.archive_20190516_firefox_desktop_exact_mau28_by_dimensions_v1
These are the only two derived datasets generated within BQ. Other derived data is copied from AWS and we can recover from the backups we're making in AWS.
We are also not going to back up data in GCS or "live" BQ tables, since AWS is still the source of truth for these and we have no production dependencies so far on completeness of the GCP pipeline data.
| Assignee | ||
Comment 3•6 years ago
|
||
Test restore
I have verified that the size of the archive tables matches the size of the real tables and that we can run queries on those tables. This is sufficient to be able to fill back in the BQ data we are going to delete.
| Assignee | ||
Comment 4•6 years ago
|
||
Delete the Data: BigQuery
Affected data has been deleted from BQ as codified in delete-from-bq.sh in https://github.com/mozilla/bigquery-etl/pull/131
Currently, a script is running to repopulate clients_last_seen_v1 from 2019-05-04 up to present. Each day of that dataset depends on the previous day, so we had to delete all the way up to the present and now repopulate based on clients_daily with the affected period removed.
| Assignee | ||
Comment 5•6 years ago
|
||
Validating Data Deletion: BigQuery
These tables all show 0 bytes for the deleted region, based on looking at query preview results for queries like:
SELECT
*
FROM
`moz-fx-data-derived-datasets.telemetry.clients_daily_v6`
WHERE
submission_date BETWEEN "2019-05-04"
AND "2019-05-19"
| Assignee | ||
Comment 6•6 years ago
|
||
Delete the Data: GCS
Data has been deleted for the affected times in the following paths:
delete_hours gs://moz-fx-data-stage-data/structured-decoded
delete_hours gs://moz-fx-data-stage-data/structured-bq-sink-error
Listings verify that the affected dates are gone.
Six more paths are in progress right now.
| Assignee | ||
Comment 7•6 years ago
|
||
Delete the Data: GCS
Data has been deleted for all GCS paths in scope for the deletion, and I've verified in listings that the affected dates are gone.
| Assignee | ||
Comment 8•6 years ago
|
||
Delete Backups
The only deleted data for which we were retaining backups on GCP was the clients_last_seen table. The backup has now been deleted.
Description
•