Closed Bug 1550814 Opened 6 years ago Closed 6 years ago

Remove data collected during hotfix rollout from GCP

Categories

(Data Platform and Tools :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kparlante, Assigned: klukas)

References

Details

Please remove desktop telemetry data and activity stream data for the affected period: between 2019-05-04T11:00:00Z and 2019-05-11T11:00:00Z

Assignee: nobody → jklukas
Blocks: 1550787

Take a backfill of affected data

Created archive tables in BQ:

moz-fx-data-derived-datasets:static.archive_20190516_clients_last_seen_v1
moz-fx-data-derived-datasets:static.archive_20190516_firefox_desktop_exact_mau28_by_dimensions_v1

These are the only two derived datasets generated within BQ. Other derived data is copied from AWS and we can recover from the backups we're making in AWS.

We are also not going to back up data in GCS or "live" BQ tables, since AWS is still the source of truth for these and we have no production dependencies so far on completeness of the GCP pipeline data.

Test restore

I have verified that the size of the archive tables matches the size of the real tables and that we can run queries on those tables. This is sufficient to be able to fill back in the BQ data we are going to delete.

Delete the Data: BigQuery

Affected data has been deleted from BQ as codified in delete-from-bq.sh in https://github.com/mozilla/bigquery-etl/pull/131

Currently, a script is running to repopulate clients_last_seen_v1 from 2019-05-04 up to present. Each day of that dataset depends on the previous day, so we had to delete all the way up to the present and now repopulate based on clients_daily with the affected period removed.

Validating Data Deletion: BigQuery

These tables all show 0 bytes for the deleted region, based on looking at query preview results for queries like:

SELECT
  *
FROM
  `moz-fx-data-derived-datasets.telemetry.clients_daily_v6`
WHERE
  submission_date BETWEEN "2019-05-04"
  AND "2019-05-19"

Delete the Data: GCS

Data has been deleted for the affected times in the following paths:

delete_hours gs://moz-fx-data-stage-data/structured-decoded
delete_hours gs://moz-fx-data-stage-data/structured-bq-sink-error

Listings verify that the affected dates are gone.

Six more paths are in progress right now.

Delete the Data: GCS

Data has been deleted for all GCS paths in scope for the deletion, and I've verified in listings that the affected dates are gone.

Delete Backups

The only deleted data for which we were retaining backups on GCP was the clients_last_seen table. The backup has now been deleted.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.