Closed Bug 1587519 Opened 5 years ago Closed 4 years ago

Delete telemetry data upon request

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
5

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: relud)

References

(Blocks 1 open bug)

Details

When we receive a deletion request as in Bug 1585410 or Bug 1587095, delete the corresponding data from long term storage.

Assignee: nobody → dthorn
Blocks: 1598720
Points: --- → 5
Priority: -- → P1

Implementation now supports:

  • recording state in a BigQuery table to prevent repeated work when resumed
  • handling main_v4 using either DELETE statements or SELECT statements

I am running the script to process the initial deletes manually, and I will schedule the next round via airflow.

Deletes are currently running in relud-17123 for --start-date=2019-11-13 through --end-date=2020-01-22, where I have copied telemetry_derived.main_summary_v4 and telemetry_stable.deletion_request_v4 from moz-fx-data-shared-prod. The state table in use is relud-17123.test.shredder_state.

When that completes I will do some verification, then copy main_summary_v4 to prod and run deletes on everything else except telemetry_stable.main_v4.

Deleting from main_v4 is blocked on confirming with operations that on-demand pricing is an acceptable solution short-term, until we find a way to do deletes more cost-efficiently with reserved slots.

Deleting from everything else will require increasing our slot reservation from 500 to 1000 in order to complete within 2 weeks.

current glean tables are now configured for deletion requests.

Rows in main_summary_v4 between 2019-10-25 and 2020-01-21 with corresponding deletion requests sent before 2020-01-22 have now been deleted.

I found a bug in the query handling the first round of deletes from main_summary_v4 for dates before 2019-10-25. If the bug does not impact performance then it will take about 9 days to re-run this.

Other tables will be handled after main_summary_v4 is complete, with the exception of main_v4 which I will run manually in a separate project with on-demand pricing while I investigate more cost effective solutions.

this is now run every 28 days from airflow

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.