Open Bug 1633928 Opened 5 years ago Updated 3 months ago

Formulate and document data pipeline source data deletion policies

Categories

(Data Platform and Tools :: General, task, P3)

task
Points:
2

Tracking

(Not tracked)

People

(Reporter: whd, Assigned: mreid)

References

Details

(Whiteboard: [dataplatform])

(Note this differs from data retention practices, and is specifically about defining policies around deleting production source datasets)

This policy was very simple in AWS: we didn't delete source datasets. However, it's become clear that in some cases this is desirable in our re-architected platform on GCP. Less clear is what deleting source datasets specifically entails. See bug #1626490, bug #1633525, and bug #1631849 for motivating examples.

Some things that should be considered:

  • defining eligibility of data to be deleted, including manager approval or similar "data deletion review" process
  • client behavior
  • data lineage
  • support for multiple deletion schemes: dropping at edge, delete schemas/BQ tables, using UnwantedDataException
  • support for historical data/dataset archival policies
  • documentation/communication around when datasets are deleted or deprecated
  • operational procedures surrounding deletion

My personal take is that we should generally not delete source datasets and instead create UnwantedDataExceptions to stop ingesting data we no longer care about, leaving historical data as-is. Source dataset deletion can be viewed as a kind of incompatible schema change as described in bug #1627722, and the prescription in comment #1 therein would also apply (albeit to a lesser degree). "Removing pings and associated data" can also be viewed under the existing framework for flagging and rejecting unwanted data and deletion could be implemented using the existing mechanisms.

In the event we do develop a general policy for deleting datasets, paramount to me from a procedural perspective is that some kind of manager (likely :mreid) sign-off is involved.

From an operations perspective, it is significantly easier to reason about data pipeline processing and deployment around data deletion when we know that BigQuery source datasets pending deletion are guaranteed not to receive new data, something that UnwantedDataException could supply. Source dataset deletion could thus be a two-step process (flag as unwanted in decoder, then remove tables via MPS) whereby historical data could be deleted some time after it is flagged as unwanted. The general issue is that normal schemas deploys require bigquery to be updated first, while in the dataset removal case the decoder needs to be updated first (and must contain no schemas updates except deletions).

Until we have a better policy in place, I'm going to NI :mreid whenever I see a bug involving deletion, or if a delete propagates to stage (which halts the pipeline), and perform any schema updates involving deletes manually with full drains.

Operations Perspective
I'm happy with the suggestion of adding these pings as UnwantedDataExceptions. It does seem ideal to do that first and then delete the schema afterwards. It might actually be good to centralize the pings we deleted in one location (probably MPS) so they can be easily surfaced later.

Policy Perspective
We don't really have owners of pings right now. I do agree that this makes things murky w.r.t. when and who we refer to to see if they are eligible for removal.

We do have the existing dataset deletion process, which could work here as well (we are deleting a dataset, just not a derived one).

If it is agreed upon by all parties to delete the dataset, my opinion is we should definitely do it. Old, stale data has little usefulness, and creates both operational and cognitive burden on the data platform.

Points: --- → 2
Priority: -- → P3
Assignee: nobody → mreid
See Also: → 1753489
Whiteboard: [dataplatform]
You need to log in before you can comment on or make changes to this bug.