Open Bug 1633928 Opened 5 years ago Updated 3 months ago

Formulate and document data pipeline source data deletion policies

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: whd, Assigned: mreid)

References

Details

(Whiteboard: [dataplatform])

Wesley Dawson [:whd]

Reporter

Description

•

5 years ago

(Note this differs from data retention practices, and is specifically about defining policies around deleting production source datasets)

This policy was very simple in AWS: we didn't delete source datasets. However, it's become clear that in some cases this is desirable in our re-architected platform on GCP. Less clear is what deleting source datasets specifically entails. See bug #1626490, bug #1633525, and bug #1631849 for motivating examples.

Some things that should be considered:

defining eligibility of data to be deleted, including manager approval or similar "data deletion review" process
client behavior
data lineage
support for multiple deletion schemes: dropping at edge, delete schemas/BQ tables, using UnwantedDataException
support for historical data/dataset archival policies
documentation/communication around when datasets are deleted or deprecated
operational procedures surrounding deletion

My personal take is that we should generally not delete source datasets and instead create UnwantedDataExceptions to stop ingesting data we no longer care about, leaving historical data as-is. Source dataset deletion can be viewed as a kind of incompatible schema change as described in bug #1627722, and the prescription in comment #1 therein would also apply (albeit to a lesser degree). "Removing pings and associated data" can also be viewed under the existing framework for flagging and rejecting unwanted data and deletion could be implemented using the existing mechanisms.

In the event we do develop a general policy for deleting datasets, paramount to me from a procedural perspective is that some kind of manager (likely :mreid) sign-off is involved.

From an operations perspective, it is significantly easier to reason about data pipeline processing and deployment around data deletion when we know that BigQuery source datasets pending deletion are guaranteed not to receive new data, something that UnwantedDataException could supply. Source dataset deletion could thus be a two-step process (flag as unwanted in decoder, then remove tables via MPS) whereby historical data could be deleted some time after it is flagged as unwanted. The general issue is that normal schemas deploys require bigquery to be updated first, while in the dataset removal case the decoder needs to be updated first (and must contain no schemas updates except deletions).

Until we have a better policy in place, I'm going to NI :mreid whenever I see a bug involving deletion, or if a delete propagates to stage (which halts the pipeline), and perform any schema updates involving deletes manually with full drains.

Frank Bertsch [:frank]

Comment 1

•

5 years ago

Operations Perspective
I'm happy with the suggestion of adding these pings as UnwantedDataExceptions. It does seem ideal to do that first and then delete the schema afterwards. It might actually be good to centralize the pings we deleted in one location (probably MPS) so they can be easily surfaced later.

Policy Perspective
We don't really have owners of pings right now. I do agree that this makes things murky w.r.t. when and who we refer to to see if they are eligible for removal.

We do have the existing dataset deletion process, which could work here as well (we are deleting a dataset, just not a derived one).

If it is agreed upon by all parties to delete the dataset, my opinion is we should definitely do it. Old, stale data has little usefulness, and creates both operational and cognitive burden on the data platform.

Mark Reid [:mreid]

Assignee

Updated

•

5 years ago

Points: --- → 2

Priority: -- → P3

Mark Reid [:mreid]

Assignee

Updated

•

5 years ago

Assignee: nobody → mreid

Wesley Dawson [:whd]

Reporter

Updated

•

3 years ago

Updated

•

3 months ago

Whiteboard: [dataplatform]

Jira Integration Bot

Updated

•

3 months ago

See Also: → https://mozilla-hub.atlassian.net/browse/DENG-9244

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Formulate and document data pipeline source data deletion policies

Categories

(Data Platform and Tools :: General, task, P3)

Tracking

(Not tracked)

People

(Reporter: whd, Assigned: mreid)

References

Details

(Whiteboard: [dataplatform])

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Updated

Updated

Updated