Open Bug 1761102 Opened 4 years ago Updated 2 years ago

Programmatically delay copy-deduplicate when pipeline ingestion latency is detected

Categories

(Data Platform and Tools :: Monitoring & Alerting, enhancement, P3)

enhancement

Tracking

(Not tracked)

People

(Reporter: whd, Unassigned)

References

Details

(Whiteboard: [dataplatform])

Motivation in https://mozilla-hub.atlassian.net/browse/DSRE-557

I think we could probably implement a sensor in Airflow to delay copy deduplicate until decoder+sink oldest unacked message age is sufficiently low. Airflow can or should have access to pubsub metrics to compute this latency. This isn't going to catch all cases, but it should catch a large swath of the major ones that would otherwise require us to re-run copy-deduplicate and downstream jobs. Something that came out as a followup of bug #1759032 is that the task of re-running copy-deduplicate and all downstream jobs requires some specific knowledge beyond the simple airflow dag, so having SRE re-run ETL after an ingestion latency event would require having that specific knowledge.

There's some previous discussion on this topic from 2019 in https://github.com/mozilla/bigquery-etl/issues/493

Whiteboard: [data-platform-infra-wg] → [dataplatform]
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.