Programmatically delay copy-deduplicate when pipeline ingestion latency is detected
Categories
(Data Platform and Tools :: Monitoring & Alerting, enhancement, P3)
Tracking
(Not tracked)
People
(Reporter: whd, Unassigned)
References
Details
(Whiteboard: [dataplatform])
Motivation in https://mozilla-hub.atlassian.net/browse/DSRE-557
I think we could probably implement a sensor in Airflow to delay copy deduplicate until decoder+sink oldest unacked message age is sufficiently low. Airflow can or should have access to pubsub metrics to compute this latency. This isn't going to catch all cases, but it should catch a large swath of the major ones that would otherwise require us to re-run copy-deduplicate and downstream jobs. Something that came out as a followup of bug #1759032 is that the task of re-running copy-deduplicate and all downstream jobs requires some specific knowledge beyond the simple airflow dag, so having SRE re-run ETL after an ingestion latency event would require having that specific knowledge.
Comment 1•4 years ago
|
||
There's some previous discussion on this topic from 2019 in https://github.com/mozilla/bigquery-etl/issues/493
Updated•3 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Description
•