[fxci-etl] Refactor `fxci-etl metric export` to process one day at a time idempotently
Categories
(Release Engineering :: General, task)
Tracking
(Not tracked)
People
(Reporter: ahal, Unassigned)
Details
Currently this command is storing the end timestamp of the last interval processed. Then, the next time the job runs, it uses that as the start time for the next interval. This way there is never any overlap of metric records.
However, in a PR to docker-etl, :akomar suggested that I instead support a --date YYYY-MM-DD
argument that telemetry-airflow
can pass in. This way, the export always processes a single day's worth of metrics at a time. This would also mean we can stop storing state and we'd have the ability to re-run past jobs (i.e if we discovered a bug and needed to re-populate old data).
One caveat, is that we'd need to make sure the insertion is idempotent. That is, we can't have any duplicate records. The approach :akomar recommends is to first DELETE all records in the partition that we are inserting. Then, insert the new ones. Essentially overwrite records instead of append them (like fxci-etl currently does).
This should be well tested on "dev" tables in moz-fx-releng-dev
before landing in production.
Reporter | ||
Comment 1•2 months ago
|
||
Here's an example airflow DAG that passes in --date
:
https://github.com/mozilla/telemetry-airflow/blob/main/dags/dap_collector.py#L64
Reporter | ||
Updated•18 days ago
|
Description
•