Closed Bug 1912067 Opened 2 months ago Closed 18 days ago

[fxci-etl] Refactor `fxci-etl metric export` to process one day at a time idempotently

Categories

(Release Engineering :: General, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ahal, Unassigned)

Details

Currently this command is storing the end timestamp of the last interval processed. Then, the next time the job runs, it uses that as the start time for the next interval. This way there is never any overlap of metric records.

However, in a PR to docker-etl, :akomar suggested that I instead support a --date YYYY-MM-DD argument that telemetry-airflow can pass in. This way, the export always processes a single day's worth of metrics at a time. This would also mean we can stop storing state and we'd have the ability to re-run past jobs (i.e if we discovered a bug and needed to re-populate old data).

One caveat, is that we'd need to make sure the insertion is idempotent. That is, we can't have any duplicate records. The approach :akomar recommends is to first DELETE all records in the partition that we are inserting. Then, insert the new ones. Essentially overwrite records instead of append them (like fxci-etl currently does).

This should be well tested on "dev" tables in moz-fx-releng-dev before landing in production.

Here's an example airflow DAG that passes in --date:
https://github.com/mozilla/telemetry-airflow/blob/main/dags/dap_collector.py#L64

Status: NEW → RESOLVED
Closed: 18 days ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.