Eliminate conditions for data-duplication in the MozDatabricksRunSubmitOperator
Categories
(Data Platform and Tools :: General, enhancement)
Tracking
(Not tracked)
People
(Reporter: amiyaguchi, Unassigned)
Details
The MozDatabricksRunSubmitOperator occasionally causes data-duplication in MainSummary. This has happened twice in a single month period.
This is generally caused by an orphaned cluster caused by an API error that's not handled correctly. A retry of the job leads to two instances of the batch-view to be run.
We can verify that data is written twice by either scanning the file-system for multiple task-ids, or by checking for the percentage of duplicated document-ids.
The solution to this issue is to either:
- Upgrade to Airflow 1.10.1
- Backport the Databricks hook, operator, and tests into
telemetry-airflow/plugins
In particular, the patch for this JIRA issue needs to be pulled in.
https://issues.apache.org/jira/browse/AIRFLOW-2709
See also https://github.com/mozilla/telemetry-airflow/pull/416 for issues related to the retry mechanism. We disabled retries in https://github.com/mozilla/telemetry-airflow/pull/417
| Reporter | ||
Updated•5 years ago
|
| Assignee | ||
Updated•3 years ago
|
Description
•