Closed Bug 1522270 Opened 7 years ago Closed 5 years ago

Eliminate conditions for data-duplication in the MozDatabricksRunSubmitOperator

Categories

(Data Platform and Tools :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: amiyaguchi, Unassigned)

Details

The MozDatabricksRunSubmitOperator occasionally causes data-duplication in MainSummary. This has happened twice in a single month period.

This is generally caused by an orphaned cluster caused by an API error that's not handled correctly. A retry of the job leads to two instances of the batch-view to be run.

We can verify that data is written twice by either scanning the file-system for multiple task-ids, or by checking for the percentage of duplicated document-ids.

The solution to this issue is to either:

  • Upgrade to Airflow 1.10.1
  • Backport the Databricks hook, operator, and tests into telemetry-airflow/plugins

In particular, the patch for this JIRA issue needs to be pulled in.

https://issues.apache.org/jira/browse/AIRFLOW-2709


See also https://github.com/mozilla/telemetry-airflow/pull/416 for issues related to the retry mechanism. We disabled retries in https://github.com/mozilla/telemetry-airflow/pull/417

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
Component: Scheduling → General
You need to log in before you can comment on or make changes to this bug.