Closed Bug 1321424 Opened 8 years ago Closed 7 years ago

Airflow should not run duplicate jobs

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugzilla, Assigned: bugzilla)

References

Details

Attachments

(2 files)

It looks like Airflow kicked off the Longitudinal job, lost its connection to the cluster, and retried 30 minutes later: https://workflow.telemetry.mozilla.org/admin/airflow/log?execution_date=2016-11-20T00%3A00%3A00&dag_id=longitudinal&task_id=longitudinal

longitudinal v20161126 and v20161119 both have twice the number of rows as expected.

A couple things arising from this:

This issue doesn't appear to be isolated to longitudinal:
https://workflow.telemetry.mozilla.org/admin/taskinstance/?sort=14&flt0_execution_date_greater_than=2016-11-01+18%3A35%3A00&desc=1

Seems like next steps are:
- See if we can fix the retry logic on airflow to prevent duplicate jobs when they're actually running fine
- Consider adding a second check just before write to verify the key we're writing output to is still empty
- Consider adding a verification task to the longitudinal dag that runs a few quick tests (e.g. check HLL count on client_id vs row count)
Blocks: 1269754
I forgot the first remediation step: try to identify all and remove the duplicate data from affected datasets
I think we should delete output data (longitudinal itself as well as any downstream tasks) and re-run the Longitudinal job for the two periods in question.
The crash_aggregates dataset was affected too, though this one looks like it'll be easy to fix (there are two files per partition for the affected days instead of one). Any downstream reports will need updating though.
Assignee: nobody → ssuh
I've removed the extra files from the two affected crash_aggregates partitions:
s3://telemetry-parquet/crash_aggregates/v1/submission_date=2016-11-19/
and 
s3://telemetry-parquet/crash_aggregates/v1/submission_date=2016-11-26/
Depends on: 1321647
The longitudinal and cross-sectional datasets for the last two weeks have been re-run and hive is now pointing to the latest again. The Update Orphaning downstream job has been re-run as well, and Bug 1321647 has been filed for re-running the game hardware survey.
this still valid?
Flags: needinfo?(ssuh)
Yes -- we haven't completed any of the remediation items yet to prevent this from happening again.
Flags: needinfo?(ssuh)
Summary: Airflow ran duplicate longitudinal jobs on 11/27, and probably on 11/20 → Airflow should not run duplicate jobs
See Also: → 1324805
Points: --- → 3
Priority: -- → P1
Depends on: 1326068
Component: Metrics: Pipeline → Scheduling
Product: Cloud Services → Data Platform and Tools
whd fixed this as part of the Dockerflow migration \o/
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Scheduling → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: