Closed
Bug 1321424
Opened 8 years ago
Closed 7 years ago
Airflow should not run duplicate jobs
Categories
(Data Platform and Tools :: General, defect, P1)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bugzilla, Assigned: bugzilla)
References
Details
Attachments
(2 files)
It looks like Airflow kicked off the Longitudinal job, lost its connection to the cluster, and retried 30 minutes later: https://workflow.telemetry.mozilla.org/admin/airflow/log?execution_date=2016-11-20T00%3A00%3A00&dag_id=longitudinal&task_id=longitudinal longitudinal v20161126 and v20161119 both have twice the number of rows as expected. A couple things arising from this: This issue doesn't appear to be isolated to longitudinal: https://workflow.telemetry.mozilla.org/admin/taskinstance/?sort=14&flt0_execution_date_greater_than=2016-11-01+18%3A35%3A00&desc=1 Seems like next steps are: - See if we can fix the retry logic on airflow to prevent duplicate jobs when they're actually running fine - Consider adding a second check just before write to verify the key we're writing output to is still empty - Consider adding a verification task to the longitudinal dag that runs a few quick tests (e.g. check HLL count on client_id vs row count)
I forgot the first remediation step: try to identify all and remove the duplicate data from affected datasets
Comment 2•8 years ago
|
||
I think we should delete output data (longitudinal itself as well as any downstream tasks) and re-run the Longitudinal job for the two periods in question.
Comment 3•8 years ago
|
||
The crash_aggregates dataset was affected too, though this one looks like it'll be easy to fix (there are two files per partition for the affected days instead of one). Any downstream reports will need updating though.
Comment 4•8 years ago
|
||
I've removed the extra files from the two affected crash_aggregates partitions: s3://telemetry-parquet/crash_aggregates/v1/submission_date=2016-11-19/ and s3://telemetry-parquet/crash_aggregates/v1/submission_date=2016-11-26/
The longitudinal and cross-sectional datasets for the last two weeks have been re-run and hive is now pointing to the latest again. The Update Orphaning downstream job has been re-run as well, and Bug 1321647 has been filed for re-running the game hardware survey.
Yes -- we haven't completed any of the remediation items yet to prevent this from happening again.
Flags: needinfo?(ssuh)
Updated•8 years ago
|
Summary: Airflow ran duplicate longitudinal jobs on 11/27, and probably on 11/20 → Airflow should not run duplicate jobs
Comment 8•8 years ago
|
||
Updated•7 years ago
|
Points: --- → 3
Priority: -- → P1
Comment 9•7 years ago
|
||
Updated•7 years ago
|
Component: Metrics: Pipeline → Scheduling
Product: Cloud Services → Data Platform and Tools
Assignee | ||
Comment 10•7 years ago
|
||
whd fixed this as part of the Dockerflow migration \o/
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•2 years ago
|
Component: Scheduling → General
You need to log in
before you can comment on or make changes to this bug.
Description
•