Closed Bug 1294702 Opened 8 years ago Closed 6 years ago

Overwrite partial datasets after a job has been retriggered

Categories

(Data Platform and Tools :: General, defect, P2)

defect
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: amiyaguchi)

References

Details

User Story

A job scheduled with Airflow might fail after it has started copying data on S3. When Airflow retriggers it, the job should delete the partial data before it actually starts. We need to make sure that this is the case for all jobs currently being run on Airflow.
      No description provided.
Points: --- → 2
Priority: -- → P2
Component: Metrics: Pipeline → Scheduling
Product: Cloud Services → Data Platform and Tools
Assignee: nobody → amiyaguchi
This needs to double checking to make sure longitundial handles this case. Otherwise, I don't think our jobs care about partial data.
I added this recently in Longitudinal: https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/Longitudinal.scala#L172-L173

But, yes, otherwise we should be using the spark parquet writer in atomic mode.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Scheduling → General
You need to log in before you can comment on or make changes to this bug.