Closed Bug 1269725 Opened 8 years ago Closed 8 years ago

The CrashAggregateView watchdog should check that a _SUCCESS file exists

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: mdoglio)

References

Details

On Friday we transferred ownership of the moz-crash-rate-aggregates job from azhang to mdoglio. We noticed that the moz-crash-rate-aggregates job never successfully ran to completion as the scheduled EMR clusters terminate with the error: "Shut down as step failed". No logs are available. It looks though that the data on S3 is complete, i.e. each partition has a _SUCCESS file which should be written by Spark only after all the files belonging to that partition have been written. It seems that the job is failing at the very end for some other reason. Furthermore, the watchdog job, which should send an alert when the job fails, seems to be checking only for the existence of the partition on S3, not for the presence of the _SUCCESS file. We should rectify that.
Assignee: nobody → mdoglio
Points: --- → 3
Priority: -- → P1
This has been fixed in https://github.com/mozilla/telemetry-batch-view/pull/71 (under review).
Depends on: 1275346
Status: NEW → ASSIGNED
The first part of this bug is a WONTFIX. We replaced the old python job with a new version written in scala, which is not affected by the bug described. The second part is still valid, as we are using the same watchdog script as before.
Summary: moz-crash-rate-aggregates job is failing → The CrashAggregateView watchdog should check that a _SUCCESS file exists
Depends on: 1275346
No longer depends on: 1275346
Deployed on production
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.