Closed Bug 1515145 Opened 6 years ago Closed 6 years ago

[Meta] Improve visibility into delays of main_summary

Categories

(Data Platform and Tools Graveyard :: Operations, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwoo, Assigned: hwoo)

Details

Attachments

(1 file)

Currently, main_summary runs for 6 hours daily. If it fails it will email dag owners, and then retry the job (configured to 2 retries). If the retries fail, it will update status page and email dag owners. Short term sub tasks: 1. Adjust main_summary's timeout from 14 hours to ?. 2. Add a pagerduty service + email integration to page ?. Then configure main_summary to also email this pagerduty endpoint on retries/failures. 3. Add an airflow job scheduled for when we think main summary should finish, that verifies and updates statuspage. This job would either check status of the main_summary task, or data in s3. In the Long term we need something to update statuspage (bonus points for slack, pagerduty) when the data is delayed/missing that doesn't rely on the task status of an EmrSparkOperator airflow job. Either another airflow job to verify the data with athena/redash/boto/etc. or some external service running in gcp(which would need aws keys to access s3).
Closing this out. 1. Timeout was set to 4 hours, databricks runs the first main summary task in ~2 hours now. 2. Done and pages Frank for now. 3. This may not be needed in favor of a better long term solution. The statuspage operator should set partial_outage for a 3x failed main_summary ~7-13 hours later with the current configuration. This will be well before 9AM PST the following day. I've created https://bugzilla.mozilla.org/show_bug.cgi?id=1517820 to track this if it is still a need.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee: nobody → hwoo
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: