Closed
Bug 1515145
Opened 6 years ago
Closed 6 years ago
[Meta] Improve visibility into delays of main_summary
Categories
(Data Platform and Tools Graveyard :: Operations, enhancement)
Data Platform and Tools Graveyard
Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: hwoo, Assigned: hwoo)
Details
Attachments
(1 file)
Currently, main_summary runs for 6 hours daily. If it fails it will email dag owners, and then retry the job (configured to 2 retries). If the retries fail, it will update status page and email dag owners.
Short term sub tasks:
1. Adjust main_summary's timeout from 14 hours to ?.
2. Add a pagerduty service + email integration to page ?. Then configure main_summary to also email this pagerduty endpoint on retries/failures.
3. Add an airflow job scheduled for when we think main summary should finish, that verifies and updates statuspage. This job would either check status of the main_summary task, or data in s3.
In the Long term we need something to update statuspage (bonus points for slack, pagerduty) when the data is delayed/missing that doesn't rely on the task status of an EmrSparkOperator airflow job. Either another airflow job to verify the data with athena/redash/boto/etc. or some external service running in gcp(which would need aws keys to access s3).
Comment 1•6 years ago
|
||
Assignee | ||
Comment 2•6 years ago
|
||
Closing this out.
1. Timeout was set to 4 hours, databricks runs the first main summary task in ~2 hours now.
2. Done and pages Frank for now.
3. This may not be needed in favor of a better long term solution. The statuspage operator should set partial_outage for a 3x failed main_summary ~7-13 hours later with the current configuration. This will be well before 9AM PST the following day. I've created https://bugzilla.mozilla.org/show_bug.cgi?id=1517820 to track this if it is still a need.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → hwoo
Updated•2 years ago
|
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•