Closed
Bug 1306225
Opened 8 years ago
Closed 8 years ago
Airflow jobs are failing
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Unassigned)
References
Details
User Story
The add-on view backfill (using 10 jobs at the time) caused the outage of our scheduled EMR jobs due to the following error: [2016-09-29 00:15:53,716] {models.py:1286} ERROR - An error occurred (ThrottlingException) when calling the DescribeCluster operation: Rate exceeded Blake, I would like to understand what the limits are, why we are getting throttled and how we can increase those limits to make sure this doesn't happen again. Mark, When we backfilled the main_summary view we set max_active_runs=5 and we didn't run into any issues; we should enforce this as a global limit [1]. I will suspend the execution of the addons DAG until this problem has been fixed. [1] https://github.com/mozilla/telemetry-airflow/blob/master/ansible/files/airflow/airflow.cfg#L42
Attachments
(3 files)
No description provided.
Reporter | ||
Updated•8 years ago
|
User Story: (updated)
Reporter | ||
Updated•8 years ago
|
Flags: needinfo?(bimsland)
Reporter | ||
Updated•8 years ago
|
Flags: needinfo?(mreid)
Reporter | ||
Updated•8 years ago
|
Severity: normal → blocker
Reporter | ||
Updated•8 years ago
|
User Story: (updated)
Comment 1•8 years ago
|
||
As far as I know (from my previous dealings with AWS support) those limits are a) not exposed to us and b) not able to be increased. There's a page on the AWS site [1] that suggests setting up retries with exponential backoff and I've read before that the recommendation is to not query for status more than once every 10 seconds. [1] https://aws.amazon.com/premiumsupport/knowledge-center/emr-cluster-status-throttling-error/
Updated•8 years ago
|
Flags: needinfo?(bimsland)
Comment 2•8 years ago
|
||
Comment 3•8 years ago
|
||
Flags: needinfo?(mreid)
Comment 4•8 years ago
|
||
We could increase the monitoring timeout at [1] to something like 5 minutes. That would reduce the rate we're hitting DescribeCluster by 5x. What do you think? [1] https://github.com/mozilla/telemetry-airflow/blob/master/dags/operators/emr_spark_operator.py#L155
Updated•8 years ago
|
Flags: needinfo?(rvitillo)
Flags: needinfo?(bimsland)
Reporter | ||
Comment 5•8 years ago
|
||
Attachment #8796163 -
Flags: review?(mdoglio)
Reporter | ||
Updated•8 years ago
|
Flags: needinfo?(rvitillo)
Updated•8 years ago
|
Attachment #8796163 -
Flags: review?(mdoglio) → review+
Comment 6•8 years ago
|
||
This problem seems to be resolved for now.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(bimsland)
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•