Closed Bug 1306225 Opened 8 years ago Closed 8 years ago

Airflow jobs are failing

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Unassigned)

References

Details

User Story

The add-on view backfill (using 10 jobs at the time) caused the outage of our scheduled EMR jobs due to the following error:

[2016-09-29 00:15:53,716] {models.py:1286} ERROR - An error occurred (ThrottlingException) when calling the DescribeCluster operation: Rate exceeded

Blake,
I would like to understand what the limits are, why we are getting throttled and how we can increase those limits to make sure this doesn't happen again.

Mark,
When we backfilled the main_summary view we set max_active_runs=5 and we didn't run into any issues; we should enforce this as a global limit [1]. I will suspend the execution of the addons DAG until this problem has been fixed.

[1] https://github.com/mozilla/telemetry-airflow/blob/master/ansible/files/airflow/airflow.cfg#L42

Attachments

(3 files)

      No description provided.
User Story: (updated)
Flags: needinfo?(bimsland)
Flags: needinfo?(mreid)
Blocks: 1269754
Severity: normal → blocker
User Story: (updated)
As far as I know (from my previous dealings with AWS support) those limits are a) not exposed to us and b) not able to be increased. There's a page on the AWS site [1] that suggests setting up retries with exponential backoff and I've read before that the recommendation is to not query for status more than once every 10 seconds.

[1] https://aws.amazon.com/premiumsupport/knowledge-center/emr-cluster-status-throttling-error/
Flags: needinfo?(bimsland)
Flags: needinfo?(mreid)
Blocks: 1300224
We could increase the monitoring timeout at [1] to something like 5 minutes. That would reduce the rate we're hitting DescribeCluster by 5x. What do you think?

[1] https://github.com/mozilla/telemetry-airflow/blob/master/dags/operators/emr_spark_operator.py#L155
Flags: needinfo?(rvitillo)
Flags: needinfo?(bimsland)
Attached file patch
Attachment #8796163 - Flags: review?(mdoglio)
Flags: needinfo?(rvitillo)
Attachment #8796163 - Flags: review?(mdoglio) → review+
This problem seems to be resolved for now.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(bimsland)
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: