Closed Bug 1306225 Opened 8 years ago Closed 8 years ago

Airflow jobs are failing

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Unassigned)

References

Details

User Story

The add-on view backfill (using 10 jobs at the time) caused the outage of our scheduled EMR jobs due to the following error:

[2016-09-29 00:15:53,716] {models.py:1286} ERROR - An error occurred (ThrottlingException) when calling the DescribeCluster operation: Rate exceeded

Blake,
I would like to understand what the limits are, why we are getting throttled and how we can increase those limits to make sure this doesn't happen again.

Mark,
When we backfilled the main_summary view we set max_active_runs=5 and we didn't run into any issues; we should enforce this as a global limit [1]. I will suspend the execution of the addons DAG until this problem has been fixed.

[1] https://github.com/mozilla/telemetry-airflow/blob/master/ansible/files/airflow/airflow.cfg#L42

Attachments

(3 files)

[telemetry-airflow] mreid-moz:decrease_addons_parallelism > mozilla:master 8 years ago GitHub Autolander Bot 52 bytes, text/x-github-pull-request		Details \| Review
Decrease max dag runs from 10 to 5 8 years ago Mark Reid [:mreid] 52 bytes, text/x-github-pull-request		Details \| Review
patch 8 years ago Roberto Agostino Vitillo (:rvitillo) 52 bytes, text/x-github-pull-request	mdoglio : review+	Details \| Review

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

8 years ago

      No description provided.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

User Story: (updated)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Flags: needinfo?(bimsland)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Flags: needinfo?(mreid)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Blocks: 1269754

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Severity: normal → blocker

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

User Story: (updated)

Blake Imsland [:robotblake]

Comment 1

•

8 years ago

As far as I know (from my previous dealings with AWS support) those limits are a) not exposed to us and b) not able to be increased. There's a page on the AWS site [1] that suggests setting up retries with exponential backoff and I've read before that the recommendation is to not query for status more than once every 10 seconds.

[1] https://aws.amazon.com/premiumsupport/knowledge-center/emr-cluster-status-throttling-error/

Blake Imsland [:robotblake]

Updated

•

8 years ago

Flags: needinfo?(bimsland)

GitHub Autolander Bot

Comment 2

•

8 years ago

Attached file [telemetry-airflow] mreid-moz:decrease_addons_parallelism > mozilla:master — Details

Mark Reid [:mreid]

Comment 3

•

8 years ago

Attached file Decrease max dag runs from 10 to 5 — Details

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Updated

•

8 years ago

Blocks: 1300224

Mark Reid [:mreid]

Comment 4

•

8 years ago

We could increase the monitoring timeout at [1] to something like 5 minutes. That would reduce the rate we're hitting DescribeCluster by 5x. What do you think?

[1] https://github.com/mozilla/telemetry-airflow/blob/master/dags/operators/emr_spark_operator.py#L155

Mark Reid [:mreid]

Updated

•

8 years ago

Flags: needinfo?(rvitillo)

Flags: needinfo?(bimsland)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 5

•

8 years ago

Attached file patch — Details

Attachment #8796163 - Flags: review?(mdoglio)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Flags: needinfo?(rvitillo)

Mauro Doglio [:mdoglio]

Updated

•

8 years ago

Attachment #8796163 - Flags: review?(mdoglio) → review+

Mark Reid [:mreid]

Comment 6

•

8 years ago

This problem seems to be resolved for now.

Status: NEW → RESOLVED

Closed: 8 years ago

Flags: needinfo?(bimsland)

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.