Open Bug 1326223 Opened 7 years ago Updated 6 years ago

Use a retrying HTTPAdapter with requests to retry timeouts within the same task

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

Details

Currently we automatically retry many of the celery tasks, so long as the exception that occurred isn't on the NON_RETRYABLE_EXCEPTIONS blacklist:
https://github.com/mozilla/treeherder/blob/71c0e50657d9662162a8c8a5944ddaf9b49c15f1/treeherder/workers/task.py#L12

However many of the transient failures that occur during tasks are for issues with making requests to external services.

As such we could instead (or more likely: in addition to) get requests to retry itself, rather than having to retry the whole celery task.

Pros:
* Reduced overhead of having to restart the celery task
* Reduced noise in New Relic from what is external timeouts

Cons:
* Reduced visibility in New Relic (though perhaps mitigated by limiting the HTTPAdapter retries to 2-3, and leaving most of the retry attempts at the retryable_task() level)

Examples:
https://github.com/kennethreitz/requests/issues/2682#issuecomment-123273406
http://docs.python-requests.org/en/latest/user/advanced/#transport-adapters

We'll need to likely lower the retryable_task() retry counts if we do this, to avoid an excessive number of max retries when the two retry methods are combined.
Component: Treeherder → Treeherder: Data Ingestion
Priority: -- → P3
Assignee: nobody → ghickman
Assignee: ghickman → nobody
You need to log in before you can comment on or make changes to this bug.