Closed Bug 1284289 Opened 8 years ago Closed 8 years ago

Improvements to celery task retrying & New Relic reporting

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(1 file)

There are currently a few issues:
* Bug 1281809 added the `@retryable_task` decorator to the `parse_log`, `store_failure_lines` and `crossreference_error_lines` tasks, however all three had their own retry logic already, so there are now two layers of retries, which is what contributed to the 100K message backlog in bug 1283413.
* New Relic lumps all of the `celery.exceptions.Retry()` exceptions together which isn't great for visibility. We should use the Python agent's `.record_exception()` feature to report the original exception (and include the retry count in the custom attributes list), then we can add `celery.exceptions.Retry()` to the ignore list on the New Relic dashboard config.
* There are a few other tasks that can be switched to the `@retryable_task` decorator to save duplication.
* The Celery docs now suggest prefixing the `.retry()` call with `raise `, to make the lack of continuing code execution more obvious. (Even though that particular `raise` will never actually occur.)
See Also: → 1284505
See Also: 1284505
Attachment #8772834 - Flags: review?(wlachance)
Comment on attachment 8772834 [details] [review]
[treeherder] mozilla:task-retry-tweaks > mozilla:master

Looks like some nice cleanup!
Attachment #8772834 - Flags: review?(wlachance) → review+
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/46b2a41f0b553c63787f94af79d0c93e7ea6f5c9
Bug 1284289 - Remove duplicate .retry() of crossreference_error_lines

Since the task already has the `@retryable_task` decorator applied.

https://github.com/mozilla/treeherder/commit/2f317a0be2ec7b6c993feb44b5d33e18791d77b0
Bug 1284289 - Remove duplicate .retry() of store_failure_lines

Since the task already has the `@retryable_task` decorator applied.

https://github.com/mozilla/treeherder/commit/77959aa2c7a6077f0a0c1ced5018619f76153d73
Bug 1284289 - Remove duplicate .retry() of parse_log

Since the task already has the `@retryable_task` decorator applied.

Note that the decorator retries in more cases than the .retry()s in
`utils.py` did, since previously only `urllib2.URLError` exceptions
during `extract_text_log_artifacts()` would retry, whereas now any will.
However (a) we can add more exemptions to `@retryable_task` as we
encounter them later, (b) this isn't a regression on the current
duplicate `.retry()` behaviour.

https://github.com/mozilla/treeherder/commit/ffd5a5c8921ec7f892f20462089686efe2231972
Bug 1284289 - Use @retryable_task with the submit_elasticsearch_doc task

Rather than duplicating more retry-handling logic.

https://github.com/mozilla/treeherder/commit/e518b52b6e359d742c2e85cf4400e219bf9f694f
Bug 1284289 - Clean up log parser logging statements

https://github.com/mozilla/treeherder/commit/9faf9d97ae0b050c1bdc3fda3c4bb7a4e6a15c34
Bug 1284289 - Clarify celery retry call behaviour with 'raise' prefix

As recommended by:
http://docs.celeryproject.org/en/stable/reference/celery.app.task.html#celery.app.task.Task.retry

https://github.com/mozilla/treeherder/commit/e699729b40f4359248aa54803de7f94b0a7abefc
Bug 1284289 - Simplify @retryable_task's exception blacklist

The previous variable name of `raise_exceptions` was easy to misread as
"exceptions that will cause this task to fail and thus retry". Also,
nothing is customising this list at present, so keep as a fixed
blacklist for now.

https://github.com/mozilla/treeherder/commit/77a809cd0c7adbc273a88a859d0350ab99ba1730
Bug 1284289 - Don't retry tasks for 'zlib.error` exceptions

This prevents continually retrying when we hit bug 1284360, given that
the task will never succeed anyway.

https://github.com/mozilla/treeherder/commit/f5817334d86e08836874b00ecac38fd6e4479f3c
Bug 1284289 - Manually report exceptions that caused tasks to retry

See in-code comment for more details.

Once this is deployed, I'll use the New Relic web configuration page to
add `celery.exceptions:Retry` to the ignore list. (Contrary to the
linked New Relic docs, this cannot be done via newrelic.ini, since the
server-side config takes preference once server-side mode is enabled.)
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
I've now added 'celery.exceptions:Retry' to the ignore list for all New Relic app configurations via the web settings UI, since we're reporting the exceptions individually.

As such:
* Exceptions of type "celery.exceptions:Retry" will no longer be ingested into New Relic
* The counts of existing exceptions will appear to increase, but it's really just the retries morphing signature (from the PR here)
* When looking at an individual exception, you can tell if it was from a retry, since it will have the custom attribute "number_of_prior_retries"
* The new "Error Analysis" section on New Relic should be much more usable, since exceptions won't just be all piled into the one bucket.

If anyone has any questions, just ask :-)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: