Closed Bug 1339288 Opened 7 years ago Closed 6 years ago

Consider using Celery acks_late to prevent loss of tasks if worker crashes

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: emorley, Unassigned)

Details

Turns out that if the worker crashes we can lose tasks.

Whilst that should be rare, I wonder if we can also hit this case when Heroku terminates dynos due to deploys/the 24-hourly dyno restart.

"""
Should I use retry or acks_late?

Answer: Depends. It’s not necessarily one or the other, you may want to use both.

Task.retry is used to retry tasks, notably for expected errors that is catchable with the try: block. The AMQP transaction is not used for these errors: if the task raises an exception it is still acknowledged!

The acks_late setting would be used when you need the task to be executed again if the worker (for some reason) crashes mid-execution. It’s important to note that the worker is not known to crash, and if it does it is usually an unrecoverable error that requires human intervention (bug in the worker, or task code).

...

So use retry for Python errors, and if your task is idempotent combine that with acks_late if that level of reliability is required.
"""

See:
http://docs.celeryproject.org/en/3.1/faq.html#faq-acks-late-vs-retry
Component: Treeherder → Treeherder: Infrastructure
We don't see worker crashes ever
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.