[sync] The queue is getting stuck
Categories
(Webtools Graveyard :: Pontoon, enhancement, P2)
Tracking
(Not tracked)
People
(Reporter: mathjazz, Assigned: jotes)
References
Details
| Reporter | ||
Comment 1•7 years ago
|
||
| Reporter | ||
Comment 2•7 years ago
|
||
| Reporter | ||
Comment 3•7 years ago
|
||
| Reporter | ||
Comment 4•7 years ago
|
||
| Reporter | ||
Comment 5•7 years ago
|
||
Comment 6•7 years ago
|
||
| Assignee | ||
Comment 7•7 years ago
|
||
:mathjazz Could you check if the transport url is the same before and after a restart of Celery?
| Reporter | ||
Comment 8•7 years ago
|
||
As far as I can tell from the logs, it's always the same, it looks like this:
-------------- celery@6c0e4709-ccf0-41bd-8276-7fa0f9567507 v3.1.18 (Cipater)
---- **** -----
--- * *** * -- Linux-4.4.0-1031-aws-x86_64-with-debian-buster-sid
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: pontoon:0x7f63ba8e5410
- ** ---------- .> transport: amqp://lzlaccmk:**@moose.rmq.cloudamqp.com:5672/lzlaccmk
- ** ---------- .> results: disabled
- *** --- * --- .> concurrency: 8 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> celery exchange=celery(direct) key=celery
| Assignee | ||
Comment 9•7 years ago
|
||
:mathjazz
(feel free to ignore this comment if you managed to get alerts on the prod).
The one of possible problems to fix in this bug is the lack of alarms coming from Papertrail.
As it's stated in docs, Papertrail doesn't support inactivity alerts:
https://help.papertrailapp.com/kb/how-it-works/alerts/#inactivity-and-metrics-integrations
A small thing would be to add a Cron task which would produce an error message if the last sync happened >X minutes ago.
| Reporter | ||
Comment 10•7 years ago
|
||
Actually the docs say you can set up inactivity alerts:
https://help.papertrailapp.com/kb/how-it-works/alerts/#inactivity-alerts
I've set one for when "Sync complete." message doesn't show up for over 1 hour. Thanks for the idea! :)
I can add more people to the list of recipients if the alert proves to be working as expected and if we don't fix the bug soon, of course.
| Assignee | ||
Comment 11•7 years ago
|
||
:mathjazz
I misunderstood the docs, sorry for the confusion.
| Assignee | ||
Comment 12•6 years ago
|
||
:mathjazz
Is it okay to take this one?
| Reporter | ||
Comment 14•6 years ago
|
||
Just came across this, which might be useful:
"We assume that a system administrator deliberately killing the task does not want it to automatically restart."
https://docs.celeryproject.org/en/latest/userguide/tasks.html
"If you really want a task to be redelivered in these scenarios you should consider enabling the task_reject_on_worker_lost setting."
https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_reject_on_worker_lost
| Assignee | ||
Comment 15•6 years ago
|
||
Hey,
Sadly, I don't have a solution yet. However, I've decided to dump my current thoughts:
- Unfortunately,
task_reject_on_worker_lostis not available on Celery 3.1.18 (look at https://docs.celeryproject.org/en/3.1/configuration.html). - I found a discussion on Github may be related to the problem:
https://github.com/celery/celery/issues/2839
Additionally, Heroku docs also mention this option:
https://devcenter.heroku.com/articles/celery-heroku#using-remap_sigterm - I don't now why Celery leaves messages in the queue. I couldn't reproduce the bug locally, Celery seems to drop messages from the Rabbitmq
queue without any problems. - I'll try to deploy Pontoon on Heroku and enable the CloudAMQP addon to reproduce the issue.
Comment 16•6 years ago
|
||
To give you an idea, we sync every 20 minutes, and this problem happened twice in the last week (Dec 26, Dec 29). But it can as well keep working for several weeks without problems.
I think we need a way to reproduce it systematically (make the sync artificially longer + kill something?)
| Reporter | ||
Comment 17•6 years ago
|
||
:jotes, thanks for the investigation!
According to that GH discussion, the issue seems specific to Heroku. AFAICT we never reproduced the problem locally, but you should be able to do so on Heroku by following Comment 5.
Looks like both potential solutions (REMAP_SIGTERM, task_reject_on_worker_lost) point towards upgrading to celery 4, which we might also need for the python3 upgrade.
Comment 18•6 years ago
|
||
(In reply to Jarek Śmiejczak [:jotes] from comment #15)
- Unfortunately,
task_reject_on_worker_lostis not available on Celery 3.1.18 (look at https://docs.celeryproject.org/en/3.1/configuration.html).
Note that we are very likely to upgrade celery to its latest version as part of the Python 3 move. More on this in early January.
| Assignee | ||
Comment 19•6 years ago
|
||
After an evening spent on testing Celery on Heroku with CloudAMQP, I couldn't force the queue to get stuck. I probably don't understand what actually happens in the prod environment and I'll need a few more details.
This is how I tried to reproduce the problem (but without success):
- I don't have too many projects on my testing on my instance of Pontoon. To makes things slower, I've added the following line: https://github.com/jotes/pontoon/blob/heroku-test-sync-ampq/pontoon/sync/tasks.py#L79 - I have to make the sync task slow enough to give me some time to trigger a SIGTERM signal.
- I execute
./manage.py sync_projects-> It starts and waits. - I disable the celery worker in the Heroku console
- The Celery worker goes down and I see WorkerLostError (as expected)
- I execute
./manage.py sync_projectsa couple of times when the worker is down. - New messages are added to the Celery queue, I'm able to see them on the RabbitMQ Admin panel (and they're marked as "ready")
- After some time, I run the worker again.
- All messages on the celery queue are changing their state from "ready" to "unacknowledged". But nothing happens, and the worker is doing nothing (or is doing something that is not visible in the logs).
- I run the sync task again
- After the worker picks up all tasks and executes them :(
Did I reproduce it correctly?
BTW, I have a few questions/concerns:
- What's the current value of https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_acks_late on Prod? Celery should acknowledge a message after a Celery worker picks it up (by default).
- Can you look to your RabbitMQ admin console and show the numbers of messages from around the time the stuck happens? A screenshot could help. Here's how it looks like on my instance: https://imgur.com/a/rHzv6jX
- Every execution of
./manage.py sync_projectsshould publish a new message to the celery queue -> does that happen after every cycling event? How many messages are produced during the downtime of the Celery workers? I don't know why the workers don't pick up new messages and process them (yet).
| Reporter | ||
Comment 20•6 years ago
|
||
(In reply to Jarek Śmiejczak [:jotes] from comment #19)
Did I reproduce it correctly?
No. :) On prod, sync jobs following the worker restart just add messages to the queue, but none of them gets executed.
Have you set CELERY_ALWAYS_EAGER to True in your environment?
What's the current value of https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_acks_late on Prod? Celery should acknowledge a message after a Celery worker picks it up (by default).
We use the default value. These are the only Celery settings we override:
https://github.com/mozilla/pontoon/blob/ad502fe9475a1c223cccb22baac9be0a47545476/pontoon/settings/base.py#L718-L733
Just a guess: looking at CELERYD_MAX_TASKS_PER_CHILD, I wonder if the number of projects needs to be higher than 20 (e.g. 30) to reproduce the error reliably.
Can you look to your RabbitMQ admin console and show the numbers of messages from around the time the stuck happens? A screenshot could help. Here's how it looks like on my instance: https://imgur.com/a/rHzv6jX
I can check that the next time the stuck occurs.
Every execution of
./manage.py sync_projectsshould publish a new message to the celery queue -> does that happen after every cycling event?
Yes.
How many messages are produced during the downtime of the Celery workers? I don't know why the workers don't pick up new messages and process them (yet).
I think the number is usually 0. We have around 35 projects to sync, so when the worker restarts, 0-35 jobs get stuck. Since sync_projects is executed every 20 minutes and the worker restarts in a matter of seconds, it's not very likely that the next sync_projects tasks hits the window when the worker is down.
| Assignee | ||
Comment 21•6 years ago
|
||
It looks like the problem doesn't occur on prod anymore (probably due to the recent upgrade of Celery). Is it okay to close it?
I'll create a follow-up bug about the Docker confugration that may help people to debug issues with Celery/RabbitMQ environments locally.
| Reporter | ||
Comment 22•6 years ago
|
||
Thank you very much jotes!
This bug has been biting us for a very long. The fix not only save us from more or less occasional troubles of releasing the queue, but also allows us to increase the sync frequency by a factor of 2 (and opens the door for more in the future).
Updated•4 years ago
|
Description
•