1477570 - [sync] The queue is getting stuck

Reporter

Description

•

7 years ago

Sync queue is getting stuck on RabbitMQ every few days and needs to be released manually: http://mozilla-pontoon.readthedocs.io/en/latest/admin/maintenance.html#releasing-the-queue We need to investigate what's going on.

Matjaz Horvat [:mathjazz]

Reporter

Comment 1

•

7 years ago

For starters, it would be nice to set up some alerts to inform us when the queue gets stuck, so that we know we need to release it. That might not be entirely trivial in e.g. Papertrail, because there are no error messages in the log (no sync activity means no sync activity in the log as well).

Matjaz Horvat [:mathjazz]

Reporter

Comment 2

•

7 years ago

We should also look into our CloudAMQP (Rabbitmq) plan and consider upgrading.

Matjaz Horvat [:mathjazz]

Reporter

Comment 3

•

7 years ago

We should also upgrade to the latest Heroku stack (probably unrelated, but who knows): https://devcenter.heroku.com/articles/upgrading-to-the-latest-stack

Matjaz Horvat [:mathjazz]

Reporter

Comment 4

•

7 years ago

Note: Pontoon has been on the latest stack (heroku-16) since Monday.

Matjaz Horvat [:mathjazz]

Reporter

Comment 5

•

7 years ago

The queue got stuck again and I noticed it happened right after Cycling (restart) of the worker: https://devcenter.heroku.com/articles/dynos#automatic-dyno-restarts Looking into the log I noticed the pattern: every time the queue gets stuck, it happens when Cycling interrupts the sync process. Messages (sync tasks) that are left in the queue when the worker dies, never leave the queue. Messages added after each subsequent sync period just pile up. Note that probability of this scenario increases with the sync frequency, so it's not surprising we've been encountering the problem more often when we were syncing evety 10 minutes. For reference, this is how Cycling looks in the log: heroku/worker.1: Cycling heroku/worker.1: State changed from up to starting heroku/worker.1: Stopping all processes with SIGTERM app/worker.1: worker: Warm shutdown (MainProcess) app/worker.1: [2018-07-26 10:10:39,464: ERROR/MainProcess] Process 'Worker-146' pid:28302 exited with 'signal 15 (SIGTERM)' app/worker.1: [2018-07-26 10:10:39,986: ERROR/MainProcess] Process 'Worker-145' pid:28120 exited with 'signal 15 (SIGTERM)' app/worker.1: [2018-07-26 10:10:39,997: ERROR/MainProcess] Task pontoon.sync.tasks.sync_project[b02a4c6e-6837-4598-b22d-c291e320be30] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',) app/worker.1: Traceback (most recent call last): app/worker.1: File "/app/.heroku/python/lib/python2.7/site-packages/billiard/pool.py", line 1171, in mark_as_worker_lost app/worker.1: human_status(exitcode)), app/worker.1: WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM). heroku/worker.1: Process exited with status 0 app/worker.1: [2018-07-26 10:10:55,927: WARNING/MainProcess] /app/.heroku/python/lib/python2.7/site-packages/celery/apps/worker.py:161: CDeprecationWarning: app/worker.1: Starting from version 3.2 Celery will refuse to accept pickle by default. app/worker.1: The pickle serializer is a security concern as it may give attackers app/worker.1: the ability to execute any command. It's important to secure app/worker.1: your broker from unauthorized access when using pickle, so we think app/worker.1: that enabling pickle should require a deliberate action and not be app/worker.1: the default choice. app/worker.1: If you depend on pickle then you should set a setting to disable this app/worker.1: warning and to be sure that everything will continue working app/worker.1: when you upgrade to Celery 3.2:: app/worker.1: CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml'] app/worker.1: You must only enable the serializers that you will actually use. app/worker.1: warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED)) app/worker.1: -------------- celery@7c76ca9a-1d45-4c14-9579-42153cdabbce v3.1.18 (Cipater) app/worker.1: ---- **** ----- app/worker.1: --- * *** * -- Linux-4.4.0-1019-aws-x86_64-with-debian-stretch-sid app/worker.1: -- * - **** --- app/worker.1: - ** ---------- [config] app/worker.1: - ** ---------- .> app: pontoon:0x7fc486c600d0 app/worker.1: - ** ---------- .> transport: amqp://lzlaccmk:**@moose.rmq.cloudamqp.com:5672/lzlaccmk app/worker.1: - ** ---------- .> results: disabled app/worker.1: - *** --- * --- .> concurrency: 2 (prefork) app/worker.1: -- ******* ---- app/worker.1: --- ***** ----- [queues] app/worker.1: -------------- .> celery exchange=celery(direct) key=celery app/worker.1: [tasks] app/worker.1: . pontoon.checks.tasks.bulk_run_checks app/worker.1: . pontoon.sync.tasks.sync_project app/worker.1: . pontoon.sync.tasks.sync_translations app/worker.1: 2018-07-26 10:10:55,944 (11/MainThread) newrelic.core.agent INFO - New Relic Python Agent (2.106.1.88) app/worker.1: 2018-07-26 10:10:55,942 (10/MainThread) newrelic.core.agent INFO - New Relic Python Agent (2.106.1.88) app/worker.1: [2018-07-26 10:10:55,944: INFO/Worker-2] New Relic Python Agent (2.106.1.88) app/worker.1: [2018-07-26 10:10:55,942: INFO/Worker-1] New Relic Python Agent (2.106.1.88) app/worker.1: [2018-07-26 10:10:55,973: INFO/MainProcess] Connected to amqp://lzlaccmk:**@moose.rmq.cloudamqp.com:5672/lzlaccmk app/worker.1: [2018-07-26 10:10:56,031: WARNING/MainProcess] celery@7c76ca9a-1d45-4c14-9579-42153cdabbce ready.

Comment 6

•

7 years ago

I am definitely not an expert for message brokers, but seems you are using Rabbit MQ. From my experience with RMQ, this sounds like a misconfiguration of the broker. Do you have access to configuration of some connection keep-alives, heartbeats, or something that looks similar to https://www.rabbitmq.com/confirms.html#automatic-requeueing ? I never did that, but IIRC it's possible to make the client "ignore" redelivered (at least once failed) messages. Isn't that your case by any chance? It might be also interesting to also check how dead letter exchanges and message TTL work, so in case of a "broken" message, the others won't get blocked. If you can point me where is the code producing and consuming the messages, I can have a look, if I can come up with a way to improve the configuration.

Jarek Śmiejczak [:jotes]

Assignee

Comment 7

•

7 years ago

:mathjazz Could you check if the transport url is the same before and after a restart of Celery?

Flags: needinfo?(m)

Matjaz Horvat [:mathjazz]

Reporter

Comment 8

•

7 years ago

As far as I can tell from the logs, it's always the same, it looks like this:

 -------------- celery@6c0e4709-ccf0-41bd-8276-7fa0f9567507 v3.1.18 (Cipater) 
---- **** -----  
--- * ***  * -- Linux-4.4.0-1031-aws-x86_64-with-debian-buster-sid 
-- * - **** ---  
- ** ---------- [config] 
- ** ---------- .> app:         pontoon:0x7f63ba8e5410 
- ** ---------- .> transport:   amqp://lzlaccmk:**@moose.rmq.cloudamqp.com:5672/lzlaccmk 
- ** ---------- .> results:     disabled 
- *** --- * --- .> concurrency: 8 (prefork) 
-- ******* ----  
--- ***** ----- [queues] 
 -------------- .> celery           exchange=celery(direct) key=celery

Flags: needinfo?(m)

Jarek Śmiejczak [:jotes]

Assignee

Comment 9

•

7 years ago

:mathjazz

(feel free to ignore this comment if you managed to get alerts on the prod).
The one of possible problems to fix in this bug is the lack of alarms coming from Papertrail.
As it's stated in docs, Papertrail doesn't support inactivity alerts:
https://help.papertrailapp.com/kb/how-it-works/alerts/#inactivity-and-metrics-integrations

A small thing would be to add a Cron task which would produce an error message if the last sync happened >X minutes ago.

Flags: needinfo?(m)

Matjaz Horvat [:mathjazz]

Reporter

Comment 10

•

7 years ago

Actually the docs say you can set up inactivity alerts:
https://help.papertrailapp.com/kb/how-it-works/alerts/#inactivity-alerts

I've set one for when "Sync complete." message doesn't show up for over 1 hour. Thanks for the idea! :)

I can add more people to the list of recipients if the alert proves to be working as expected and if we don't fix the bug soon, of course.

Flags: needinfo?(m)

Jarek Śmiejczak [:jotes]

Assignee

Comment 11

•

7 years ago

:mathjazz

I misunderstood the docs, sorry for the confusion.

Jarek Śmiejczak [:jotes]

Assignee

Comment 12

•

6 years ago

:mathjazz
Is it okay to take this one?

Matjaz Horvat [:mathjazz]

Reporter

Comment 13

•

6 years ago

Absolutely yes. :)

Assignee: nobody → poke

Status: NEW → ASSIGNED

Matjaz Horvat [:mathjazz]

Reporter

Comment 14

•

6 years ago

Just came across this, which might be useful:

"We assume that a system administrator deliberately killing the task does not want it to automatically restart."
https://docs.celeryproject.org/en/latest/userguide/tasks.html

"If you really want a task to be redelivered in these scenarios you should consider enabling the task_reject_on_worker_lost setting."
https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_reject_on_worker_lost

Jarek Śmiejczak [:jotes]

Assignee

Comment 15

•

6 years ago

Hey,
Sadly, I don't have a solution yet. However, I've decided to dump my current thoughts:

Unfortunately, task_reject_on_worker_lost is not available on Celery 3.1.18 (look at https://docs.celeryproject.org/en/3.1/configuration.html).
I found a discussion on Github may be related to the problem:
https://github.com/celery/celery/issues/2839
Additionally, Heroku docs also mention this option:
https://devcenter.heroku.com/articles/celery-heroku#using-remap_sigterm
I don't now why Celery leaves messages in the queue. I couldn't reproduce the bug locally, Celery seems to drop messages from the Rabbitmq
queue without any problems.
I'll try to deploy Pontoon on Heroku and enable the CloudAMQP addon to reproduce the issue.

Francesco Lodolo [:flod]

Comment 16

•

6 years ago

To give you an idea, we sync every 20 minutes, and this problem happened twice in the last week (Dec 26, Dec 29). But it can as well keep working for several weeks without problems.

I think we need a way to reproduce it systematically (make the sync artificially longer + kill something?)

Matjaz Horvat [:mathjazz]

Reporter

Comment 17

•

6 years ago

:jotes, thanks for the investigation!

According to that GH discussion, the issue seems specific to Heroku. AFAICT we never reproduced the problem locally, but you should be able to do so on Heroku by following Comment 5.

Looks like both potential solutions (REMAP_SIGTERM, task_reject_on_worker_lost) point towards upgrading to celery 4, which we might also need for the python3 upgrade.

[DEACTIVATED] Adrian Gaudebert

Comment 18

•

6 years ago

(In reply to Jarek Śmiejczak [:jotes] from comment #15)

Unfortunately, task_reject_on_worker_lost is not available on Celery 3.1.18 (look at https://docs.celeryproject.org/en/3.1/configuration.html).

Note that we are very likely to upgrade celery to its latest version as part of the Python 3 move. More on this in early January.

Jarek Śmiejczak [:jotes]

Assignee

Comment 19

•

6 years ago

After an evening spent on testing Celery on Heroku with CloudAMQP, I couldn't force the queue to get stuck. I probably don't understand what actually happens in the prod environment and I'll need a few more details.

This is how I tried to reproduce the problem (but without success):

I don't have too many projects on my testing on my instance of Pontoon. To makes things slower, I've added the following line: https://github.com/jotes/pontoon/blob/heroku-test-sync-ampq/pontoon/sync/tasks.py#L79 - I have to make the sync task slow enough to give me some time to trigger a SIGTERM signal.
I execute ./manage.py sync_projects -> It starts and waits.
I disable the celery worker in the Heroku console
The Celery worker goes down and I see WorkerLostError (as expected)
I execute ./manage.py sync_projects a couple of times when the worker is down.
New messages are added to the Celery queue, I'm able to see them on the RabbitMQ Admin panel (and they're marked as "ready")
After some time, I run the worker again.
All messages on the celery queue are changing their state from "ready" to "unacknowledged". But nothing happens, and the worker is doing nothing (or is doing something that is not visible in the logs).
I run the sync task again
After the worker picks up all tasks and executes them :(

Did I reproduce it correctly?
BTW, I have a few questions/concerns:

What's the current value of https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_acks_late on Prod? Celery should acknowledge a message after a Celery worker picks it up (by default).
Can you look to your RabbitMQ admin console and show the numbers of messages from around the time the stuck happens? A screenshot could help. Here's how it looks like on my instance: https://imgur.com/a/rHzv6jX
Every execution of ./manage.py sync_projects should publish a new message to the celery queue -> does that happen after every cycling event? How many messages are produced during the downtime of the Celery workers? I don't know why the workers don't pick up new messages and process them (yet).

Flags: needinfo?(m)

Matjaz Horvat [:mathjazz]

Reporter

Comment 20

•

6 years ago

(In reply to Jarek Śmiejczak [:jotes] from comment #19)

Did I reproduce it correctly?

No. :) On prod, sync jobs following the worker restart just add messages to the queue, but none of them gets executed.

Have you set CELERY_ALWAYS_EAGER to True in your environment?

What's the current value of https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_acks_late on Prod? Celery should acknowledge a message after a Celery worker picks it up (by default).

We use the default value. These are the only Celery settings we override:
https://github.com/mozilla/pontoon/blob/ad502fe9475a1c223cccb22baac9be0a47545476/pontoon/settings/base.py#L718-L733

Just a guess: looking at CELERYD_MAX_TASKS_PER_CHILD, I wonder if the number of projects needs to be higher than 20 (e.g. 30) to reproduce the error reliably.

Can you look to your RabbitMQ admin console and show the numbers of messages from around the time the stuck happens? A screenshot could help. Here's how it looks like on my instance: https://imgur.com/a/rHzv6jX

I can check that the next time the stuck occurs.

Every execution of ./manage.py sync_projects should publish a new message to the celery queue -> does that happen after every cycling event?

Yes.

How many messages are produced during the downtime of the Celery workers? I don't know why the workers don't pick up new messages and process them (yet).

I think the number is usually 0. We have around 35 projects to sync, so when the worker restarts, 0-35 jobs get stuck. Since sync_projects is executed every 20 minutes and the worker restarts in a matter of seconds, it's not very likely that the next sync_projects tasks hits the window when the worker is down.

Flags: needinfo?(m)

Jarek Śmiejczak [:jotes]

Assignee

Comment 21

•

6 years ago

It looks like the problem doesn't occur on prod anymore (probably due to the recent upgrade of Celery). Is it okay to close it?
I'll create a follow-up bug about the Docker confugration that may help people to debug issues with Celery/RabbitMQ environments locally.

Flags: needinfo?(m)

Matjaz Horvat [:mathjazz]

Reporter

Comment 22

•

6 years ago

Thank you very much jotes!

This bug has been biting us for a very long. The fix not only save us from more or less occasional troubles of releasing the queue, but also allows us to increase the sync frequency by a factor of 2 (and opens the door for more in the future).

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Flags: needinfo?(m)

Resolution: --- → FIXED

Matjaz Horvat [:mathjazz]

Reporter

Updated

•

4 years ago

Updated

•

4 years ago

Product: Webtools → Webtools Graveyard

Bugzilla

[sync] The queue is getting stuck

Categories

(Webtools Graveyard :: Pontoon, enhancement, P2)

Tracking

(Not tracked)

People

(Reporter: mathjazz, Assigned: jotes)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Comment 21

Comment 22

Updated

Updated