Closed Bug 1477570 Opened 7 years ago Closed 6 years ago

[sync] The queue is getting stuck

Categories

(Webtools Graveyard :: Pontoon, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mathjazz, Assigned: jotes)

References

Details

Sync queue is getting stuck on RabbitMQ every few days and needs to be released manually: http://mozilla-pontoon.readthedocs.io/en/latest/admin/maintenance.html#releasing-the-queue We need to investigate what's going on.
For starters, it would be nice to set up some alerts to inform us when the queue gets stuck, so that we know we need to release it. That might not be entirely trivial in e.g. Papertrail, because there are no error messages in the log (no sync activity means no sync activity in the log as well).
We should also look into our CloudAMQP (Rabbitmq) plan and consider upgrading.
We should also upgrade to the latest Heroku stack (probably unrelated, but who knows): https://devcenter.heroku.com/articles/upgrading-to-the-latest-stack
Note: Pontoon has been on the latest stack (heroku-16) since Monday.
The queue got stuck again and I noticed it happened right after Cycling (restart) of the worker: https://devcenter.heroku.com/articles/dynos#automatic-dyno-restarts Looking into the log I noticed the pattern: every time the queue gets stuck, it happens when Cycling interrupts the sync process. Messages (sync tasks) that are left in the queue when the worker dies, never leave the queue. Messages added after each subsequent sync period just pile up. Note that probability of this scenario increases with the sync frequency, so it's not surprising we've been encountering the problem more often when we were syncing evety 10 minutes. For reference, this is how Cycling looks in the log: heroku/worker.1: Cycling heroku/worker.1: State changed from up to starting heroku/worker.1: Stopping all processes with SIGTERM app/worker.1: worker: Warm shutdown (MainProcess) app/worker.1: [2018-07-26 10:10:39,464: ERROR/MainProcess] Process 'Worker-146' pid:28302 exited with 'signal 15 (SIGTERM)' app/worker.1: [2018-07-26 10:10:39,986: ERROR/MainProcess] Process 'Worker-145' pid:28120 exited with 'signal 15 (SIGTERM)' app/worker.1: [2018-07-26 10:10:39,997: ERROR/MainProcess] Task pontoon.sync.tasks.sync_project[b02a4c6e-6837-4598-b22d-c291e320be30] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',) app/worker.1: Traceback (most recent call last): app/worker.1: File "/app/.heroku/python/lib/python2.7/site-packages/billiard/pool.py", line 1171, in mark_as_worker_lost app/worker.1: human_status(exitcode)), app/worker.1: WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM). heroku/worker.1: Process exited with status 0 app/worker.1: [2018-07-26 10:10:55,927: WARNING/MainProcess] /app/.heroku/python/lib/python2.7/site-packages/celery/apps/worker.py:161: CDeprecationWarning: app/worker.1: Starting from version 3.2 Celery will refuse to accept pickle by default. app/worker.1: The pickle serializer is a security concern as it may give attackers app/worker.1: the ability to execute any command. It's important to secure app/worker.1: your broker from unauthorized access when using pickle, so we think app/worker.1: that enabling pickle should require a deliberate action and not be app/worker.1: the default choice. app/worker.1: If you depend on pickle then you should set a setting to disable this app/worker.1: warning and to be sure that everything will continue working app/worker.1: when you upgrade to Celery 3.2:: app/worker.1: CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml'] app/worker.1: You must only enable the serializers that you will actually use. app/worker.1: warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED)) app/worker.1: -------------- celery@7c76ca9a-1d45-4c14-9579-42153cdabbce v3.1.18 (Cipater) app/worker.1: ---- **** ----- app/worker.1: --- * *** * -- Linux-4.4.0-1019-aws-x86_64-with-debian-stretch-sid app/worker.1: -- * - **** --- app/worker.1: - ** ---------- [config] app/worker.1: - ** ---------- .> app: pontoon:0x7fc486c600d0 app/worker.1: - ** ---------- .> transport: amqp://lzlaccmk:**@moose.rmq.cloudamqp.com:5672/lzlaccmk app/worker.1: - ** ---------- .> results: disabled app/worker.1: - *** --- * --- .> concurrency: 2 (prefork) app/worker.1: -- ******* ---- app/worker.1: --- ***** ----- [queues] app/worker.1: -------------- .> celery exchange=celery(direct) key=celery app/worker.1: [tasks] app/worker.1: . pontoon.checks.tasks.bulk_run_checks app/worker.1: . pontoon.sync.tasks.sync_project app/worker.1: . pontoon.sync.tasks.sync_translations app/worker.1: 2018-07-26 10:10:55,944 (11/MainThread) newrelic.core.agent INFO - New Relic Python Agent (2.106.1.88) app/worker.1: 2018-07-26 10:10:55,942 (10/MainThread) newrelic.core.agent INFO - New Relic Python Agent (2.106.1.88) app/worker.1: [2018-07-26 10:10:55,944: INFO/Worker-2] New Relic Python Agent (2.106.1.88) app/worker.1: [2018-07-26 10:10:55,942: INFO/Worker-1] New Relic Python Agent (2.106.1.88) app/worker.1: [2018-07-26 10:10:55,973: INFO/MainProcess] Connected to amqp://lzlaccmk:**@moose.rmq.cloudamqp.com:5672/lzlaccmk app/worker.1: [2018-07-26 10:10:56,031: WARNING/MainProcess] celery@7c76ca9a-1d45-4c14-9579-42153cdabbce ready.
See Also: → 1398338
I am definitely not an expert for message brokers, but seems you are using Rabbit MQ. From my experience with RMQ, this sounds like a misconfiguration of the broker. Do you have access to configuration of some connection keep-alives, heartbeats, or something that looks similar to https://www.rabbitmq.com/confirms.html#automatic-requeueing ? I never did that, but IIRC it's possible to make the client "ignore" redelivered (at least once failed) messages. Isn't that your case by any chance? It might be also interesting to also check how dead letter exchanges and message TTL work, so in case of a "broken" message, the others won't get blocked. If you can point me where is the code producing and consuming the messages, I can have a look, if I can come up with a way to improve the configuration.

:mathjazz Could you check if the transport url is the same before and after a restart of Celery?

Flags: needinfo?(m)

As far as I can tell from the logs, it's always the same, it looks like this:

 -------------- celery@6c0e4709-ccf0-41bd-8276-7fa0f9567507 v3.1.18 (Cipater) 
---- **** -----  
--- * ***  * -- Linux-4.4.0-1031-aws-x86_64-with-debian-buster-sid 
-- * - **** ---  
- ** ---------- [config] 
- ** ---------- .> app:         pontoon:0x7f63ba8e5410 
- ** ---------- .> transport:   amqp://lzlaccmk:**@moose.rmq.cloudamqp.com:5672/lzlaccmk 
- ** ---------- .> results:     disabled 
- *** --- * --- .> concurrency: 8 (prefork) 
-- ******* ----  
--- ***** ----- [queues] 
 -------------- .> celery           exchange=celery(direct) key=celery 
Flags: needinfo?(m)

:mathjazz

(feel free to ignore this comment if you managed to get alerts on the prod).
The one of possible problems to fix in this bug is the lack of alarms coming from Papertrail.
As it's stated in docs, Papertrail doesn't support inactivity alerts:
https://help.papertrailapp.com/kb/how-it-works/alerts/#inactivity-and-metrics-integrations

A small thing would be to add a Cron task which would produce an error message if the last sync happened >X minutes ago.

Flags: needinfo?(m)

Actually the docs say you can set up inactivity alerts:
https://help.papertrailapp.com/kb/how-it-works/alerts/#inactivity-alerts

I've set one for when "Sync complete." message doesn't show up for over 1 hour. Thanks for the idea! :)

I can add more people to the list of recipients if the alert proves to be working as expected and if we don't fix the bug soon, of course.

Flags: needinfo?(m)

:mathjazz

I misunderstood the docs, sorry for the confusion.

:mathjazz
Is it okay to take this one?

Absolutely yes. :)

Assignee: nobody → poke
Status: NEW → ASSIGNED

Just came across this, which might be useful:

"We assume that a system administrator deliberately killing the task does not want it to automatically restart."
https://docs.celeryproject.org/en/latest/userguide/tasks.html

"If you really want a task to be redelivered in these scenarios you should consider enabling the task_reject_on_worker_lost setting."
https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_reject_on_worker_lost

Hey,
Sadly, I don't have a solution yet. However, I've decided to dump my current thoughts:

To give you an idea, we sync every 20 minutes, and this problem happened twice in the last week (Dec 26, Dec 29). But it can as well keep working for several weeks without problems.

I think we need a way to reproduce it systematically (make the sync artificially longer + kill something?)

:jotes, thanks for the investigation!

According to that GH discussion, the issue seems specific to Heroku. AFAICT we never reproduced the problem locally, but you should be able to do so on Heroku by following Comment 5.

Looks like both potential solutions (REMAP_SIGTERM, task_reject_on_worker_lost) point towards upgrading to celery 4, which we might also need for the python3 upgrade.

(In reply to Jarek Śmiejczak [:jotes] from comment #15)

Note that we are very likely to upgrade celery to its latest version as part of the Python 3 move. More on this in early January.

After an evening spent on testing Celery on Heroku with CloudAMQP, I couldn't force the queue to get stuck. I probably don't understand what actually happens in the prod environment and I'll need a few more details.

This is how I tried to reproduce the problem (but without success):

  • I don't have too many projects on my testing on my instance of Pontoon. To makes things slower, I've added the following line: https://github.com/jotes/pontoon/blob/heroku-test-sync-ampq/pontoon/sync/tasks.py#L79 - I have to make the sync task slow enough to give me some time to trigger a SIGTERM signal.
  • I execute ./manage.py sync_projects -> It starts and waits.
  • I disable the celery worker in the Heroku console
  • The Celery worker goes down and I see WorkerLostError (as expected)
  • I execute ./manage.py sync_projects a couple of times when the worker is down.
  • New messages are added to the Celery queue, I'm able to see them on the RabbitMQ Admin panel (and they're marked as "ready")
  • After some time, I run the worker again.
  • All messages on the celery queue are changing their state from "ready" to "unacknowledged". But nothing happens, and the worker is doing nothing (or is doing something that is not visible in the logs).
  • I run the sync task again
  • After the worker picks up all tasks and executes them :(

Did I reproduce it correctly?
BTW, I have a few questions/concerns:

  • What's the current value of https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_acks_late on Prod? Celery should acknowledge a message after a Celery worker picks it up (by default).
  • Can you look to your RabbitMQ admin console and show the numbers of messages from around the time the stuck happens? A screenshot could help. Here's how it looks like on my instance: https://imgur.com/a/rHzv6jX
  • Every execution of ./manage.py sync_projects should publish a new message to the celery queue -> does that happen after every cycling event? How many messages are produced during the downtime of the Celery workers? I don't know why the workers don't pick up new messages and process them (yet).
Flags: needinfo?(m)

(In reply to Jarek Śmiejczak [:jotes] from comment #19)

Did I reproduce it correctly?

No. :) On prod, sync jobs following the worker restart just add messages to the queue, but none of them gets executed.

Have you set CELERY_ALWAYS_EAGER to True in your environment?

What's the current value of https://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-task_acks_late on Prod? Celery should acknowledge a message after a Celery worker picks it up (by default).

We use the default value. These are the only Celery settings we override:
https://github.com/mozilla/pontoon/blob/ad502fe9475a1c223cccb22baac9be0a47545476/pontoon/settings/base.py#L718-L733

Just a guess: looking at CELERYD_MAX_TASKS_PER_CHILD, I wonder if the number of projects needs to be higher than 20 (e.g. 30) to reproduce the error reliably.

Can you look to your RabbitMQ admin console and show the numbers of messages from around the time the stuck happens? A screenshot could help. Here's how it looks like on my instance: https://imgur.com/a/rHzv6jX

I can check that the next time the stuck occurs.

Every execution of ./manage.py sync_projects should publish a new message to the celery queue -> does that happen after every cycling event?

Yes.

How many messages are produced during the downtime of the Celery workers? I don't know why the workers don't pick up new messages and process them (yet).

I think the number is usually 0. We have around 35 projects to sync, so when the worker restarts, 0-35 jobs get stuck. Since sync_projects is executed every 20 minutes and the worker restarts in a matter of seconds, it's not very likely that the next sync_projects tasks hits the window when the worker is down.

Flags: needinfo?(m)

It looks like the problem doesn't occur on prod anymore (probably due to the recent upgrade of Celery). Is it okay to close it?
I'll create a follow-up bug about the Docker confugration that may help people to debug issues with Celery/RabbitMQ environments locally.

Flags: needinfo?(m)

Thank you very much jotes!

This bug has been biting us for a very long. The fix not only save us from more or less occasional troubles of releasing the queue, but also allows us to increase the sync frequency by a factor of 2 (and opens the door for more in the future).

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(m)
Resolution: --- → FIXED
See Also: → 1702385
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.