As per bug 716953, it looks like rabbit is having real problems on this server. Dumitru suggested filing a bug looking at why. If we could get some ganglia graphs of the rabbit queue, look at the logs perhaps we might be able to come up with a plan. But at the moment I've no idea why its failing.
If the connection to RabbitMQ goes through our firewall, I know it likes to close connections after half an hour or so of inactivity. Freddo ran into this and we set up a cron job to ping Celery through Rabbit to keep it alive.
Looking at the load, I'd be tempted to turn rabbit mq off if nothing obvious springs to mind. Celery is there to cope with servers sending huge bursts of traffic. AMO is under control for that now at the client end. Khan was getting 200k errors a day from Socorro and coped just fine. To turn it off set: CELERY_ALWAYS_EAGER = True In local_settings and see how it does.
I added the SJC and PHX1 Hosts into Ganglia, we should be able to monitor the rabbitmq celery queue here. https://ganglia.mozilla.org/phx1/?c=Arecibo&h=arecibo1.dmz.phx1.mozilla.com&m=load_one&r=hour&s=descending&hc=4&mc=2 https://ganglia.mozilla.org/sjc1/?c=Infra&h=ganglia1.dmz.sjc1.mozilla.com&m=load_one&r=hour&s=descending&hc=4&mc=2 While checking rabbitmq I noticed that queues were being created for every celery task completed. I tried to get the number of queue created but it caused rabbitmq to become unresponsive. The last output on my screen shows around 70K+ queues. These queues looks like they are not being consumed and the number of them continue to grow. Could we set CELERY_IGNORE_RESULT = True in Arecibo ? [firstname.lastname@example.org normal]# /usr/sbin/rabbitmqctl list_queues -p arecibo name memory messagesListing queues ... e698f8e13e8340cab3b9e5d36f636370 21728 1 02b46117479e4dc48341b747faba2dd9 21728 1 b327402b7d3e4bd883d3996fabffac5e 21728 1 c6cb5b3fc605468597246c538726171d 21728 1
That's useful info, thanks. Set CELERY_IGNORE_RESULT=True, we don't care if it works or not. Will research why a queue is being created for each task completed.
Looks like this is a symptom of the task failing, it creates a temporary queue for each failure. Are the celery logs accessible anywhere like syslog1 or similar?
Added CELERY_IGNORE_RESULTS to settings_local.py. I dropped celeryd-arecibo.log.gz in your home directory on khan.
Thanks. Looked in the log, nothing there about the tasks failing which is odd. Hopefully the ignore results helps.
Added nagios alerts for rabbitmq and celery on these hosts. The rabbit queue is being ingested fairly quickly but I will keep an eye on this for the next few days.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.