Closed Bug 1238476 Opened 10 years ago Closed 9 years ago

support-celery2.webapp.phx1.mozilla.com:Swap is WARNING

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dgarvey, Assigned: cliang)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2430] )

Attachments

(2 files)

Mon 01:31:41 PST [1012] support-celery2.webapp.phx1.mozilla.com:Swap is WARNING: SWAP WARNING - 40% free (814 MB out of 2047 MB)
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2430]
Its coming up to about 50% [dgarvey@support-celery2.webapp.phx1 ~]$ free total used free shared buffers cached Mem: 5992988 5878856 114132 4 3836 19496 -/+ buffers/cache: 5855524 137464 Swap: 2097148 1322512 774636 [dgarvey@support-celery2.webapp.phx1 ~]$
Don't see an improvement yet in memory usage. It goes down to 50% but comes back down into the alert threshold. [dgarvey@support-celery2.webapp.phx1 ~]$ ps axo %mem,pid,euser,cmd | sort -nr | head -n 10 4.1 10923 root /usr/bin/ruby /usr/bin/puppet agent --verbose --onetime --no-daemonize 3.8 1078 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 3.6 996 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 3.6 31274 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 3.6 31263 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 3.5 865 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 3.5 856 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 3.5 30688 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 3.5 2994 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 3.4 975 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E [dgarvey@support-celery2.webapp.phx1 ~]$ free total used free shared buffers cached Mem: 5992988 5687416 305572 4 56532 146380 -/+ buffers/cache: 5484504 508484 Swap: 2097148 1655440 441708 [dgarvey@support-celery2.webapp.phx1 ~]$ [dgarvey@support-celery2.webapp.phx1 ~]$ free total used free shared buffers cached Mem: 5992988 5687416 305572 4 56532 146380 -/+ buffers/cache: 5484504 508484 Swap: 2097148 1655440 441708 [dgarvey@support-celery2.webapp.phx1 ~]$ w 10:17:27 up 124 days, 16:15, 1 user, load average: 0.08, 0.17, 0.14 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT dgarvey pts/0 admin1a.private. 10:09 0.00s 0.08s 0.00s w [dgarvey@support-celery2.webapp.phx1 ~]$ sudo supervisorctl restart celery-kitsune-prod celery-kitsune-prod: stopped celery-kitsune-prod: started [dgarvey@support-celery2.webapp.phx1 ~]$ ps axo %mem,pid,euser,cmd | sort -nr | head -n 10 1.6 12728 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 1.6 12724 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 1.6 12722 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 1.6 12718 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 1.6 12714 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 1.6 12710 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 1.6 12709 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 1.6 12706 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 1.6 12703 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E 1.6 12701 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E [dgarvey@support-celery2.webapp.phx1 ~]$ free total used free shared buffers cached Mem: 5992988 2744252 3248736 4 89748 187572 -/+ buffers/cache: 2466932 3526056 Swap: 2097148 39348 2057800 [dgarvey@support-celery2.webapp.phx1 ~]$
Assignee: server-ops-webops → cliang
Attached image sumo_celery_mem.png
TL,DR: restarted celery processes on sumocelery1.webapp.phx1. We'll need to wait at least one day to see if this fixes the issue. Long form: Looking at the memory usage of of the SUMO celery boxes [sumo_celery_mem.png], you can see that there is usually a nightly spike as the database is rebuilt. On the 11th, this pattern is broken: - support-celery2 has the usual spike in memory usage but higher than normal - support celery1 & 3 show much slighter rise in memory usage but *stay* in that higher range - sumocelery1 shows no peaks at all Looking at the rabbitmq GUI -> Queues, I could see that one queue had number of unacknowledged messages: vhost, kitsune_prod; queue, celery. [rabbitmq_queues_unacknowledged.png] If there are unacknowledged messages in a rabbit queue, odd are that those messages were sent from rabbit to a celery node but the celery node did not send back acknowledgement that it received them. The SOP is to restart the celery process. That usually forces rabbit to realize that the unacknowledged messages that it sent to the unresponsive celery node need to be resent. Restarting celery on sumocelery1 took far longer than usual (on the order of minutes). However, once it was restarted, the number of unacknowledged messages dropped to zero. I prophylactically did a rolling restart of celery on support-celery1-3.webapp.phx1 so that they start from a "clean" memory pattern and we'll see if the issues with memory crop up again tonight.
6:55 AM <nagios-phx1> Fri 06:54:59 PST [1015] support-celery3.webapp.phx1.mozilla.com:Swap is WARNING: SWAP WARNING - 40% free (815 MB out of 2046 MB) (http://m.mozilla.org/Swap) [root@support-celery3.webapp.phx1 ~]# free -m total used free shared buffers cached Mem: 5894 5585 309 0 113 181 -/+ buffers/cache: 5289 605 Swap: 2046 1228 817 [root@support-celery3.webapp.phx1 ~]#
* Slightly lowered the number of celery processes running on the servers (from 30 to 24) * Increased the threshold for the SUMO rabbitMQ alert in case this radically increases the backlog. (Looking at the data for the last week, it shouldn't.) * Filed bug 1240213 for a better rabbit alert that should better get at the root issue of why we care about a high number of messages in the queue.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: