Closed
Bug 1238476
Opened 10 years ago
Closed 9 years ago
support-celery2.webapp.phx1.mozilla.com:Swap is WARNING
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Infrastructure & Operations Graveyard
WebOps: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dgarvey, Assigned: cliang)
References
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/2430] )
Attachments
(2 files)
Mon 01:31:41 PST [1012] support-celery2.webapp.phx1.mozilla.com:Swap is WARNING: SWAP WARNING - 40% free (814 MB out of 2047 MB)
Reporter | ||
Comment 1•10 years ago
|
||
Its coming up to about 50%
[dgarvey@support-celery2.webapp.phx1 ~]$ free
total used free shared buffers cached
Mem: 5992988 5878856 114132 4 3836 19496
-/+ buffers/cache: 5855524 137464
Swap: 2097148 1322512 774636
[dgarvey@support-celery2.webapp.phx1 ~]$
Reporter | ||
Comment 2•10 years ago
|
||
Don't see an improvement yet in memory usage. It goes down to 50% but comes back down into the alert threshold.
[dgarvey@support-celery2.webapp.phx1 ~]$ ps axo %mem,pid,euser,cmd | sort -nr | head -n 10
4.1 10923 root /usr/bin/ruby /usr/bin/puppet agent --verbose --onetime --no-daemonize
3.8 1078 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
3.6 996 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
3.6 31274 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
3.6 31263 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
3.5 865 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
3.5 856 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
3.5 30688 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
3.5 2994 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
3.4 975 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
[dgarvey@support-celery2.webapp.phx1 ~]$ free
total used free shared buffers cached
Mem: 5992988 5687416 305572 4 56532 146380
-/+ buffers/cache: 5484504 508484
Swap: 2097148 1655440 441708
[dgarvey@support-celery2.webapp.phx1 ~]$
[dgarvey@support-celery2.webapp.phx1 ~]$ free
total used free shared buffers cached
Mem: 5992988 5687416 305572 4 56532 146380
-/+ buffers/cache: 5484504 508484
Swap: 2097148 1655440 441708
[dgarvey@support-celery2.webapp.phx1 ~]$ w
10:17:27 up 124 days, 16:15, 1 user, load average: 0.08, 0.17, 0.14
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
dgarvey pts/0 admin1a.private. 10:09 0.00s 0.08s 0.00s w
[dgarvey@support-celery2.webapp.phx1 ~]$ sudo supervisorctl restart celery-kitsune-prod
celery-kitsune-prod: stopped
celery-kitsune-prod: started
[dgarvey@support-celery2.webapp.phx1 ~]$ ps axo %mem,pid,euser,cmd | sort -nr | head -n 10
1.6 12728 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
1.6 12724 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
1.6 12722 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
1.6 12718 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
1.6 12714 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
1.6 12710 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
1.6 12709 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
1.6 12706 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
1.6 12703 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
1.6 12701 apache /data/www/support.mozilla.com/kitsune/virtualenv/bin/python /data/www/support.mozilla.com/kitsune/manage.py celeryd --loglevel=INFO -f /var/log/newrelic/newrelic-support-celery.log -c 30 -E
[dgarvey@support-celery2.webapp.phx1 ~]$ free
total used free shared buffers cached
Mem: 5992988 2744252 3248736 4 89748 187572
-/+ buffers/cache: 2466932 3526056
Swap: 2097148 39348 2057800
[dgarvey@support-celery2.webapp.phx1 ~]$
Assignee | ||
Comment 3•10 years ago
|
||
Assignee | ||
Comment 4•10 years ago
|
||
Assignee | ||
Comment 5•10 years ago
|
||
TL,DR: restarted celery processes on sumocelery1.webapp.phx1. We'll need to wait at least one day to see if this fixes the issue.
Long form:
Looking at the memory usage of of the SUMO celery boxes [sumo_celery_mem.png], you can see that there is usually a nightly spike as the database is rebuilt. On the 11th, this pattern is broken:
- support-celery2 has the usual spike in memory usage but higher than normal
- support celery1 & 3 show much slighter rise in memory usage but *stay* in that higher range
- sumocelery1 shows no peaks at all
Looking at the rabbitmq GUI -> Queues, I could see that one queue had number of unacknowledged messages: vhost, kitsune_prod; queue, celery. [rabbitmq_queues_unacknowledged.png]
If there are unacknowledged messages in a rabbit queue, odd are that those messages were sent from rabbit to a celery node but the celery node did not send back acknowledgement that it received them. The SOP is to restart the celery process. That usually forces rabbit to realize that the unacknowledged messages that it sent to the unresponsive celery node need to be resent.
Restarting celery on sumocelery1 took far longer than usual (on the order of minutes). However, once it was restarted, the number of unacknowledged messages dropped to zero. I prophylactically did a rolling restart of celery on support-celery1-3.webapp.phx1 so that they start from a "clean" memory pattern and we'll see if the issues with memory crop up again tonight.
6:55 AM <nagios-phx1> Fri 06:54:59 PST [1015] support-celery3.webapp.phx1.mozilla.com:Swap is WARNING: SWAP WARNING - 40% free (815 MB out of 2046 MB) (http://m.mozilla.org/Swap)
[root@support-celery3.webapp.phx1 ~]# free -m
total used free shared buffers cached
Mem: 5894 5585 309 0 113 181
-/+ buffers/cache: 5289 605
Swap: 2046 1228 817
[root@support-celery3.webapp.phx1 ~]#
Assignee | ||
Comment 8•9 years ago
|
||
* Slightly lowered the number of celery processes running on the servers (from 30 to 24)
* Increased the threshold for the SUMO rabbitMQ alert in case this radically increases the backlog.
(Looking at the data for the last week, it shouldn't.)
* Filed bug 1240213 for a better rabbit alert that should better get at the root issue of why we care about a high number of messages in the queue.
Assignee | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•