Closed
Bug 1129731
Opened 10 years ago
Closed 10 years ago
treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: vinh, Assigned: pir)
References
Details
Getting alerted in nagios for:
treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free
I'm unable to remedy SWAP. Can someone help?
Reporter | ||
Updated•10 years ago
|
Component: Treeherder → Treeherder: Infrastructure
QA Contact: laura
Reporter | ||
Updated•10 years ago
|
Group: mozilla-employee-confidential
Comment 1•10 years ago
|
||
(Unhiding since this doesn't contain anything confidential)
Mauro/Cameron, do we know why this has started occurring? More load from TaskCluster?
Group: mozilla-employee-confidential
OS: Other → All
Priority: -- → P1
Hardware: x86 → All
Comment 2•10 years ago
|
||
A bit of this as well;
Thu 23:50:09 PST [5209] treeherder-rabbitmq1.private.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (131210) messages_ready CRITICAL (131161)
Comment 3•10 years ago
|
||
fubar, any idea why the celery process mem usage has grown so much in the last week? Did we update anything? I don't seem to recall a prod push in that timeline:
https://rpm.newrelic.com/accounts/677903/servers/5575925/processes?tw[end]=1423232521&tw[start]=1420554121#id=721569760
Flags: needinfo?(klibby)
Comment 4•10 years ago
|
||
See also bugs 1094814 and 1113115. Probably just a memory leak from general usage; we've seen it on the other celery nodes, as well as celery tasks on other services.
Comment 5•10 years ago
|
||
I don't believe it's a memory leak in this case, we deployed to prod twice in that range and the deploy script would have restarted the processes.
Assignee | ||
Comment 6•10 years ago
|
||
<nagios-scl3:#sysadmins> Sun 02:54:09 PST [5804]
treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL
- 10% free (202 MB out of 2047 MB) (http://m.mozilla.org/Swap)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20968 treeherd 20 0 846m 378m 3448 S 0.0 9.9 25:03.68 python
20970 treeherd 20 0 831m 365m 3472 S 0.3 9.5 25:12.17 python
20969 treeherd 20 0 801m 336m 3544 S 0.3 8.8 23:44.89 python
25661 treeherd 20 0 582m 134m 2416 S 0.0 3.5 47:39.14 celery
495 20968 5.6 9.8 866792 387512 ? Sl 03:37 25:03 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
├─supervisord,5479 /usr/bin/supervisord
│ ├─celery,20771 /usr/bin/celery -A treeherder beat -f /var/log/celery/celerybeat.log
│ ├─newrelic_plugin,1747 /usr/bin/newrelic_plugin_agent -c /etc/newrelic/agent.yml -f
│ ├─python,17210 /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs ...
│ │ ├─python,20968 /usr/bin/celery -A treeherder worker -c 3 -Q ...
│ │ │ └─{python},20971
│ │ ├─python,20969 /usr/bin/celery -A treeherder worker -c 3 -Q ...
│ │ │ └─{python},20975
│ │ └─python,20970 /usr/bin/celery -A treeherder worker -c 3 -Q ...
│ │ └─{python},20973
│ └─python,19437 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker_hp.log ...
│ └─python,20949 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 ...
│ └─{python},20950
[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# supervisorctl
newrelic-plugin-agent RUNNING pid 25178, uptime 0:02:31
run_celery_worker RUNNING pid 25195, uptime 0:02:26
run_celery_worker_hp RUNNING pid 25194, uptime 0:02:26
run_celerybeat RUNNING pid 25179, uptime 0:02:31
supervisor> elp
*** Unknown syntax: elp
supervisor> help
default commands (type help <topic>):
=====================================
add clear fg open quit remove restart start stop update
avail exit maintail pid reload reread shutdown status tail version
supervisor> stop run_celery_worker
run_celery_worker: stopped
supervisor> stop run_celery_worker_hp
stop run_celery_worker_hp: stopped
supervisor> stop run_celerybeat
stoprun_celerybeat: stopped
supervisor> stop newrelic-plugin-agent
newrelic-plugin-agent: stopped
[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery| wc -l
40
Some of these have been around for longer than the others, haven't been restarted in the last week:
495 13831 0.1 1.5 568428 60012 ? Sl Jan30 18:29 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
495 13832 0.1 1.9 562700 75608 ? Sl Jan30 18:58 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
495 13833 0.1 0.9 556728 35504 ? Sl Jan30 19:39 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
495 15701 0.2 2.3 518292 90852 ? Sl Jan30 38:54 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
495 15713 0.2 1.7 601252 68512 ? Sl Jan30 36:57 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
495 15876 0.2 1.3 508244 53924 ? Sl Jan30 29:14 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}' | xargs kill
kill 25460: No such process
[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}' | xargs kill -9
kill 25482: No such process
[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}'
25489
[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | grep '[c]elery' | wc -l
0
supervisor> start newrelic-plugin-agent
newrelic-plugin-agent: started
supervisor> start run_celery_worker
run_celery_worker: started
supervisor> start run_celery_worker_hp
run_celery_worker_hp: started
supervisor> start run_celerybeat
run_celerybeat: started
[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | grep '[c]elery' | wc -l
7
<nagios-scl3:#sysadmins> Sun 03:14:09 PST [5805]
treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is OK: SWAP OK - 99% free
(2015 MB out of 2047 MB) (http://m.mozilla.org/Swap)
Comment 7•10 years ago
|
||
Peter, thanks for spotting that - looks like we're getting zombie celery processes. Wonder if we could make the deploy script more aggressive at cleaning up processes once it's tried gracefully?
The running instance count here makes it pretty easy to spot, retrospectively:
https://rpm.newrelic.com/accounts/677903/servers/5575925/processes#id=721569760
This definitely wasn't anything to do with the leaks we've seen elsewhere.
Anyway, I'll file another bug for (a) one-off checking all nodes for any other zombies, (b) coming up with a longer term solution for preventing them being created in the first place, or else cleaning them up automatically.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Assignee: nobody → pradcliffe+bugzilla
You need to log in
before you can comment on or make changes to this bug.
Description
•