Closed Bug 1129731 Opened 9 years ago Closed 9 years ago

treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: vinh, Assigned: pir)

References

Details

Getting alerted in nagios for:

treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free

I'm unable to remedy SWAP.  Can someone help?
Component: Treeherder → Treeherder: Infrastructure
QA Contact: laura
Group: mozilla-employee-confidential
(Unhiding since this doesn't contain anything confidential)

Mauro/Cameron, do we know why this has started occurring? More load from TaskCluster?
Group: mozilla-employee-confidential
OS: Other → All
Priority: -- → P1
Hardware: x86 → All
A bit of this as well;

Thu 23:50:09 PST [5209] treeherder-rabbitmq1.private.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (131210) messages_ready CRITICAL (131161)
fubar, any idea why the celery process mem usage has grown so much in the last week? Did we update anything? I don't seem to recall a prod push in that timeline:
https://rpm.newrelic.com/accounts/677903/servers/5575925/processes?tw[end]=1423232521&tw[start]=1420554121#id=721569760
Flags: needinfo?(klibby)
See also bugs 1094814 and 1113115. Probably just a memory leak from general usage; we've seen it on the other celery nodes, as well as celery tasks on other services.
Depends on: 1113115
Flags: needinfo?(klibby)
See Also: → 1094814
I don't believe it's a memory leak in this case, we deployed to prod twice in that range and the deploy script would have restarted the processes.
<nagios-scl3:#sysadmins> Sun 02:54:09 PST [5804] 
  treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL 
  - 10% free (202 MB out of 2047 MB) (http://m.mozilla.org/Swap)

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
20968 treeherd  20   0  846m 378m 3448 S  0.0  9.9  25:03.68 python             
20970 treeherd  20   0  831m 365m 3472 S  0.3  9.5  25:12.17 python             
20969 treeherd  20   0  801m 336m 3544 S  0.3  8.8  23:44.89 python             
25661 treeherd  20   0  582m 134m 2416 S  0.0  3.5  47:39.14 celery             

495      20968  5.6  9.8 866792 387512 ?       Sl   03:37  25:03 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h


  ├─supervisord,5479 /usr/bin/supervisord
  │   ├─celery,20771 /usr/bin/celery -A treeherder beat -f /var/log/celery/celerybeat.log
  │   ├─newrelic_plugin,1747 /usr/bin/newrelic_plugin_agent -c /etc/newrelic/agent.yml -f
  │   ├─python,17210 /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs ...
  │   │   ├─python,20968 /usr/bin/celery -A treeherder worker -c 3 -Q ...
  │   │   │   └─{python},20971
  │   │   ├─python,20969 /usr/bin/celery -A treeherder worker -c 3 -Q ...
  │   │   │   └─{python},20975
  │   │   └─python,20970 /usr/bin/celery -A treeherder worker -c 3 -Q ...
  │   │       └─{python},20973
  │   └─python,19437 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker_hp.log ...
  │       └─python,20949 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 ...
  │           └─{python},20950


[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# supervisorctl 
newrelic-plugin-agent            RUNNING    pid 25178, uptime 0:02:31
run_celery_worker                RUNNING    pid 25195, uptime 0:02:26
run_celery_worker_hp             RUNNING    pid 25194, uptime 0:02:26
run_celerybeat                   RUNNING    pid 25179, uptime 0:02:31
supervisor> elp
*** Unknown syntax: elp
supervisor> help

default commands (type help <topic>):
=====================================
add    clear  fg        open  quit    remove  restart   start   stop  update 
avail  exit   maintail  pid   reload  reread  shutdown  status  tail  version

supervisor> stop run_celery_worker
run_celery_worker: stopped
supervisor> stop run_celery_worker_hp 
stop run_celery_worker_hp: stopped
supervisor> stop run_celerybeat
stoprun_celerybeat: stopped
supervisor> stop newrelic-plugin-agent
newrelic-plugin-agent: stopped

[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery| wc -l
40

Some of these have been around for longer than the others, haven't been restarted in the last week:


495      13831  0.1  1.5 568428 60012 ?        Sl   Jan30  18:29 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
495      13832  0.1  1.9 562700 75608 ?        Sl   Jan30  18:58 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
495      13833  0.1  0.9 556728 35504 ?        Sl   Jan30  19:39 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
495      15701  0.2  2.3 518292 90852 ?        Sl   Jan30  38:54 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
495      15713  0.2  1.7 601252 68512 ?        Sl   Jan30  36:57 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h
495      15876  0.2  1.3 508244 53924 ?        Sl   Jan30  29:14 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h

[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}' | xargs kill
kill 25460: No such process
[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}' | xargs kill -9
kill 25482: No such process
[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}'
25489
[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | grep '[c]elery' | wc -l
0

supervisor> start newrelic-plugin-agent
newrelic-plugin-agent: started
supervisor> start run_celery_worker
run_celery_worker: started
supervisor> start run_celery_worker_hp
run_celery_worker_hp: started
supervisor> start run_celerybeat
run_celerybeat: started

[root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | grep '[c]elery' | wc -l
7

<nagios-scl3:#sysadmins> Sun 03:14:09 PST [5805] 
  treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is OK: SWAP OK - 99% free 
  (2015 MB out of 2047 MB) (http://m.mozilla.org/Swap)
Peter, thanks for spotting that - looks like we're getting zombie celery processes. Wonder if we could make the deploy script more aggressive at cleaning up processes once it's tried gracefully?

The running instance count here makes it pretty easy to spot, retrospectively:
https://rpm.newrelic.com/accounts/677903/servers/5575925/processes#id=721569760

This definitely wasn't anything to do with the leaks we've seen elsewhere.

Anyway, I'll file another bug for (a) one-off checking all nodes for any other zombies, (b) coming up with a longer term solution for preventing them being created in the first place, or else cleaning them up automatically.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee: nobody → pradcliffe+bugzilla
Depends on: 1131059
You need to log in before you can comment on or make changes to this bug.