1129731 - treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free

Reporter

Description

•

10 years ago

Getting alerted in nagios for: treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free I'm unable to remedy SWAP. Can someone help?

Vinh Hua [:vinh]

Reporter

Updated

•

10 years ago

Component: Treeherder → Treeherder: Infrastructure

QA Contact: laura

Vinh Hua [:vinh]

Reporter

Updated

•

10 years ago

Group: mozilla-employee-confidential

Ed Morley [:emorley]

Comment 1

•

10 years ago

(Unhiding since this doesn't contain anything confidential) Mauro/Cameron, do we know why this has started occurring? More load from TaskCluster?

Group: mozilla-employee-confidential

OS: Other → All

Priority: -- → P1

Hardware: x86 → All

Ryan C [:ryanc] (UTC-4)

Comment 2

•

10 years ago

A bit of this as well; Thu 23:50:09 PST [5209] treeherder-rabbitmq1.private.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (131210) messages_ready CRITICAL (131161)

Ed Morley [:emorley]

Comment 3

•

10 years ago

fubar, any idea why the celery process mem usage has grown so much in the last week? Did we update anything? I don't seem to recall a prod push in that timeline: https://rpm.newrelic.com/accounts/677903/servers/5575925/processes?tw[end]=1423232521&tw[start]=1420554121#id=721569760

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 4

•

10 years ago

See also bugs 1094814 and 1113115. Probably just a memory leak from general usage; we've seen it on the other celery nodes, as well as celery tasks on other services.

Depends on: 1113115

Flags: needinfo?(klibby)

Comment 5

•

10 years ago

I don't believe it's a memory leak in this case, we deployed to prod twice in that range and the deploy script would have restarted the processes.

Peter Radcliffe [:pir]

Assignee

Comment 6

•

10 years ago

<nagios-scl3:#sysadmins> Sun 02:54:09 PST [5804] treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 10% free (202 MB out of 2047 MB) (http://m.mozilla.org/Swap) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20968 treeherd 20 0 846m 378m 3448 S 0.0 9.9 25:03.68 python 20970 treeherd 20 0 831m 365m 3472 S 0.3 9.5 25:12.17 python 20969 treeherd 20 0 801m 336m 3544 S 0.3 8.8 23:44.89 python 25661 treeherd 20 0 582m 134m 2416 S 0.0 3.5 47:39.14 celery 495 20968 5.6 9.8 866792 387512 ? Sl 03:37 25:03 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h ├─supervisord,5479 /usr/bin/supervisord │ ├─celery,20771 /usr/bin/celery -A treeherder beat -f /var/log/celery/celerybeat.log │ ├─newrelic_plugin,1747 /usr/bin/newrelic_plugin_agent -c /etc/newrelic/agent.yml -f │ ├─python,17210 /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs ... │ │ ├─python,20968 /usr/bin/celery -A treeherder worker -c 3 -Q ... │ │ │ └─{python},20971 │ │ ├─python,20969 /usr/bin/celery -A treeherder worker -c 3 -Q ... │ │ │ └─{python},20975 │ │ └─python,20970 /usr/bin/celery -A treeherder worker -c 3 -Q ... │ │ └─{python},20973 │ └─python,19437 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker_hp.log ... │ └─python,20949 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 ... │ └─{python},20950 [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# supervisorctl newrelic-plugin-agent RUNNING pid 25178, uptime 0:02:31 run_celery_worker RUNNING pid 25195, uptime 0:02:26 run_celery_worker_hp RUNNING pid 25194, uptime 0:02:26 run_celerybeat RUNNING pid 25179, uptime 0:02:31 supervisor> elp *** Unknown syntax: elp supervisor> help default commands (type help <topic>): ===================================== add clear fg open quit remove restart start stop update avail exit maintail pid reload reread shutdown status tail version supervisor> stop run_celery_worker run_celery_worker: stopped supervisor> stop run_celery_worker_hp stop run_celery_worker_hp: stopped supervisor> stop run_celerybeat stoprun_celerybeat: stopped supervisor> stop newrelic-plugin-agent newrelic-plugin-agent: stopped [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery| wc -l 40 Some of these have been around for longer than the others, haven't been restarted in the last week: 495 13831 0.1 1.5 568428 60012 ? Sl Jan30 18:29 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 495 13832 0.1 1.9 562700 75608 ? Sl Jan30 18:58 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 495 13833 0.1 0.9 556728 35504 ? Sl Jan30 19:39 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default,process_objects,cycle_data,calculate_eta,populate_performance_series,fetch_bugs -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 495 15701 0.2 2.3 518292 90852 ? Sl Jan30 38:54 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 495 15713 0.2 1.7 601252 68512 ? Sl Jan30 36:57 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h 495 15876 0.2 1.3 508244 53924 ? Sl Jan30 29:14 /usr/bin/python /usr/bin/celery -A treeherder worker -c 3 -Q default -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker.log -l INFO -n default.%h [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}' | xargs kill kill 25460: No such process [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}' | xargs kill -9 kill 25482: No such process [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | fgrep celery | awk '{print $2}' 25489 [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | grep '[c]elery' | wc -l 0 supervisor> start newrelic-plugin-agent newrelic-plugin-agent: started supervisor> start run_celery_worker run_celery_worker: started supervisor> start run_celery_worker_hp run_celery_worker_hp: started supervisor> start run_celerybeat run_celerybeat: started [root@treeherder-rabbitmq1.private.scl3 pradcliffe]# ps auxww | grep '[c]elery' | wc -l 7 <nagios-scl3:#sysadmins> Sun 03:14:09 PST [5805] treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is OK: SWAP OK - 99% free (2015 MB out of 2047 MB) (http://m.mozilla.org/Swap)

Ed Morley [:emorley]

Comment 7

•

10 years ago

Peter, thanks for spotting that - looks like we're getting zombie celery processes. Wonder if we could make the deploy script more aggressive at cleaning up processes once it's tried gracefully? The running instance count here makes it pretty easy to spot, retrospectively: https://rpm.newrelic.com/accounts/677903/servers/5575925/processes#id=721569760 This definitely wasn't anything to do with the leaks we've seen elsewhere. Anyway, I'll file another bug for (a) one-off checking all nodes for any other zombies, (b) coming up with a longer term solution for preventing them being created in the first place, or else cleaning them up automatically.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

10 years ago

Assignee: nobody → pradcliffe+bugzilla

Ed Morley [:emorley]

Updated

•

10 years ago

Depends on: 1131059

Bugzilla

treeherder-rabbitmq1.private.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Tracking

(Not tracked)

People

(Reporter: vinh, Assigned: pir)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated