Closed Bug 1094814 Opened 10 years ago Closed 9 years ago

High Memory & Swap on treeherder-etl[1-2].private.scl3.mozilla.com

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rbryce, Assigned: fubar)

References

()

Details

+++ This bug was initially created as a clone of Bug #1094517 +++

Automated alert report from nagios1.private.scl3.mozilla.com:

Hostname: treeherder-etl1.private.scl3.mozilla.com
Service:  Swap
State:    WARNING
Output:   SWAP WARNING - 46% free (928 MB out of 2047 MB)

Runbook:  http://m.allizom.org/Swap
https://graphite-scl3.mozilla.org/render/?width=586&height=308&from=-24days&target=hosts.treeherder-etl[1-2]_private_scl3_mozilla_com.memory.memory.used.value

Memory usage has been running up the past month. Standard swap clearing wont work with when the memory usage is high. 

the buildapi celery processes seem to be the most greedy.  Not knowing enough about this service, I'm guessing the server could just use a bump in allocated memory.
I am not exactly sure who to contact in QA about this. CC :arr, :edmorley, ctalbert.
Summary: Memory & Swap on treeherder-etl[1-2].private.scl3.mozilla.com → High Memory & Swap on treeherder-etl[1-2].private.scl3.mozilla.com
emorley is probably your man for treeherder, but I'll cc jgriffin just in case, too.
Treeherder is an A-Team project rather than QA.

Best CC list for treeherder issues is:
treeherder@tree-management.bugs, :fubar, :edmorley, :mdoglio, :camd

The Bugzilla components for it are Tree Management::Treeherder and Tree Management::Infrastructure (the latter for Developer Services team tasks).

Could someone update whatever wiki MOC uses? :-)

fubar/mdoglio are the best ones to look at this particular issue (CCed).
(In reply to Ed Morley [:edmorley] from comment #4)
> Treeherder is an A-Team project rather than QA.
> 
> Best CC list for treeherder issues is:
> treeherder@tree-management.bugs, :fubar, :edmorley, :mdoglio, :camd
> 
> The Bugzilla components for it are Tree Management::Treeherder and Tree
> Management::Infrastructure (the latter for Developer Services team tasks).
> 
> Could someone update whatever wiki MOC uses? :-)

I've updated the mana docs at https://mana.mozilla.org/wiki/display/websites/treeherder.mozilla.org to make contacts a bit more obvious. I've also added a note under Troubleshooting on what to do if ETL nodes have memory issues again.

> fubar/mdoglio are the best ones to look at this particular issue (CCed).

I wonder if this might be a memory leak? The processes were running normally and swap dropped back to normal on restarting the celery jobs.

23834:   /usr/bin/python /usr/bin/celery -A treeherder worker -Q buildapi --concurrency=5 --logfile=/var/log/celery/celery_worker_buildapi.log -l INFO --maxtasksperchild=500 -n buildapi.%h
Address           Kbytes     RSS   Dirty Mode   Mapping
...
00000000015b0000   39976   36408   34296 rw---    [ anon ]
0000000003cba000  739896  739572  739556 rw---    [ anon ]
...

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  SWAP COMMAND
23834 treeherd  20   0 1172m 768m 3032 S  0.0 20.0   1:50.19 2776 /usr/bin/python /usr/bin/celery...
We should set the maxtasksperchild parameter to something like 50. That means that on average each process in the worker pool will restart after 200 minutes. This is based on the rough approximation that a worker runs a task every 4 minutes
Still occuring. Multiple alerts today. Have had to restart twice today already:
 supervisorctl restart celery_buildapi
Will look into bumping the memory on those nodes during or after the work week.
Moving this to the most appropriate component for us to track this.
There are also no IP addresses or anything else confidential, so making this bug open.
Group: infra
Component: MOC: Problems → Treeherder: Infrastructure
OS: Other → All
Product: Infrastructure & Operations → Tree Management
QA Contact: dmoore → laura
Hardware: Other → All
Version: other → ---
Depends on: 1113115
Priority: -- → P1
Assignee: nobody → klibby
Status: NEW → ASSIGNED
See Also: → 1129731
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.