Using datasource causes memory leaks in the API and possibly celery tasks



Tree Management
2 years ago
a year ago


(Reporter: emorley, Unassigned)


(Blocks: 1 bug)


In bug 1307785 we had to lower the maximum request each gunicorn worker would handle before being respawned, to match the value we had for SCL3, since we are leaking pretty hard.

1) `vagrant up` on current master (a98f432)
2) Run each of these in a separate `vagrant ssh` shell:
  - `TREEHERDER_DEBUG=False gunicorn treeherder.config.wsgi:application --log-file - -- timeout 29`
  - `top -o %MEM -u vagrant` (then hit `e` to switch to human-readable memory values)
  - `while true; do curl -sS -o /dev/null "" ; done`
3) Watch the memory usage of the gunicorn process for 30s
4) Ctrl+C out of the curl invocation, and replace it with:
  - `while true; do curl -sS -o /dev/null "" ; done`
5) Watch gunicorn memory usage for 30s

At step 3, the memory usage jumps slightly after the first request but remains constant.

At step 5, the memory usage continuously grows.

The only difference between /api/repository/ and /api/repository/1/ is that the latter uses datasource in addition to the ORM (since it uses `retrieve()` rather than the default `list()`):

Using Dozer ( to inspect objects shows thousands of:
(and several builtins types that are presumably referenced by them)

I'm not sure if this leak is a fault in datasource or the way in which we use it.

I picked this endpoint only because it seemed like a reduced testcase (and an easy way to compare ORM vs datasource), but I'm presuming:
* all API datasource usage leaks
* celery tasks that use datasource leak too (which would explain why several of our ingestion task leaks went away recently, as we switched more things away from datasource)

Once we are no longer using datasource we can probably raise both the gunicorn `--max-requests` value and the celery `--maxtasksperchild` specified in Procfile, which should improve performance. (For several worker types we're restarting each celery worker after only 20 tasks, which for the short lived Pulse ingestion tasks is pretty crazy)
Dozer output (number of each object type in memory) after a few mins of the 2nd curl loop:

MySQLdb.connections.Connection: 16618
MySQLdb.cursors.DictCursor: 16617
__builtin__.cell: 53761
__builtin__.dict: 142570
__builtin__.function: 71459
__builtin__.getset_descriptor: 2148
__builtin__.instance: 16799
__builtin__.instancemethod: 183288
__builtin__.list: 56674
__builtin__.module: 1151 1263
__builtin__.set: 133663
__builtin__.tuple: 76677
__builtin__.type: 2935
__builtin__.weakproxy: 16619
__builtin__.weakref: 4212
__builtin__.wrapper_descriptor: 1422

The MySQLdb.cursors.DictCursor's parent is `MySQLdb.connections.Connection`, and the parent of that is:

140453918172336 dict (via its "'con_obj'" key)
dict of len 3: {'con_obj': <_mysql.connection closed at 649c130>, 'cursor': <MySQLdb.cursors.DictCur...


So datasource is closing the connection but we're still persisting the con_obj.

I've tried deleting the connection object and/or setting to None in TreeherderModelBase's __exit__, like Django does:
...along with a number of other related things, but to no avail.

I'm not going to spend any more time on this, whilst the leak is causing errors on stage every other day and prod less frequently, we can just wait to we move away from datasource.
Duplicate of this bug: 1287489
This is also why people have had memory issues using `./ runserver` vs run_gunicorn, since the latter reloads after X requests, whereas runserver doesn't.
Fixed by not using datasource anymore.
Last Resolved: 2 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1178641
You need to log in before you can comment on or make changes to this bug.