Making an infra specific bug, leaving the pager in MOC, let's see if that works out OK. +++ This bug was initially created as a clone of Bug #1120192 +++ The python script to process hg data is taking up most of the memory on a10n.webapp.scl3. It looks like swap was a problem in November for some time, too and the process has been running since last year, does this job have a slow memory leak? a10n 734 0.0 56.6 2423132 1088404 ? S 2014 20:39 /data/www/a10n.mozilla.org/src/a10n/env/bin/python /data/www/a10n.mozilla.org/src/a10n/scripts/a10n hg Mem: 1922372k total, 1822556k used, 99816k free, 243248k buffers Swap: 2097148k total, 1053808k used, 1043340k free, 175816k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 734 a10n 20 0 2366m 1.0g 2496 S 0.0 56.6 20:39.35 python 733 a10n 20 0 409m 32m 2292 S 0.3 1.7 187:51.83 twistd CCing Gregory. Can you help me? I suspect this has something to do with in-mem caching of long running hg processes. The process is https://github.com/Pike/a10n/blob/master/a10n/hg_elmo/worker.py, which in turn calls https://github.com/mozilla/elmo/blob/master/apps/pushes/utils.py#L108, and in the end mercurial.commands.pull/update. That's on python 2.7.3 and hg 2.8. I guess updating hg should be the first step? Might need code changes, though.
Purely informational; that wasn't a pager bug and could have been moved to a more appropriate place. I just didn't know where that was so put it in our problems queue.
https://github.com/Pike/a10n/commit/94227bfe8b830fe77bdf69a7099eac3d2f4614e9 landed, I'll now get this into production. I went ahead and landed that without review, as there's little real code change. Just some tidbits in the stage script which are boring.
This is deployed, and just upgrading mercurial didn't help. Gregory, is there a trick to flush/invalidate cashes within a long-standing mercurial process? I'd prefer to not shell out to the command server, in particular for changeset inspection in https://github.com/mozilla/elmo/blob/master/apps/pushes/utils.py#L38
mercurial.hg.repository instances won't automatically update if an external process modifies the repository. You should recreate these instances if you suspect an external process may have interacted with the repository.
Mercurial 2.8 is ancient and has known memory leaks in certain situations. You should be running 3.2.4.
Filed bug 1137668 to use hglib, that's gonna move the cache to a separate short-lived process. (We're using mercurial 3.2.4 for a bit now, didn't help)
Depends on: 1137668
and again: <nagios-scl3:#sysadmins> Mon 03:41:50 PST  a10n.webapp.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free (1021 MB out of 2047 MB) (http://m.mozilla.org/Swap) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15280 a10n 20 0 2120m 811m 2572 S 0.0 43.2 14:08.54 python 15279 a10n 20 0 335m 30m 2296 S 0.3 1.6 78:20.14 twistd [firstname.lastname@example.org pradcliffe]# supervisorctl avail program-a10n in use auto 999:999 program-get-pushes in use auto 999:999 [email@example.com pradcliffe]# supervisorctl restart program-a10n program-a10n: stopped program-a10n: started
I think all the tooling changes that we thought would help this bug are done, and the regressions (also kicking of swap alerts) have been fixed, so I'm marking this FIXED.
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.