Closed
Bug 1120193
Opened 9 years ago
Closed 7 years ago
a10n.webapp.scl3.mozilla.com:Swap alerting
Categories
(Localization Infrastructure and Tools :: Automation, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Pike, Unassigned)
References
Details
Making an infra specific bug, leaving the pager in MOC, let's see if that works out OK. +++ This bug was initially created as a clone of Bug #1120192 +++ The python script to process hg data is taking up most of the memory on a10n.webapp.scl3. It looks like swap was a problem in November for some time, too and the process has been running since last year, does this job have a slow memory leak? a10n 734 0.0 56.6 2423132 1088404 ? S 2014 20:39 /data/www/a10n.mozilla.org/src/a10n/env/bin/python /data/www/a10n.mozilla.org/src/a10n/scripts/a10n hg Mem: 1922372k total, 1822556k used, 99816k free, 243248k buffers Swap: 2097148k total, 1053808k used, 1043340k free, 175816k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 734 a10n 20 0 2366m 1.0g 2496 S 0.0 56.6 20:39.35 python 733 a10n 20 0 409m 32m 2292 S 0.3 1.7 187:51.83 twistd CCing Gregory. Can you help me? I suspect this has something to do with in-mem caching of long running hg processes. The process is https://github.com/Pike/a10n/blob/master/a10n/hg_elmo/worker.py, which in turn calls https://github.com/mozilla/elmo/blob/master/apps/pushes/utils.py#L108, and in the end mercurial.commands.pull/update. That's on python 2.7.3 and hg 2.8. I guess updating hg should be the first step? Might need code changes, though.
Comment 1•9 years ago
|
||
Purely informational; that wasn't a pager bug and could have been moved to a more appropriate place. I just didn't know where that was so put it in our problems queue.
Reporter | ||
Comment 3•9 years ago
|
||
https://github.com/Pike/a10n/commit/94227bfe8b830fe77bdf69a7099eac3d2f4614e9 landed, I'll now get this into production. I went ahead and landed that without review, as there's little real code change. Just some tidbits in the stage script which are boring.
Reporter | ||
Comment 4•9 years ago
|
||
This is deployed, and just upgrading mercurial didn't help. Gregory, is there a trick to flush/invalidate cashes within a long-standing mercurial process? I'd prefer to not shell out to the command server, in particular for changeset inspection in https://github.com/mozilla/elmo/blob/master/apps/pushes/utils.py#L38
Flags: needinfo?(gps)
Comment 5•9 years ago
|
||
mercurial.hg.repository instances won't automatically update if an external process modifies the repository. You should recreate these instances if you suspect an external process may have interacted with the repository.
Flags: needinfo?(gps)
Comment 6•9 years ago
|
||
Mercurial 2.8 is ancient and has known memory leaks in certain situations. You should be running 3.2.4.
Reporter | ||
Comment 7•9 years ago
|
||
Filed bug 1137668 to use hglib, that's gonna move the cache to a separate short-lived process. (We're using mercurial 3.2.4 for a bit now, didn't help)
Depends on: 1137668
Comment 8•9 years ago
|
||
and again: <nagios-scl3:#sysadmins> Mon 03:41:50 PST [5184] a10n.webapp.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free (1021 MB out of 2047 MB) (http://m.mozilla.org/Swap) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15280 a10n 20 0 2120m 811m 2572 S 0.0 43.2 14:08.54 python 15279 a10n 20 0 335m 30m 2296 S 0.3 1.6 78:20.14 twistd [root@a10n.webapp.scl3 pradcliffe]# supervisorctl avail program-a10n in use auto 999:999 program-get-pushes in use auto 999:999 [root@a10n.webapp.scl3 pradcliffe]# supervisorctl restart program-a10n program-a10n: stopped program-a10n: started
Reporter | ||
Comment 9•7 years ago
|
||
I think all the tooling changes that we thought would help this bug are done, and the regressions (also kicking of swap alerts) have been fixed, so I'm marking this FIXED.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•