a10n.webapp.scl3.mozilla.com:Swap alerting

RESOLVED FIXED

Status

Localization Infrastructure and Tools
Automation
RESOLVED FIXED
3 years ago
11 months ago

People

(Reporter: Pike, Unassigned)

Tracking

Details

(Reporter)

Description

3 years ago
Making an infra specific bug, leaving the pager in MOC, let's see if that works out OK.

+++ This bug was initially created as a clone of Bug #1120192 +++

The python script to process hg data is taking up most of the memory on a10n.webapp.scl3. It looks like swap was a problem in November for some time, too and the process has been running since last year, does this job have a slow memory leak?


a10n       734  0.0 56.6 2423132 1088404 ?     S     2014  20:39 /data/www/a10n.mozilla.org/src/a10n/env/bin/python /data/www/a10n.mozilla.org/src/a10n/scripts/a10n hg

Mem:   1922372k total,  1822556k used,    99816k free,   243248k buffers
Swap:  2097148k total,  1053808k used,  1043340k free,   175816k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
  734 a10n      20   0 2366m 1.0g 2496 S  0.0 56.6  20:39.35 python             
  733 a10n      20   0  409m  32m 2292 S  0.3  1.7 187:51.83 twistd

CCing Gregory.

Can you help me? I suspect this has something to do with in-mem caching of long running hg processes.

The process is https://github.com/Pike/a10n/blob/master/a10n/hg_elmo/worker.py, which in turn calls https://github.com/mozilla/elmo/blob/master/apps/pushes/utils.py#L108, and in the end mercurial.commands.pull/update.

That's on python 2.7.3 and hg 2.8.

I guess updating hg should be the first step? Might need code changes, though.
(Reporter)

Updated

3 years ago
Blocks: 1120192
No longer depends on: 1120192
(Reporter)

Updated

3 years ago
Depends on: 1120196
Purely informational; that wasn't a pager bug and could have been moved to a more appropriate place. I just didn't know where that was so put it in our problems queue.

Updated

3 years ago
Duplicate of this bug: 1120192
(Reporter)

Comment 3

3 years ago
https://github.com/Pike/a10n/commit/94227bfe8b830fe77bdf69a7099eac3d2f4614e9 landed, I'll now get this into production.

I went ahead and landed that without review, as there's little real code change. Just some tidbits in the stage script which are boring.
(Reporter)

Comment 4

3 years ago
This is deployed, and just upgrading mercurial didn't help.

Gregory, is there a trick to flush/invalidate cashes within a long-standing mercurial process? I'd prefer to not shell out to the command server, in particular for changeset inspection in https://github.com/mozilla/elmo/blob/master/apps/pushes/utils.py#L38
Flags: needinfo?(gps)

Comment 5

3 years ago
mercurial.hg.repository instances won't automatically update if an external process modifies the repository. You should recreate these instances if you suspect an external process may have interacted with the repository.
Flags: needinfo?(gps)

Comment 6

3 years ago
Mercurial 2.8 is ancient and has known memory leaks in certain situations. You should be running 3.2.4.
(Reporter)

Comment 7

3 years ago
Filed bug 1137668 to use hglib, that's gonna move the cache to a separate short-lived process.

(We're using mercurial 3.2.4 for a bit now, didn't help)
Depends on: 1137668
(Reporter)

Updated

3 years ago
No longer blocks: 1120192
and again:

 <nagios-scl3:#sysadmins> Mon 03:41:50 PST [5184] 
  a10n.webapp.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 50% free (1021 
  MB out of 2047 MB) (http://m.mozilla.org/Swap)

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
15280 a10n      20   0 2120m 811m 2572 S  0.0 43.2  14:08.54 python             
15279 a10n      20   0  335m  30m 2296 S  0.3  1.6  78:20.14 twistd             

[root@a10n.webapp.scl3 pradcliffe]# supervisorctl avail
program-a10n                     in use    auto      999:999
program-get-pushes               in use    auto      999:999
[root@a10n.webapp.scl3 pradcliffe]# supervisorctl restart program-a10n
program-a10n: stopped
program-a10n: started
(Reporter)

Comment 9

11 months ago
I think all the tooling changes that we thought would help this bug are done, and the regressions (also kicking of swap alerts) have been fixed, so I'm marking this FIXED.
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.