Closed Bug 1443984 Opened 6 years ago Closed 6 years ago

hgweb1[1234].dmz.scl3.mozilla.com have been alerting at 4pm pst

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sal, Unassigned)

Details

Attachments

(1 file)

These boxes have been alerting the past couple of days at around 4pm PST. Did anything change recently? 

Today 3/7:
@nagios-scl3> (IRC) Thu 00:01:48 UTC [5578] [devservices] hgweb14.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar  8 00:01:17 hgweb14.dmz.scl3.mozilla.com kernel: [7102054.080461] Out of memory: Kill process 31481 (httpd) score 59 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process)
16:07:36 <@nagios-scl3> (IRC) Thu 00:07:36 UTC [5580] [devservices] hgweb12.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar  8 00:01:10 hgweb12.dmz.scl3.mozilla.com kernel: [7103627.937814] Out of memory: Kill process 4549 (httpd) score 61 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process)
16:08:19 <@nagios-scl3> (IRC) Thu 00:08:19 UTC [5581] [devservices] hgweb11.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar  8 00:01:04 hgweb11.dmz.scl3.mozilla.com kernel: [7104442.862549] Out of memory: Kill process 4783 (httpd) score 54 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process)
16:09:59 <@nagios-scl3> (IRC) Thu 00:09:58 UTC [5582] [devservices] hgweb13.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar  8 00:01:05 hgweb13.dmz.scl3.mozilla.com kernel: [7103230.242897] Out of memory: Kill process 12690 (httpd) score 62 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process)

Tuesday 3/6:
@nagios-scl3> Wed 00:01:46 UTC [5952] [devservices] hgweb13.dmz.scl3.mozilla.com:Zookeeper - hg is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/Zookeeper+-+hg)
16:01:50 <@nagios-scl3> Wed 00:01:50 UTC [5955] [devservices] hgweb11.dmz.scl3.mozilla.com:Zookeeper - hg is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/Zookeeper+-+hg)
16:01:56 <@nagios-scl3> Wed 00:01:56 UTC [5958] [devservices] hgweb12.dmz.scl3.mozilla.com:Zookeeper - hg is WARNING: NODE CRITICAL - not in read/write mode: null (http://m.mozilla.org/Zookeeper+-+hg)
16:01:57 <@nagios-scl3> (IRC) Wed 00:01:57 UTC [5959] [devservices] hgweb12.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar  7 00:01:07 hgweb12.dmz.scl3.mozilla.com kernel: [7017223.798692] Out of memory: Kill process 6078 (httpd) score 53 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process)
16:02:02 <@nagios-scl3> Wed 00:02:02 UTC [5962] [devservices] hgweb14.dmz.scl3.mozilla.com:Zookeeper - hg is WARNING: ENSEMBLE WARNING - node (hgweb12.dmz.scl3.mozilla.com) is alive but not available (http://m.mozilla.org/Zookeeper+-+hg)

Monday 3/5:
@nagios-scl3> Tue 00:01:27 UTC [5382] [devservices] hgweb14.dmz.scl3.mozilla.com:Zookeeper - hg is WARNING: ENSEMBLE WARNING - socket error connecting to hgweb12.dmz.scl3.mozilla.com: timed out (http://m.mozilla.org/Zookeeper+-+hg)
16:01:28 <@nagios-scl3> Tue 00:01:27 UTC [5385] [devservices] hgweb12.dmz.scl3.mozilla.com:Zookeeper - hg is CRITICAL: socket error connecting to hgweb12.dmz.scl3.mozilla.com: timed out (http://m.mozilla.org/Zookeeper+-+hg)
16:01:29 <@nagios-scl3> Tue 00:01:28 UTC [5388] [devservices] hgweb11.dmz.scl3.mozilla.com:Zookeeper - hg is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/Zookeeper+-+hg)
16:01:37 <@nagios-scl3> Tue 00:01:36 UTC [5391] [devservices] hgssh4.dmz.scl3.mozilla.com:Zookeeper - hg is WARNING: ENSEMBLE WARNING - node (hgweb12.dmz.scl3.mozilla.com) is alive but not available (http://m.mozilla.org/Zookeeper+-+hg)
16:01:37
Again today, had to contact :fubar because I was not able to login to hgssh4



16:01:39 <@nagios-scl3> Sun 00:01:40 UTC [5288] [devservices] hgssh4.dmz.scl3.mozilla.com:Kafka - hg heartbeat is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. (http://m.mozilla.org/Kafka+-+hg+heartbeat)
16:02:27 <@nagios-scl3> Sun 00:02:28 UTC [5292] [devservices] hgssh5.dmz.scl3.mozilla.com:Kafka - hg heartbeat is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. (http://m.mozilla.org/Kafka+-+hg+heartbeat)
16:03:57 <@sal> nagios-scl3: recheck 5288
16:03:57 <@nagios-scl3> sal: rechecking all services on host hgssh4.dmz.scl3.mozilla.com
16:04:05 <@nagios-scl3> Sun 00:04:06 UTC [5296] [devservices] hgweb13.dmz.scl3.mozilla.com:procs - hg Kafka is CRITICAL: PROCS CRITICAL: 0 processes with regex args 'kafka.Kafka' (http://m.mozilla.org/procs+-+hg+Kafka)
16:04:07 <@sal> nagios-scl3: recheck 5292
16:04:07 <@nagios-scl3> sal: rechecking all services on host hgssh5.dmz.scl3.mozilla.com
16:04:08 <@nagios-scl3> Sun 00:04:09 UTC [5299] [devservices] hgssh5.dmz.scl3.mozilla.com:Kafka - hg heartbeat is WARNING: NRPE: Unable to read output (http://m.mozilla.org/Kafka+-+hg+heartbeat)
16:04:27 <@nagios-scl3> Sun 00:04:28 UTC [5302] [devservices] hgssh5.dmz.scl3.mozilla.com:Kafka - hg heartbeat is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. (http://m.mozilla.org/Kafka+-+hg+heartbeat)
16:05:05 <@nagios-scl3> Sun 00:05:06 UTC [5305] [devservices] hgweb13.dmz.scl3.mozilla.com:procs - hg Kafka is OK: PROCS OK: 1 process with regex args 'kafka.Kafka' (http://m.mozilla.org/procs+-+hg+Kafka)
16:05:58 <@nagios-scl3> Sun 00:05:59 UTC [5309] [devservices] hgssh4.dmz.scl3.mozilla.com:Kafka - hg heartbeat is OK: OK - heartbeat message sent to 8/8 partitions successfully (http://m.mozilla.org/Kafka+-+hg+heartbeat)
16:06:07 <@nagios-scl3> Sun 00:06:08 UTC [5312] [devservices] hgssh5.dmz.scl3.mozilla.com:Kafka - hg heartbeat is OK: OK - heartbeat message sent to 8/8 partitions successfully (http://m.mozilla.org/Kafka+-+hg+heartbeat)
16:07:58 <@nagios-scl3> Sun 00:07:59 UTC [5315] [devservices] hgssh4.dmz.scl3.mozilla.com:hg push data aggregator lag is CRITICAL: CRITICAL - 92 messages from 8 partitions behind (http://m.mozilla.org/hg+push+data+aggregator+lag)
16:10:35 <@sal> nagios-scl3: recheck 5315
16:10:35 <@nagios-scl3> sal: rechecking all services on host hgssh4.dmz.scl3.mozilla.com
16:10:36 <@nagios-scl3> Sun 00:10:37 UTC [5320] [devservices] hgssh4.dmz.scl3.mozilla.com:procs - hg vcsreplicator aggregator is CRITICAL: PROCS CRITICAL: 0 processes with regex args 'vcsreplicator-aggregator' (http://m.mozilla.org/procs+-+hg+vcsreplicator+aggregator)
Here are yesterdays alerts around at the same time: 


3/8 alerts; 
@nagios-scl3> (IRC) Fri 00:07:26 UTC [5626] [devservices] hgweb14.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar  9 00:01:19 hgweb14.dmz.scl3.mozilla.com kernel: [7188454.524513] Out of memory: Kill process 21354 (httpd) score 57 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process)
16:07:46 <@ryanc> Ope
16:08:25 <@ryanc> Right on time
16:20:23 <@nagios-scl3> Fri 00:20:24 UTC [5635] [devservices] hgssh4.dmz.scl3.mozilla.com:hg sns notifier lag is CRITICAL: CRITICAL - 1/1 partitions out of sync (http://m.mozilla.org/hg+sns+notifier+lag)
16:21:23
I also get this when trying to reach hgssh4, weird because I can access hgssh5 and these are both configured the same way..


sal@work-mb  ~  ssh -v sespinoza@hgssh4.dmz.scl3.mozilla.com
OpenSSH_7.3p1, LibreSSL 2.4.1
debug1: Reading configuration data /Users/sal/.ssh/config
debug1: /Users/sal/.ssh/config line 1: Applying options for *
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 20: Applying options for *
debug1: Connecting to hgssh4.dmz.scl3.mozilla.com [10.22.74.45] port 22.
debug1: Connection established.
debug1: identity file /Users/sal/.ssh/R4096_sepinoza type 1
debug1: key_load_public: No such file or directory
debug1: identity file /Users/sal/.ssh/R4096_sepinoza-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.3
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4
debug1: match: OpenSSH_7.4 pat OpenSSH* compat 0x04000000
debug1: Authenticating to hgssh4.dmz.scl3.mozilla.com:22 as 'sespinoza'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256@libssh.org
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:6vutHUPKyWXcZc0kQB3Bn6t9t3prRTToTWP6u4YqYkI
debug1: Host 'hgssh4.dmz.scl3.mozilla.com' is known and matches the ECDSA host key.
debug1: Found key in /Users/sal/.ssh/known_hosts:1
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS received
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<rsa-sha2-256,rsa-sha2-512>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /Users/sal/.ssh/R4096_sepinoza
debug1: Authentications that can continue: publickey
debug1: No more authentication methods to try.
Permission denied (publickey).
Connor or Greg is this something you can look into to determine the root cause?  

Kendall, what steps did you take in comment #1 to address to problem?
Flags: needinfo?(sheehan)
Flags: needinfo?(klibby)
Flags: needinfo?(gps)
We've been seeing more and more of these OOM alerts for httpd over the past several weeks. I'm not sure what the root cause is.

Also, the alerts may be a bit misleading. We have the httpd processes and we have the wsgi process pool (spawned by the parent httpd process). It is highly unlikely the httpd processes are leaking memory. However, the wsgi processes can and are expected to use >1 GB memory for some requests.

We do have the wsgi processes configured to serve at most 20 requests before they are re-spawned. This should keep memory leaks in check. If their RSS memory is growing between requests, that could represent a memory leak in Mercurial - possibly due to one of our extensions.

Also, I'd be shocked if any memory leaks were so bad that we ran out of system memory. We have 32GB RAM on these machines plus 16 GB swap. We almost always have >10 GB of head room for the httpd processes to use (expanding into what is normally utilized by the page cache). We run a max of 24 WSGI processes. So unless every process consumes >1 GB, we should have plenty of memory to not OOM, even under load.

Yet I look at /var/log/messages and I do see OOM events. And swap is apparently full, with wsgi processes utilizing swap. So *something* is definitely going on here. I'll keep poking around.
I think I have a lead.

Shortly after 00:00:00 UTC, we start seeing tons of requests to /mozilla-central/raw-file/tip/security/nss/lib/ckfw/builtins/certdata.txt. I reckon this is a bunch of CRON jobs all waking up at the same time to update the Mozilla certificate database. We're flooded with hundreds of these requests within a short time window.

We /should/ be able to handle this load. If a run a local Mercurial server with `hg serve`, this URL is served in <100ms after first load.

However, our individual request CPU logging on the servers (/var/log/hg/parsed-*) says these requests are taking ~1.2s of CPU time. If I run a local server on hgweb via `hg serve --hgmo`, requests complete fast. But if I hit the actual production server process, requests are taking ~1.2s, just like the logs show.

Furthermore, I can see the WSGI processes apparently leaking RSS as requests are made.

So something is awry with the production servers serving this request. It doesn't reproduce in the standalone server. The main differences between the standalone server and the production server are:

* httpd + mod_wsgi versus Python's built-in HTTP server
* hgwebdir WSGI application instead of hgweb WSGI application

I'll keep poking around until I isolate the problem.
I've isolated this to the hgwebdir WSGI application. I can reproduce slowness and memory growth by running `/var/hg/venv_hgweb/bin/hg serve --web-config /repo/hg/webroot_wsgi/hgweb.config`, which runs a local server the way httpd would.

What appears to be happening is that the process is leaking repository objects or some data associated with repositories. The WSGI application does cache repository objects. But it is supposed to reuse them for subsequent requests. For reasons I don't yet understand, that repository reuse is happening in local `hg serve` but not when using hgwebdir. This causes a slowdown in request processing (because each request has to read the repository data again instead of having it caches) and is contributing to a memory leak.
Flags: needinfo?(sheehan)
Flags: needinfo?(klibby)
Flags: needinfo?(gps)
Group: infra
Bleh. hgwebdir isn't using the repository caching layer.

https://www.mercurial-scm.org/repo/hg/file/b434965f984e/mercurial/hgweb/hgweb_mod.py is where the hgweb WSGI application does its repo caching.

https://www.mercurial-scm.org/repo/hg/file/b434965f984e/mercurial/hgweb/hgwebdir_mod.py#l291 is where hgwebdir calls the hgweb application.

That in of itself is bad because it means hgwebdir will be much slower than it could be if it were to cache repo instances.

And we must be leaking the "repo" object in hgwebdir somehow, otherwise memory use wouldn't increase between requests.
I was hoping this would be easy fix upstream. But the often simple fix of "drop a reference to an object or call del" isn't working. I suspect there is a cycle within the localrepository class that is keeping the object alive and is preventing its refcount from going to 0 and being GCd.

We can work around this by sprinkling a `gc.collect()` in hgwebdir. But that's likely not appropriate for upstream. I may try sending the patch to see what happens.

Our best short-term mitigation is to drop the max requests per WSGI worker so the impact of the memory leak is mitigated.
Here's a histogram of the requests to certdata.txt over a period of ~10 days from 1 server.

By Hour (UTC):

   7134 00
   6013 01
   4204 02
   4105 03
   5042 04
   4326 05
   4511 06
   4444 07
   4483 08
   4612 09
   4023 10
   3680 11
   3719 12
   8415 13
   3986 14
   4375 15
   5862 16
   5280 17
   4506 18
   4046 19
   3879 20
   4088 21
   4079 22
   4648 23

By Minute:

  13584 00
   2151 01
   1512 02
   1516 03
   1601 04
   1417 05
   1373 06
   1378 07
   1332 08
   1344 09
   2081 10
   2384 11
   2174 12
   1656 13
   1383 14
   6132 15
   1249 16
   1173 17
   1393 18
   1249 19
   1419 20
   1405 21
   1263 22
   1448 23
   1751 24
   1527 25
   1325 26
   1205 27
   1120 28
   1266 29
   4905 30
   2423 31
   1779 32
   1527 33
   1436 34
   1359 35
   1423 36
   1395 37
   1275 38
   1168 39
   1258 40
   1195 41
   1425 42
   1491 43
   1543 44
   4109 45
   1229 46
   1379 47
   1402 48
   1368 49
   1962 50
   2322 51
   1717 52
   1476 53
   1537 54
   1407 55
   1243 56
   1512 57
   1494 58
   1890 59

Requests by hour are pretty well distributed. But we see requests by minute are most concentrated at the top of every hour and then in minutes 15, 30, and 45.

I can't wait for more people to adopt systemd's timers, which allow you to introduce random variations for periodic tasks so we don't see a clump of requests at the top of every hour.

To be clear, I have no problem with people hitting hg.mozilla.org to obtain this certificate data. If it weren't for the memory leak, we'd barely notice this load.
Comment on attachment 8958224 [details]
serverlog: perform garbage collection on every request (bug 1443984);

https://reviewboard.mozilla.org/r/227164/#review232956

lgtm.
Attachment #8958224 - Flags: review?(sheehan) → review+
Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/83f1cda27356
serverlog: perform garbage collection on every request ; r=sheehan
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
This is now deployed and it appears to have its intended effect on reducing memory leaks.
Attachment #8958224 - Flags: review?(glob)
The same GC hack will be in Mercurial 4.5.2. https://www.mercurial-scm.org/repo/hg/rev/ff2370a70fe8
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: