Closed
Bug 1443984
Opened 6 years ago
Closed 6 years ago
hgweb1[1234].dmz.scl3.mozilla.com have been alerting at 4pm pst
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: sal, Unassigned)
Details
Attachments
(1 file)
These boxes have been alerting the past couple of days at around 4pm PST. Did anything change recently? Today 3/7: @nagios-scl3> (IRC) Thu 00:01:48 UTC [5578] [devservices] hgweb14.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar 8 00:01:17 hgweb14.dmz.scl3.mozilla.com kernel: [7102054.080461] Out of memory: Kill process 31481 (httpd) score 59 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process) 16:07:36 <@nagios-scl3> (IRC) Thu 00:07:36 UTC [5580] [devservices] hgweb12.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar 8 00:01:10 hgweb12.dmz.scl3.mozilla.com kernel: [7103627.937814] Out of memory: Kill process 4549 (httpd) score 61 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process) 16:08:19 <@nagios-scl3> (IRC) Thu 00:08:19 UTC [5581] [devservices] hgweb11.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar 8 00:01:04 hgweb11.dmz.scl3.mozilla.com kernel: [7104442.862549] Out of memory: Kill process 4783 (httpd) score 54 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process) 16:09:59 <@nagios-scl3> (IRC) Thu 00:09:58 UTC [5582] [devservices] hgweb13.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar 8 00:01:05 hgweb13.dmz.scl3.mozilla.com kernel: [7103230.242897] Out of memory: Kill process 12690 (httpd) score 62 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process) Tuesday 3/6: @nagios-scl3> Wed 00:01:46 UTC [5952] [devservices] hgweb13.dmz.scl3.mozilla.com:Zookeeper - hg is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/Zookeeper+-+hg) 16:01:50 <@nagios-scl3> Wed 00:01:50 UTC [5955] [devservices] hgweb11.dmz.scl3.mozilla.com:Zookeeper - hg is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/Zookeeper+-+hg) 16:01:56 <@nagios-scl3> Wed 00:01:56 UTC [5958] [devservices] hgweb12.dmz.scl3.mozilla.com:Zookeeper - hg is WARNING: NODE CRITICAL - not in read/write mode: null (http://m.mozilla.org/Zookeeper+-+hg) 16:01:57 <@nagios-scl3> (IRC) Wed 00:01:57 UTC [5959] [devservices] hgweb12.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar 7 00:01:07 hgweb12.dmz.scl3.mozilla.com kernel: [7017223.798692] Out of memory: Kill process 6078 (httpd) score 53 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process) 16:02:02 <@nagios-scl3> Wed 00:02:02 UTC [5962] [devservices] hgweb14.dmz.scl3.mozilla.com:Zookeeper - hg is WARNING: ENSEMBLE WARNING - node (hgweb12.dmz.scl3.mozilla.com) is alive but not available (http://m.mozilla.org/Zookeeper+-+hg) Monday 3/5: @nagios-scl3> Tue 00:01:27 UTC [5382] [devservices] hgweb14.dmz.scl3.mozilla.com:Zookeeper - hg is WARNING: ENSEMBLE WARNING - socket error connecting to hgweb12.dmz.scl3.mozilla.com: timed out (http://m.mozilla.org/Zookeeper+-+hg) 16:01:28 <@nagios-scl3> Tue 00:01:27 UTC [5385] [devservices] hgweb12.dmz.scl3.mozilla.com:Zookeeper - hg is CRITICAL: socket error connecting to hgweb12.dmz.scl3.mozilla.com: timed out (http://m.mozilla.org/Zookeeper+-+hg) 16:01:29 <@nagios-scl3> Tue 00:01:28 UTC [5388] [devservices] hgweb11.dmz.scl3.mozilla.com:Zookeeper - hg is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/Zookeeper+-+hg) 16:01:37 <@nagios-scl3> Tue 00:01:36 UTC [5391] [devservices] hgssh4.dmz.scl3.mozilla.com:Zookeeper - hg is WARNING: ENSEMBLE WARNING - node (hgweb12.dmz.scl3.mozilla.com) is alive but not available (http://m.mozilla.org/Zookeeper+-+hg) 16:01:37
Reporter | ||
Comment 1•6 years ago
|
||
Again today, had to contact :fubar because I was not able to login to hgssh4 16:01:39 <@nagios-scl3> Sun 00:01:40 UTC [5288] [devservices] hgssh4.dmz.scl3.mozilla.com:Kafka - hg heartbeat is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. (http://m.mozilla.org/Kafka+-+hg+heartbeat) 16:02:27 <@nagios-scl3> Sun 00:02:28 UTC [5292] [devservices] hgssh5.dmz.scl3.mozilla.com:Kafka - hg heartbeat is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. (http://m.mozilla.org/Kafka+-+hg+heartbeat) 16:03:57 <@sal> nagios-scl3: recheck 5288 16:03:57 <@nagios-scl3> sal: rechecking all services on host hgssh4.dmz.scl3.mozilla.com 16:04:05 <@nagios-scl3> Sun 00:04:06 UTC [5296] [devservices] hgweb13.dmz.scl3.mozilla.com:procs - hg Kafka is CRITICAL: PROCS CRITICAL: 0 processes with regex args 'kafka.Kafka' (http://m.mozilla.org/procs+-+hg+Kafka) 16:04:07 <@sal> nagios-scl3: recheck 5292 16:04:07 <@nagios-scl3> sal: rechecking all services on host hgssh5.dmz.scl3.mozilla.com 16:04:08 <@nagios-scl3> Sun 00:04:09 UTC [5299] [devservices] hgssh5.dmz.scl3.mozilla.com:Kafka - hg heartbeat is WARNING: NRPE: Unable to read output (http://m.mozilla.org/Kafka+-+hg+heartbeat) 16:04:27 <@nagios-scl3> Sun 00:04:28 UTC [5302] [devservices] hgssh5.dmz.scl3.mozilla.com:Kafka - hg heartbeat is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. (http://m.mozilla.org/Kafka+-+hg+heartbeat) 16:05:05 <@nagios-scl3> Sun 00:05:06 UTC [5305] [devservices] hgweb13.dmz.scl3.mozilla.com:procs - hg Kafka is OK: PROCS OK: 1 process with regex args 'kafka.Kafka' (http://m.mozilla.org/procs+-+hg+Kafka) 16:05:58 <@nagios-scl3> Sun 00:05:59 UTC [5309] [devservices] hgssh4.dmz.scl3.mozilla.com:Kafka - hg heartbeat is OK: OK - heartbeat message sent to 8/8 partitions successfully (http://m.mozilla.org/Kafka+-+hg+heartbeat) 16:06:07 <@nagios-scl3> Sun 00:06:08 UTC [5312] [devservices] hgssh5.dmz.scl3.mozilla.com:Kafka - hg heartbeat is OK: OK - heartbeat message sent to 8/8 partitions successfully (http://m.mozilla.org/Kafka+-+hg+heartbeat) 16:07:58 <@nagios-scl3> Sun 00:07:59 UTC [5315] [devservices] hgssh4.dmz.scl3.mozilla.com:hg push data aggregator lag is CRITICAL: CRITICAL - 92 messages from 8 partitions behind (http://m.mozilla.org/hg+push+data+aggregator+lag) 16:10:35 <@sal> nagios-scl3: recheck 5315 16:10:35 <@nagios-scl3> sal: rechecking all services on host hgssh4.dmz.scl3.mozilla.com 16:10:36 <@nagios-scl3> Sun 00:10:37 UTC [5320] [devservices] hgssh4.dmz.scl3.mozilla.com:procs - hg vcsreplicator aggregator is CRITICAL: PROCS CRITICAL: 0 processes with regex args 'vcsreplicator-aggregator' (http://m.mozilla.org/procs+-+hg+vcsreplicator+aggregator)
Reporter | ||
Comment 2•6 years ago
|
||
Here are yesterdays alerts around at the same time: 3/8 alerts; @nagios-scl3> (IRC) Fri 00:07:26 UTC [5626] [devservices] hgweb14.dmz.scl3.mozilla.com:Out of memory - killed process is WARNING: Log errors: Mar 9 00:01:19 hgweb14.dmz.scl3.mozilla.com kernel: [7188454.524513] Out of memory: Kill process 21354 (httpd) score 57 or sacrifice child (http://m.mozilla.org/Out+of+memory+-+killed+process) 16:07:46 <@ryanc> Ope 16:08:25 <@ryanc> Right on time 16:20:23 <@nagios-scl3> Fri 00:20:24 UTC [5635] [devservices] hgssh4.dmz.scl3.mozilla.com:hg sns notifier lag is CRITICAL: CRITICAL - 1/1 partitions out of sync (http://m.mozilla.org/hg+sns+notifier+lag) 16:21:23
Reporter | ||
Comment 3•6 years ago
|
||
I also get this when trying to reach hgssh4, weird because I can access hgssh5 and these are both configured the same way.. sal@work-mb ~ ssh -v sespinoza@hgssh4.dmz.scl3.mozilla.com OpenSSH_7.3p1, LibreSSL 2.4.1 debug1: Reading configuration data /Users/sal/.ssh/config debug1: /Users/sal/.ssh/config line 1: Applying options for * debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 20: Applying options for * debug1: Connecting to hgssh4.dmz.scl3.mozilla.com [10.22.74.45] port 22. debug1: Connection established. debug1: identity file /Users/sal/.ssh/R4096_sepinoza type 1 debug1: key_load_public: No such file or directory debug1: identity file /Users/sal/.ssh/R4096_sepinoza-cert type -1 debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_7.3 debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4 debug1: match: OpenSSH_7.4 pat OpenSSH* compat 0x04000000 debug1: Authenticating to hgssh4.dmz.scl3.mozilla.com:22 as 'sespinoza' debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: algorithm: curve25519-sha256@libssh.org debug1: kex: host key algorithm: ecdsa-sha2-nistp256 debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none debug1: expecting SSH2_MSG_KEX_ECDH_REPLY debug1: Server host key: ecdsa-sha2-nistp256 SHA256:6vutHUPKyWXcZc0kQB3Bn6t9t3prRTToTWP6u4YqYkI debug1: Host 'hgssh4.dmz.scl3.mozilla.com' is known and matches the ECDSA host key. debug1: Found key in /Users/sal/.ssh/known_hosts:1 debug1: rekey after 134217728 blocks debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug1: rekey after 134217728 blocks debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_EXT_INFO received debug1: kex_input_ext_info: server-sig-algs=<rsa-sha2-256,rsa-sha2-512> debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey debug1: Next authentication method: publickey debug1: Offering RSA public key: /Users/sal/.ssh/R4096_sepinoza debug1: Authentications that can continue: publickey debug1: No more authentication methods to try. Permission denied (publickey).
Comment 4•6 years ago
|
||
Connor or Greg is this something you can look into to determine the root cause? Kendall, what steps did you take in comment #1 to address to problem?
Flags: needinfo?(sheehan)
Flags: needinfo?(klibby)
Flags: needinfo?(gps)
Comment 5•6 years ago
|
||
We've been seeing more and more of these OOM alerts for httpd over the past several weeks. I'm not sure what the root cause is. Also, the alerts may be a bit misleading. We have the httpd processes and we have the wsgi process pool (spawned by the parent httpd process). It is highly unlikely the httpd processes are leaking memory. However, the wsgi processes can and are expected to use >1 GB memory for some requests. We do have the wsgi processes configured to serve at most 20 requests before they are re-spawned. This should keep memory leaks in check. If their RSS memory is growing between requests, that could represent a memory leak in Mercurial - possibly due to one of our extensions. Also, I'd be shocked if any memory leaks were so bad that we ran out of system memory. We have 32GB RAM on these machines plus 16 GB swap. We almost always have >10 GB of head room for the httpd processes to use (expanding into what is normally utilized by the page cache). We run a max of 24 WSGI processes. So unless every process consumes >1 GB, we should have plenty of memory to not OOM, even under load. Yet I look at /var/log/messages and I do see OOM events. And swap is apparently full, with wsgi processes utilizing swap. So *something* is definitely going on here. I'll keep poking around.
Comment 6•6 years ago
|
||
I think I have a lead. Shortly after 00:00:00 UTC, we start seeing tons of requests to /mozilla-central/raw-file/tip/security/nss/lib/ckfw/builtins/certdata.txt. I reckon this is a bunch of CRON jobs all waking up at the same time to update the Mozilla certificate database. We're flooded with hundreds of these requests within a short time window. We /should/ be able to handle this load. If a run a local Mercurial server with `hg serve`, this URL is served in <100ms after first load. However, our individual request CPU logging on the servers (/var/log/hg/parsed-*) says these requests are taking ~1.2s of CPU time. If I run a local server on hgweb via `hg serve --hgmo`, requests complete fast. But if I hit the actual production server process, requests are taking ~1.2s, just like the logs show. Furthermore, I can see the WSGI processes apparently leaking RSS as requests are made. So something is awry with the production servers serving this request. It doesn't reproduce in the standalone server. The main differences between the standalone server and the production server are: * httpd + mod_wsgi versus Python's built-in HTTP server * hgwebdir WSGI application instead of hgweb WSGI application I'll keep poking around until I isolate the problem.
Comment 7•6 years ago
|
||
I've isolated this to the hgwebdir WSGI application. I can reproduce slowness and memory growth by running `/var/hg/venv_hgweb/bin/hg serve --web-config /repo/hg/webroot_wsgi/hgweb.config`, which runs a local server the way httpd would. What appears to be happening is that the process is leaking repository objects or some data associated with repositories. The WSGI application does cache repository objects. But it is supposed to reuse them for subsequent requests. For reasons I don't yet understand, that repository reuse is happening in local `hg serve` but not when using hgwebdir. This causes a slowdown in request processing (because each request has to read the repository data again instead of having it caches) and is contributing to a memory leak.
Flags: needinfo?(sheehan)
Flags: needinfo?(klibby)
Flags: needinfo?(gps)
Updated•6 years ago
|
Group: infra
Comment 8•6 years ago
|
||
Bleh. hgwebdir isn't using the repository caching layer. https://www.mercurial-scm.org/repo/hg/file/b434965f984e/mercurial/hgweb/hgweb_mod.py is where the hgweb WSGI application does its repo caching. https://www.mercurial-scm.org/repo/hg/file/b434965f984e/mercurial/hgweb/hgwebdir_mod.py#l291 is where hgwebdir calls the hgweb application. That in of itself is bad because it means hgwebdir will be much slower than it could be if it were to cache repo instances. And we must be leaking the "repo" object in hgwebdir somehow, otherwise memory use wouldn't increase between requests.
Comment 9•6 years ago
|
||
I was hoping this would be easy fix upstream. But the often simple fix of "drop a reference to an object or call del" isn't working. I suspect there is a cycle within the localrepository class that is keeping the object alive and is preventing its refcount from going to 0 and being GCd. We can work around this by sprinkling a `gc.collect()` in hgwebdir. But that's likely not appropriate for upstream. I may try sending the patch to see what happens. Our best short-term mitigation is to drop the max requests per WSGI worker so the impact of the memory leak is mitigated.
Comment 10•6 years ago
|
||
Here's a histogram of the requests to certdata.txt over a period of ~10 days from 1 server. By Hour (UTC): 7134 00 6013 01 4204 02 4105 03 5042 04 4326 05 4511 06 4444 07 4483 08 4612 09 4023 10 3680 11 3719 12 8415 13 3986 14 4375 15 5862 16 5280 17 4506 18 4046 19 3879 20 4088 21 4079 22 4648 23 By Minute: 13584 00 2151 01 1512 02 1516 03 1601 04 1417 05 1373 06 1378 07 1332 08 1344 09 2081 10 2384 11 2174 12 1656 13 1383 14 6132 15 1249 16 1173 17 1393 18 1249 19 1419 20 1405 21 1263 22 1448 23 1751 24 1527 25 1325 26 1205 27 1120 28 1266 29 4905 30 2423 31 1779 32 1527 33 1436 34 1359 35 1423 36 1395 37 1275 38 1168 39 1258 40 1195 41 1425 42 1491 43 1543 44 4109 45 1229 46 1379 47 1402 48 1368 49 1962 50 2322 51 1717 52 1476 53 1537 54 1407 55 1243 56 1512 57 1494 58 1890 59 Requests by hour are pretty well distributed. But we see requests by minute are most concentrated at the top of every hour and then in minutes 15, 30, and 45. I can't wait for more people to adopt systemd's timers, which allow you to introduce random variations for periodic tasks so we don't see a clump of requests at the top of every hour. To be clear, I have no problem with people hitting hg.mozilla.org to obtain this certificate data. If it weren't for the memory leak, we'd barely notice this load.
Comment hidden (mozreview-request) |
Comment 12•6 years ago
|
||
mozreview-review |
Comment on attachment 8958224 [details] serverlog: perform garbage collection on every request (bug 1443984); https://reviewboard.mozilla.org/r/227164/#review232956 lgtm.
Attachment #8958224 -
Flags: review?(sheehan) → review+
Comment 13•6 years ago
|
||
Pushed by cosheehan@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/83f1cda27356 serverlog: perform garbage collection on every request ; r=sheehan
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Comment 14•6 years ago
|
||
This is now deployed and it appears to have its intended effect on reducing memory leaks.
Attachment #8958224 -
Flags: review?(glob)
Comment 15•6 years ago
|
||
The same GC hack will be in Mercurial 4.5.2. https://www.mercurial-scm.org/repo/hg/rev/ff2370a70fe8
You need to log in
before you can comment on or make changes to this bug.
Description
•