Closed Bug 1493209 Opened 7 years ago Closed 7 years ago

Investigate problem on Syslog-proxy1.dmz.mdc1.mozilla.com

Categories

(Infrastructure & Operations :: MOC: Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: phrozyn, Assigned: Usul)

Details

Today I noticed nagios errors that our workers that connect to this host were timing out due to "Sep 21 15:23:18 mozdef1.private.mdc1.mozilla.com contegix-auditd-worker: amqp.exceptions.NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'auditd' in vhost 'contegix' due to timeout" So I went to log into the admin interface for rabbitmq running on that host. It gives me this "Management API returned status code 500 -" and the entire page is blank as though it can't access it's queues. So I am pinging MOC and Usul responded. This bug is for tracking this issue.
ran puppet : Error: /Stage[main]/Yum/Resources[yumrepo]: Failed to generate additional resources using 'generate': Section "base" is already defined, cannot redefine in /etc/yum.repos.d/CentOS-Base.repo Info: Applying configuration version 'db9abfeddeaae62e368e6cfb18154817f5d2dc11' Notice: /Stage[main]/Rabbitmq::Updateconfig/Exec[update-rabbitmq-vhosts]/returns: executed successfully Notice: Finished catalog run in 28.05 seconds sudo yum-wrapper update was unhappy : [root@syslog-proxy1.dmz.mdc1 yum.repos.d]# grep mirro * |wc -l 31 [root@syslog-proxy1.dmz.mdc1 yum.repos.d]# [root@syslog-proxy1.dmz.mdc1 yum.repos.d]# grep mdc1 * |wc -l 10 [root@syslog-proxy1.dmz.mdc1 yum.repos.d]# tared the files [root@syslog-proxy1.dmz.mdc1 yum.repos.d]# grep mirro * |awk -F: '{print $1}'|uniq base.repo CentOS-Base.repo CentOS-CR.repo CentOS-Debuginfo.repo CentOS-fasttrack.repo centosplus.repo CentOS-Sources.repo cr.repo extras.repo fasttrack.repo updates.repo [root@syslog-proxy1.dmz.mdc1 yum.repos.d]# grep mirro * |awk -F: '{print $1}'|uniq |xargs rm rm -f C7* Error: Package: erlang-ic-20.3.8.7-1.el7.x86_64 (erlang-solutions-direct) Requires: erlang-kernel(x86-64) = 20.3.8.7-1.el7 Removing: erlang-kernel-20.3-1.el7.centos.x86_64 (@erlang-solutions-direct) erlang-kernel(x86-64) = 20.3-1.el7.centos Updated By: erlang-kernel-21.0.6-1.el7.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 21.0.6-1.el7 Available: erlang-kernel-R16B-03.18.el7.x86_64 (epel) erlang-kernel(x86-64) = R16B-03.18.el7 Available: erlang-kernel-17.1-1.1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 17.1-1.1.el7.centos Available: erlang-kernel-17.1-3.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 17.1-3.el7.centos Available: erlang-kernel-17.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 17.3-1.el7.centos Available: erlang-kernel-17.4-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 17.4-1.el7.centos Available: erlang-kernel-17.5-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 17.5-1.el7.centos Available: erlang-kernel-17.5.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 17.5.3-1.el7.centos Available: erlang-kernel-18.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 18.0-1.el7.centos Available: erlang-kernel-18.1-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 18.1-1.el7.centos Available: erlang-kernel-18.2-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 18.2-1.el7.centos Available: erlang-kernel-18.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 18.3-1.el7.centos Available: erlang-kernel-19.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 19.0-1.el7.centos Available: erlang-kernel-19.1-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 19.1-1.el7.centos Available: erlang-kernel-19.2-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 19.2-1.el7.centos Available: erlang-kernel-19.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 19.3-1.el7.centos Available: erlang-kernel-20.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 20.0-1.el7.centos Available: erlang-kernel-20.1-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 20.1-1.el7.centos Available: erlang-kernel-20.2-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 20.2-1.el7.centos Available: erlang-kernel-20.3.8.7-1.el7.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 20.3.8.7-1.el7 Available: erlang-kernel-21.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 21.0-1.el7.centos Available: erlang-kernel-21.0.5-1.el7.x86_64 (erlang-solutions-direct) erlang-kernel(x86-64) = 21.0.5-1.el7 Error: Package: erlang-ic-20.3.8.7-1.el7.x86_64 (erlang-solutions-direct) Requires: erlang-erts(x86-64) = 20.3.8.7-1.el7 Removing: erlang-erts-20.3-1.el7.centos.x86_64 (@erlang-solutions-direct) erlang-erts(x86-64) = 20.3-1.el7.centos Updated By: erlang-erts-21.0.6-1.el7.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 21.0.6-1.el7 Available: erlang-erts-R16B-03.18.el7.x86_64 (epel) erlang-erts(x86-64) = R16B-03.18.el7 Available: erlang-erts-17.1-1.1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 17.1-1.1.el7.centos Available: erlang-erts-17.1-3.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 17.1-3.el7.centos Available: erlang-erts-17.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 17.3-1.el7.centos Available: erlang-erts-17.4-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 17.4-1.el7.centos Available: erlang-erts-17.5-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 17.5-1.el7.centos Available: erlang-erts-17.5.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 17.5.3-1.el7.centos Available: erlang-erts-18.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 18.0-1.el7.centos Available: erlang-erts-18.1-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 18.1-1.el7.centos Available: erlang-erts-18.2-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 18.2-1.el7.centos Available: erlang-erts-18.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 18.3-1.el7.centos Available: erlang-erts-19.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 19.0-1.el7.centos Available: erlang-erts-19.1-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 19.1-1.el7.centos Available: erlang-erts-19.2-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 19.2-1.el7.centos Available: erlang-erts-19.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 19.3-1.el7.centos Available: erlang-erts-20.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 20.0-1.el7.centos Available: erlang-erts-20.1-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 20.1-1.el7.centos Available: erlang-erts-20.2-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 20.2-1.el7.centos Available: erlang-erts-20.3.8.7-1.el7.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 20.3.8.7-1.el7 Available: erlang-erts-21.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 21.0-1.el7.centos Available: erlang-erts-21.0.5-1.el7.x86_64 (erlang-solutions-direct) erlang-erts(x86-64) = 21.0.5-1.el7 Error: Package: erlang-ic-20.3.8.7-1.el7.x86_64 (erlang-solutions-direct) Requires: erlang-stdlib(x86-64) = 20.3.8.7-1.el7 Removing: erlang-stdlib-20.3-1.el7.centos.x86_64 (@erlang-solutions-direct) erlang-stdlib(x86-64) = 20.3-1.el7.centos Updated By: erlang-stdlib-21.0.6-1.el7.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 21.0.6-1.el7 Available: erlang-stdlib-R16B-03.18.el7.x86_64 (epel) erlang-stdlib(x86-64) = R16B-03.18.el7 Available: erlang-stdlib-17.1-1.1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 17.1-1.1.el7.centos Available: erlang-stdlib-17.1-3.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 17.1-3.el7.centos Available: erlang-stdlib-17.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 17.3-1.el7.centos Available: erlang-stdlib-17.4-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 17.4-1.el7.centos Available: erlang-stdlib-17.5-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 17.5-1.el7.centos Available: erlang-stdlib-17.5.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 17.5.3-1.el7.centos Available: erlang-stdlib-18.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 18.0-1.el7.centos Available: erlang-stdlib-18.1-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 18.1-1.el7.centos Available: erlang-stdlib-18.2-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 18.2-1.el7.centos Available: erlang-stdlib-18.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 18.3-1.el7.centos Available: erlang-stdlib-19.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 19.0-1.el7.centos Available: erlang-stdlib-19.1-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 19.1-1.el7.centos Available: erlang-stdlib-19.2-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 19.2-1.el7.centos Available: erlang-stdlib-19.3-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 19.3-1.el7.centos Available: erlang-stdlib-20.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 20.0-1.el7.centos Available: erlang-stdlib-20.1-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 20.1-1.el7.centos Available: erlang-stdlib-20.2-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 20.2-1.el7.centos Available: erlang-stdlib-20.3.8.7-1.el7.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 20.3.8.7-1.el7 Available: erlang-stdlib-21.0-1.el7.centos.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 21.0-1.el7.centos Available: erlang-stdlib-21.0.5-1.el7.x86_64 (erlang-solutions-direct) erlang-stdlib(x86-64) = 21.0.5-1.el7 You could try using --skip-broken to work around the problem
Assignee: nobody → ludovic
sudo yum-wrapper update --skip-broken Install 1 Package Upgrade 72 Packages Remove 1 Package Skipped (dependency problems) 106 Packages Total download size: 195 M Is this ok [y/d/N]: y rebooted.
I'll log in and take a look at the erlang issue, We had digi update rabbit and erlang to later versions upon migration to work around some tls issues. It may be the erlang repo simply can't be accessed or something.
Summary: Investigate problem on Syslog-proxy1.private.mdc1.mozilla.com → Investigate problem on Syslog-proxy1.dmz.mdc1.mozilla.com
After the initial handshake timeout (because of patching mozdef, this means any connections from the workers that were initiated would time out because they were shut down) we see a file descriptor limit alarm: 018-09-21 03:47:34.603 [warning] <0.271.0> file descriptor limit alarm set.~n~n********************************************************************~n*** New connections will not be accepted until this alarm clears ***~n********************************************************************~n Then half a minute later we see it clears and immediately retriggers and clears: 2018-09-21 03:48:14.708 [warning] <0.271.0> file descriptor limit alarm cleared~n 2018-09-21 03:48:14.717 [warning] <0.271.0> file descriptor limit alarm set.~n~n********************************************************************~n*** New connections will not be accepted until this alarm clears ***~n********************************************************************~n 2018-09-21 03:48:14.719 [warning] <0.271.0> file descriptor limit alarm cleared~n This happens again one more time, and then we get: 018-09-21 03:58:49.283 [warning] <0.1130.0> Ranch acceptor reducing accept rate: out of file descriptors 2018-09-21 03:58:50.263 [error] <0.273.0> ** Generic server vm_memory_monitor terminating After the reboot: File Descriptors: current system limit is 791357, and in use are 1408 We should monitor this Nagios alert for: cat /proc/sys/fs/file-nr the first number is the number in use, the last number is the max amount the system can handle.
ulimit -aH showed open file limit to be 4096 which is way to low for time when we are patching mozdef hosts. That means rabbit is piling up messages and may even be caching them to disk for persistence. So upping the fd limit to 100k with a soft limit of 65535. It should only ever hit this limit during our patching if it does. After a reboot we have the new limit: [root@syslog-proxy1.dmz.mdc1 ~]# ulimit -aH core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 31204 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 100000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 31204 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
[root@syslog-proxy1.dmz.mdc1 ~]# cat /proc/767/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 31204 31204 processes Max open files 1024 4096 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 31204 31204 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [root@syslog-proxy1.dmz.mdc1 ~]# su - rabbitmq -s /bin/sh -c 'ulimit -n' 65535 [root@syslog-proxy1.dmz.mdc1 ~]# rabbitmqctl status | grep -A 4 limit {vm_memory_limit,3280923852}, {disk_free_limit,50000000}, {disk_free,50213949440}, {file_descriptors, [{total_limit,924}, <-- this is hard coded but I believe it changes dynamically. {total_used,112}, {sockets_limit,829}, {sockets_used,90}]}, {processes,[{limit,1048576},{used,1353}]}, {run_queue,0}, {uptime,453}, {kernel,{net_ticktime,60}}]
Closing this bug, will reopen if we find this issue recurring.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.