Datadog alert for: - tiles.prod.infernyx.application_stats_last_complete_age - tiles.prod.infernyx.impression_stats_last_complete_age - tiles.prod.redshift.impression_stats_daily.rows - tiles.prod.redshift.application_stats_daily.rows These happened at approximately the same time. Which points to perhaps a job processing issue with infernyx. - Tried SSH'ing to infernyx from bastion host, get connection timed out - Will retry restarting the server
- *Did not restart the server* - Connected to VPN, was able to SSH into the server - Server seems to be running fine, and inferno seems to be ok too - lots of log lines like: > 2015-08-04 08:33:19,808 ERROR 31876 [inferno.lib.archiver] Disco returned an empty list for a blob in tag processed:impression:2015-08-04
- Called :relud, not quite sure where the stall is in the system
1) looks like the datadog-agent service got into a bad state. I did `systemctl restart datadog-agent`, which failed, and the logs indicated a port needed to be released. I looked through the code to determine what port (17123), and I had to `netstat -tulpn` to find the process claiming 127.0.0.1:17123 and then kill it. Then I had to `systemctl start datadog-agent`, which worked. 2) now that stats were reaching datadog, it became clear that application_stats was not running, so I did `systemctl status inferno` which showed the processes 'master', 'rules.blacklisted_impression_stats', and 'rules.ip_click_counter', then I did ls /var/run/inferno/*.pid, which showed '/var/run/inferno/application_stats.pid' to exist, despite not being in the previous list as 'rules.application_stats', so I rm'd the pid file and that started application_stats again.
pretty sure both these problems were caused by the oom killer, and we should increase the size of the infernyx host.
pr to fix oom issue: https://github.com/mozilla-services/svcops/pull/603