Closed Bug 1190747 Opened 10 years ago Closed 10 years ago

Alert: on tiles.prod.infernyx.impression_stats_last_complete_age

Categories

(Content Services Graveyard :: Tiles: Ops, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mostlygeek, Unassigned)

Details

(Whiteboard: alert)

Datadog alert for: - tiles.prod.infernyx.application_stats_last_complete_age - tiles.prod.infernyx.impression_stats_last_complete_age - tiles.prod.redshift.impression_stats_daily.rows - tiles.prod.redshift.application_stats_daily.rows These happened at approximately the same time. Which points to perhaps a job processing issue with infernyx. - Tried SSH'ing to infernyx from bastion host, get connection timed out - Will retry restarting the server
- *Did not restart the server* - Connected to VPN, was able to SSH into the server - Server seems to be running fine, and inferno seems to be ok too - lots of log lines like: > 2015-08-04 08:33:19,808 ERROR 31876 [inferno.lib.archiver] Disco returned an empty list for a blob in tag processed:impression:2015-08-04
- Called :relud, not quite sure where the stall is in the system
Whiteboard: alert
1) looks like the datadog-agent service got into a bad state. I did `systemctl restart datadog-agent`, which failed, and the logs indicated a port needed to be released. I looked through the code to determine what port (17123), and I had to `netstat -tulpn` to find the process claiming 127.0.0.1:17123 and then kill it. Then I had to `systemctl start datadog-agent`, which worked. 2) now that stats were reaching datadog, it became clear that application_stats was not running, so I did `systemctl status inferno` which showed the processes 'master', 'rules.blacklisted_impression_stats', and 'rules.ip_click_counter', then I did ls /var/run/inferno/*.pid, which showed '/var/run/inferno/application_stats.pid' to exist, despite not being in the previous list as 'rules.application_stats', so I rm'd the pid file and that started application_stats again.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
pretty sure both these problems were caused by the oom killer, and we should increase the size of the infernyx host.
You need to log in before you can comment on or make changes to this bug.