Alert: on tiles.prod.infernyx.impression_stats_last_complete_age

RESOLVED FIXED

Status

RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: mostlygeek, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: alert)

(Reporter)

Description

4 years ago
Datadog alert for: 

- tiles.prod.infernyx.application_stats_last_complete_age 
- tiles.prod.infernyx.impression_stats_last_complete_age 
- tiles.prod.redshift.impression_stats_daily.rows 
- tiles.prod.redshift.application_stats_daily.rows

These happened at approximately the same time. Which points to perhaps a job processing issue with infernyx. 

- Tried SSH'ing to infernyx from bastion host, get connection timed out
- Will retry restarting the server
(Reporter)

Comment 1

4 years ago
- *Did not restart the server* 
- Connected to VPN, was able to SSH into the server
- Server seems to be running fine, and inferno seems to be ok too
- lots of log lines like: 

> 2015-08-04 08:33:19,808 ERROR 31876 [inferno.lib.archiver] Disco returned an empty list for a blob in tag processed:impression:2015-08-04
(Reporter)

Comment 2

4 years ago
- Called :relud, not quite sure where the stall is in the system
(Reporter)

Updated

4 years ago
Whiteboard: alert
1) looks like the datadog-agent service got into a bad state.

I did `systemctl restart datadog-agent`, which failed, and the logs indicated a port needed to be released. I looked through the code to determine what port (17123), and I had to `netstat -tulpn` to find the process claiming 127.0.0.1:17123 and then kill it. Then I had to `systemctl start datadog-agent`, which worked.

2) now that stats were reaching datadog, it became clear that application_stats was not running, so I did `systemctl status inferno` which showed the processes 'master', 'rules.blacklisted_impression_stats', and 'rules.ip_click_counter', then I did ls /var/run/inferno/*.pid, which showed '/var/run/inferno/application_stats.pid' to exist, despite not being in the previous list as 'rules.application_stats', so I rm'd the pid file and that started application_stats again.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
pretty sure both these problems were caused by the oom killer, and we should increase the size of the infernyx host.
You need to log in before you can comment on or make changes to this bug.