If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Alert: on tiles.prod.infernyx.impression_stats_last_complete_age

RESOLVED FIXED

Status

Content Services Graveyard
Tiles: Ops
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: mostlygeek, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: alert)

(Reporter)

Description

2 years ago
Datadog alert for: 

- tiles.prod.infernyx.application_stats_last_complete_age 
- tiles.prod.infernyx.impression_stats_last_complete_age 
- tiles.prod.redshift.impression_stats_daily.rows 
- tiles.prod.redshift.application_stats_daily.rows

These happened at approximately the same time. Which points to perhaps a job processing issue with infernyx. 

- Tried SSH'ing to infernyx from bastion host, get connection timed out
- Will retry restarting the server
(Reporter)

Comment 1

2 years ago
- *Did not restart the server* 
- Connected to VPN, was able to SSH into the server
- Server seems to be running fine, and inferno seems to be ok too
- lots of log lines like: 

> 2015-08-04 08:33:19,808 ERROR 31876 [inferno.lib.archiver] Disco returned an empty list for a blob in tag processed:impression:2015-08-04
(Reporter)

Comment 2

2 years ago
- Called :relud, not quite sure where the stall is in the system
(Reporter)

Updated

2 years ago
Whiteboard: alert
1) looks like the datadog-agent service got into a bad state.

I did `systemctl restart datadog-agent`, which failed, and the logs indicated a port needed to be released. I looked through the code to determine what port (17123), and I had to `netstat -tulpn` to find the process claiming 127.0.0.1:17123 and then kill it. Then I had to `systemctl start datadog-agent`, which worked.

2) now that stats were reaching datadog, it became clear that application_stats was not running, so I did `systemctl status inferno` which showed the processes 'master', 'rules.blacklisted_impression_stats', and 'rules.ip_click_counter', then I did ls /var/run/inferno/*.pid, which showed '/var/run/inferno/application_stats.pid' to exist, despite not being in the previous list as 'rules.application_stats', so I rm'd the pid file and that started application_stats again.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
pretty sure both these problems were caused by the oom killer, and we should increase the size of the infernyx host.
pr to fix oom issue: https://github.com/mozilla-services/svcops/pull/603
You need to log in before you can comment on or make changes to this bug.