Closed
Bug 1190747
Opened 10 years ago
Closed 10 years ago
Alert: on tiles.prod.infernyx.impression_stats_last_complete_age
Categories
(Content Services Graveyard :: Tiles: Ops, defect)
Content Services Graveyard
Tiles: Ops
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mostlygeek, Unassigned)
Details
(Whiteboard: alert)
Datadog alert for:
- tiles.prod.infernyx.application_stats_last_complete_age
- tiles.prod.infernyx.impression_stats_last_complete_age
- tiles.prod.redshift.impression_stats_daily.rows
- tiles.prod.redshift.application_stats_daily.rows
These happened at approximately the same time. Which points to perhaps a job processing issue with infernyx.
- Tried SSH'ing to infernyx from bastion host, get connection timed out
- Will retry restarting the server
| Reporter | ||
Comment 1•10 years ago
|
||
- *Did not restart the server*
- Connected to VPN, was able to SSH into the server
- Server seems to be running fine, and inferno seems to be ok too
- lots of log lines like:
> 2015-08-04 08:33:19,808 ERROR 31876 [inferno.lib.archiver] Disco returned an empty list for a blob in tag processed:impression:2015-08-04
| Reporter | ||
Comment 2•10 years ago
|
||
- Called :relud, not quite sure where the stall is in the system
| Reporter | ||
Updated•10 years ago
|
Whiteboard: alert
Comment 3•10 years ago
|
||
1) looks like the datadog-agent service got into a bad state.
I did `systemctl restart datadog-agent`, which failed, and the logs indicated a port needed to be released. I looked through the code to determine what port (17123), and I had to `netstat -tulpn` to find the process claiming 127.0.0.1:17123 and then kill it. Then I had to `systemctl start datadog-agent`, which worked.
2) now that stats were reaching datadog, it became clear that application_stats was not running, so I did `systemctl status inferno` which showed the processes 'master', 'rules.blacklisted_impression_stats', and 'rules.ip_click_counter', then I did ls /var/run/inferno/*.pid, which showed '/var/run/inferno/application_stats.pid' to exist, despite not being in the previous list as 'rules.application_stats', so I rm'd the pid file and that started application_stats again.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 4•10 years ago
|
||
pretty sure both these problems were caused by the oom killer, and we should increase the size of the infernyx host.
Comment 5•10 years ago
|
||
pr to fix oom issue: https://github.com/mozilla-services/svcops/pull/603
You need to log in
before you can comment on or make changes to this bug.
Description
•