Closed Bug 1156354 Opened 10 years ago Closed 10 years ago

Adjust infernyx alerts

Categories

(Content Services Graveyard :: Tiles: Ops, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: relud, Assigned: relud)

Details

Now that we are getting infernyx metrics in datadog, I see that over the weekend it took almost 30 minutes between impression_stats completions. I believe this is due to how long it takes inferno to collect 128 impression blobs. Based on discussion in bug 1151989 I made an alert to go off if impression_stats.last_complete age was over 30 minutes at all times in the last 5 minutes (so basically if it reaches 35 minutes). We also have alerts if impression_stats_daily doesn't get new rows for 30 minutes, and if it takes more than 30 minutes for data on a new day to arrive in redshift. I think it would be best to adjust the acceptable delay for these alerts from 30 minutes to 1 hour.
I'm +1 tuning the alert up to 1 hour. It fits the data.
There are a number of factors determining how long the job will take to complete. Right now, the cluster isn't very busy, so the buffering of blobs to reach 128 is the deciding factor. As more jobs are run and more data added to the cluster, this may change. Also, because inferno can process logs fairly rapidly and the business doesn't really operate on an 'hour-to-hour' basis, I think we can push the alert up to 4 hours. This should eliminate false-positive alerts for the foreseeable future.
i'm okay with 4 hours. benson?
Flags: needinfo?(bwong)
I'm good with 4 hours.
Flags: needinfo?(bwong)
adjusting to 4 hours
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.