Closed
Bug 1214691
Opened 9 years ago
Closed 9 years ago
Whittle down number of hosts that have collectd log check
Categories
(Infrastructure & Operations :: IT-Managed Tools, task)
Infrastructure & Operations
IT-Managed Tools
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ericz, Assigned: ericz)
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1917] )
The collectd carbon connection log check has had a sordid history. It runs on nearly all of our hosts and is sensitive to network disruptions, so every time a network or Zeus problem happens, every host alerts, clogging up #sysadmins. It also has a history of false alarms because something is wrong with Nagios' log_warn check (see bug 1185353). It is a valid check in that I'd like to know if a host is no longer sending data to Graphite, but this cost is too high for what in practice has turned out to be little value. It's more a canary in a coal mine than a indicator of Collectd's health on a host.
I propose to reduce the number of hosts this runs on to a few in SCL3 and PHX1 such that when it alerts, it doesn't flood the channel with huge numbers of alerts. But we'll still have it as at least some indicator of graphite problems should it alert, just more limited in coverage.
Comment 1•9 years ago
|
||
Perhaps we could add a puppet fact that detects problems? Is there a better way to see if a host is not currently sending data to graphite than looking at the logs?
Assignee | ||
Comment 2•9 years ago
|
||
Collectd's log is the only way I'm aware of to tell if it is sending data to Graphite or not.
Comment 3•9 years ago
|
||
If that's the only way to get current state perhaps a script that parses the log and instead of just looking to see if there are any bad things there it could look to see if the bad things are last items in the log?
If there's a connection failure and a recovery after it there's no need to alert. That logic could be used for a puppet fact too so have a small number set up in nagios and the rest report via puppet dashboard?
Assignee | ||
Comment 4•9 years ago
|
||
Yes a better check could be created, but it's a low priority. Does anyone watch puppet dashboard for errors?
Assignee | ||
Comment 5•9 years ago
|
||
Reduction in scope committed in r109148. A custom script for Nagios that acts as a smarter log_warn would be a good thing to do sometime.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•