Closed Bug 1214691 Opened 9 years ago Closed 9 years ago

Whittle down number of hosts that have collectd log check

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ericz, Assigned: ericz)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1917] )

The collectd carbon connection log check has had a sordid history. It runs on nearly all of our hosts and is sensitive to network disruptions, so every time a network or Zeus problem happens, every host alerts, clogging up #sysadmins. It also has a history of false alarms because something is wrong with Nagios' log_warn check (see bug 1185353). It is a valid check in that I'd like to know if a host is no longer sending data to Graphite, but this cost is too high for what in practice has turned out to be little value. It's more a canary in a coal mine than a indicator of Collectd's health on a host. I propose to reduce the number of hosts this runs on to a few in SCL3 and PHX1 such that when it alerts, it doesn't flood the channel with huge numbers of alerts. But we'll still have it as at least some indicator of graphite problems should it alert, just more limited in coverage.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1917]
Perhaps we could add a puppet fact that detects problems? Is there a better way to see if a host is not currently sending data to graphite than looking at the logs?
Collectd's log is the only way I'm aware of to tell if it is sending data to Graphite or not.
If that's the only way to get current state perhaps a script that parses the log and instead of just looking to see if there are any bad things there it could look to see if the bad things are last items in the log? If there's a connection failure and a recovery after it there's no need to alert. That logic could be used for a puppet fact too so have a small number set up in nagios and the rest report via puppet dashboard?
Yes a better check could be created, but it's a low priority. Does anyone watch puppet dashboard for errors?
Reduction in scope committed in r109148. A custom script for Nagios that acts as a smarter log_warn would be a good thing to do sometime.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.