Add a flag called something like `--has-errors` that returns an exit code 0 if there are no errors. And higher exit codes (a la the way Nagios escalates warnings and errors) if there are errors. With this in place, we can decide on some higher abstractions such as only returning an exit code >0 if a backfill backed job has failed more than once.
The only docs I can find are http://nagios.sourceforge.net/docs/nrpe/NRPE.pdf
For Nagios plugins, it will call the script and if it returns 0 everything is ok. A return code of 1 indicates a WARNING. And 2 indicates CRITICAL. Whatever is printed to STDOUT will be available in Nagios as the reason it is WARNING or CRITICAL. So basically, this seems reasonable to me if you can reduce the amount of alerting it will do with logic in crontabber.
:rhelmer I'm still curious about specific business logic to apply. Here's one possible solution:: 1. If there is any error with a count > 1, yield a CRITICAL 2. If there is any error with count == 1, yield a WARNING Another solution is:: 1. Same as 1 above but... 2. If the app is NOT backfill based, yield a CRITICAL Any thoughts?
(In reply to Peter Bengtsson [:peterbe] from comment #3) > :rhelmer ... > Another solution is:: > > 1. Same as 1 above but... > 2. If the app is NOT backfill based, yield a CRITICAL > > Any thoughts? I like this one ^ We want to know right away if anything requires manual intervention, and this should just about cover it I think.
Pull request: https://github.com/mozilla/socorro/pull/1074
Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/fd35afc6aa9f537c7475c2de8163640887274fc7 bug 836425 - nagios alerts introspection, r=rhelmer