celery-dev1.addons.phx1.mozilla.com has been down for a day or so and used up a bunch of developer time trying to figure out why tasks weren't getting executed. Jeremy added a host check for it in nagios to alert IT but I'd like to see notifications for the developers and QA also so they are aware of what's going on. What do you think of configuring nagios to alert in #amo and #flightdeck for critical problems with dev1.addons.phx1.mozilla.com, celery-dev1.addons.phx1.mozilla.com, and services-stage1.addons.phx1.mozilla.com? We're already alerting in #flightdeck for afterhours problems there so we could probably copy those rules.
Alerting to flightdeck is trivial, but we're not currently alerting anything to #amo. Is it necessary to go to both, or is to just flightdeck ok? What checks do you want to report?
(In reply to Rob Tucker [:rtucker] from comment #1) > Alerting to flightdeck is trivial, but we're not currently alerting anything > to #amo. > Is it necessary to go to both, or is to just flightdeck ok? Needs to be both, they are separate projects. > What checks do you want to report? anything CRITICAL? I don't really know. I don't want a noisy bot, I just want to know when something is to the point that it's affecting the project.
Bringing oremj in as I don't have any way of knowing what is critical. Jeremy, Thoughts?
I'd start by just sending all alerts from those servers. We can trim later if needed.
Added the checks and new contact groups. I edited the nagios bot to report correctly to the requested irc channels. Confirmed with clouserw that things are set and he r+'d R/F this.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.