Closed
Bug 687957
Opened 14 years ago
Closed 14 years ago
Need nagios/heartbeat monitor checks for builder-addons and builder-addons-next hosts
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: stephend, Assigned: rtucker)
Details
We need a nagios/heartbeat check on both https://builder-addons.allizom.org/ and https://builder-addons-next.allizom.org/, please.
(See bug 686946.)
Comment 1•14 years ago
|
||
What should we be checking for? Just a basic http response?
Rob, set this up only to page during work-hours
Assignee: server-ops → rtucker
| Reporter | ||
Comment 2•14 years ago
|
||
(In reply to Corey Shields [:cshields] from comment #1)
> What should we be checking for? Just a basic http response?
A 200 OK would be a good start, yes.
> Rob, set this up only to page during work-hours
Zac Campbell, who works on FlightDeck, shouldn't be blocked by a down web server; this is something I talked to mrz about a while ago -- we have team members working literally around the clock on our projects. Can you reconsider?
| Assignee | ||
Comment 3•14 years ago
|
||
Stephen,
I can search through the content returned for a string. This typically gives a bit more strength to the check.
If there's a string in the content you want me to search for, just let me know and I will make it so.
| Reporter | ||
Comment 4•14 years ago
|
||
(In reply to Rob Tucker [:rtucker] from comment #3)
> Stephen,
> I can search through the content returned for a string. This typically gives
> a bit more strength to the check.
>
> If there's a string in the content you want me to search for, just let me
> know and I will make it so.
Thanks, Rob, I think "Fastest Way to Build" (case-insensitive) might be a good check.
Comment 5•14 years ago
|
||
AMO also has this page, but it has a JSON view that nagios hits that gives the nagios job a little more flexibility in figuring out what is wrong. I don't think flightdeck has that view right now. For what it's worth, if anything on that page goes wrong right now it'll return a 500 error, so string check is probably unnecessary.
Monitoring the dev site is kind of a catch 22. If the db is dead or something, IT needs paged, but if we commit a new feature and need a new setting added that's an IT bug, but not necessarily page worthy. Since IT doesn't write the code, they'd have to evaluate the situation when they get paged. I have great confidence that they'd do fine, but I'm not sure they'd want to. :)
Perhaps we could put "devs in #flightdeck" in the nagios message and if it's not something obvious IT could always jump into the IRC channel and see if one of the devs need a hand?
| Assignee | ||
Comment 6•14 years ago
|
||
clouserw:
It's also possible that I could have the check not page oncall, but send alerts to irc.
I could also have the nagios bot join #flightdeck and send alerts there as well.
Comment 7•14 years ago
|
||
(In reply to Stephen Donner [:stephend] from comment #2)
> Zac Campbell, who works on FlightDeck, shouldn't be blocked by a down web
> server; this is something I talked to mrz about a while ago -- we have team
> members working literally around the clock on our projects. Can you
> reconsider?
Not for a dev resources. The issue is that dev is a moving target we don't have total control over. fex, it updates on every check in. If a dev is working late at night and checks in a piece of code that causes a 500, I'd rather not have my guys woken up for that.
What we do for others is what Rob suggests, where the irc bot sends those notices out to a channel.
hope that explains our position there.. Host checks, hw checks, etc.. still all page 24/7.
Comment 8•14 years ago
|
||
Alerts to IRC sound good. You're saying:
Work hours:
- Bot announces in regular channels and #flightdeck
- IT paged as usual
Off hours:
- Bot announces in #flightdeck
Is that right? I think that'd work. If it becomes a problem we can always revisit too
Comment 9•14 years ago
|
||
Yup.
Rob, let's ship it!
| Assignee | ||
Comment 10•14 years ago
|
||
The checks are added for pm-app-amo24 and are visible here:
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=pm-app-amo24
All hardware checks are 24x7, these checks won't page oncall ever, but will alert to our IRC channel where they'll be caught.
I've added nagios contacts and groups for flightdeck. I've set the only contact for the group to be sdonner@mozilla.com. If you want others added to the contact group, please let me know as well.
I've joined the nagios-sjc1 bot to #flightdeck on irc where this check will alert.
If you don't need any other changes to this, please close it out.
Thanks!
| Reporter | ||
Comment 11•14 years ago
|
||
(In reply to Rob Tucker [:rtucker] from comment #10)
> The checks are added for pm-app-amo24 and are visible here:
> https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=pm-
> app-amo24
>
> All hardware checks are 24x7, these checks won't page oncall ever, but will
> alert to our IRC channel where they'll be caught.
>
> I've added nagios contacts and groups for flightdeck. I've set the only
> contact for the group to be sdonner@mozilla.com. If you want others added to
> the contact group, please let me know as well.
>
> I've joined the nagios-sjc1 bot to #flightdeck on irc where this check will
> alert.
>
> If you don't need any other changes to this, please close it out.
>
> Thanks!
Hey Rob -
I don't need to be notified by email when it fails, thanks; our CI (Jenkins) already does that -- just needed this monitored on the IT side.
| Assignee | ||
Comment 12•14 years ago
|
||
I've removed your email address from the contact. I'm going to close this one out. Please feel free to reopen it if you have any other issues.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 13•14 years ago
|
||
Thanks, Rob; appreciate it!
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•