Closed Bug 881228 Opened 11 years ago Closed 11 years ago

Please enable downtime alerts for nagios-releng in #buildduty

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: ashish)

References

Details

This will allow nagios-releng to do alerts like what we see nagios-scl3 doing:

Mon 05:16:48 PDT ringring.mv.mozilla.com is DOWNTIMESTART (UP) :PING OK - Packet loss = 0%, RTA = 3.18 ms
nagios-scl3	Mon 05:17:47 PDT ringring.mv.mozilla.com is DOWNTIMEEND (UP) :PING OK - Packet loss = 0%, RTA = 2.41 ms

Allows us to know if a group of hosts is alerting because it comes off downtime, or because it is a new alert.

We could additionally turn it on for our release@ nagios notification e-mails, but I'm not as concerned about giving us more spam that way, since every buildduty person I know of uses the channel over e-mail for this task.
I did tell you in IRC there was no urgency here, but I'm curious what we're looking at for an ETA. Since -- for me if no-one else, this would help a lot!
Flags: needinfo?(ashish)
:Callek the last time we spoke you mentioned you'd bring this up in your team meeting and clarify whether emailing the list for downtime notifications (just as all other alerts do) was feasible? If yes, I can expedite and push this out soon. Do let me know, thanks!
Flags: needinfo?(ashish) → needinfo?(bugspam.Callek)
redir needinfo to hal, since he took ownership of this item.
Flags: needinfo?(bugspam.Callek) → needinfo?(hwine)
Hal, poke.
Shyam -- aiui the ideal solution is to get the "depends on" relationships into Nagios, so we'll get the right notifications.

Sending downtime alerts is a stop-gap until the "depends on" relationships are established. We seem to be bogging down on that effort -- I can't even find a bug on it, so created bug 932598.

Since we've made some procedural changes on our side, back to Callek to confirm this is still wanted by the buildduty team.
Flags: needinfo?(hwine) → needinfo?(bugspam.Callek)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #5)
> Shyam -- aiui the ideal solution is to get the "depends on" relationships
> into Nagios, so we'll get the right notifications.
> 
> Sending downtime alerts is a stop-gap until the "depends on" relationships
> are established. We seem to be bogging down on that effort -- I can't even
> find a bug on it, so created bug 932598.

This is indeed still a want, irregardless of the depends on relationships being accurate.

The depends on being done will lower the usefulness of this as a global thing, but will not invalidate this bug in and of itself.
Flags: needinfo?(bugspam.Callek)
Pushed out, live and verified:

Hosts:
---8<---
23:03:14 < ashish> nagios-releng: downtime t-w732-ix-126.wintest.releng.scl3.mozilla.com 1m test
23:03:14 < nagios-releng> ashish: Downtime for host t-w732-ix-126.wintest.releng.scl3.mozilla.com scheduled for 0:01:00
23:03:15 < nagios-releng> Tue 23:03:14 PST t-w732-ix-126.wintest.releng.scl3.mozilla.com is DOWNTIMESTART (DOWN) :PING CRITICAL - Packet loss = 100%
23:04:15 < nagios-releng> Tue 23:04:15 PST t-w732-ix-126.wintest.releng.scl3.mozilla.com is DOWNTIMEEND (DOWN) :PING CRITICAL - Packet loss = 100%
---8<---

Services:
---8<---
22:18:53 < ashish> nagios-releng: downtime 4003 1m test
22:18:53 < nagios-releng> ashish: Downtime for service bm-remote.build.mtv1.mozilla.com:http scheduled for 0:01:00
22:18:57 < nagios-releng> Tue 22:18:57 PST bm-remote.build.mtv1.mozilla.com:http is DOWNTIMESTART (WARNING): HTTP WARNING: HTTP/1.1 403 Forbidden - 599 bytes in 0.007 second response time (http://m.allizom.org/http) (notify-by-email) HTTP WARNING: HTTP/1.1 403 Forbidden - 599 bytes in 0.007 second response time
22:19:53 < nagios-releng> Tue 22:19:53 PST bm-remote.build.mtv1.mozilla.com:http is DOWNTIMEEND (WARNING): HTTP WARNING: HTTP/1.1 403 Forbidden - 599 bytes in 0.007 second response time (http://m.allizom.org/http) (notify-by-email) HTTP WARNING: HTTP/1.1 403 Forbidden - 599 bytes in 0.007 second response time
---8<---
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.