Closed Bug 864877 Opened 12 years ago Closed 11 years ago

Nagios paging changes for vcs-sync machines

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Assigned: rbryce)

Details

(Whiteboard: [reit-ops])

We've updated our usage of the github-sync machines. Please adjust nagios as follows: github-sync2.dmz.scl3.mozilla.com - is now "owned" by :aki for development of next tools - should not page on nagios alerts (email/irc only) github-sync1-dev.dmz.scl3.mozilla.com github-sync1.dmz.scl3.mozilla.com github-sync3.dmz.scl3.mozilla.com - continue to be production machines - should page hwine on critical alerts Thanks!
Assignee: server-ops → rbryce
Changes made. I added 2 extra contactgroups in nagios to direct alerts for just these hosts. "githubsync" for asasaki "releng" for hwine Hal, I can make it so you only receive "Critical" alerts, but that would apply to other systems you get alerts for as well. Is that ok?
Flags: needinfo?(hwine)
worked this out with Hal on IRC. Hal you are set to receive only critical SMS alerts for these host. You're email alerts will remain the same.
Flags: needinfo?(hwine)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Unfortunately, I did not get paged for the event in bug 872333 comment 0 Please adjust so I would have.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Hal Wine [:hwine] from comment #3) > Unfortunately, I did not get paged for the event in bug 872333 comment 0 > > Please adjust so I would have. Just sent some test pages to hwine. The config seems to be working as expected. This could be a lost SMS in the carrier system. At hwine's request, I updated his pager number to a gvoice number.
Paging should be good to go now.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
github-sync2.dmz.scl3 just paged oncall for low inodes. It looks like that host is set for IRC-only notifications as well as the stuff for :aki: 'github-sync2.dmz.scl3.mozilla.com' => { parents => 'seamicro-b1.r101-3.console.scl3.mozilla.com', contact_groups => 'sysalertsonly,githubsync', Perhaps that doesn't override the generic disk check's contact_groups?
Status: RESOLVED → REOPENED
Flags: needinfo?(ashish)
Resolution: FIXED → ---
(In reply to Eric Ziegenhorn :ericz from comment #6) > github-sync2.dmz.scl3 just paged oncall for low inodes. It looks like that > host is set for IRC-only notifications as well as the stuff for :aki: > > 'github-sync2.dmz.scl3.mozilla.com' => { > parents => 'seamicro-b1.r101-3.console.scl3.mozilla.com', > contact_groups => 'sysalertsonly,githubsync', > > Perhaps that doesn't override the generic disk check's contact_groups? That is correct - service checks default to 'sysalerts' unless changed and do not inherit the host's contact_groups.
Flags: needinfo?(ashish)
(In reply to Eric Ziegenhorn :ericz from comment #6) > github-sync2.dmz.scl3 just paged oncall for low inodes. It looks like that > host is set for IRC-only notifications as well as the stuff for :aki: > > 'github-sync2.dmz.scl3.mozilla.com' => { > parents => 'seamicro-b1.r101-3.console.scl3.mozilla.com', > contact_groups => 'sysalertsonly,githubsync', > > Perhaps that doesn't override the generic disk check's contact_groups? Eric, Im not sure what the action is here.
can you answer comment 8, please
Flags: needinfo?(eziegenhorn)
The action here is to remove these systems that should have irc-only alerts from the generic hostgroup and put them in / make irc-only versions of that same group of checks.
Flags: needinfo?(eziegenhorn)
Latest machine usage list: (In reply to Hal Wine [:hwine] (use needinfo) from comment #0) > We've updated our usage of the github-sync machines. Please adjust nagios as > follows: > > github-sync2.dmz.scl3.mozilla.com > - is again a production machine > - should page on nagios alerts (email/irc only) > - should page hwine on critical alerts > The following remain production: > github-sync1-dev.dmz.scl3.mozilla.com > github-sync1.dmz.scl3.mozilla.com > github-sync3.dmz.scl3.mozilla.com > - continue to be production machines > - should page hwine on critical alerts New member of pod: github-sync4.dmz.scl3.mozilla.com - is a "spare" box, and likely to be in production shortly My vote would be to keep things simple and treat all 5 boxes as production at this time. We will not have the kind of development again that led to this bug being required.
Sorry for the delay here. The ultimate solution is to move github-sync2 & 4 to generic-preprod from generic hostgroup. This will suppress afterhours alerts, and retain IRC alerts for system level services.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.