Closed Bug 886637 Opened 11 years ago Closed 11 years ago

change releng slave notifications to not alert host down until 7 hour duration

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Assigned: ashish)

References

Details

(Whiteboard: [reit-nagios])

Attachments

(1 file, 1 obsolete file)

We now have automation to handle the majority of host down issues. The automation runs every 6 hours.

This means that no human is going to take an action on such a host down until the automation has a chance to fix things.

Please change the slave host down notifications to not alert until they have been down for over 7 hours. We don't want to change ec2 hosts.

This includes:
 - build slaves
 - try slaves
 - talos slaves
 - tegra slaves
 - panda slaves

The following hosts should NOT be modified - they continue to be handled by humans:
 - buildbot masters (build & test)
 - machines in ec2
Assignee: server-ops → ashish
Hal: I presume this should also apply to all service checks as well as host checks since they will alert much sooner than 7 hours?  Also, what mechanism will you be using to catch when large numbers (or entire silos) of hosts go offline at once due to a service disruption?  Will this be done manually by the sheriffs now?
Flags: needinfo?(hwine)
Amy:

1. Good question -- I think nagios doesn't check services on "known down" boxes. But, I don't know if it makes that decision on "soft" or "hard" downs. If "soft", then we're okay. Let's assume soft until proven otherwise.

2. The "entire silo" question is being addressed separately as part of the "let's get nagios working".

I.e. this is a "progress not perfection" step, and we'll be watching for the challenges.
Flags: needinfo?(hwine)
Ashish - what is the current ETR on this -- it will make life much better for buildduty.
Flags: needinfo?(ashish)
re loss of entire silos, the cluster checks on nagios1.private.releng.scl3.mozilla.com should catch problems.
(In reply to Hal Wine [:hwine] from comment #3)
> Ashish - what is the current ETR on this -- it will make life much better
> for buildduty.

Need a little assistance here. A list of hostgroups would be most helpful [https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=all&style=grid]. Thanks!

:arr Is it ok to move the hosts to the "very lazy" category and bump first_notification_delay to 420m?
Flags: needinfo?(ashish) → needinfo?(arich)
Very lazy is just the tegras, I believe, and since they're included in this list, I think that would work.  For the pandas, we don't want to change the timing for the imaging servers, so I'm not sure what magic you would need to do to make that happen (since I'm not 100% sure how the panda check is set up to query mozpool).
Flags: needinfo?(arich)
Also, as part of this you probably want to revisit the cluster checks with releng to make sure that they're set at appropriate thresholds (and add cluster checks in silos where they don't exist).
(In reply to Ashish Vijayaram [:ashish] from comment #5)
> (In reply to Hal Wine [:hwine] from comment #3)
> > Ashish - what is the current ETR on this -- it will make life much better
> > for buildduty.
> 
> Need a little assistance here. A list of hostgroups would be most helpful
> [https://nagios.mozilla.org/releng-scl3/cgi-bin/status.
> cgi?hostgroup=all&style=grid]. Thanks!

See attachment, once :jhopkins approves I hope that removes all blockers to this.
Comment on attachment 771370 [details]
nagios service groups to delay notification on

hwine: The list looks good except for a handful of possible omissions:

HP centos6 mock builders (bld-centos6-hp)
staging windows xp 32 bit talos servers (staging-talos-r3-xp)
windows 8 64 bit talos servers (t-w864-ix)
windows 2008R2 64 bit build hosts (w64-ix-slaves)
Attachment #771370 - Flags: review?(jhopkins) → review+
Flags: needinfo?(hwine)
with additions from comment 10

:ashish - ETR?
Attachment #771370 - Attachment is obsolete: true
Flags: needinfo?(hwine) → needinfo?(ashish)
All these hostgroups (extra being mac-partner-repack) have their first_notification_delay bumped to 420 mins:

bld-centos6-hp
linux64-ix-slaves
linux-ix-slaves
mac-partner-repack
mtv1-bld-linux64-ix
mw32-ix-slaves
pandas
prod-talos-linux32-ix
prod-talos-linux64-ix
prod-talos-mtnlion-r5
prod-talos-r3-fed
prod-talos-r3-fed64
prod-talos-r3-w7
prod-talos-r3-xp
prod-talos-r4-lion
prod-talos-r4-snow
prod-t-w732-ix
prod-t-xp32-ix
r5-production-builders
r5-try-builders
scl1-bld-linux64-ix
staging-talos-r3-xp
t-w864-ix
tegras
w64-ix-slaves

mac-partner-repack got included because it was in the same classification of hosts all the other hostgroups belongs to, so I included that as well for simplicity.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(ashish)
Resolution: --- → FIXED
Ashish - thanks, and thanks for noting the repack machine.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: