Closed Bug 932598 Opened 11 years ago Closed 10 years ago

Update all releng nagios information so only root causes alerts

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Assigned: ashish)

References

Details

(Whiteboard: :MOC)

The whole push behind the work in bug 885560 is to enable releng buildduty to quickly identify the scope of any issue.

Currently, we get notifications of individual machine outages, but not of the switch/database/colo that went offline and caused the issues. Our goal is to not spend time diagnosing issues when the cause is already known.

From various discussions, it appears that this is possible, but requires IT to add a lot of information into Nagios. This is the visible-to-RelEng tracker for that.
move to correct component
Assignee: nobody → server-ops
Component: Tools → Server Operations
Product: Release Engineering → mozilla.org
QA Contact: hwine → shyam
Summary: Update all releng nagios information so only root cause alerts → Update all releng nagios information so only root causes alerts
Version: unspecified → other
Dupe of bug 927941?
(In reply to Ed Morley [:edmorley UTC+1] from comment #2)
> Dupe of bug 927941?

Not quite -- bug 927941 asks for cluster/pool level notifications (warn when 50% tegras offline -- no common cause needed)

This asks for single alert for the specific failure that takes out multiple machines (e.g. a chassis switch failure that affect foopies and pandas and anything else in that physical rack)
Ah - that makes sense thank you :-)
Whiteboard: :MOC
With the extensive parenting work carried out in the past year (Bug 838959, et al.), I believe the scope of work in this bug has been covered.
Assignee: server-ops → ashish
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.