Update all releng nagios information so only root causes alerts



5 years ago
4 years ago


(Reporter: hwine, Assigned: ashish)



(Whiteboard: :MOC)



5 years ago
The whole push behind the work in bug 885560 is to enable releng buildduty to quickly identify the scope of any issue.

Currently, we get notifications of individual machine outages, but not of the switch/database/colo that went offline and caused the issues. Our goal is to not spend time diagnosing issues when the cause is already known.

From various discussions, it appears that this is possible, but requires IT to add a lot of information into Nagios. This is the visible-to-RelEng tracker for that.

Comment 1

5 years ago
move to correct component
Assignee: nobody → server-ops
Component: Tools → Server Operations
Product: Release Engineering → mozilla.org
QA Contact: hwine → shyam
Summary: Update all releng nagios information so only root cause alerts → Update all releng nagios information so only root causes alerts
Version: unspecified → other

Comment 2

5 years ago
Dupe of bug 927941?

Comment 3

5 years ago
(In reply to Ed Morley [:edmorley UTC+1] from comment #2)
> Dupe of bug 927941?

Not quite -- bug 927941 asks for cluster/pool level notifications (warn when 50% tegras offline -- no common cause needed)

This asks for single alert for the specific failure that takes out multiple machines (e.g. a chassis switch failure that affect foopies and pandas and anything else in that physical rack)

Comment 4

5 years ago
Ah - that makes sense thank you :-)


4 years ago
Whiteboard: :MOC

Comment 5

4 years ago
With the extensive parenting work carried out in the past year (Bug 838959, et al.), I believe the scope of work in this bug has been covered.
Assignee: server-ops → ashish
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.