The whole push behind the work in bug 885560 is to enable releng buildduty to quickly identify the scope of any issue. Currently, we get notifications of individual machine outages, but not of the switch/database/colo that went offline and caused the issues. Our goal is to not spend time diagnosing issues when the cause is already known. From various discussions, it appears that this is possible, but requires IT to add a lot of information into Nagios. This is the visible-to-RelEng tracker for that.
move to correct component
Dupe of bug 927941?
(In reply to Ed Morley [:edmorley UTC+1] from comment #2) > Dupe of bug 927941? Not quite -- bug 927941 asks for cluster/pool level notifications (warn when 50% tegras offline -- no common cause needed) This asks for single alert for the specific failure that takes out multiple machines (e.g. a chassis switch failure that affect foopies and pandas and anything else in that physical rack)
Ah - that makes sense thank you :-)
With the extensive parenting work carried out in the past year (Bug 838959, et al.), I believe the scope of work in this bug has been covered.