Closed Bug 932598 Opened 11 years ago Closed 10 years ago

Update all releng nagios information so only root causes alerts

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: hwine, Assigned: ashish)

References

Details

(Whiteboard: :MOC)

Hal Wine [:hwine] use NI!

Reporter

Description

•

11 years ago

The whole push behind the work in bug 885560 is to enable releng buildduty to quickly identify the scope of any issue. Currently, we get notifications of individual machine outages, but not of the switch/database/colo that went offline and caused the issues. Our goal is to not spend time diagnosing issues when the cause is already known. From various discussions, it appears that this is possible, but requires IT to add a lot of information into Nagios. This is the visible-to-RelEng tracker for that.

Hal Wine [:hwine] use NI!

Reporter

Comment 1

•

11 years ago

move to correct component

Assignee: nobody → server-ops

Component: Tools → Server Operations

Product: Release Engineering → mozilla.org

QA Contact: hwine → shyam

Summary: Update all releng nagios information so only root cause alerts → Update all releng nagios information so only root causes alerts

Version: unspecified → other

Ed Morley [:emorley]

Comment 2

•

11 years ago

Dupe of bug 927941?

Hal Wine [:hwine] use NI!

Reporter

Comment 3

•

11 years ago

(In reply to Ed Morley [:edmorley UTC+1] from comment #2) > Dupe of bug 927941? Not quite -- bug 927941 asks for cluster/pool level notifications (warn when 50% tegras offline -- no common cause needed) This asks for single alert for the specific failure that takes out multiple machines (e.g. a chassis switch failure that affect foopies and pandas and anything else in that physical rack)

Ed Morley [:emorley]

Comment 4

•

11 years ago

Ah - that makes sense thank you :-)

Rick Bryce [:rbryce]

Updated

•

10 years ago

Whiteboard: :MOC

Ashish Vijayaram [:ashish]

Assignee

Comment 5

•

10 years ago

With the extensive parenting work carried out in the past year (Bug 838959, et al.), I believe the scope of work in this bug has been covered.

Assignee: server-ops → ashish

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Update all releng nagios information so only root causes alerts

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: hwine, Assigned: ashish)

References

Details

(Whiteboard: :MOC)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Updated