Closed Bug 1043839 Opened 10 years ago Closed 10 years ago

Automatically downtime nagios alerts when rebooting slaves

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: pmoore, Unassigned)

References

(
URL
)

Details

Attachments

(1 file)

Screen Shot 2014-07-25 at 09.43.07.png 10 years ago Pete Moore [:pmoore][:pete] 206.01 KB, image/png		Details

Pete Moore [:pmoore][:pete]

Reporter

Description

•

10 years ago

Attached image Screen Shot 2014-07-25 at 09.43.07.png — Details

A cursory glance at https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?host=all&servicestatustypes=16&hoststatustypes=3&serviceprops=42&sorttype=1&sortoption=6&limit=0 shows many alerts - however most of these have a duration < 3 min. My assumption here is that these mostly, (if not completely) relate to nagios alerts when a slave gets rebooted after completing a job.

The problem with this is that this nagios service status page contains a lot of noise, which hides the real problems which may be occurring.

I think we can solve this problem by downtiming nagios alerts for ~5 minutes immediately before rebooting a slave.

This should keep the nagios interface cleaner, and make it easier to spot real problems.

Screenshot attached of nagios display at time of bug creation.

I'm hoping there is a single code path where slaves are rebooted after completing a buildbot job, where this nagios downtime can be inserted... =)

Pete Moore [:pmoore][:pete]

Reporter

Comment 1

•

10 years ago

Maybe this is not needed, in light of bug 1028191 ?

This assumes my assumptions are correct - that the nagios alerts really do correspond to machine reboots after buildbot jobs complete...

Depends on: 1028191

Justin Wood (:Callek)

Comment 2

•

10 years ago

This is just a matter of the nagios web UI being confusing:

heres the trick, notice there is an "Attempt" column, while nagios does flap the machine to down, it also doesn't actually do *any* notification in such a state, it checks for X consecutive times and then alerts.

The timings set are such that it won't actually alert if its in the middle of a reboot, but if it reaches all X there usually is something to do.

I'm also not sure if we really want every host in our network trying to contact nagios to do ack's, we used to try some nagios passive checks instead for buildbot, and iirc that melted nagios's performance:
http://mxr.mozilla.org/build/source/puppet/modules/buildslave/files/runslave.py#231

My views are:
https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?host=all&servicestatustypes=28&serviceprops=270346&hostprops=270346&limit=0&sorttype=1&sortoption=1

and 

https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?host=all&servicestatustypes=16&hoststatustypes=15

and I have never yet figured out how to show *only* services in that "all attempts done" state.

---

Additionally we have Bug 1033292 -- which is far more valuable, imo.

Due to all that WONTFIXING for now.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: Tools → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Automatically downtime nagios alerts when rebooting slaves

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: pmoore, Unassigned)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Attachment

General

Description

File Name

Content Type