A cursory glance at shows many alerts - however most of these have a duration < 3 min. My assumption here is that these mostly, (if not completely) relate to nagios alerts when a slave gets rebooted after completing a job.

The problem with this is that this nagios service status page contains a lot of noise, which hides the real problems which may be occurring.

I think we can solve this problem by downtiming nagios alerts for ~5 minutes immediately before rebooting a slave.

This should keep the nagios interface cleaner, and make it easier to spot real problems.

Screenshot attached of nagios display at time of bug creation.

I'm hoping there is a single code path where slaves are rebooted after completing a buildbot job, where this nagios downtime can be inserted... =)
Maybe this is not needed, in light of bug 1028191 ?

This assumes my assumptions are correct - that the nagios alerts really do correspond to machine reboots after buildbot jobs complete...
This is just a matter of the nagios web UI being confusing:

heres the trick, notice there is an "Attempt" column, while nagios does flap the machine to down, it also doesn't actually do *any* notification in such a state, it checks for X consecutive times and then alerts.

The timings set are such that it won't actually alert if its in the middle of a reboot, but if it reaches all X there usually is something to do.

I'm also not sure if we really want every host in our network trying to contact nagios to do ack's, we used to try some nagios passive checks instead for buildbot, and iirc that melted nagios's performance:

My views are:


and I have never yet figured out how to show *only* services in that "all attempts done" state.


Additionally we have Bug 1033292 -- which is far more valuable, imo.

Due to all that WONTFIXING for now.
