Adjust nagios settings to be less noisy during expected machine reboots

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
RESOLVED FIXED
8 years ago
3 years ago

People

(Reporter: joduinn, Assigned: bkero)

Tracking

Details

All the RelEng build and test slaves reboot after each job. For each of these
reboots, we get nagios CRITICAL alerts followed soon after by a nagios OK as the machine comes back online. Increasing the nagios timelimit would make that noise go away, and make it easier to spot valid problems - like when a machine reboots, but never comes back until a physical power cycle. 

For each RelEng build slave, in
#        service_description             PING

...please change:
#        retry_check_interval            15
#        max_check_attempts              20


This means that a rebooting machine will not be reported until 5 minutes (15seconds x 20 retries) have passed - plenty of time for a normal reboot cycle.

Updated

8 years ago
Assignee: server-ops → bkero
I don't know if mobile phones (maemo-n810-* and n900-*) are in the same service check, but can we make sure that the phones are given at least 30 minutes to reboot and come back online before they become CRITICAL?  The devices take anywhere from three to five minutes to reboot but also take time to establish a connection to wifi.  

Something like:
#        retry_check_interval            90
#        max_check_attempts              20

would be great
bkero: From quick chat with jabba and jhford, seems there's some recent tests done here, but no update in bug - whats the latest status here?
Note that retry_interval is measured in minutes, not seconds, by default.  You can change that in the main config, but it's something to be aware of.  You can modify that by setting interval_length in the main config.  I don't know if we've done that or not.
(Assignee)

Comment 4

8 years ago
joduinn: I did change the retry_check_interval for machines, however haven't found a way to do it cleanly for just mobile devices yet.  After discussing it with jabba, it seemed like it would take significant retooling of our nagios config generator.

Apologies for not attending to the second part of the ticket sooner, and for no response to the first part.  FWIW I did coordinate a test reboot with the buildduty person at the time and ensured that the changes went smoothly.

jabba: can you confirm that I'm correct in my statement about retooling nagios?
(In reply to comment #4)
> joduinn: I did change the retry_check_interval for machines, however haven't
> found a way to do it cleanly for just mobile devices yet.  After discussing it
> with jabba, it seemed like it would take significant retooling of our nagios
> config generator.
> 
> Apologies for not attending to the second part of the ticket sooner, and for no
> response to the first part.  FWIW I did coordinate a test reboot with the
> buildduty person at the time and ensured that the changes went smoothly.

bkero: ok, glad to know there's some progress underway. Do I read this correctly that you have a change in place that would at least work for all the desktop slaves, and be no worse for our mobile devices?



> jabba: can you confirm that I'm correct in my statement about retooling nagios?
I know we talked about treating desktop masters differently - how does that fit into this?
(Assignee)

Comment 6

8 years ago
joduinn: Correct, desktop slaves should be less noisy now.  I'm talking with jabba about what to do about a change for mobile devices.

Comment 7

8 years ago
joduinn: I'm looking through scrollback in #build and don't actually see any ping criticals due to reboots. The only ping critical I see is actually a down host. The rest of the spam flooding the channel is hung slave warnings. I think the hung slave checks need to be modified differently if those are indeed false alarms?

Comment 8

8 years ago
I think the hung slave notifications are valid, and were helpful when our infrastructure was being slammed by requests. The tree's been very quiet since the summit, however (actually, all of July due to the holidays+summit), so we have machines that aren't running anything.

I think the real fix for that is bug 565397. Dunno if people mind the hung slave warnings in the meantime.
I agree with all of comment 7 and 8, except when the ESX servers are busy. Then we get one or few OK/CRITICAL/OK flaps a day as machines reboot.
(Assignee)

Comment 10

8 years ago
joduinn: I've added the n900's to the veryLazyHost class, which has a notification delay of 30 minutes.  I think this was the best way to implement the delay you wanted.  Try this out and let me know if this doesn't work for you.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.