Closed Bug 685980 Opened 14 years ago Closed 14 years ago

New nagios checks for buildbot masters

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: arich)

Details

Attachments

(1 file)

Please add checks right now for: check_pulsepublisher check_commandrunner These can be added now too if you want, but they're not yet deployed on the masters: check_pulsequeue check_commandqueue
I'm assuming there are no arguments, and these should go on all buildbot masters? What settings would you like for the checks? service_description normal_check_interval retry_check_interval max_check_attempts first_notification_delay notification_options notification_interval
Assignee: server-ops-releng → arich
(In reply to Amy Rich [:arich] from comment #1) > I'm assuming there are no arguments, and these should go on all buildbot > masters? No arguments, correct...and I think all buildbot masters. > What settings would you like for the checks? > > service_description check that {command,pulse} processor is running > normal_check_interval 15 minutes > retry_check_interval what's this? > max_check_attempts this is max attempts before notifying? it should notify if it ever fails > first_notification_delay 0 > notification_options not sure > notification_interval default "something is busted on the master" interval is fine. 30 minutes?
Hey, catlee, if you're interested, this may be a good general resource: http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html So are the first two checks just doing process checks (to see if a process is running)? If that's the case, then we should use the builtin nagios process check plugin. If you're doing something more complicated, then a separate check might be warranted, but it's best to keep it simple (since the check definitions also have to go on the central and distributed servers if it's not a stock nagios check). If you are doing your own check, the service_description is what nagios displays as the check name (and how it tracks data). So we'd probably want something very short and descriptive like "pulse processor" and ... not sure what to call the command processor without making it sound too generic (command). The normal_check_interval for things on the "important" servers (including buildbot servers) is 5 minutes at this point (that was an earlier request) with a retry_check_interval of 3. We also have the notification_interval set at 5 minutes for the services on the "important" machines like buildbot servers. first_notification_delay is used if you want to wait some period of time after the condition is triggered before sending any notifications. The typical notification_options we use are some combination of w,c,u,r (warning, critical, unknown, recovery).
check_pulsepublisher and check_queuerunner are already defined in the master's nrpe.cfg files, but are just using: command[check_commandrunner]=/usr/lib/nagios/plugins/check_procs -c 1:1 -a command_runner.py command[check_pulsepublisher]=/usr/lib/nagios/plugins/check_procs -c 1:1 -a pulse_publisher.py If you'd rather use check_procs explicitly, go ahead.
Getting back to this now that the all hands is over and we've somewhat caught up on the critical stuff... The build machines don't seem to have the check definition for check_procs_regex that we're using on other hosts. I've put it in a file called procs_regex.cfg on dev-master01 to test out the check, and both the command_runner and pulse_publisher checks are working there now. Do you want to add this check to the default list for all machines (it's a useful generic check for processes which takes a regex and looks for X - Y number of processes that match)?
These checks have been enabled on all buildbot-masters and are currently acked, waiting for releng to push the nrpe check definition to all the machines.
Attachment #563150 - Flags: review? → review?(arich)
Attachment #563150 - Flags: review?(arich) → review+
check_pulsequeue, check_commandqueue are ready to be checked as well.
I've added in both queue checks as well. I've removed the checks for talos-master and downtimed the checks on production-mobile-master since they are non-functional (intended).
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: