Closed
Bug 685980
Opened 14 years ago
Closed 14 years ago
New nagios checks for buildbot masters
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: arich)
Details
Attachments
(1 file)
|
1.03 KB,
patch
|
arich
:
review+
|
Details | Diff | Splinter Review |
Please add checks right now for:
check_pulsepublisher
check_commandrunner
These can be added now too if you want, but they're not yet deployed on the masters:
check_pulsequeue
check_commandqueue
| Assignee | ||
Comment 1•14 years ago
|
||
I'm assuming there are no arguments, and these should go on all buildbot masters? What settings would you like for the checks?
service_description
normal_check_interval
retry_check_interval
max_check_attempts
first_notification_delay
notification_options
notification_interval
Assignee: server-ops-releng → arich
| Reporter | ||
Comment 2•14 years ago
|
||
(In reply to Amy Rich [:arich] from comment #1)
> I'm assuming there are no arguments, and these should go on all buildbot
> masters?
No arguments, correct...and I think all buildbot masters.
> What settings would you like for the checks?
>
> service_description
check that {command,pulse} processor is running
> normal_check_interval
15 minutes
> retry_check_interval
what's this?
> max_check_attempts
this is max attempts before notifying? it should notify if it ever fails
> first_notification_delay
0
> notification_options
not sure
> notification_interval
default "something is busted on the master" interval is fine. 30 minutes?
| Assignee | ||
Comment 3•14 years ago
|
||
Hey, catlee, if you're interested, this may be a good general resource:
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html
So are the first two checks just doing process checks (to see if a process is running)? If that's the case, then we should use the builtin nagios process check plugin. If you're doing something more complicated, then a separate check might be warranted, but it's best to keep it simple (since the check definitions also have to go on the central and distributed servers if it's not a stock nagios check).
If you are doing your own check, the service_description is what nagios displays as the check name (and how it tracks data). So we'd probably want something very short and descriptive like "pulse processor" and ... not sure what to call the command processor without making it sound too generic (command).
The normal_check_interval for things on the "important" servers (including buildbot servers) is 5 minutes at this point (that was an earlier request) with a retry_check_interval of 3. We also have the notification_interval set at 5 minutes for the services on the "important" machines like buildbot servers.
first_notification_delay is used if you want to wait some period of time after the condition is triggered before sending any notifications.
The typical notification_options we use are some combination of w,c,u,r (warning, critical, unknown, recovery).
| Reporter | ||
Comment 4•14 years ago
|
||
check_pulsepublisher and check_queuerunner are already defined in the master's nrpe.cfg files, but are just using:
command[check_commandrunner]=/usr/lib/nagios/plugins/check_procs -c 1:1 -a command_runner.py
command[check_pulsepublisher]=/usr/lib/nagios/plugins/check_procs -c 1:1 -a pulse_publisher.py
If you'd rather use check_procs explicitly, go ahead.
| Assignee | ||
Comment 5•14 years ago
|
||
Getting back to this now that the all hands is over and we've somewhat caught up on the critical stuff... The build machines don't seem to have the check definition for check_procs_regex that we're using on other hosts. I've put it in a file called procs_regex.cfg on dev-master01 to test out the check, and both the command_runner and pulse_publisher checks are working there now.
Do you want to add this check to the default list for all machines (it's a useful generic check for processes which takes a regex and looks for X - Y number of processes that match)?
| Assignee | ||
Comment 6•14 years ago
|
||
These checks have been enabled on all buildbot-masters and are currently acked, waiting for releng to push the nrpe check definition to all the machines.
| Reporter | ||
Comment 7•14 years ago
|
||
Attachment #563150 -
Flags: review?
| Reporter | ||
Updated•14 years ago
|
Attachment #563150 -
Flags: review? → review?(arich)
| Assignee | ||
Updated•14 years ago
|
Attachment #563150 -
Flags: review?(arich) → review+
| Reporter | ||
Comment 8•14 years ago
|
||
check_pulsequeue, check_commandqueue are ready to be checked as well.
| Assignee | ||
Comment 9•14 years ago
|
||
I've added in both queue checks as well.
I've removed the checks for talos-master and downtimed the checks on production-mobile-master since they are non-functional (intended).
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•