Closed
Bug 709108
Opened 14 years ago
Closed 14 years ago
nagios checks for the signing servers
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: arich)
Details
Attachments
(1 file)
|
777 bytes,
patch
|
catlee
:
review+
|
Details | Diff | Splinter Review |
Now that things have settled down in signing server land I think we're ready to add nagios checks for the server processes. Each server has two instances, each of which will 3 in total because of forking. So, each machine should have 6 processes running, where one of the arguments is "tools/releases/signing/signing-server.py". I tried to verify this locally, but Nagios seems to think there's an extra one....:
./check_nrpe -H localhost -c check_procs_regex -a tools/release/signing/signing-server.py 6 6
PROCS CRITICAL: 7 processes with regex args 'tools/release/signing/signing-server.py'
[cltsign@signing1 plugins]$ ps auxwww | grep tools/release/signing/signing-server.py
cltsign 1469 0.0 0.0 61176 752 pts/0 R+ 08:57 0:00 grep tools/release/signing/signing-server.py
cltsign 31198 0.0 0.2 199488 10756 ? Sl 07:19 0:00 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d
cltsign 31199 0.0 0.2 178340 9580 ? S 07:19 0:00 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d
cltsign 31200 0.0 0.2 178340 9580 ? S 07:19 0:00 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d
cltsign 31225 0.0 0.3 201112 13352 ? Sl 07:20 0:03 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d --restart
cltsign 31226 2.9 0.2 179324 10740 ? S 07:20 2:51 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d --restart
cltsign 31227 3.0 0.2 179328 10740 ? S 07:20 2:53 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d --restart
(I wouldn't expect the grep one to show up through check_procs.)
Comment 1•14 years ago
|
||
Hm, I'm not sure that counting # of processes is going to be a good thing to do since the # of worker processes can vary.
Can we check the process referred to by the .pid file instead?
| Assignee | ||
Comment 2•14 years ago
|
||
You can tell nagios to check for a min and a max range of processes, but check_procs can't determine the number to check for form an external source.
I had already set the check for 6 as bhearsum requested, but I can change that.
| Assignee | ||
Updated•14 years ago
|
Assignee: server-ops-releng → arich
| Reporter | ||
Comment 3•14 years ago
|
||
Looks like we can use the "-p" flag to only find the root processes:
-p, --ppid=PPID
Only scan for children of the parent process ID indicated.
[cltsign@signing1 plugins]$ ./check_procs -p 1 --ereg-argument-array tools/release/signing/signing-server.py
PROCS OK: 2 processes with PPID = 1, regex args 'tools/release/signing/signing-server.py'
Probably need to add a new command in the config, since we use check_procs_regex on Buildbot masters and elsewhere IIRC.
How does this sound to you two?
| Assignee | ||
Comment 4•14 years ago
|
||
Okay, based on our irc conversation:
define command{
command_name check_nrpe_child_procs_regex
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_child_procs_regex -a $ARG1$ $ARG2$ $ARG3$ $ARG4$
}
child_procs_regex&contact_groups build:signing-server::$signing-servers:tools/release/signing/signing-server.py!1!2!2
Which matches up with the client-side definition you put in irc.
| Reporter | ||
Comment 5•14 years ago
|
||
Attachment #581273 -
Flags: review?(catlee)
Updated•14 years ago
|
Attachment #581273 -
Flags: review?(catlee) → review+
| Reporter | ||
Updated•14 years ago
|
Attachment #581273 -
Attachment description: update nrpe.cfg template with check_child_procs_regex command → [checked in] update nrpe.cfg template with check_child_procs_regex command
| Reporter | ||
Comment 6•14 years ago
|
||
Both the signing servers picked up the change, and the checks are now green - I think we're all done here?
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•