Closed Bug 709108 Opened 14 years ago Closed 14 years ago

nagios checks for the signing servers

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: arich)

Details

Attachments

(1 file)

Now that things have settled down in signing server land I think we're ready to add nagios checks for the server processes. Each server has two instances, each of which will 3 in total because of forking. So, each machine should have 6 processes running, where one of the arguments is "tools/releases/signing/signing-server.py". I tried to verify this locally, but Nagios seems to think there's an extra one....: ./check_nrpe -H localhost -c check_procs_regex -a tools/release/signing/signing-server.py 6 6 PROCS CRITICAL: 7 processes with regex args 'tools/release/signing/signing-server.py' [cltsign@signing1 plugins]$ ps auxwww | grep tools/release/signing/signing-server.py cltsign 1469 0.0 0.0 61176 752 pts/0 R+ 08:57 0:00 grep tools/release/signing/signing-server.py cltsign 31198 0.0 0.2 199488 10756 ? Sl 07:19 0:00 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d cltsign 31199 0.0 0.2 178340 9580 ? S 07:19 0:00 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d cltsign 31200 0.0 0.2 178340 9580 ? S 07:19 0:00 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d cltsign 31225 0.0 0.3 201112 13352 ? Sl 07:20 0:03 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d --restart cltsign 31226 2.9 0.2 179324 10740 ? S 07:20 2:51 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d --restart cltsign 31227 3.0 0.2 179328 10740 ? S 07:20 2:53 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d --restart (I wouldn't expect the grep one to show up through check_procs.)
Hm, I'm not sure that counting # of processes is going to be a good thing to do since the # of worker processes can vary. Can we check the process referred to by the .pid file instead?
You can tell nagios to check for a min and a max range of processes, but check_procs can't determine the number to check for form an external source. I had already set the check for 6 as bhearsum requested, but I can change that.
Assignee: server-ops-releng → arich
Looks like we can use the "-p" flag to only find the root processes: -p, --ppid=PPID Only scan for children of the parent process ID indicated. [cltsign@signing1 plugins]$ ./check_procs -p 1 --ereg-argument-array tools/release/signing/signing-server.py PROCS OK: 2 processes with PPID = 1, regex args 'tools/release/signing/signing-server.py' Probably need to add a new command in the config, since we use check_procs_regex on Buildbot masters and elsewhere IIRC. How does this sound to you two?
Okay, based on our irc conversation: define command{ command_name check_nrpe_child_procs_regex command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_child_procs_regex -a $ARG1$ $ARG2$ $ARG3$ $ARG4$ } child_procs_regex&contact_groups build:signing-server::$signing-servers:tools/release/signing/signing-server.py!1!2!2 Which matches up with the client-side definition you put in irc.
Attachment #581273 - Flags: review?(catlee) → review+
Attachment #581273 - Attachment description: update nrpe.cfg template with check_child_procs_regex command → [checked in] update nrpe.cfg template with check_child_procs_regex command
Both the signing servers picked up the change, and the checks are now green - I think we're all done here?
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: