Closed Bug 709108 Opened 13 years ago Closed 13 years ago

nagios checks for the signing servers

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: arich)

Details

Attachments

(1 file)

Now that things have settled down in signing server land I think we're ready to add nagios checks for the server processes. Each server has two instances, each of which will 3 in total because of forking. So, each machine should have 6 processes running, where one of the arguments is "tools/releases/signing/signing-server.py". I tried to verify this locally, but Nagios seems to think there's an extra one....:
./check_nrpe -H localhost -c check_procs_regex -a tools/release/signing/signing-server.py 6 6
PROCS CRITICAL: 7 processes with regex args 'tools/release/signing/signing-server.py'

[cltsign@signing1 plugins]$ ps auxwww | grep tools/release/signing/signing-server.py
cltsign   1469  0.0  0.0  61176   752 pts/0    R+   08:57   0:00 grep tools/release/signing/signing-server.py
cltsign  31198  0.0  0.2 199488 10756 ?        Sl   07:19   0:00 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d
cltsign  31199  0.0  0.2 178340  9580 ?        S    07:19   0:00 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d
cltsign  31200  0.0  0.2 178340  9580 ?        S    07:19   0:00 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d
cltsign  31225  0.0  0.3 201112 13352 ?        Sl   07:20   0:03 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d --restart
cltsign  31226  2.9  0.2 179324 10740 ?        S    07:20   2:51 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d --restart
cltsign  31227  3.0  0.2 179328 10740 ?        S    07:20   2:53 bin/python tools/release/signing/signing-server.py signing.ini -v -l signing.log -d --restart

(I wouldn't expect the grep one to show up through check_procs.)
Hm, I'm not sure that counting # of processes is going to be a good thing to do since the # of worker processes can vary.

Can we check the process referred to by the .pid file instead?
You can tell nagios to check for a min and a max range of processes, but check_procs can't determine the number to check for form an external source.

I had already set the check for 6 as bhearsum requested, but I can change that.
Assignee: server-ops-releng → arich
Looks like we can use the "-p" flag to only find the root processes:
 -p, --ppid=PPID
   Only scan for children of the parent process ID indicated.

[cltsign@signing1 plugins]$ ./check_procs -p 1 --ereg-argument-array tools/release/signing/signing-server.py
PROCS OK: 2 processes with PPID = 1, regex args 'tools/release/signing/signing-server.py'

Probably need to add a new command in the config, since we use check_procs_regex on Buildbot masters and elsewhere IIRC.

How does this sound to you two?
Okay, based on our irc conversation:

define command{
    command_name    check_nrpe_child_procs_regex
    command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_child_procs_regex -a $ARG1$ $ARG2$ $ARG3$ $ARG4$
    }

child_procs_regex&contact_groups build:signing-server::$signing-servers:tools/release/signing/signing-server.py!1!2!2

Which matches up with the client-side definition you put in irc.
Attachment #581273 - Flags: review?(catlee) → review+
Attachment #581273 - Attachment description: update nrpe.cfg template with check_child_procs_regex command → [checked in] update nrpe.cfg template with check_child_procs_regex command
Both the signing servers picked up the change, and the checks are now green - I think we're all done here?
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: