Closed Bug 723815 Opened 12 years ago Closed 12 years ago

Redo nagios checks for rsync module checks on surf

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: arich)

Details

That's 
 mozilla-releases rsync current
 mozilla-releases rsync pre-releases
 mozilla-releases rsync releases
in the web/irc interfaces, and
 check_rsync_current_size
 check_rsync_releases_size
 check_rsync_prereleases_size
when defined in surf:/etc/nagios/nrpe.d/check-release-size.cfg

These checks will be adding load to the netapps by doing rsync -n to get module sizes, although probably less than the machines actually fetching the files. Mainly I don't like the nagios behavior of checking more frequently after a check goes WARNING/CRITICAL. It tends to pound a server which is already behind.
I agree that the less load on surf/the netapp the better, but is there some other way we intend to monitor this?  I presume we had reason to in the past.
Assignee: server-ops-releng → arich
Over in bug 725711 I've enhanced the rsync motd so that it gives sizes for mozilla-prereleases, mozilla-releases, and mozilla-current, updated daily at about 4am Pacific. We can use that to add checks on mozilla-prereleases and mozilla-current, and replace the current one for mozilla-releases.

At surf:/usr/lib/nagios/plugins/contrib/check_rsync_releases_size.py there is a simple python script to read the motd file and extract the size, which is based on rtucker's version (which did an rsync -n but that took too long for nagios to cope with; backed up at ~nthomas/check_rsync_releases_size.py.rtucker). I'm hoping puppet won't come along and wipe the new version, but there's a copy in ~nthomas if it does.

I suggest we call the commands defined in surf:/etc/nagios/nrpe.d/check-release-size.cfg, where I've adjusted to sensible thresholds. If you prefer to consolidate that to one definition and pass the three arguments using the nagios server then feel free. And we should deprecate the call to check_moz_rel_rsync (defined in surf:/etc/nagios/nrpe.d/rsync-size.cfg).
To make things clear and sane (unlike the old rsync checks which had all sorts of different names pointing at different things), the configuration below is now being used.  nthomas, can you please verify that the checks on surf match what you were after (and that I haven't disabled anything that should still be there)?  

Also note that prereleases is currently over size and that releases is *under* size, but only prereleases is alerting.  Was that the intended behavior of your script?

====

The individual service check definitions for each category in /etc/nagios/mpt/services.cfg:

define service{
    use                             generic-service
    host_name                       surf
    service_description             rsync size mozilla-current
    contact_groups                  build
    notification_options            u,c,r
    normal_check_interval           360
    max_check_attempts              4
    retry_check_interval            360
    notification_interval           1440
    check_command                   check_rsync_releases_size!mozilla-current!30!40
}
define service{
    use                             generic-service
    host_name                       surf
    service_description             rsync size mozilla-prereleases
    contact_groups                  build
    notification_options            u,c,r
    normal_check_interval           360
    max_check_attempts              4
    retry_check_interval            360
    notification_interval           1440
    check_command                   check_rsync_releases_size!mozilla-prereleases!230!250
}

define service{
    use                             generic-service
    host_name                       surf
    service_description             rsync size mozilla-releases
    contact_groups                  build
    notification_options            u,c,r
    normal_check_interval           360
    max_check_attempts              4
    retry_check_interval            360
    notification_interval           1440
    check_command                   check_rsync_releases_size!mozilla-releases!125!140
}


========================

The global check definition in /etc/nagios/checkcommands.cfg  on mradm01 and dm-nagios01:

define command{
    command_name    check_rsync_releases_size
    command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -t 15 -c check_rsync_releases_size -a $ARG1$ $ARG2$ $ARG3$
}

============================

The check definition in /etc/nagios/nrpe.d/check_rsync_releases_size.cfg on surf:

command[check_rsync_releases_size]=/usr/lib/nagios/plugins/contrib/check_rsync_releases_size.py $ARG1$ $ARG2$ $ARG3$

============================

And then the script itself is located on surf at:

/usr/lib/nagios/plugins/contrib/check_rsync_releases_size.py
This all looks great to me, thanks for cleaning it all up and documenting it here.

I'm not surprised prereleases is alerting at the moment given the the recent gaggle of releases, and the limits on it are an estimate because we haven't monitored it before now. We've had issues with cn-adm01.cn getting low on space from this module so I don't want to increase them at until I've looked at doing some cleanup first. Should be lots there that can get removed as the update traffic dies away on the older releases.
Summary: Disable nagios checks for rsync module checks on surf → Redo nagios checks for rsync module checks on surf
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.