Closed Bug 674504 Opened 13 years ago Closed 13 years ago

please add nagios check for stage-rsync.m.o:mozilla-prereleases module size

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: rtucker)

Details

We've got an existing nagios check on releases-rsync.m.o:mozilla-releases that watches that module size, can we get one set up for stage-rsync.m.o:mozilla-prereleases with the same thresholds?
Could you point me at the existing check?  I couldn't find anything obvious by looking at the web interface or the config files.
Assignee: server-ops-releng → arich
I'm not sure if it's part of the check_releasesrsynclag plugin or something else, but it occurs to me that I'll need to hand this over to infra since I don't have permissions to install plugins on these hosts, anyway.
Assignee: arich → server-ops
Component: Server Operations: RelEng → Server Operations
QA Contact: zandr → mrz
Assignee: server-ops → rtucker
(In reply to comment #1)
> Could you point me at the existing check?  I couldn't find anything obvious
> by looking at the web interface or the config files.

(In reply to comment #1)
> Could you point me at the existing check?  I couldn't find anything obvious
> by looking at the web interface or the config files.

Whoops, it's on surf, apparently:
https://nagios.mozilla.org/nagios/cgi-bin/extinfo.cgi?type=2&host=surf&service=mozilla-releases+rsync+size
I just want to make sure that I'm understanding this properly.

Duplicate this check:
https://nagios.mozilla.org/nagios/cgi-bin/extinfo.cgi?type=2&host=surf&service=mozilla-releases+rsync+size

for host:
stage-rsync.m.o:mozilla-prereleases

Once this is confirmed, I'll set it up.
Right now we're syncing modules on stage so we can use du and populate /pub/mozilla.org/zz/rsyncd-motd, and then use that in the nagios check. That takes up as much disk space as the rsync modules, so for -releases and -current that's 135G at the moment, and if we add prereleases more like 255G.

I think we can do better by having a cron job that does rsync -nav, and uses the info in the last couple of lines, eg for mozilla-releases:
 sent 242564 bytes  received 2027041 bytes  6744.74 bytes/sec
 total size is 114676442966  speedup is 50527.05 (DRY RUN)
That total size (106G) is what we want to know for the motd/nagios, without using up the space.
Also, we need to be careful how stuff is set up:

 stage.m.o::mozilla-releases is really mozilla-preleases (for stage-rsync)
 stage.m.o::mozilla-releases-mirrors is really mozilla-releases (for motd)

 stage-rsync.m.o:mozilla-releases is mozilla-releases

 mozilla-currrent is mozilla-current everywhere
Nick,
Did you ever figure out all the criteria for this?
Rob, my thoughts are 

* we need a (new?) nagios plugin which does a rsync -nav on a given host::module, and looks for a line starting 'total size', and reports the fourth word in that line converted from bytes into gigabytes. Eg
  total size is 114676442966  speedup is 50527.05 (DRY RUN) 
becomes
  module size is 107G
 
* three checks using that plugin, running on surf.m.o
 * 'mozilla-releases rsync size',    using localhost::mozilla-releases-mirrors
 * 'mozilla-prereleases rsync size', using localhost::mozilla-releases
 * 'mozilla-current rsync size',     using localhost::mozilla-current

If you can give me the limits for the existing check ('mozilla-releases rsync size') we can work up some for the latter two.
Nick,
So the thresholds for the existing check are 110GB. The check is simply doing:
SIZE_REL=`grep releases /pub/mozilla.org/zz/rsyncd-motd | sed -re 's/.*: ([0-9 ]+)GB/\1/'`

I don't see the directories you're referring to on surf (mozilla-releases, mozilla-prereleases). Do you have any additional information?
Nick,
So I did some more digging, on pv-mirror01:

[root@pv-mirror01 tmp]# rsync -nav /root/mozilla-current
sending incremental file list
-rw-r--r--      495713 2010/04/20 08:32:35 mozilla-current

sent 40 bytes  received 12 bytes  104.00 bytes/sec
total size is 495713  speedup is 9532.94 (DRY RUN)

[root@pv-mirror01 tmp]# rsync -nav /root/mozilla-releases
sending incremental file list
-rw-r--r--     6207751 2010/04/20 08:38:46 mozilla-releases

sent 41 bytes  received 12 bytes  106.00 bytes/sec
total size is 6207751  speedup is 117127.38 (DRY RUN)


Still nothing for mozilla-prereleases though
The double colon notation in comment #8 is rsync's way of specifying host::module. 
eg: this is prereleases
nthomas@surf:~$ time rsync -nav localhost::mozilla-releases . | grep ^total
total size is 129286759754  speedup is 55571.09 (DRY RUN)

real	1m13.785s

NB: in comment #8, the nagios check name isn't the same as the module name for historical reasons.
Nick,
I've got nrpe checks that function for both localhost::mozilla-current and localhost::mozilla-releases but nothing for localhost::mozilla-prereleases as I can't seem to get the rsync -nav to work to that one.

Do you know of a different way to access that rsync module?

I changed the nrpe script execution timeout to 300 for this to work since the default value of 60 seconds isn't long enough. I'm not sure if this is going to get clobbered by puppet or not. Here are the script executions from mradm01 and the responses

check_nrpe -H 10.2.74.116 -t 300 -c check_rsync_releases_size
OK: RSYNC SIZE is 132.77GB

/usr/lib/nagios/plugins/check_nrpe -H 10.2.74.116 -t 120 -c check_rsync_current_size
OK: RSYNC SIZE is 34.22GB

How often should these be checked?
What should the warning and critical values for each be set at?
(In reply to Rob Tucker [:rtucker] from comment #12)

$ rsync localhost::
<snip motd>
mozilla-all    	Mozilla FTP
mozilla-releases	Mozilla Software Releases
mozilla-releases-mirrors	Mozilla Software Releases (for Mirrors)
mozilla-current	Mozilla Current Release Only - high bandwidth low disk space
releases-com   	Mozilla Corporation Partner Releases
$ grep pre /etc/rsyncd.conf
<nothing relevant>

There doesn't seem to be a mozilla-prereleases rsync module on stage.m.o.
So that leads me to believe that we're good with just mozilla-current and mozilla-releases, I very well am incorrect though.

If that is the case I just need to know how often to check, the thresholds and possibly fix puppet clobbering my updated config file.
Please take another look at the second paragraph of comment #8 for the details of the names for the checks and modules.

A limit of 110GB is fine for the mozilla-prereleases and mozilla-releases modules, and lets use 40G for mozilla-current (for now anyway). They can go directly to CRITICAL on crossing those values. Checking once a day will still be fine. Notifications should go to #build and RelEng people (see the existing check).
I just finished adding the 3 checks, they are green and setup as requested. Closing this one out!
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Almost there, just would like a few more tweaks:

* 'mozilla-releases rsync pre-releases' is 'OK: RSYNC SIZE is 133.98GB', but that should be over the 110G limit and CRITICAL ?
* verify how frequently the checks run, seems to be more frequent than once a day. Does it depend on the state of the check ?
* at one point I saw all three checks with 'Service Check Timed Out', may need a longer timeout and/or to make sure the checks don't run at the same time
* the old check 'mozilla-releases rsync size' can be removed
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I originally increased the warning and critical thresholds so that these don't page, evidently you want me to setup the checks while the services are in a failed state, so I just set them exactly where you wanted them, so now they will page.

I also did as you said and copied the settings from the existing rsync check, I just now backtracked on that and hard set them at 1440. 

I cannot increase the timeout anymore. It's at 5 minutes. Anything beyond that the problem isn't the check, the problem is the box.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.