Closed Bug 656916 Opened 13 years ago Closed 13 years ago

please add nagios http redirect check to buildbot masters

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: arich)

References

Details

Attachments

(3 files, 3 obsolete files)

In bug 646076 I've set up Bouncer to limit the build network to Mozilla-internal mirrors only. There's some concern that these changes could be lost at some point, so I'd like nagios checks to verify.

The most stable Bouncer product that'll work for this -- that I can find -- is http://download.mozilla.org/?product=firefox-4.0-euballot&os=win&lang=en-GB.

Is it possible to check that hitting that URL gets a 403 to a host that resolves to an internal IP address? I'm hoping we can check that instead of for a specific hostname, to future-proof us against the hostname changing.

The URL will need updating at every major release (read: every quarter).

I'd like this check run on all hosts in the releng-buildbot-master and 'releng-buildbot-masters-scl1 groups, which will get us coverage in every colo.
If the check is run *on* the masters, then it can be arbitrarily complex, and just added to the nrpe.cfg on the masters.

Ben, if you can write up a short shell script that will do the check (following http://nagios.sourceforge.net/docs/3_0/pluginapi.html), and add that script to nrpe.cfg, and wrap the whole thing up in a puppet patch, then adding the check on the nagios servers can happen once it's installed.
Assignee: server-ops-releng → bhearsum
Sounds good, thanks for the pointer!
This patch centralizes all of the Master and Slave Nagios installation/configuration into one module, used by all slave and buildbot master nodes. Dustin, can you tell me if this looks sane or requires any big changes, before I spend time testing it everywhere?

One thing that's a little weird is that num_masters gets set in the Master node definition, but is used in the template defined by nagios::service. With that being a shared class between masters/slaves I wasn't sure how else to do it.
Attachment #532664 - Flags: feedback?(dustin)
Comment on attachment 532664 [details] [diff] [review]
centralize nagios for masters/slaves into one module

looks good.  The num_masters thing should eventually be taken care of through puppet mechanisms, so it's OK for now.
Attachment #532664 - Flags: feedback?(dustin) → feedback+
This patch centralizes all the existing Nagios stuff in Puppet into one module. Other than the check_http_redirect bits, all of this is just re-arranging existing checks, and importing CentOS/Darwin config files into the repository.

I tested on all of: 32-bit CentOS, 64-bit CentOS, 10.5 build, 10.6 build, dev-master01.

For CentOS, the only change made was whitespace in nrpe.cfg (because the 32 and 64 versions differed slightly in that way). For Darwin, it was a complete no-op. For dev-master01, the expected changes were made: nrpe.cfg was updated, check_http_redirect was created.

I also tested that check_http_redirect will work for our purposes:
./check_nrpe -H localhost -c check_http_redirect -a "http%3A%2F%2Fdownload.mozilla.org%2F%3Fproduct%3Dfirefox-4.0-euballot%26os%3Dwin%26lang%3Den-GB" 10.
Redirect OK: Host: dm-download02.mozilla.org, IP: 10.2.74.17
Attachment #532664 - Attachment is obsolete: true
Attachment #533007 - Flags: review?(dustin)
Attachment #533007 - Flags: review?(catlee)
No longer blocks: 646046
Comment on attachment 533007 [details] [diff] [review]
nagios puppet module for masters/slaves, tested version

Review of attachment 533007 [details] [diff] [review]:
-----------------------------------------------------------------

::: classes/buildslave.pp
@@ +4,4 @@
>  ### in the /manifests/packages/ tree
>  
>  class buildslave {
> +    include nagios::install

I'd rather spell this with a simple 'include nagios' - class nagios {} in init.pp can then include the sub-modules.

::: modules/nagios/manifests/service.pp
@@ +11,5 @@
> +                i686   => "lib",
> +                x86_64 => "lib64"
> +            }
> +            file {
> +                "/etc/nagios/nrpe.cfg":

Can you encode the nrpe.cfg conditionals into a single template, instead?  Most of the file is identical between systems..
Attachment #533007 - Flags: review?(dustin) → review-
(In reply to comment #6)
> >  class buildslave {
> > +    include nagios::install
> 
> I'd rather spell this with a simple 'include nagios' - class nagios {} in
> init.pp can then include the sub-modules.

Just to be clear: You want me to drop the "include nagios::service" lines from the buildslaves classes and master nodes, and add one to the "nagios" class?

> ::: modules/nagios/manifests/service.pp
> @@ +11,5 @@
> > +                i686   => "lib",
> > +                x86_64 => "lib64"
> > +            }
> > +            file {
> > +                "/etc/nagios/nrpe.cfg":
> 
> Can you encode the nrpe.cfg conditionals into a single template, instead? 
> Most of the file is identical between systems..

Sure.
(In reply to comment #7)
> Just to be clear: You want me to drop the "include nagios::service" lines
> from the buildslaves classes and master nodes, and add one to the "nagios"
> class?

Yep:

modules/nagios/manifests/init.pp:
class nagios {
    include nagios::service, nagios::install
}

buildslave.pp:
class buildslave {
    include nagios
}

> Sure.

Thanks, I know that's a bunch of work, but I think it will make future maintenance easier as we try to do things across all machines.  Keep in mind that having extra, unused lines in there isn't a big deal -- that might make the logic simpler.
Attachment #533007 - Attachment is obsolete: true
Attachment #533007 - Flags: review?(catlee)
Attachment #533067 - Flags: review?(dustin)
Attachment #533067 - Flags: review?(catlee)
Comment on attachment 533067 [details] [diff] [review]
[checked in] v3, dustin's review comments addressed

Review of attachment 533067 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #533067 - Flags: review?(dustin) → review+
Comment on attachment 533067 [details] [diff] [review]
[checked in] v3, dustin's review comments addressed

I'd rather not have import "nagios", but not going to block on that.
Attachment #533067 - Flags: review?(catlee) → review+
Comment on attachment 533067 [details] [diff] [review]
[checked in] v3, dustin's review comments addressed

Landed this, without the "import nagios" bit (because it didn't end up being necessary), and updated all of the Puppet masters:
changeset:   352:a6bdc303a09f
Attachment #533067 - Attachment description: v3, dustin's review comments addressed → [checked in] v3, dustin's review comments addressed
Getting some errors:
 Could not retrieve catalog: Could not find class secrets in namespaces nagios::service at /etc/puppet/modules/nagios/manifests/service.pp:4 on node mv-moz2-linux-ix-slave03.build.mozilla.org

Turns out my bustage fix created more bustage, because secrets isn't applicable, except on Masters. This bustage fix fix should get things going again.
Attachment #534768 - Flags: review?(dustin)
Comment on attachment 534768 [details] [diff] [review]
[checked in] fix secrets in nagios::service

Please add a comment as to why the include is buried there - but otherwise, looks good.
Attachment #534768 - Flags: review?(dustin) → review+
Comment on attachment 534768 [details] [diff] [review]
[checked in] fix secrets in nagios::service

Landed, with a comment.
changeset:   354:6fd021161433
Attachment #534768 - Attachment description: fix secrets in nagios::service → [checked in] fix secrets in nagios::service
Rolled out nrpe.cfg/check_http_redirect changes on the following machines by hand (because they're not in Puppet):
buildbot-master1
buildbot-master2
buildbot-master3
buildbot-master5

I didn't bother with older masters on dev/pp because a) most of them are going away and b) production alone gets us the coverage we need. Anything set-up through Puppet in the future will get them, though.
Amy,

We're now ready to add these checks. Specifically, buildbot-master1, 2, 3, 04, 5, and 6 (as well as 07 - 14 when they get added to Nagios), and dev-master01 need a "check_http_redirect" added. It needs two arguments, the first is "http%3A%2F%2Fdownload.mozilla.org%2F%3Fproduct%3Dfirefox-4.0-euballot%26os%3Dwin%26lang%3Den-GB" (encoded because NRPE won't let ampresands through), the second is "10.". This will check that the IP the URL redirects to somewhere inside of our network.
Assignee: bhearsum → arich
In addition to the nagios plugin (which I've copied to the buildbot servers), this check is also going to require the local installation of LWP::UserAgent and an nrpe.cfg modification.  Who takes care of adding perl modules to the buildbot servers (I don't want to interfere with normal operation).

Also, these things (the check file, the nrpe.cfg change, and the perl dependencies) should be handled by puppet on any new buildbot-master servers.  Catlee, do you want to look into doing this since you're working on the master config?
I'm confused - how can a Python script can require a perl module?

Perl modules should be installed as RPMs via puppet.

Maybe it's time to start writing puppet patches ;)
I'm confused too. To be clear, the check_http_redirect installed on the masters is a python script of my own creation, because the Perl version of the same name didn't work for our use case. I've already tested it through check_nrpe, too.
Ah, sorry, I thought we were still using the perl version of the check_http_redirect script.  If not, great! Lets name it something else to avoid confusion.
Agreed - it's one more patch to land, but I think it will disambiguate later.  I didn't realize there was already a "built-in" nagios test of that name.
The version I found didn't seem to be any official supported thing (http://www.purple.org.ua/wp-content/uploads/2009/09/check_http_redirect.pl), but I have a follow-up patch for another issue already....so may as well!
I built this patch on top of your ganglia one, since you did some refactoring. I didn't include anything to delete the existing check_http_redirect file because it's simple enough to do by hand.

Ran on dev-master01, with the following output:
notice: //buildmaster/Package[mysql-devel]/ensure: ensure changed '5.0.77-4.el5_5.5' to '5.0.77-4.el5_6.6'
notice: //nagios::install/Nagios::Install::Plugin[check_http_redirect_ip]/File[/usr/lib64/nagios/plugins/check_http_redirect_ip]/ensure: content changed '{md5}194d961e0185ef9991eacce89defec2f' to '{md5}194d961e0185ef9991eacce89defec2f'
--- /etc/hosts	2010-10-29 00:20:34.000000000 -0700
+++ /tmp/puppet-diffing.10707.0	2011-05-25 05:29:46.000000000 -0700
@@ -1,4 +1,3 @@
-# Do not remove the following line, or various programs
-# that require network functionality will fail.
-127.0.0.1		localhost.localdomain localhost
-::1		localhost6.localdomain6 localhost6
+# (managed by puppet)
+127.0.0.1               localhost.localdomain localhost
+::1             localhost6.localdomain6 localhost6
notice: //network::hosts/File[/etc/hosts]/content: content changed '{md5}3f6bc5fdac37347e2c6913259470ae9c' to '{md5}2d131c811c19033d550adfffa208a08e'
--- /etc/nagios/nrpe.cfg	2011-05-17 13:54:49.000000000 -0700
+++ /tmp/puppet-diffing.10707.0	2011-05-25 05:29:46.000000000 -0700
@@ -28,9 +28,10 @@
 # Masters only
 command[check_swap]=/usr/lib64/nagios/plugins/check_swap -w $ARG1$ -c $ARG2$
 command[check_buildbot]=/usr/lib64/nagios/plugins/check_procs -c 0:0 -C buildbot
-command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H tm-b01-master01.mozilla.org
+command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H tm-b01-master01.mozilla.org -u buildbot -p H63732fW54vm
 command[check_ntp_time]=/usr/lib64/nagios/plugins/check_ntp_time -H $ARG1$ -w $ARG2$ -c $ARG3$
-command[check_http_redirect]=/usr/lib64/nagios/plugins/check_http_redirect -U $ARG1$ -I $ARG2$
+command[check_http_redirect_ip]=/usr/lib64/nagios/plugins/check_http_redirect_ip -U $ARG1$ -I $ARG2$
+command[check_ganglia]=/usr/lib64/nagios/plugins/check_ganglia -h $ARG1$ -s $ARG2$ -m $ARG3$ -w $ARG4$ -c $ARG5$
 
 
 
notice: //nagios::service/File[/etc/nagios/nrpe.cfg]/content: content changed '{md5}cb6d55f80b368aa6eb4f299fde2bf6ab' to 'unknown checksum'
info: //nagios::service/File[/etc/nagios/nrpe.cfg]: Scheduling refresh of Service[nrpe]
notice: //nagios::service/Service[nrpe]: Triggering 'refresh' from 1 dependencies
notice: //nagios::install/Nagios::Install::Plugin[check_ganglia]/File[/usr/lib64/nagios/plugins/check_ganglia]/ensure: content changed '{md5}fd3d08d9950b235b7f2a9639422e6f39' to '{md5}fd3d08d9950b235b7f2a9639422e6f39'
notice: Finished catalog run in 16.71 seconds

I'm not really sure why /etc/hosts is being synced, but dev-master01 doesn't run puppetd regularly, so I'm not surprised it's a bit out of sync.
Attachment #535039 - Flags: review?(dustin)
Comment on attachment 535039 [details] [diff] [review]
move "include nagios" to master class, rename http plugin

I see the change to nrpe.cfg in your output, but I don't see it in this patch?
Attachment #535039 - Flags: review?(dustin) → review+
Sorry, forgot to rediff before I attached.
Attachment #535039 - Attachment is obsolete: true
Attachment #535103 - Flags: review?(dustin)
Attachment #535103 - Flags: review?(dustin) → review+
Attachment #535103 - Attachment description: again, with the right diff → [checked in] again, with the right diff
Amy, this is ready to go for buildbot-master1,2,3,04,5,06 now.
These checks are now active on the following servers:

SCL1:

$production-buildbot-masters = buildbot-master1.build.scl1,buildbot-master2.build.scl1,buildbot-master04.build.scl1,buildbot-master5.build.scl1,buildbot-master06.build.scl1


MTV1:

$production-mtv1-buildbot-masters = buildbot-master3.build.mtv1,test-master1.build.mtv1,production-mobile-master.build.mtv1

There are no servers in sjc1 that are currently running the check.  The group that contains the buildbot servers for that datacenter is:

SJC1:

$production-sjc1-buildbot-masters = production-master1.build.sjc1,production-master2.build.sjc1,production-master3.build.sjc1,preproduction-master.build.sjc1


Let me know if/when you would like to add the hosts in sjc1.
Checks have been added to the sjc1 hosts as well.  Tested with all three colos, and looks to be working.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: