please add nagios http redirect check to buildbot masters

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: bhearsum, Assigned: arr)

Tracking

Details

Attachments

(3 attachments, 3 obsolete attachments)

(Reporter)

Description

7 years ago
In bug 646076 I've set up Bouncer to limit the build network to Mozilla-internal mirrors only. There's some concern that these changes could be lost at some point, so I'd like nagios checks to verify.

The most stable Bouncer product that'll work for this -- that I can find -- is http://download.mozilla.org/?product=firefox-4.0-euballot&os=win&lang=en-GB.

Is it possible to check that hitting that URL gets a 403 to a host that resolves to an internal IP address? I'm hoping we can check that instead of for a specific hostname, to future-proof us against the hostname changing.

The URL will need updating at every major release (read: every quarter).

I'd like this check run on all hosts in the releng-buildbot-master and 'releng-buildbot-masters-scl1 groups, which will get us coverage in every colo.
If the check is run *on* the masters, then it can be arbitrarily complex, and just added to the nrpe.cfg on the masters.

Ben, if you can write up a short shell script that will do the check (following http://nagios.sourceforge.net/docs/3_0/pluginapi.html), and add that script to nrpe.cfg, and wrap the whole thing up in a puppet patch, then adding the check on the nagios servers can happen once it's installed.
Assignee: server-ops-releng → bhearsum
(Reporter)

Comment 2

7 years ago
Sounds good, thanks for the pointer!
(Reporter)

Comment 3

7 years ago
Created attachment 532664 [details] [diff] [review]
centralize nagios for masters/slaves into one module

This patch centralizes all of the Master and Slave Nagios installation/configuration into one module, used by all slave and buildbot master nodes. Dustin, can you tell me if this looks sane or requires any big changes, before I spend time testing it everywhere?

One thing that's a little weird is that num_masters gets set in the Master node definition, but is used in the template defined by nagios::service. With that being a shared class between masters/slaves I wasn't sure how else to do it.
Attachment #532664 - Flags: feedback?(dustin)
Comment on attachment 532664 [details] [diff] [review]
centralize nagios for masters/slaves into one module

looks good.  The num_masters thing should eventually be taken care of through puppet mechanisms, so it's OK for now.
Attachment #532664 - Flags: feedback?(dustin) → feedback+
(Reporter)

Comment 5

7 years ago
Created attachment 533007 [details] [diff] [review]
nagios puppet module for masters/slaves, tested version

This patch centralizes all the existing Nagios stuff in Puppet into one module. Other than the check_http_redirect bits, all of this is just re-arranging existing checks, and importing CentOS/Darwin config files into the repository.

I tested on all of: 32-bit CentOS, 64-bit CentOS, 10.5 build, 10.6 build, dev-master01.

For CentOS, the only change made was whitespace in nrpe.cfg (because the 32 and 64 versions differed slightly in that way). For Darwin, it was a complete no-op. For dev-master01, the expected changes were made: nrpe.cfg was updated, check_http_redirect was created.

I also tested that check_http_redirect will work for our purposes:
./check_nrpe -H localhost -c check_http_redirect -a "http%3A%2F%2Fdownload.mozilla.org%2F%3Fproduct%3Dfirefox-4.0-euballot%26os%3Dwin%26lang%3Den-GB" 10.
Redirect OK: Host: dm-download02.mozilla.org, IP: 10.2.74.17
Attachment #532664 - Attachment is obsolete: true
Attachment #533007 - Flags: review?(dustin)
Attachment #533007 - Flags: review?(catlee)
(Reporter)

Updated

7 years ago
No longer blocks: 646046
Comment on attachment 533007 [details] [diff] [review]
nagios puppet module for masters/slaves, tested version

Review of attachment 533007 [details] [diff] [review]:
-----------------------------------------------------------------

::: classes/buildslave.pp
@@ +4,4 @@
>  ### in the /manifests/packages/ tree
>  
>  class buildslave {
> +    include nagios::install

I'd rather spell this with a simple 'include nagios' - class nagios {} in init.pp can then include the sub-modules.

::: modules/nagios/manifests/service.pp
@@ +11,5 @@
> +                i686   => "lib",
> +                x86_64 => "lib64"
> +            }
> +            file {
> +                "/etc/nagios/nrpe.cfg":

Can you encode the nrpe.cfg conditionals into a single template, instead?  Most of the file is identical between systems..
Attachment #533007 - Flags: review?(dustin) → review-
(Reporter)

Comment 7

7 years ago
(In reply to comment #6)
> >  class buildslave {
> > +    include nagios::install
> 
> I'd rather spell this with a simple 'include nagios' - class nagios {} in
> init.pp can then include the sub-modules.

Just to be clear: You want me to drop the "include nagios::service" lines from the buildslaves classes and master nodes, and add one to the "nagios" class?

> ::: modules/nagios/manifests/service.pp
> @@ +11,5 @@
> > +                i686   => "lib",
> > +                x86_64 => "lib64"
> > +            }
> > +            file {
> > +                "/etc/nagios/nrpe.cfg":
> 
> Can you encode the nrpe.cfg conditionals into a single template, instead? 
> Most of the file is identical between systems..

Sure.
(In reply to comment #7)
> Just to be clear: You want me to drop the "include nagios::service" lines
> from the buildslaves classes and master nodes, and add one to the "nagios"
> class?

Yep:

modules/nagios/manifests/init.pp:
class nagios {
    include nagios::service, nagios::install
}

buildslave.pp:
class buildslave {
    include nagios
}

> Sure.

Thanks, I know that's a bunch of work, but I think it will make future maintenance easier as we try to do things across all machines.  Keep in mind that having extra, unused lines in there isn't a big deal -- that might make the logic simpler.
(Reporter)

Comment 9

7 years ago
Created attachment 533067 [details] [diff] [review]
[checked in] v3, dustin's review comments addressed
Attachment #533007 - Attachment is obsolete: true
Attachment #533007 - Flags: review?(catlee)
Attachment #533067 - Flags: review?(dustin)
Attachment #533067 - Flags: review?(catlee)
Comment on attachment 533067 [details] [diff] [review]
[checked in] v3, dustin's review comments addressed

Review of attachment 533067 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #533067 - Flags: review?(dustin) → review+
Comment on attachment 533067 [details] [diff] [review]
[checked in] v3, dustin's review comments addressed

I'd rather not have import "nagios", but not going to block on that.
Attachment #533067 - Flags: review?(catlee) → review+
(Reporter)

Comment 12

7 years ago
Comment on attachment 533067 [details] [diff] [review]
[checked in] v3, dustin's review comments addressed

Landed this, without the "import nagios" bit (because it didn't end up being necessary), and updated all of the Puppet masters:
changeset:   352:a6bdc303a09f
Attachment #533067 - Attachment description: v3, dustin's review comments addressed → [checked in] v3, dustin's review comments addressed
(Reporter)

Comment 13

7 years ago
Created attachment 534768 [details] [diff] [review]
[checked in] fix secrets in nagios::service

Getting some errors:
 Could not retrieve catalog: Could not find class secrets in namespaces nagios::service at /etc/puppet/modules/nagios/manifests/service.pp:4 on node mv-moz2-linux-ix-slave03.build.mozilla.org

Turns out my bustage fix created more bustage, because secrets isn't applicable, except on Masters. This bustage fix fix should get things going again.
Attachment #534768 - Flags: review?(dustin)
Comment on attachment 534768 [details] [diff] [review]
[checked in] fix secrets in nagios::service

Please add a comment as to why the include is buried there - but otherwise, looks good.
Attachment #534768 - Flags: review?(dustin) → review+
(Reporter)

Comment 15

7 years ago
Comment on attachment 534768 [details] [diff] [review]
[checked in] fix secrets in nagios::service

Landed, with a comment.
changeset:   354:6fd021161433
Attachment #534768 - Attachment description: fix secrets in nagios::service → [checked in] fix secrets in nagios::service
(Reporter)

Comment 16

7 years ago
Rolled out nrpe.cfg/check_http_redirect changes on the following machines by hand (because they're not in Puppet):
buildbot-master1
buildbot-master2
buildbot-master3
buildbot-master5

I didn't bother with older masters on dev/pp because a) most of them are going away and b) production alone gets us the coverage we need. Anything set-up through Puppet in the future will get them, though.
(Reporter)

Comment 17

7 years ago
Amy,

We're now ready to add these checks. Specifically, buildbot-master1, 2, 3, 04, 5, and 6 (as well as 07 - 14 when they get added to Nagios), and dev-master01 need a "check_http_redirect" added. It needs two arguments, the first is "http%3A%2F%2Fdownload.mozilla.org%2F%3Fproduct%3Dfirefox-4.0-euballot%26os%3Dwin%26lang%3Den-GB" (encoded because NRPE won't let ampresands through), the second is "10.". This will check that the IP the URL redirects to somewhere inside of our network.
Assignee: bhearsum → arich
(Assignee)

Comment 18

7 years ago
In addition to the nagios plugin (which I've copied to the buildbot servers), this check is also going to require the local installation of LWP::UserAgent and an nrpe.cfg modification.  Who takes care of adding perl modules to the buildbot servers (I don't want to interfere with normal operation).

Also, these things (the check file, the nrpe.cfg change, and the perl dependencies) should be handled by puppet on any new buildbot-master servers.  Catlee, do you want to look into doing this since you're working on the master config?
I'm confused - how can a Python script can require a perl module?

Perl modules should be installed as RPMs via puppet.

Maybe it's time to start writing puppet patches ;)
(Reporter)

Comment 20

7 years ago
I'm confused too. To be clear, the check_http_redirect installed on the masters is a python script of my own creation, because the Perl version of the same name didn't work for our use case. I've already tested it through check_nrpe, too.
(Assignee)

Comment 21

7 years ago
Ah, sorry, I thought we were still using the perl version of the check_http_redirect script.  If not, great! Lets name it something else to avoid confusion.
Agreed - it's one more patch to land, but I think it will disambiguate later.  I didn't realize there was already a "built-in" nagios test of that name.
(Reporter)

Comment 23

7 years ago
The version I found didn't seem to be any official supported thing (http://www.purple.org.ua/wp-content/uploads/2009/09/check_http_redirect.pl), but I have a follow-up patch for another issue already....so may as well!
(Reporter)

Comment 24

7 years ago
Created attachment 535039 [details] [diff] [review]
move "include nagios" to master class, rename http plugin

I built this patch on top of your ganglia one, since you did some refactoring. I didn't include anything to delete the existing check_http_redirect file because it's simple enough to do by hand.

Ran on dev-master01, with the following output:
notice: //buildmaster/Package[mysql-devel]/ensure: ensure changed '5.0.77-4.el5_5.5' to '5.0.77-4.el5_6.6'
notice: //nagios::install/Nagios::Install::Plugin[check_http_redirect_ip]/File[/usr/lib64/nagios/plugins/check_http_redirect_ip]/ensure: content changed '{md5}194d961e0185ef9991eacce89defec2f' to '{md5}194d961e0185ef9991eacce89defec2f'
--- /etc/hosts	2010-10-29 00:20:34.000000000 -0700
+++ /tmp/puppet-diffing.10707.0	2011-05-25 05:29:46.000000000 -0700
@@ -1,4 +1,3 @@
-# Do not remove the following line, or various programs
-# that require network functionality will fail.
-127.0.0.1		localhost.localdomain localhost
-::1		localhost6.localdomain6 localhost6
+# (managed by puppet)
+127.0.0.1               localhost.localdomain localhost
+::1             localhost6.localdomain6 localhost6
notice: //network::hosts/File[/etc/hosts]/content: content changed '{md5}3f6bc5fdac37347e2c6913259470ae9c' to '{md5}2d131c811c19033d550adfffa208a08e'
--- /etc/nagios/nrpe.cfg	2011-05-17 13:54:49.000000000 -0700
+++ /tmp/puppet-diffing.10707.0	2011-05-25 05:29:46.000000000 -0700
@@ -28,9 +28,10 @@
 # Masters only
 command[check_swap]=/usr/lib64/nagios/plugins/check_swap -w $ARG1$ -c $ARG2$
 command[check_buildbot]=/usr/lib64/nagios/plugins/check_procs -c 0:0 -C buildbot
-command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H tm-b01-master01.mozilla.org
+command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H tm-b01-master01.mozilla.org -u buildbot -p H63732fW54vm
 command[check_ntp_time]=/usr/lib64/nagios/plugins/check_ntp_time -H $ARG1$ -w $ARG2$ -c $ARG3$
-command[check_http_redirect]=/usr/lib64/nagios/plugins/check_http_redirect -U $ARG1$ -I $ARG2$
+command[check_http_redirect_ip]=/usr/lib64/nagios/plugins/check_http_redirect_ip -U $ARG1$ -I $ARG2$
+command[check_ganglia]=/usr/lib64/nagios/plugins/check_ganglia -h $ARG1$ -s $ARG2$ -m $ARG3$ -w $ARG4$ -c $ARG5$
 
 
 
notice: //nagios::service/File[/etc/nagios/nrpe.cfg]/content: content changed '{md5}cb6d55f80b368aa6eb4f299fde2bf6ab' to 'unknown checksum'
info: //nagios::service/File[/etc/nagios/nrpe.cfg]: Scheduling refresh of Service[nrpe]
notice: //nagios::service/Service[nrpe]: Triggering 'refresh' from 1 dependencies
notice: //nagios::install/Nagios::Install::Plugin[check_ganglia]/File[/usr/lib64/nagios/plugins/check_ganglia]/ensure: content changed '{md5}fd3d08d9950b235b7f2a9639422e6f39' to '{md5}fd3d08d9950b235b7f2a9639422e6f39'
notice: Finished catalog run in 16.71 seconds

I'm not really sure why /etc/hosts is being synced, but dev-master01 doesn't run puppetd regularly, so I'm not surprised it's a bit out of sync.
(Reporter)

Updated

7 years ago
Attachment #535039 - Flags: review?(dustin)
Comment on attachment 535039 [details] [diff] [review]
move "include nagios" to master class, rename http plugin

I see the change to nrpe.cfg in your output, but I don't see it in this patch?
Attachment #535039 - Flags: review?(dustin) → review+
(Reporter)

Comment 26

7 years ago
Created attachment 535103 [details] [diff] [review]
[checked in] again, with the right diff

Sorry, forgot to rediff before I attached.
Attachment #535039 - Attachment is obsolete: true
Attachment #535103 - Flags: review?(dustin)
Attachment #535103 - Flags: review?(dustin) → review+
(Reporter)

Updated

7 years ago
Attachment #535103 - Attachment description: again, with the right diff → [checked in] again, with the right diff
(Reporter)

Comment 27

7 years ago
Amy, this is ready to go for buildbot-master1,2,3,04,5,06 now.
(Assignee)

Comment 28

7 years ago
These checks are now active on the following servers:

SCL1:

$production-buildbot-masters = buildbot-master1.build.scl1,buildbot-master2.build.scl1,buildbot-master04.build.scl1,buildbot-master5.build.scl1,buildbot-master06.build.scl1


MTV1:

$production-mtv1-buildbot-masters = buildbot-master3.build.mtv1,test-master1.build.mtv1,production-mobile-master.build.mtv1

There are no servers in sjc1 that are currently running the check.  The group that contains the buildbot servers for that datacenter is:

SJC1:

$production-sjc1-buildbot-masters = production-master1.build.sjc1,production-master2.build.sjc1,production-master3.build.sjc1,preproduction-master.build.sjc1


Let me know if/when you would like to add the hosts in sjc1.
(Assignee)

Comment 29

7 years ago
Checks have been added to the sjc1 hosts as well.  Tested with all three colos, and looks to be working.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED

Updated

6 years ago
Duplicate of this bug: 624000
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.