Closed
Bug 656916
Opened 13 years ago
Closed 13 years ago
please add nagios http redirect check to buildbot masters
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: arich)
References
Details
Attachments
(3 files, 3 obsolete files)
22.48 KB,
patch
|
dustin
:
review+
catlee
:
review+
|
Details | Diff | Splinter Review |
454 bytes,
patch
|
dustin
:
review+
|
Details | Diff | Splinter Review |
3.08 KB,
patch
|
dustin
:
review+
|
Details | Diff | Splinter Review |
In bug 646076 I've set up Bouncer to limit the build network to Mozilla-internal mirrors only. There's some concern that these changes could be lost at some point, so I'd like nagios checks to verify. The most stable Bouncer product that'll work for this -- that I can find -- is http://download.mozilla.org/?product=firefox-4.0-euballot&os=win&lang=en-GB. Is it possible to check that hitting that URL gets a 403 to a host that resolves to an internal IP address? I'm hoping we can check that instead of for a specific hostname, to future-proof us against the hostname changing. The URL will need updating at every major release (read: every quarter). I'd like this check run on all hosts in the releng-buildbot-master and 'releng-buildbot-masters-scl1 groups, which will get us coverage in every colo.
Comment 1•13 years ago
|
||
If the check is run *on* the masters, then it can be arbitrarily complex, and just added to the nrpe.cfg on the masters. Ben, if you can write up a short shell script that will do the check (following http://nagios.sourceforge.net/docs/3_0/pluginapi.html), and add that script to nrpe.cfg, and wrap the whole thing up in a puppet patch, then adding the check on the nagios servers can happen once it's installed.
Updated•13 years ago
|
Assignee: server-ops-releng → bhearsum
Reporter | ||
Comment 2•13 years ago
|
||
Sounds good, thanks for the pointer!
Reporter | ||
Comment 3•13 years ago
|
||
This patch centralizes all of the Master and Slave Nagios installation/configuration into one module, used by all slave and buildbot master nodes. Dustin, can you tell me if this looks sane or requires any big changes, before I spend time testing it everywhere? One thing that's a little weird is that num_masters gets set in the Master node definition, but is used in the template defined by nagios::service. With that being a shared class between masters/slaves I wasn't sure how else to do it.
Attachment #532664 -
Flags: feedback?(dustin)
Comment 4•13 years ago
|
||
Comment on attachment 532664 [details] [diff] [review] centralize nagios for masters/slaves into one module looks good. The num_masters thing should eventually be taken care of through puppet mechanisms, so it's OK for now.
Attachment #532664 -
Flags: feedback?(dustin) → feedback+
Reporter | ||
Comment 5•13 years ago
|
||
This patch centralizes all the existing Nagios stuff in Puppet into one module. Other than the check_http_redirect bits, all of this is just re-arranging existing checks, and importing CentOS/Darwin config files into the repository. I tested on all of: 32-bit CentOS, 64-bit CentOS, 10.5 build, 10.6 build, dev-master01. For CentOS, the only change made was whitespace in nrpe.cfg (because the 32 and 64 versions differed slightly in that way). For Darwin, it was a complete no-op. For dev-master01, the expected changes were made: nrpe.cfg was updated, check_http_redirect was created. I also tested that check_http_redirect will work for our purposes: ./check_nrpe -H localhost -c check_http_redirect -a "http%3A%2F%2Fdownload.mozilla.org%2F%3Fproduct%3Dfirefox-4.0-euballot%26os%3Dwin%26lang%3Den-GB" 10. Redirect OK: Host: dm-download02.mozilla.org, IP: 10.2.74.17
Attachment #532664 -
Attachment is obsolete: true
Attachment #533007 -
Flags: review?(dustin)
Attachment #533007 -
Flags: review?(catlee)
Comment 6•13 years ago
|
||
Comment on attachment 533007 [details] [diff] [review] nagios puppet module for masters/slaves, tested version Review of attachment 533007 [details] [diff] [review]: ----------------------------------------------------------------- ::: classes/buildslave.pp @@ +4,4 @@ > ### in the /manifests/packages/ tree > > class buildslave { > + include nagios::install I'd rather spell this with a simple 'include nagios' - class nagios {} in init.pp can then include the sub-modules. ::: modules/nagios/manifests/service.pp @@ +11,5 @@ > + i686 => "lib", > + x86_64 => "lib64" > + } > + file { > + "/etc/nagios/nrpe.cfg": Can you encode the nrpe.cfg conditionals into a single template, instead? Most of the file is identical between systems..
Attachment #533007 -
Flags: review?(dustin) → review-
Reporter | ||
Comment 7•13 years ago
|
||
(In reply to comment #6) > > class buildslave { > > + include nagios::install > > I'd rather spell this with a simple 'include nagios' - class nagios {} in > init.pp can then include the sub-modules. Just to be clear: You want me to drop the "include nagios::service" lines from the buildslaves classes and master nodes, and add one to the "nagios" class? > ::: modules/nagios/manifests/service.pp > @@ +11,5 @@ > > + i686 => "lib", > > + x86_64 => "lib64" > > + } > > + file { > > + "/etc/nagios/nrpe.cfg": > > Can you encode the nrpe.cfg conditionals into a single template, instead? > Most of the file is identical between systems.. Sure.
Comment 8•13 years ago
|
||
(In reply to comment #7) > Just to be clear: You want me to drop the "include nagios::service" lines > from the buildslaves classes and master nodes, and add one to the "nagios" > class? Yep: modules/nagios/manifests/init.pp: class nagios { include nagios::service, nagios::install } buildslave.pp: class buildslave { include nagios } > Sure. Thanks, I know that's a bunch of work, but I think it will make future maintenance easier as we try to do things across all machines. Keep in mind that having extra, unused lines in there isn't a big deal -- that might make the logic simpler.
Reporter | ||
Comment 9•13 years ago
|
||
Attachment #533007 -
Attachment is obsolete: true
Attachment #533007 -
Flags: review?(catlee)
Attachment #533067 -
Flags: review?(dustin)
Attachment #533067 -
Flags: review?(catlee)
Comment 10•13 years ago
|
||
Comment on attachment 533067 [details] [diff] [review] [checked in] v3, dustin's review comments addressed Review of attachment 533067 [details] [diff] [review]: -----------------------------------------------------------------
Attachment #533067 -
Flags: review?(dustin) → review+
Comment 11•13 years ago
|
||
Comment on attachment 533067 [details] [diff] [review] [checked in] v3, dustin's review comments addressed I'd rather not have import "nagios", but not going to block on that.
Attachment #533067 -
Flags: review?(catlee) → review+
Reporter | ||
Comment 12•13 years ago
|
||
Comment on attachment 533067 [details] [diff] [review] [checked in] v3, dustin's review comments addressed Landed this, without the "import nagios" bit (because it didn't end up being necessary), and updated all of the Puppet masters: changeset: 352:a6bdc303a09f
Attachment #533067 -
Attachment description: v3, dustin's review comments addressed → [checked in] v3, dustin's review comments addressed
Reporter | ||
Comment 13•13 years ago
|
||
Getting some errors: Could not retrieve catalog: Could not find class secrets in namespaces nagios::service at /etc/puppet/modules/nagios/manifests/service.pp:4 on node mv-moz2-linux-ix-slave03.build.mozilla.org Turns out my bustage fix created more bustage, because secrets isn't applicable, except on Masters. This bustage fix fix should get things going again.
Attachment #534768 -
Flags: review?(dustin)
Comment 14•13 years ago
|
||
Comment on attachment 534768 [details] [diff] [review] [checked in] fix secrets in nagios::service Please add a comment as to why the include is buried there - but otherwise, looks good.
Attachment #534768 -
Flags: review?(dustin) → review+
Reporter | ||
Comment 15•13 years ago
|
||
Comment on attachment 534768 [details] [diff] [review] [checked in] fix secrets in nagios::service Landed, with a comment. changeset: 354:6fd021161433
Attachment #534768 -
Attachment description: fix secrets in nagios::service → [checked in] fix secrets in nagios::service
Reporter | ||
Comment 16•13 years ago
|
||
Rolled out nrpe.cfg/check_http_redirect changes on the following machines by hand (because they're not in Puppet): buildbot-master1 buildbot-master2 buildbot-master3 buildbot-master5 I didn't bother with older masters on dev/pp because a) most of them are going away and b) production alone gets us the coverage we need. Anything set-up through Puppet in the future will get them, though.
Reporter | ||
Comment 17•13 years ago
|
||
Amy, We're now ready to add these checks. Specifically, buildbot-master1, 2, 3, 04, 5, and 6 (as well as 07 - 14 when they get added to Nagios), and dev-master01 need a "check_http_redirect" added. It needs two arguments, the first is "http%3A%2F%2Fdownload.mozilla.org%2F%3Fproduct%3Dfirefox-4.0-euballot%26os%3Dwin%26lang%3Den-GB" (encoded because NRPE won't let ampresands through), the second is "10.". This will check that the IP the URL redirects to somewhere inside of our network.
Assignee: bhearsum → arich
Assignee | ||
Comment 18•13 years ago
|
||
In addition to the nagios plugin (which I've copied to the buildbot servers), this check is also going to require the local installation of LWP::UserAgent and an nrpe.cfg modification. Who takes care of adding perl modules to the buildbot servers (I don't want to interfere with normal operation). Also, these things (the check file, the nrpe.cfg change, and the perl dependencies) should be handled by puppet on any new buildbot-master servers. Catlee, do you want to look into doing this since you're working on the master config?
Comment 19•13 years ago
|
||
I'm confused - how can a Python script can require a perl module? Perl modules should be installed as RPMs via puppet. Maybe it's time to start writing puppet patches ;)
Reporter | ||
Comment 20•13 years ago
|
||
I'm confused too. To be clear, the check_http_redirect installed on the masters is a python script of my own creation, because the Perl version of the same name didn't work for our use case. I've already tested it through check_nrpe, too.
Assignee | ||
Comment 21•13 years ago
|
||
Ah, sorry, I thought we were still using the perl version of the check_http_redirect script. If not, great! Lets name it something else to avoid confusion.
Comment 22•13 years ago
|
||
Agreed - it's one more patch to land, but I think it will disambiguate later. I didn't realize there was already a "built-in" nagios test of that name.
Reporter | ||
Comment 23•13 years ago
|
||
The version I found didn't seem to be any official supported thing (http://www.purple.org.ua/wp-content/uploads/2009/09/check_http_redirect.pl), but I have a follow-up patch for another issue already....so may as well!
Reporter | ||
Comment 24•13 years ago
|
||
I built this patch on top of your ganglia one, since you did some refactoring. I didn't include anything to delete the existing check_http_redirect file because it's simple enough to do by hand. Ran on dev-master01, with the following output: notice: //buildmaster/Package[mysql-devel]/ensure: ensure changed '5.0.77-4.el5_5.5' to '5.0.77-4.el5_6.6' notice: //nagios::install/Nagios::Install::Plugin[check_http_redirect_ip]/File[/usr/lib64/nagios/plugins/check_http_redirect_ip]/ensure: content changed '{md5}194d961e0185ef9991eacce89defec2f' to '{md5}194d961e0185ef9991eacce89defec2f' --- /etc/hosts 2010-10-29 00:20:34.000000000 -0700 +++ /tmp/puppet-diffing.10707.0 2011-05-25 05:29:46.000000000 -0700 @@ -1,4 +1,3 @@ -# Do not remove the following line, or various programs -# that require network functionality will fail. -127.0.0.1 localhost.localdomain localhost -::1 localhost6.localdomain6 localhost6 +# (managed by puppet) +127.0.0.1 localhost.localdomain localhost +::1 localhost6.localdomain6 localhost6 notice: //network::hosts/File[/etc/hosts]/content: content changed '{md5}3f6bc5fdac37347e2c6913259470ae9c' to '{md5}2d131c811c19033d550adfffa208a08e' --- /etc/nagios/nrpe.cfg 2011-05-17 13:54:49.000000000 -0700 +++ /tmp/puppet-diffing.10707.0 2011-05-25 05:29:46.000000000 -0700 @@ -28,9 +28,10 @@ # Masters only command[check_swap]=/usr/lib64/nagios/plugins/check_swap -w $ARG1$ -c $ARG2$ command[check_buildbot]=/usr/lib64/nagios/plugins/check_procs -c 0:0 -C buildbot -command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H tm-b01-master01.mozilla.org +command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H tm-b01-master01.mozilla.org -u buildbot -p H63732fW54vm command[check_ntp_time]=/usr/lib64/nagios/plugins/check_ntp_time -H $ARG1$ -w $ARG2$ -c $ARG3$ -command[check_http_redirect]=/usr/lib64/nagios/plugins/check_http_redirect -U $ARG1$ -I $ARG2$ +command[check_http_redirect_ip]=/usr/lib64/nagios/plugins/check_http_redirect_ip -U $ARG1$ -I $ARG2$ +command[check_ganglia]=/usr/lib64/nagios/plugins/check_ganglia -h $ARG1$ -s $ARG2$ -m $ARG3$ -w $ARG4$ -c $ARG5$ notice: //nagios::service/File[/etc/nagios/nrpe.cfg]/content: content changed '{md5}cb6d55f80b368aa6eb4f299fde2bf6ab' to 'unknown checksum' info: //nagios::service/File[/etc/nagios/nrpe.cfg]: Scheduling refresh of Service[nrpe] notice: //nagios::service/Service[nrpe]: Triggering 'refresh' from 1 dependencies notice: //nagios::install/Nagios::Install::Plugin[check_ganglia]/File[/usr/lib64/nagios/plugins/check_ganglia]/ensure: content changed '{md5}fd3d08d9950b235b7f2a9639422e6f39' to '{md5}fd3d08d9950b235b7f2a9639422e6f39' notice: Finished catalog run in 16.71 seconds I'm not really sure why /etc/hosts is being synced, but dev-master01 doesn't run puppetd regularly, so I'm not surprised it's a bit out of sync.
Reporter | ||
Updated•13 years ago
|
Attachment #535039 -
Flags: review?(dustin)
Comment 25•13 years ago
|
||
Comment on attachment 535039 [details] [diff] [review] move "include nagios" to master class, rename http plugin I see the change to nrpe.cfg in your output, but I don't see it in this patch?
Attachment #535039 -
Flags: review?(dustin) → review+
Reporter | ||
Comment 26•13 years ago
|
||
Sorry, forgot to rediff before I attached.
Attachment #535039 -
Attachment is obsolete: true
Attachment #535103 -
Flags: review?(dustin)
Updated•13 years ago
|
Attachment #535103 -
Flags: review?(dustin) → review+
Reporter | ||
Updated•13 years ago
|
Attachment #535103 -
Attachment description: again, with the right diff → [checked in] again, with the right diff
Reporter | ||
Comment 27•13 years ago
|
||
Amy, this is ready to go for buildbot-master1,2,3,04,5,06 now.
Assignee | ||
Comment 28•13 years ago
|
||
These checks are now active on the following servers: SCL1: $production-buildbot-masters = buildbot-master1.build.scl1,buildbot-master2.build.scl1,buildbot-master04.build.scl1,buildbot-master5.build.scl1,buildbot-master06.build.scl1 MTV1: $production-mtv1-buildbot-masters = buildbot-master3.build.mtv1,test-master1.build.mtv1,production-mobile-master.build.mtv1 There are no servers in sjc1 that are currently running the check. The group that contains the buildbot servers for that datacenter is: SJC1: $production-sjc1-buildbot-masters = production-master1.build.sjc1,production-master2.build.sjc1,production-master3.build.sjc1,preproduction-master.build.sjc1 Let me know if/when you would like to add the hosts in sjc1.
Assignee | ||
Comment 29•13 years ago
|
||
Checks have been added to the sjc1 hosts as well. Tested with all three colos, and looks to be working.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•