Closed
Bug 629698
Opened 14 years ago
Closed 14 years ago
new nagios passive check, 'buildbot_start', for all desktop slaves
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
References
Details
Per bug 627126, this check should alert after 6h with no report. Its results will be supplied passively by the buildslaves.
To get started, these checks should have notifications turned off, and if possible only go to the 'warning' state after 6 hours. There are a number of dependent bugs to get this working correctly, but having visibility into the workings immediately will make that go more smoothly.
I worry that buildslaves have different notions of their hostname than nagios does, but we'll cross that bridge when we come to it.
Assignee | ||
Comment 1•14 years ago
|
||
zandr - ping?
Assignee | ||
Comment 2•14 years ago
|
||
I assume we're running NSCA. I thought the protocol was a trivial serialization of a command file, but it seems it's also encrypted, and not with something easy like SSL. There do not appear to be any Python libraries available to accomplish this.
I wonder if we'd consider setting up NCSAWeb?
http://exchange.nagios.org/directory/Addons/Passive-Checks/NSCAweb/details
Assignee | ||
Comment 3•14 years ago
|
||
Um, *are* we running nsca? What sort of encryption and passwords does it require? I could play around with an existing service to figure this out, but I'd rather be told than try to guess (and potentially send incorrect results).
Comment 4•14 years ago
|
||
Just to answer you question, the encryption method on dm-nagios01 is set to blowfish. On bm-admin01 it's set to XOR.
Assignee | ||
Comment 5•14 years ago
|
||
Amy - let's talk about this tomorrow once the builds start rollin'. This is the next project to unstick on POSIX once slavealloc is up and running.
Comment 6•14 years ago
|
||
Here's what I've blocked out for the server side stuff (I haven't added it yet):
',
'buildbot_start' => '
define service {
use generic-service
host_name replace_with_host_name
service_description buildbot-start
contact_groups build
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 1
normal_check_interval 1
max_check_attempts 1
freshness_threshold 21600
notification_options w,u,c,r
check_command notify-no-buildbot-start
notification_period 24x7
}
define command {
command_name notify-no-buildbot-start
command_line /usr/lib/nagios/check_dummy 2 "CRITICAL: Slave has not reported buildbot startup!"
}
As you suggested, when we start testing, we can initially make the dummy check return OK instead of CRITICAL.
Assignee | ||
Comment 7•14 years ago
|
||
This is the code from the nagios generation scripts that converts a host's IP into a hostname for nagios:
$hostname = scalar (gethostbyaddr ($ip, AF_INET));
if ($hostname) {
$hostname =~ s/\.(com|int|org)$//;
$hostname =~ s/\.mozilla$//;
So basically, reverse DNS, then strip .com, .int, .org, and .mozilla. That should be easy enough to replicate on the slaves.
Once this check is in place for all desktop slaves, I'll start trying to emulate the XOR encryption in Python so that I can embed it in runslave.py. Hopefully it's not too horrendous!
Assignee: server-ops-releng → arich
Assignee | ||
Comment 8•14 years ago
|
||
By the way, so we're clear on where this sits, I need the check set up before I can start hacking runslave.py to use it. If it's easier to only add it for one host or one hostgroup, that's fine. I just need something I can watch in nagios to know if I'm doing it right.
Comment 9•14 years ago
|
||
Ah, sorry, I thought you were doing work on your end first. Let's pick one host to apply this to first. Which one do you suggest?
Also it would probably be prudent to use blowfish (like the central server) instead of xor, I think. Will that make this substantially more difficult?
Assignee | ||
Comment 10•14 years ago
|
||
(In reply to comment #9)
> Ah, sorry, I thought you were doing work on your end first. Let's pick one
> host to apply this to first. Which one do you suggest?
Any POSIX slave should be fine, then. Let's say linux-ix-slave10.
> Also it would probably be prudent to use blowfish (like the central server)
> instead of xor, I think. Will that make this substantially more difficult?
It requires implementing blowfish in dependency-free Python. Assuming your crypto-coding skills aren't several orders of magnitude better than mine, yes, I can say without hesitation that it will make this substantially more difficult :)
Comment 11•14 years ago
|
||
Check added for linux-ix-slave10.
Updated•14 years ago
|
Assignee: arich → dustin
Assignee | ||
Comment 12•14 years ago
|
||
I've disabled the buildslave startup on this machine. I'm going to install Net::NSCA::Client on it to verify that I can submit results with a known-good implementation; then I'll try to write my own implementation.
Assignee | ||
Comment 13•14 years ago
|
||
OK, here's what I get:
[root@linux-ix-slave10 ~]# cat nsca_input
linux-ix-slave10.build buildbot-start 1 testy testy
[root@linux-ix-slave10 ~]# send_nsca -H bm-admin01.mozilla.org < nsca_input
1 data packet(s) sent to host successfully.
But no change in the status on the web (I should see "testy testy" and a WARNING status, I think). Am I sending that to the right host?
Comment 14•14 years ago
|
||
For some reason the nsca.cfg file had been modified to use a command file that did not match that of the nagios.cfg config file. I've changed nsca.cfg back and restarted it. Give it a try now.
Assignee | ||
Comment 15•14 years ago
|
||
This one is fixed - the passive check is working and it only remains to implement and deploy, which is handled in dependent bugs.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 16•14 years ago
|
||
(In reply to comment #15)
> This one is fixed - the passive check is working and it only remains to
> implement and deploy, which is handled in dependent bugs.
Linking depbug#629701.
Blocks: 629701
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•