Closed Bug 629698 Opened 14 years ago Closed 14 years ago

new nagios passive check, 'buildbot_start', for all desktop slaves

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Per bug 627126, this check should alert after 6h with no report. Its results will be supplied passively by the buildslaves. To get started, these checks should have notifications turned off, and if possible only go to the 'warning' state after 6 hours. There are a number of dependent bugs to get this working correctly, but having visibility into the workings immediately will make that go more smoothly. I worry that buildslaves have different notions of their hostname than nagios does, but we'll cross that bridge when we come to it.
zandr - ping?
I assume we're running NSCA. I thought the protocol was a trivial serialization of a command file, but it seems it's also encrypted, and not with something easy like SSL. There do not appear to be any Python libraries available to accomplish this. I wonder if we'd consider setting up NCSAWeb? http://exchange.nagios.org/directory/Addons/Passive-Checks/NSCAweb/details
Um, *are* we running nsca? What sort of encryption and passwords does it require? I could play around with an existing service to figure this out, but I'd rather be told than try to guess (and potentially send incorrect results).
Just to answer you question, the encryption method on dm-nagios01 is set to blowfish. On bm-admin01 it's set to XOR.
Amy - let's talk about this tomorrow once the builds start rollin'. This is the next project to unstick on POSIX once slavealloc is up and running.
Here's what I've blocked out for the server side stuff (I haven't added it yet): ', 'buildbot_start' => ' define service { use generic-service host_name replace_with_host_name service_description buildbot-start contact_groups build active_checks_enabled 0 passive_checks_enabled 1 check_freshness 1 normal_check_interval 1 max_check_attempts 1 freshness_threshold 21600 notification_options w,u,c,r check_command notify-no-buildbot-start notification_period 24x7 } define command { command_name notify-no-buildbot-start command_line /usr/lib/nagios/check_dummy 2 "CRITICAL: Slave has not reported buildbot startup!" } As you suggested, when we start testing, we can initially make the dummy check return OK instead of CRITICAL.
This is the code from the nagios generation scripts that converts a host's IP into a hostname for nagios: $hostname = scalar (gethostbyaddr ($ip, AF_INET)); if ($hostname) { $hostname =~ s/\.(com|int|org)$//; $hostname =~ s/\.mozilla$//; So basically, reverse DNS, then strip .com, .int, .org, and .mozilla. That should be easy enough to replicate on the slaves. Once this check is in place for all desktop slaves, I'll start trying to emulate the XOR encryption in Python so that I can embed it in runslave.py. Hopefully it's not too horrendous!
Assignee: server-ops-releng → arich
By the way, so we're clear on where this sits, I need the check set up before I can start hacking runslave.py to use it. If it's easier to only add it for one host or one hostgroup, that's fine. I just need something I can watch in nagios to know if I'm doing it right.
Ah, sorry, I thought you were doing work on your end first. Let's pick one host to apply this to first. Which one do you suggest? Also it would probably be prudent to use blowfish (like the central server) instead of xor, I think. Will that make this substantially more difficult?
(In reply to comment #9) > Ah, sorry, I thought you were doing work on your end first. Let's pick one > host to apply this to first. Which one do you suggest? Any POSIX slave should be fine, then. Let's say linux-ix-slave10. > Also it would probably be prudent to use blowfish (like the central server) > instead of xor, I think. Will that make this substantially more difficult? It requires implementing blowfish in dependency-free Python. Assuming your crypto-coding skills aren't several orders of magnitude better than mine, yes, I can say without hesitation that it will make this substantially more difficult :)
Check added for linux-ix-slave10.
Assignee: arich → dustin
I've disabled the buildslave startup on this machine. I'm going to install Net::NSCA::Client on it to verify that I can submit results with a known-good implementation; then I'll try to write my own implementation.
OK, here's what I get: [root@linux-ix-slave10 ~]# cat nsca_input linux-ix-slave10.build buildbot-start 1 testy testy [root@linux-ix-slave10 ~]# send_nsca -H bm-admin01.mozilla.org < nsca_input 1 data packet(s) sent to host successfully. But no change in the status on the web (I should see "testy testy" and a WARNING status, I think). Am I sending that to the right host?
For some reason the nsca.cfg file had been modified to use a command file that did not match that of the nagios.cfg config file. I've changed nsca.cfg back and restarted it. Give it a try now.
This one is fixed - the passive check is working and it only remains to implement and deploy, which is handled in dependent bugs.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #15) > This one is fixed - the passive check is working and it only remains to > implement and deploy, which is handled in dependent bugs. Linking depbug#629701.
Blocks: 629701
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.