I'm not sure if these are a nagios failure (nagios isn't really built to ping boxes that restart all the time) or a problem with the slaves. I'll track the alerts in the google spreadsheet in the URL to see if I can find the pattern.
I can't tell if these alerts are bogus or not - too much other chaos. I'll comment on that in the parent bug, and hopefully work it out in person tomorrow.
Sometimes these look like: 15:08 < nagios>  try-linux64-slave09.build:buildbot is CRITICAL: Connection refused by host and they seem to happen while the slave is restarting - this one did. I checked the web interface and saw the CRITICAL I expected. A few moments later I navigated back to the same page and saw "NRPE: Unable to read output" What I don't understand is that in the web interface this service - indeed, all of the services for this host, and on a few other hosts I've checked - are marked as passive, with active checks disabled. I didn't think that was possible with NRPE - aren't NRPE checks triggered when the master connects to the slave and requests the check? There's something here I don't understand that's blocking my ability to diagnose further.
I was mixing up some "Connection refused" (which was due to a typo in my puppet deployment of the nrpe.cnf changes) with the ping failures, which are better described in bug 625867. So, dup'ing.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 625867
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.