Closed Bug 756839 Opened 13 years ago Closed 13 years ago

nagios-sjc1 is broken

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: afernandez, Assigned: rbryce)

Details

Creating bug for historical purposes but this was a previously known issue. nagios-sjc1 still not working. I tried troubleshooting (tried pretty much everything) but still get the "no activity" alerts (which could possibly be commented out to avoid the spam). nagios-sjc1 is indeed doing checks and dm-nagios01 seems them but the script still reports No activity. script that performs the "No activity in the last 5 mins" check; /root/bin/check_nagios_master on dm-nagios01 Tried what's on https://mana.mozilla.org/wiki/display/SYSADMIN/Nagios#Nagios-TroubleshootingStartupShutdown and various other things but still no luck. It would start working and then stop after 2-3 mins/few checks. Actual nagios set-up a bit messy and more complicated than what it needs to be and poorly documented so makes troubleshooting non trivial.
dm-nagios01 isn't executing host checks consistently. Going by the last recorded status in nagios ("Network unreachable"), it stopped doing checks around the time of netops activity on 05/18 evening. After the activity, very few hosts "recovered", although there was no problems in the network. nagios was run in debug mode to troubleshoot but very little came out of it. There is no clear cause as to why dm-nagios01 isn't executing host checks. Service checks for all those hosts are ignored because nagios has the host as hard down. Most of the hosts/services in production being monitored here are build/releng. CC'ing relevant folks.
OS: FreeBSD → Linux
Summary: nagios-sjc1 → nagios-sjc1 is broken
I saw this when I woke up this morning and started debugging since this means that we've had no working monitoring on releng systems since Friday. 1) Instead of being a named pipe, /var/log/nagios/rw/nagios.cmd was a regular file that had stuff going back since August (ugh!). I'm sure nsca had a fit every time it got restarted because of this. I removed the file and restarted nagios, which correctly created the pipe. 2) There were also some ancient check files hanging around in /var/nagios/spool/checkresults/ which I deleted. 3) I babysat and kept restarting nsca and nagios until /usr/bin/nagiostats started consistently reporting that active host checks and passive service checks were being performed correctly (cross-verifying with the last check date in the hosts and services views of the gui). Dumitru uncommented the cron job that checks this and pages oncall if things aren't working. I *think* this is all working now, but I'm leaving it open so you guys can keep an eye on it and add any more information as the day goes on.
A bit more followup on this after some more digging... I went back to set command_check_interval to -1 like other nagios servers, which means it will check the command queue as often as it can. I'm not sure why someone had commented out 15s and -1 and had put 1s. I've been running an ls of /var/nagios/spool/checkresults/ in conjunction with nagiostats, and I can see that we're definitely doing service checks and that the 5 minute service check counter increments when files are added to /var/nagios/spool/checkresults/ and then processed. I suspect that we're just getting check results from so few hosts now that dm-nagios01 is only used as the cs for a very few systems that between the time it takes the ds to process logs on its end (it only sends them over ever 60 seconds), the time the cs takes to process on its end (it seems like 10-30s), checks never show up in the 1 minute bucket. My guess is that it's looking at the timestamp on the original check submission instead of the time that the check was processed on the cs. As a result, I think the cron job that pages on call if there haven't been any checks in the past minute may no longer be valid. As far as host checks, I still see them being done on demand, but a few minutes after a restart, they don't show up as scheduled (and evne then, the number that show up as scheduled after a restart differs from 8 to 1000 sometimes). I'm not entirely certain that they did before, so I'm not sure if this is aberrant behavior or not. Jabba, if you guys in infra could take a look tomorrow to make sure that things are running properly, that would be much appreciated.
Group: infra
After the network blip over the weekend, nagios-sjc1 failed to pick up the recoveries until ashish restarted nagios. Can someone on the SRE please look into this? Having nagios fail to recognize recoveries is not a good thing.
There was another event last night and everything was working as expected.
(In reply to Phong Tran [:phong] from comment #5) > There was another event last night and everything was working as expected. So...can we close this out?
sure thing
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
There were two instances of network activity causing Nagios to not see hosts in MTV1. On both instances, nagios and nsca had to be kicked to get nagios into action and start performing host checks.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
ANd, actually, even then it did not appear to be running active checks again. Something definitely needs investigation in dm-nagios01. still. Or we need to move off of it comeplete (which is the plan) sooner rather than later.
Assignee: server-ops → rbryce
This server is about to die.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
(In reply to Rick Bryce [:rbryce] from comment #10) > This server is about to die. To be clear, this isn't going to affect releng nagios coverage, is it? I remember hearing from Amy (who will correct me if I'm wrong) that releng still needed to rely on nagios-sjc1 rather than nagios-releng or nagios-scl3 after the move to scl3. If those other nagios services will be picking up the slack, or if the IRC bot names are irrelevant here, I'll just be quiet.
coop - it does reley on nagios-sjc1. we have another bug to sever that relationship. Which is going to take place very soon.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.