Every couple of minutes for the last couple of days (starting 6/1 at 3.53am ET) nagios has paged on all of the PostgreSQL monitors on Socorro, as follows: ERROR: Password for user nagiosdaemon: While this doesn't sound like a horrible problem, the reason this is critical is that it means we can't see any real problems with PostgreSQL, so it's effectively been unmonitored all weekend. The concerning thing is that 50 or so minutes earlier on Friday, we were getting real pages ("Could not find a suitable psql executable"), for which I also don't know the root cause. If there's a maintenance bug for around that time window, I don't think I'm cced on it - if there is one, please add me.
Taking this for now - I've paged mpressman.
Assignee: server-ops-database → scabral
I'll note that https://dp-nagios01.phx.mozilla.com/nagios/cgi-bin/status.cgi?host=tp-socorro01-master01.phx1 shows that master01 is all green. Also, here's a sample of the page from Laura - so it *is* coming from phx nagios... Subject: ** PROBLEM alert - tp-socorro01-master01.phx1.mozilla.com/PostgreSQL Connection is UNKNOWN ** Date: Sun, 03 Jun 2012 07:59:33 -0700 From: email@example.com (nagios) To: firstname.lastname@example.org ***** Nagios ***** Notification Type: PROBLEM Service: PostgreSQL Connection Host: tp-socorro01-master01.phx1.mozilla.com Address: 10.8.70.100 State: UNKNOWN Date/Time: 06-03-2012 07:59:33 Additional Info: ERROR: Password for user nagiosdaemon:
I also worked on this with rhelmer on friday and was able to manually connect from nagios to tp-socorro01-master01 just fine as nagiosdaemon. I think we need to give this to someone who knows about nagios as to why this is alerting, but not showing the alerts in the system
The check_postgres.pl for socorro is the same as the check_postgres.pl for the general postgres check
Indeed, this is the URL that shows the problems: http://nagios1.private.phx1.mozilla.com/phx1/cgi-bin/status.cgi?host=tp-socorro01-master01.phx1.mozilla.com And there was already another ticket open for this: https://bugzilla.mozilla.org/show_bug.cgi?id=758856
I think I found the problem - the nagios check doesn't specify a password for the nagiosdaemon user. On the old phx monitoring system, there was a .pgpass file in the nagios user's homedir. I've copied that and the permissions, so hopefully that will fix it (I have to wait a few minutes for the next check to go).
sheeri: nice catch!
Well, you did the first steps - make sure that the nagios user could connect with the right password. So that made me think, "what password is the nagiosdaemon user *trying* to use, and why would it be different from one nagios to the next". I'm familiar with Nagios, and our setup doesn't have any differences, so it had to be that somehow magically one machine used a password and the other didn't. MySQL has .my.cnf files, so I figured pg might have something similar.... w00t! 2 of the 4 paging checks have gone green, I'll wait until they're all green to close this ticket.
See Also: → bug 758856
Closing, it was the .pgpass file that fixed it. http://nagios1.private.phx1.mozilla.com/phx1/cgi-bin/status.cgi?host=tp-socorro01-master01.phx1.mozilla.com and http://nagios1.private.phx1.mozilla.com/phx1/cgi-bin/status.cgi?host=tp-socorro01-master02.phx1.mozilla.com are all green now.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Whoops, sorry I didn't realize this was alerting laura! I set this up in -dev on Fri and had opened Bug 760596 to grant access. Had no idea about .pgpass, sorry again!
It was sending to an e-mail alias, so it wasn't actually paging anyone (but it should have...that's a matter for another ticket).
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.