Closed
Bug 760987
Opened 14 years ago
Closed 14 years ago
PostgreSQL has been paging for three days
Categories
(Data & BI Services Team :: DB: MySQL, task)
Data & BI Services Team
DB: MySQL
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: laura, Assigned: scabral)
Details
Every couple of minutes for the last couple of days (starting 6/1 at 3.53am ET) nagios has paged on all of the PostgreSQL monitors on Socorro, as follows:
ERROR: Password for user nagiosdaemon:
While this doesn't sound like a horrible problem, the reason this is critical is that it means we can't see any real problems with PostgreSQL, so it's effectively been unmonitored all weekend.
The concerning thing is that 50 or so minutes earlier on Friday, we were getting real pages ("Could not find a suitable psql executable"), for which I also don't know the root cause.
If there's a maintenance bug for around that time window, I don't think I'm cced on it - if there is one, please add me.
| Assignee | ||
Comment 1•14 years ago
|
||
Taking this for now - I've paged mpressman.
Assignee: server-ops-database → scabral
| Assignee | ||
Comment 2•14 years ago
|
||
I'll note that https://dp-nagios01.phx.mozilla.com/nagios/cgi-bin/status.cgi?host=tp-socorro01-master01.phx1 shows that master01 is all green.
Also, here's a sample of the page from Laura - so it *is* coming from phx nagios...
Subject: ** PROBLEM alert - tp-socorro01-master01.phx1.mozilla.com/PostgreSQL Connection is UNKNOWN **
Date: Sun, 03 Jun 2012 07:59:33 -0700
From: nagios@nagios1.private.phx1.mozilla.com (nagios)
To: cron-socorro@mozilla.com
***** Nagios *****
Notification Type: PROBLEM
Service: PostgreSQL Connection
Host: tp-socorro01-master01.phx1.mozilla.com
Address: 10.8.70.100
State: UNKNOWN
Date/Time: 06-03-2012 07:59:33
Additional Info:
ERROR: Password for user nagiosdaemon:
Comment 3•14 years ago
|
||
I also worked on this with rhelmer on friday and was able to manually connect from nagios to tp-socorro01-master01 just fine as nagiosdaemon. I think we need to give this to someone who knows about nagios as to why this is alerting, but not showing the alerts in the system
| Assignee | ||
Comment 4•14 years ago
|
||
Thanx, Matt.
Comment 5•14 years ago
|
||
The check_postgres.pl for socorro is the same as the check_postgres.pl for the general postgres check
| Assignee | ||
Comment 6•14 years ago
|
||
Indeed, this is the URL that shows the problems:
http://nagios1.private.phx1.mozilla.com/phx1/cgi-bin/status.cgi?host=tp-socorro01-master01.phx1.mozilla.com
And there was already another ticket open for this:
https://bugzilla.mozilla.org/show_bug.cgi?id=758856
| Assignee | ||
Comment 7•14 years ago
|
||
I think I found the problem - the nagios check doesn't specify a password for the nagiosdaemon user. On the old phx monitoring system, there was a .pgpass file in the nagios user's homedir. I've copied that and the permissions, so hopefully that will fix it (I have to wait a few minutes for the next check to go).
Comment 8•14 years ago
|
||
sheeri: nice catch!
| Assignee | ||
Comment 9•14 years ago
|
||
Well, you did the first steps - make sure that the nagios user could connect with the right password. So that made me think, "what password is the nagiosdaemon user *trying* to use, and why would it be different from one nagios to the next". I'm familiar with Nagios, and our setup doesn't have any differences, so it had to be that somehow magically one machine used a password and the other didn't. MySQL has .my.cnf files, so I figured pg might have something similar....
w00t!
2 of the 4 paging checks have gone green, I'll wait until they're all green to close this ticket.
| Assignee | ||
Comment 10•14 years ago
|
||
Closing, it was the .pgpass file that fixed it.
http://nagios1.private.phx1.mozilla.com/phx1/cgi-bin/status.cgi?host=tp-socorro01-master01.phx1.mozilla.com
and
http://nagios1.private.phx1.mozilla.com/phx1/cgi-bin/status.cgi?host=tp-socorro01-master02.phx1.mozilla.com
are all green now.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 11•14 years ago
|
||
Thanks guys!
Comment 12•14 years ago
|
||
Whoops, sorry I didn't realize this was alerting laura! I set this up in -dev on Fri and had opened Bug 760596 to grant access. Had no idea about .pgpass, sorry again!
| Assignee | ||
Comment 13•14 years ago
|
||
It was sending to an e-mail alias, so it wasn't actually paging anyone (but it should have...that's a matter for another ticket).
Updated•11 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•