PostgreSQL has been paging for three days

RESOLVED FIXED

Status

Data & BI Services Team
DB: MySQL
--
blocker
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: laura, Assigned: sheeri)

Tracking

Details

(Reporter)

Description

6 years ago
Every couple of minutes for the last couple of days (starting 6/1 at 3.53am ET) nagios has paged on all of the PostgreSQL monitors on Socorro, as follows:

ERROR: Password for user nagiosdaemon:

While this doesn't sound like a horrible problem, the reason this is critical is that it means we can't see any real problems with PostgreSQL, so it's effectively been unmonitored all weekend.   

The concerning thing is that 50 or so minutes earlier on Friday, we were getting real pages ("Could not find a suitable psql executable"), for which I also don't know the root cause.

If there's a maintenance bug for around that time window, I don't think I'm cced on it - if there is one, please add me.
(Assignee)

Comment 1

6 years ago
Taking this for now - I've paged mpressman.
Assignee: server-ops-database → scabral
(Assignee)

Comment 2

6 years ago
I'll note that https://dp-nagios01.phx.mozilla.com/nagios/cgi-bin/status.cgi?host=tp-socorro01-master01.phx1 shows that master01 is all green.

Also, here's a sample of the page from Laura - so it *is* coming from phx nagios...

Subject: 	** PROBLEM alert - tp-socorro01-master01.phx1.mozilla.com/PostgreSQL Connection is UNKNOWN **
Date: 	Sun, 03 Jun 2012 07:59:33 -0700
From: 	nagios@nagios1.private.phx1.mozilla.com (nagios)
To: 	cron-socorro@mozilla.com


***** Nagios  *****

Notification Type: PROBLEM

Service: PostgreSQL Connection
Host: tp-socorro01-master01.phx1.mozilla.com
Address: 10.8.70.100
State: UNKNOWN

Date/Time: 06-03-2012 07:59:33

Additional Info:

ERROR: Password for user nagiosdaemon:
I also worked on this with rhelmer on friday and was able to manually connect from nagios to tp-socorro01-master01 just fine as nagiosdaemon. I think we need to give this to someone who knows about nagios as to why this is alerting, but not showing the alerts in the system
(Assignee)

Comment 4

6 years ago
Thanx, Matt.
The check_postgres.pl for socorro is the same as the check_postgres.pl for the general postgres check
(Assignee)

Comment 6

6 years ago
Indeed, this is the URL that shows the problems:

http://nagios1.private.phx1.mozilla.com/phx1/cgi-bin/status.cgi?host=tp-socorro01-master01.phx1.mozilla.com

And there was already another ticket open for this:

https://bugzilla.mozilla.org/show_bug.cgi?id=758856
(Assignee)

Comment 7

6 years ago
I think I found the problem - the nagios check doesn't specify a password for the nagiosdaemon user. On the old phx monitoring system, there was a .pgpass file in the nagios user's homedir. I've copied that and the permissions, so hopefully that will fix it (I have to wait a few minutes for the next check to go).
sheeri: nice catch!
(Assignee)

Comment 9

6 years ago
Well, you did the first steps - make sure that the nagios user could connect with the right password. So that made me think, "what password is the nagiosdaemon user *trying* to use, and why would it be different from one nagios to the next". I'm familiar with Nagios, and our setup doesn't have any differences, so it had to be that somehow magically one machine used a password and the other didn't. MySQL has .my.cnf files, so I figured pg might have something similar....

w00t!

2 of the 4 paging checks have gone green, I'll wait until they're all green to close this ticket.
See Also: → bug 758856
(Assignee)

Comment 10

6 years ago
Closing, it was the .pgpass file that fixed it.

http://nagios1.private.phx1.mozilla.com/phx1/cgi-bin/status.cgi?host=tp-socorro01-master01.phx1.mozilla.com
and
http://nagios1.private.phx1.mozilla.com/phx1/cgi-bin/status.cgi?host=tp-socorro01-master02.phx1.mozilla.com

are all green now.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
(Reporter)

Comment 11

6 years ago
Thanks guys!
Whoops, sorry I didn't realize this was alerting laura! I set this up in -dev on Fri and had opened Bug 760596 to grant access. Had no idea about .pgpass, sorry again!
(Assignee)

Comment 13

6 years ago
It was sending to an e-mail alias, so it wasn't actually paging anyone (but it should have...that's a matter for another ticket).
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.