Closed Bug 935682 Opened 12 years ago Closed 10 years ago

new postgres Nagios check for stage/dev refresh

Categories

(Infrastructure & Operations :: MOC: Service Requests, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: scabral, Assigned: rwatson)

References

Details

(Whiteboard: :MOC)

Not sure if we have a check for this, but on the breakpad db on both stage and dev, we should have a check that checks for stage freshness: select max(date_processed) from reports; should give a date that's >10 days old. Let me know if we have to write this check or if we have something we're already using that can work.
Depends on: 935635
There was a mysql backup check that could be used as a framework for these checks. I believe they checked for the existence of a file that was written after the backup completed. We could follow the same idea and have the refresh scripts write out to a file when completed.
The check we want to emulate is check_mysql_backups and the plugin is at modules/nrpe/files/plugins/check_mysql_backups - we can use this in the same way, but tailored to the refresh. So, instead of /var/run/backup-status we can use /var/run/refresh-status.
Whiteboard: :MOC
Sheeri, Can you please tell me exactly which boxes need this check?
Flags: needinfo?(scabral)
The dev hosts would be socorro1.dev.db.phx1.mozilla.com and socorro2.dev.db.phx1.mozilla.com. The stage hosts would be socorro1.stage.db.phx1.mozilla.com and socorror2.stage.db.phx1.mozilla.com
Flags: needinfo?(scabral)
Flags: needinfo?(dgarvey)
Hi Sheeri, I am not sure where to get the password for monitor or if I can get by without one? Also, I don't understand the reason for the comments [2-3] by mpressman. Why would I write a file? [dgarvey@socorro1.stage.db.phx1 ~]$ sudo su - postgres Please run /usr/local/sbin/install_hp_firmware.sh before rebooting this machine to install: eth_broadcom_update.sh Please run /usr/local/sbin/install_hp_firmware.sh before rebooting this machine to install: eth_broadcom_update.sh [postgres@socorro1 ~]$ psql -U monitor breakpad -c 'select max(date_processed) from reports;' Password for user monitor: psql: fe_sendauth: no password supplied [postgres@socorro1 ~]$ psql breakpad -c 'select max(date_processed) from reports;' max ------------------------------- 2014-12-18 23:06:25.188223+00 (1 row) [postgres@socorro1 ~]$ I did find this script regarding psql auth. What would you recommend? check_foreman_latest:latest_report=$(sudo -u postgres psql -t -c "select extract(epoch from(current_timestamp - updated_at)) from reports order by updated_at desc limit 1" puppetdashboard_mozilla_org | tr -d ' ') dgarvey@dgarvey-mozilla:~/svn/sysadmins/puppet/trunk/modules/nrpe/files/plugins$
Group: mozilla-employee-confidential
Flags: needinfo?(dgarvey)
Sheeri, Ashish pointed me to the password stuff, nevermind. However, I still am not clear on the comments [2-3] by mpressman?
David - he wanted you to look at the script at modules/nrpe/files/plugins/check_mysql_backups and use a similar idea to check that the postgres backups were done. However, I think it's much more simple to just run the query: select max(date_processed) from reports; And that should give a date that's >10 days old. A simple check, like this - http://exchange.nagios.org/directory/Plugins/Databases/PostgresQL/check_psql_query/details
Here is the script I have so far. Verify it if you would please. import psycopg2 from datetime import datetime, timedelta import sys try: conn = psycopg2.connect( host = 'socorro1.dev.db.phx1.mozilla.com', database = 'breakpad', user = 'nagiosdaemon', password = '' ) except: print "I am unable to connect to the database." cur = conn.cursor() try: cur.execute("select max(date_processed) from reports;") except: print "I can't SELECT from reports" rows = cur.fetchall() for row in rows: query_date = row[0] #remove timezone awareness query_date = query_date.replace(tzinfo=None) #Subtract 10 days from current date to get working date working_date = datetime.now() - timedelta(days=10) print working_date print query_date if working_date > query_date: print "Ok: query date is greater than 10 days" sys.exit(0) else: print "Critical: query date is less than 10 days" sys.exit(2) sys.exit(3)
I would want to wrap the whole first section getting data from the db in a try timeout with a configurable value. We've had problems before with monitoring script hanging on database access and although the incoming connection from Nagios gets timed out the scripts themselves hanging around until the machine alerts for load.
I think that's what the script I recommended does - http://exchange.nagios.org/directory/Plugins/Databases/PostgresQL/check_psql_query/details I'm not sure if it's more useful to write one from scratch or just modify the existing plugin to handle our situation. I thought the latter would be better, because then we can use it with any query on any postgres server, without having to add a new script. I was wondering why it was taking so long to put that plugin in place with a query like "SELECT CURRENT_DATE - DATE(MAX(date_processed)) FROM reports;" and making sure it's not greater than 1. Could we just do that instead of reinventing the wheel?
That no star perl script? Sorry my perl is not as good as it used to be.
I see the connection wrapped in an eval in that plugin perl script but I don't see a timeout and don't know if the eval helps with that at all, not being a perl person.
ok I found a solution to the connection timeout. import psycopg2 from datetime import datetime, timedelta import sys #Create connection timeout. import os os.environ['PGOPTIONS'] = '-c statement_timeout=60' try: conn = psycopg2.connect( host = 'socorro1.dev.db.phx1.mozilla.com', database = 'breakpad', user = 'nagiosdaemon', password = '', ) except: print "I am unable to connect to the database." cur = conn.cursor() try: cur.execute("select max(date_processed) from reports;") except: print "I can't SELECT from reports" rows = cur.fetchall() for row in rows: query_date = row[0] #remove timezone awareness query_date = query_date.replace(tzinfo=None) #Subtract 10 days from current date to get working date working_date = datetime.now() - timedelta(days=10) #print working_date #print query_date if working_date > query_date: print "Ok: query date is greater than 10 days" sys.exit(0) else: print "Critical: query date is less than 10 days" sys.exit(2) sys.exit(3)
I was thinking that I can create a ssh tunnel with paramiko to make this more secure but it wouldn't work. [dgarvey@nagios1.private.phx1 ~]$ sudo su - nagios Please run /usr/local/sbin/install_hp_firmware.sh before rebooting this machine to install: eth_broadcom_update.sh -sh-4.1$ ssh socorro1.dev.db.phx1.mozilla.com The authenticity of host 'socorro1.dev.db.phx1.mozilla.com (10.8.70.124)' can't be established. RSA key fingerprint is 2b:26:98:2f:ab:5a:07:9e:01:ab:6e:2d:bc:5c:09:3c. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'socorro1.dev.db.phx1.mozilla.com,10.8.70.124' (RSA) to the list of known hosts. Permission denied (publickey,gssapi-keyex,gssapi-with-mic). -sh-4.1$
Please avoid ssh tunnels. The nagiosdaemon user shouldn't have enough privileges that a compromise should hurt...
Assignee: server-ops → rwatson
Group: infra, mozilla-employee-confidential
Component: Server Operations → MOC: Service Requests
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → lypulong
When I originally tested this I only used socorro1.dev.db.phx1.mozilla.com which works. However, I don't see this script running on the other boxes requested for the following reasons. socorro2.dev.db.phx1.mozilla.com: No breakpad database. socorro1.stage.db.phx1.mozilla.com: nagiosdaemon has no read privileges on table reports. socorro2.stage.db.phx1.mozilla.com: No breakpad database. The script in current incarnation. import psycopg2 from datetime import datetime, timedelta import sys, getopt # Get the args try: opts, args = getopt.getopt(sys.argv[1:], "h:p:H:", ["help", "passwd=", "hostname="]) except getopt.GetoptError: print " Args are: -p password -H hostname" sys.exit(3) for opt, arg in opts: if opt in ("-h", "--help"): print "Args are: -p password -H hostname" sys.exit(3) elif opt == '-p': if arg: passwd = arg elif opt == '-H': if arg: hostname = arg #Config vars: db = 'breakpad' #database name dbuser = 'nagiosdaemon' #database username conn_string = "host='%s' dbname='%s' user='%s' password='%s'" %(hostname, db, dbuser, passwd) #Create connection timeout. import os os.environ['PGOPTIONS'] = '-c statement_timeout=60' try: conn = psycopg2.connect(conn_string) except: print "I am unable to connect to the database." cur = conn.cursor() try: cur.execute("select max(date_processed) from reports;") except: print "I can't SELECT from reports" rows = cur.fetchall() for row in rows: query_date = row[0] #remove timezone awareness query_date = query_date.replace(tzinfo=None) #Subtract 10 days from current date to get working date working_date = datetime.now() - timedelta(days=10) #print working_date #print query_date if working_date > query_date: print "Ok: query date is greater than 10 days" sys.exit(0) else: print "Critical: query date is less than 10 days" sys.exit(2) sys.exit(3)
Flags: needinfo?(scabral)
I don't think it needs to run on anything other than socorro1.dev. Flipping to Matt for his opinion.
Flags: needinfo?(scabral) → needinfo?(mpressman)
actually, socorro1.stage.db.phx1 is all this needs to be run on. The dev machine is no longer really used by anyone for refresh issues.
Flags: needinfo?(mpressman)
To be clear, since this bug was created things have evolved and for awhile it was indeed the dev machine that was the most important host, but now the group has changed how they utilize it and now stage is the one that matters
I just tested and nagiosdaemon does have read on the reports table [postgres@socorro1 ~]$ psql -U nagiosdaemon breakpad Password for user nagiosdaemon: psql (9.2.9) Type "help" for help. breakpad=> select max(date_processed) from reports; max ------------------------------- 2015-02-20 13:30:29.396055+00 (1 row)
Matt, What box was that? Do I have the wrong boxes? [dgarvey@socorro1.stage.db.phx1 ~]$ sudo -u postgres psql -d breakpad -c "SELECT grantee, privilege_type FROM information_schema.role_table_grants where table_name='reports';" grantee | privilege_type -------------+---------------- breakpad | SELECT breakpad_ro | SELECT ganglia | SELECT monitoring | SELECT breakpad_rw | INSERT breakpad_rw | SELECT breakpad_rw | UPDATE breakpad_rw | DELETE breakpad_rw | TRUNCATE breakpad_rw | REFERENCES breakpad_rw | TRIGGER (11 rows) [dgarvey@socorro1.stage.db.phx1 ~]$ [dgarvey@socorro2.stage.db.phx1 ~]$ sudo -u postgres psql -d breakpad -c "SELECT grantee, privilege_type FROM information_schema.role_table_grants where table_name='reports';" psql: FATAL: database "breakpad" does not exist [dgarvey@socorro2.stage.db.phx1 ~]$
Flags: needinfo?(mpressman)
The box is correct, socorro1.stage.db.phx1 and the database is breakpad. nagiosdaemon does indeed have select privs on reports. You can verify it works by logging in as nagiosdaemon and running the query as I did in comment 21. You can also use the has_table_privilege function to verify it has select privs on reports: As the nagiosdaemon user from the psql shell: breakpad=# select has_table_privilege('nagiosdaemon','reports','select'); has_table_privilege --------------------- t Or from the shell using a privileged user, in your example you used the postgres inherited user: sudo -u postgres psql -d breakpad -c "select has_table_privilege('nagiosdaemon','reports','select')" has_table_privilege --------------------- t (1 row) Here is the nagiosdaemon user getting the results: Using the query from your script in comment 17 as the nagiosdaemon user in the breakpad database psql -U nagiosdaemon -d breakpad -c "select max(date_processed) from reports" Password for user nagiosdaemon: max ------------------------------- 2015-02-25 02:12:13.688662+00 (1 row)
Flags: needinfo?(mpressman)
I thought I added the following in my response last night, but I must not have submitted it. The select from role_table_grants does show the roles that have access to the reports table. Essentially, the monitoring role listed contains the nagiosdaemon role so nagiosdaemon inherits the privs from monitoring.
Humm, I must be using the wrong password then? I am using this one: puppet/trunk/hiera/secrets/site.yaml:secrets_nagiosdaemon_password: [dgarvey@socorro1.stage.db.phx1 ~]$ sudo -u postgres psql -U nagiosdaemon -d breakpad -c "select max(date_processed) from reports;" Password for user nagiosdaemon: psql: FATAL: password authentication failed for user "nagiosdaemon" [dgarvey@socorro1.stage.db.phx1 ~]$ It works from dev... [dgarvey@socorro1.dev.db.phx1 ~]$ sudo -u postgres psql -U nagiosdaemon -d breakpad -c "select max(date_processed) from reports;" Password for user nagiosdaemon: max ------------------------------- 2014-02-09 19:59:54.115552+00 (1 row) [dgarvey@socorro1.dev.db.phx1 ~]$ Yes, After further investigation I noticed that this password works;) It looks like dev postgres password is diff than stage. thanks puppet/trunk/hiera/secrets/site.yaml:secrets_nagiosdaemon_postgres_password:
snap, its already alerting;) [dgarvey@nagios1.private.phx1 ~]$ python querydb.py -p secret -H socorro1.stage.db.phx1.mozilla.com Critical: query date is less than 10 days [dgarvey@nagios1.private.phx1 ~]$ I am not sure if i want to push this out right away. the contacts for the alert will be dba's?
Flags: needinfo?(mpressman)
It will come to the dba's. However, I think I see the issue with the check. We want to make sure that the max(date_processed) date is less than 10 days. I think we want to reverse it and make sure that the date_processed isn't greater than 10 days
Flags: needinfo?(mpressman)
To be a little clearer, we want to make sure the the value max(processed_date) isn't older than 10 days To get the values and logic via SQL, you can just run the query which returns three columns. The first column is the max processed_date from reports assigned the name query_date. The second column is the current date less 10 days and assigned the name working_date The third column returns the conditional result of the logic check assigned the name check_result which returns either 'Ok: query date is less than 10 days old' or 'Critical: query date is greater than 10 days old by checking that the working_date is greater than 10 days old: breakpad=# SELECT to_char(MAX(date_processed), 'YYYY-MM-DD HH24:MI:SS') as query_date, to_char((now() - interval '10 days'),'YYYY-MM-DD HH24:MI:SS') as working_date, CASE WHEN MAX(date_processed) > (now() - interval '10 days') THEN 'Ok: query date is less than 10 days old' ELSE 'Critical: query date: ' || to_char(MAX(date_processed), 'YYYY-MM-DD HH24:MI:SS') || ' is greater than 10 days old' END as check_result FROM reports; currently run that returns: query_date | working_date | check_result -------------------------------+-------------------------------+----------------------------------------- 2015-03-07 01:01:21.833951+00 | 2015-02-25 01:05:54.513884+00 | Ok: query date is less than 10 days old Or you can just run it to get check_result: breakpad=# SELECT CASE WHEN MAX(date_processed) > (now() - interval '10 days') THEN 'Ok: query date is less than 10 days old' ELSE 'Critical: query date: ' || to_char(MAX(date_processed), 'YYYY-MM-DD HH24:MI:SS') || ' is greater than 10 days old' END as check_result FROM reports; check_result ----------------------------------------- Ok: query date is less than 10 days old
That's better thanks for the clarification. [dgarvey@nagios1.private.phx1 ~]$ python querydb.py -p secret -H socorro1.stage.db.phx1.mozilla.com 2015-02-28 02:27:59.334742 2015-03-10 09:25:14.959181 OK: query date is less than 10 days [dgarvey@nagios1.private.phx1 ~]$ I changed this: #if working_date > query_date: if query_date > working_date: print "OK: query date is less than 10 days" sys.exit(0) else: print "CRITICAL: query date is greater than 10 days" sys.exit(2) to match your query: [dgarvey@nagios1.private.phx1 ~]$ sudo -u nagios psql -U nagiosdaemon -h socorro1.stage.db.phx1.mozilla.com -d breakpad -W -c "SELECT CASE WHEN MAX(date_processed) > (now() - interval '10 days') THEN 'Ok: query date is less than 10 days old' ELSE 'Critical: query date: ' || to_char(MAX(date_processed), 'YYYY-MM-DD HH24:MI:SS') || ' is greater than 10 days old' END as check_result FROM reports;" Password for user nagiosdaemon: check_result ----------------------------------------- Ok: query date is less than 10 days old (1 row) [dgarvey@nagios1.private.phx1 ~]$
Pushed it out with puppet with the following notification options. notification_period => 'usworkinghours', contact_groups => 'mysqldbalerts',
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.