Closed Bug 947938 Opened 11 years ago Closed 10 years ago

debug why newrelic pgbouncer isn't working

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: scabral, Assigned: mpressman)

Details

(Whiteboard: [2014q2] April)

Matt - could you check the pgbouncer auth files for the machines listed at https://rpm.newrelic.com/accounts/263620/plugins?type=6283 and figure out why newrelic (on the scalebase hosts) cannot connect to pgbouncer?
Whiteboard: [2014q1]
Assignee: server-ops-database → mpressman
Whiteboard: [2014q1] → [2014q2] April
The link in the description just brings up an error page and doesn't provide a list of machines. Is there a list of machines that are affected elsewhere?
Also, since pgbouncer is only running on the socorro machines, I went in verified the pgbouncer auth files and was able to connect using the credentials provided, but only locally to ensure that the password matches the hash in the auth files
Finally, the pg_hba.conf only provides newrelic access from scalebase1 and from there the only running postgres plugin showed no pgbouncer ports. There was an additional plugin found in the postgresql dir that was named pgbouncer.cfg and the pgbouncer section did not contain a password for access. The postgresql section listings did contain the proper password for the newrelic user:
pgbouncer:
-host:
 name:
 port:
 user: newrelic
One last thing while I'm flying blind without debug messages or machines at issue could be that the newrelic pgbouncer plugin may be failing because of a permissions/role issue. We only allow three roles show command access and newrelic is not listed or a member of the allowed roles. By adding the newrelic role to the pgbouncer config files stats_users then the newrelic will be allowed show access.
So, scalebase1.db.phx1.mozilla.com is running newrelic, check out /var/log for the logs, so you can see the error messages.

It's probably the pgbouncer ACLs you mention in comment 4. Is that something that's under puppet control?
I did check the logs in /var/log. Both in /var/log/newrelic and /var/log/newrelic-plugin-agent-supervisor.log[1-3] and there isn't any mention for pgbouncer. The current process list shows the newrelic_plugin_agent using the plugin /usr/local/newrelic/postgresql/pg_newrelic_config.cfg. Also in the dir /usr/local/newrelic/postgresql/pg_newrelic_config.cfg exists pgbouncer.cfg. I assume that is the plugin that isn't working and is just disabled since this hasn't been working.

Regardless, I'm with you and think that the issue has to do with the pgbouncer ACL's. Without privs to access the pgbouncer special administration database, the newrelic user cannot access and the plugin will fail. I checked the pgbouncer logs while manually trying to connect with the newrelic user and received the following message:
WARNING C-0x1725748: pgbouncer/newrelic@unix:6432 Pooler Error: not allowed

Also, yes the pgbouncer configs are under puppet control. I have updated the config and added the newrelic user (revision 86437). On socorro1.stage.db.phx1 I manually reloaded the pgbouncer configs and retested connecting as the newrelic user and it succeeded.

The next step will be to re-enable the pgbouncer new relic plugin.
The config is in /usr/local/newrelic/pgbouncer/newrelic_plugin_agent.cfg - I put socorro3 in it.

You can run it by doing:
/usr/bin/newrelic_plugin_agent  -c /usr/local/newrelic/pgbouncer/newrelic_plugin_agent.cfg -f

(-c is the location of the config file, -f is to run in the foreground, you can press Ctrl-C to get out of it)

I'm getting:
INFO       2014-04-25 15:35:04 31336  MainProcess     MainThread clihelper                                     run                       L382   : newrelic_plugin_agent 1.1.0 started
CRITICAL   2014-04-25 15:35:04 31336  MainProcess     MainThread newrelic_plugin_agent.plugins.postgresql      poll                      L256   : Could not connect to PgBouncer, skipping stats run: could not connect to server: Connection refused
        Is the server running on host "socorro3.db.phx1.mozilla.com" and accepting
        TCP/IP connections on port 6000?


I got the same result when I tried socorro1.db.phx1.mozilla.com.
YAY - It worked on stage where I had reloaded the config. I also had to put the password in it, but here are the results

newrelic_plugin_agent  -c /usr/local/newrelic/pgbouncer/newrelic_plugin_agent.cfg -f
INFO       2014-04-25 21:37:05 19236  MainProcess     MainThread clihelper                                     run                       L382   : newrelic_plugin_agent 1.1.0 started
INFO       2014-04-25 21:37:05 19236  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.05 seconds
INFO       2014-04-25 21:37:05 19236  MainProcess     MainThread newrelic_plugin_agent.agent                   send_components           L209   : Sending 31 metrics to NewRelic
INFO       2014-04-25 21:37:10 19236  MainProcess     MainThread newrelic_plugin_agent.agent                   process                   L122   : All stats processed in 4.44 seconds, next wake in 55.56
CINFO       2014-04-25 21:37:33 19236  MainProcess     MainThread clihelper                                     run                       L746   : CTRL-C caught, shutting down
INFO       2014-04-25 21:37:33 19236  MainProcess     MainThread clihelper                                     stop                      L463   : Attempting to stop the process
INFO       2014-04-25 21:37:33 19236  MainProcess     MainThread clihelper                                     run                       L761   : clihelper.run exiting cleanly
I'll go on to the rest of the socorro hosts and reload pgbouncers config that now has the newrelic user added to stats_users
All socorro hosts report successful on port 6432 (the pgbouncer-web instance)
socorro1.db.phx1
socorro2.db.phx1
socorro3.db.phx1
socorro1.stage.db.phx1
socorro-reporting1.db.phx1
All socorro hosts report successful on port 6433 (the pgbouncer-processor instance)
I added all the socorro hosts in comment 10 to /usr/local/newrelic/pgbouncer/newrelic_plugin_agent.cfg - both the web instance and processor instance and the output appears successful.

newrelic_plugin_agent  -c /usr/local/newrelic/pgbouncer/newrelic_plugin_agent.cfg -f
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread clihelper                                     run                       L382   : newrelic_plugin_agent 1.1.0 started
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.01 seconds
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.00 seconds
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.01 seconds
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.00 seconds
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.01 seconds
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.00 seconds
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.01 seconds
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.01 seconds
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.01 seconds
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.plugins.base            finish                    L141   : PgBouncer poll successful, completed in 0.00 seconds
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.agent                   send_components           L209   : Sending 274 metrics to NewRelic
INFO       2014-04-25 22:00:04 22281  MainProcess     MainThread newrelic_plugin_agent.agent                   process                   L122   : All stats processed in 0.56 seconds, next wake in 59.44
The link at https://rpm.newrelic.com/accounts/263620/plugins?type=6283 is now populated and all hosts are green indicating "Component normal"
I think we're good here. I'll leave this bug open. Please close if it is working, otherwise please let me know what I'm missing.
The point was to debug why it's not working, and it's working, so this bug is met. Resolving.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.