[amo] Database access denied sporadically

RESOLVED FIXED

Status

--
critical
RESOLVED FIXED
9 years ago
5 years ago

People

(Reporter: jbalogh, Assigned: justdave)

Tracking

Details

(Reporter)

Description

9 years ago
This is from python: OperationalError: (1045, "Access denied for user 'remora'@'10.8.70.201' (using password: YES)")

But I saw it in a php cron job too.  I saw it twice this morning and a lot at 3:29pm and 3:50pm.

Comment 1

9 years ago
15:37 < oremj> it's because the queries are going through the load balancers
15:37 < oremj> and if a different load balancer picks up the ip
15:37 < oremj> it start originating from that host
15:38 < oremj> so need to do db grants for all of them


Sorry, I told Wil about this and forgot to mention it in #webdev.
Assignee: server-ops → jeremy.orem+bugs
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
(Reporter)

Comment 2

9 years ago
I'm seeing this again for 10.8.70.201.
(Reporter)

Updated

9 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 3

9 years ago
Fixed that now too. There shouldn't be anymore.
Status: REOPENED → RESOLVED
Last Resolved: 9 years ago9 years ago
Resolution: --- → FIXED
Got this at 1:28am:
Cron <root@ip-admin02> cd	/data/amo/www/addons.mozilla.org-remora/bin;	/usr/bin/python26 import-personas.py
"Can't connect to MySQL server on '10.2.70.147' (110)"

Got this at 8:33am:
Cron <root@ip-admin02> cd	/data/amo/www/addons.mozilla.org-remora/bin;	/usr/bin/python26 maintenance.py personas_adu
"Can't connect to MySQL server on '10.2.70.147' (110)"

Got this from zprod at 9:25am:
Error (EXTERNAL IP): /fr/thunderbird/api/1.2/list/featured/all/10/WINNT/3.0
OperationalError: (2003, "Can't connect to MySQL server on '10.8.70.19' (111)")

Comment 5

9 years ago
The personas crons should be fixed now.  I think the last one was probably a fluke.
We've gotten "Access denied for user 'personas'@'10.8.70.200' (using password: YES)" twice since yesterday.  It can connect not but doesn't have permissions.  Can we just clone all the access rules from the original boxes?

php -f maintenance.php l10n_rss
php -f maintenance.php l10n_stats
both failed at 3:01am with "Can't connect to MySQL server on '10.8.70.10'"

and we've had production zamboni errors complaining about the same thing all through the night, a total of 19 errors.

Are we bumping up against max_clients again?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 7

9 years ago
The personas permissions are fixed for real this time. Just ran both of those scripts to double check.  Not sure about the unable to connect errors.

Dave or Tim, do you have any graphs on what was going on with 10.8.70.200 @ 3:00am?
bug 555880 says max_connections is 1500.  Was that carried over to the new servers?
We've been experiencing this all weekend.  Can someone reply to comment 8?
Has absolutely nothing to do with max connections, this is a permission issue.  Apparently nobody set up the ACLs in MySQL when the app moved to Phoenix.  I'll poke.
Assignee: jeremy.orem+bugs → justdave
OK, looks like someone did attempt to set them up, but granted them individually for each load balancer relay IP rather than using the wildcard that gets all of them at once and allows for future expansion.  It should still work though.  This probably is the max clients thing then, which you could probably figure out better if you had your script actually report the error it gets instead of just saying it can't connect.

The max connections thing isn't going to get solved without the app fixing the queries it's running.  We decided that after the last discussion on this.  No matter how high we set max_connections, the app will always hit it when it does the queries that back everything up.
This has started happening far more frequently since the move to phoenix.  We're working on fixing the scripts, but can you make sure the limit is still 1500 on these servers?
it's still talking to the same server.  The master database server didn't change.
mysql> show global variables like 'max_connections';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 1500  | 
+-----------------+-------+
1 row in set (0.00 sec)
fwiw, Dave bumped up the value on the slaves from 1200 this morning, I haven't seen a problem since.
We had a per-ip connection limit on the zeus VIP that was actually interfering.  I bumped that from 1200 to 20000 just to remove it from the equation.  Seems to have fixed the problem as far as I can tell.  AMO has moved back to SJC since, which makes it moot anyhow (but good to have it fixed when we eventually move back to phx again).
Status: REOPENED → RESOLVED
Last Resolved: 9 years ago9 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.