AMO occasionally can't connect to its services

RESOLVED FIXED

Status

Infrastructure & Operations
WebOps: Other
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: clouserw, Assigned: oremj)

Tracking

Details

(Reporter)

Description

6 years ago
It's not unusual to see a couple hundred to a couple thousand errors every week or so from production AMO saying it's unable to connect to a service.  Example:

> OperationalError: (2003, "Can't connect to MySQL server on 'db-amo-ro' (110)")

> TimeoutError: Request timed out after 5.000000 seconds

> OperationalError: (1135, "Can't create a new thread (errno 11); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug")

There have also been cases of db-amo-rw for the first.  The second is elastic search.  Third is MySQL.  With our amount of traffic these are generally not a big deal because the person can refresh the page and move on with their lives but occasionally they are writes that fail, and we've also been getting emails from Paypal that their IPN requests to us are failing.

It's tough to debug periodic failures like that when it might be the infrastructure itself.  Is there something we can do to increase the reliability of the servers?

Some data from today:
- at 9:45pm tonight we had a bunch of failures to connect to db-amo-ro
- at 11:09am and 11:18am today we had a bunch of failures to connect to elastic search
- at 4:57am, 5:02, 5:04, 5:05, and 5:15am we got a bunch of thread errors from mysql
- sprinkled between all those main fails were dozens of random failures to connect - just 1 or 2 scattered around.
(Assignee)

Updated

6 years ago
Assignee: server-ops → oremj
(Assignee)

Comment 1

6 years ago
We've raised the ulimits on mysql, so this shouldn't happen any more.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
(Reporter)

Comment 2

6 years ago
This includes elastic search too
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Comment 3

6 years ago
(In reply to Wil Clouser [:clouserw] from comment #2)
> This includes elastic search too

We just got a few hundred more of these also:

> OperationalError: (2013, 'Lost connection to MySQL server during query')

> OperationalError: (2003, "Can't connect to MySQL server on 'db-amo-ro' (111)")

So there are still mysql problems
(Assignee)

Comment 4

6 years ago
The fix was just for the thread errors. We upgraded zeus this morning. I think that's where the can't connect errors came from.
(Assignee)

Comment 5

6 years ago
Considering the amount of traffic the traceback rate has been pretty light. We are continuing to work with Zeus on their issues. I'm not sure there is any benefit of keeping this bug open.
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → FIXED
This happened again when we upgraded to Percona's MySQL 5.1 because:

First, we looked for the ulimit setting in /etc/security/limits.conf and /etc/security/limits.d, which turned up nothing.

The ulimit setting was set in /etc/sysconfig/mysqld, which the Oracle startup script sources:

[ -e /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog

At first we assumed that it was because Oracle's startup script is /etc/init.d/mysqld and Percona's is /etc/init.d/mysql, but when we tried that, it did not work, in fact because Percona's startup script does NOT source that file.

So then we put the following in /etc/security/limits.d/99-nproc-mysql.conf:
mysql   soft    nproc   32768
mysql   hard    nproc   65535
root   soft    nproc   32768
root   hard    nproc   65535

Restarted MySQL and all was good. We tried with just the "mysql" user first, and technically the "mysql" user isn't needed, because mysqld_safe starts as root and mysqld inherits from that, but better safe than sorry.

Updated

5 years ago
Blocks: 804406
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.