It's not unusual to see a couple hundred to a couple thousand errors every week or so from production AMO saying it's unable to connect to a service. Example: > OperationalError: (2003, "Can't connect to MySQL server on 'db-amo-ro' (110)") > TimeoutError: Request timed out after 5.000000 seconds > OperationalError: (1135, "Can't create a new thread (errno 11); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug") There have also been cases of db-amo-rw for the first. The second is elastic search. Third is MySQL. With our amount of traffic these are generally not a big deal because the person can refresh the page and move on with their lives but occasionally they are writes that fail, and we've also been getting emails from Paypal that their IPN requests to us are failing. It's tough to debug periodic failures like that when it might be the infrastructure itself. Is there something we can do to increase the reliability of the servers? Some data from today: - at 9:45pm tonight we had a bunch of failures to connect to db-amo-ro - at 11:09am and 11:18am today we had a bunch of failures to connect to elastic search - at 4:57am, 5:02, 5:04, 5:05, and 5:15am we got a bunch of thread errors from mysql - sprinkled between all those main fails were dozens of random failures to connect - just 1 or 2 scattered around.
We've raised the ulimits on mysql, so this shouldn't happen any more.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
This includes elastic search too
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Wil Clouser [:clouserw] from comment #2) > This includes elastic search too We just got a few hundred more of these also: > OperationalError: (2013, 'Lost connection to MySQL server during query') > OperationalError: (2003, "Can't connect to MySQL server on 'db-amo-ro' (111)") So there are still mysql problems
The fix was just for the thread errors. We upgraded zeus this morning. I think that's where the can't connect errors came from.
Considering the amount of traffic the traceback rate has been pretty light. We are continuing to work with Zeus on their issues. I'm not sure there is any benefit of keeping this bug open.
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago → 6 years ago
Resolution: --- → FIXED
This happened again when we upgraded to Percona's MySQL 5.1 because: First, we looked for the ulimit setting in /etc/security/limits.conf and /etc/security/limits.d, which turned up nothing. The ulimit setting was set in /etc/sysconfig/mysqld, which the Oracle startup script sources: [ -e /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog At first we assumed that it was because Oracle's startup script is /etc/init.d/mysqld and Percona's is /etc/init.d/mysql, but when we tried that, it did not work, in fact because Percona's startup script does NOT source that file. So then we put the following in /etc/security/limits.d/99-nproc-mysql.conf: mysql soft nproc 32768 mysql hard nproc 65535 root soft nproc 32768 root hard nproc 65535 Restarted MySQL and all was good. We tried with just the "mysql" user first, and technically the "mysql" user isn't needed, because mysqld_safe starts as root and mysqld inherits from that, but better safe than sorry.
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.