Closed
Bug 964876
Opened 10 years ago
Closed 10 years ago
Detect a gone-away MySQL server in buildapi
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
Details
Attachments
(1 file)
1.01 KB,
patch
|
bhearsum
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
bug 863268 attachment 8363708 [details] [diff] [review] isn't working: Traceback (most recent call last): File "/data/www/buildapi/buildapi/buildapi/lib/mq.py", line 213, in receive r = s.query(buildapidb.JobRequest).get(message_data['request_id']) File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 820, in get return loading.load_on_ident(self, key) File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/loading.py", line 226, in load_on_ident return q.one() File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2310, in one ret = list(self) File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2353, in __iter__ return self._execute_and_instances(context) File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2366, in _execute_and_instances close_with_result=True) File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2357, in _connection_from_session **kw) File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 799, in connection close_with_result=close_with_result) File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 803, in _connection_for_bind return self.transaction._connection_for_bind(engine) File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 299, in _connection_for_bind conn = bind.contextual_connect() File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect self.pool.connect(), File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 272, in connect return _ConnectionFairy(self).checkout() File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 510, in checkout self) File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/event.py", line 409, in __call__ fn(*args, **kw) File "/data/www/buildapi/buildapi/buildapi/config/environment.py", line 24, in checkout_listener cursor.execute("SELECT 1") File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/MySQLdb/cursors.py", line 205, in execute self.errorhandler(self, exc, value) File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue OperationalError: (2006, 'MySQL server has gone away') It's running the listener's query.. but not handling the OperationalError correctly.
Assignee | ||
Comment 1•10 years ago
|
||
This *might* have been a reload problem: I updated python-MySQL yesterday, and only restarted httpd on web2, not web1. This error was from web1, while I don't see any errors on web2. So it's possible that the OperationalError class being raised is a different class instance than the one being caught. I'll keep an eye on it.
Assignee | ||
Comment 2•10 years ago
|
||
This is still failing after an httpd restart on web1 yesterday - and still only failing on web1.
Assignee | ||
Comment 3•10 years ago
|
||
working: [root@web2.releng.webapp.scl3 ~]# lsof -p 17357 | grep _mysql httpd 17357 buildapi mem REG 8,3 143284 1711258 /data/www/buildapi/virtualenv/lib/python2.6/site-packages/_mysql.so httpd 17357 buildapi mem REG 8,3 57304 136539 /usr/lib64/python2.6/site-packages/_mysql.so httpd 17357 buildapi mem REG 8,3 30112 2240294 /usr/lib64/php/modules/pdo_mysql.so non-working: [root@web1.releng.webapp.scl3 ~]# lsof -p 28807 | grep _mysql httpd 28807 buildapi mem REG 8,3 143284 1711075 /data/www/buildapi/virtualenv/lib/python2.6/site-packages/_mysql.so httpd 28807 buildapi mem REG 8,3 30112 6801 /usr/lib64/php/modules/pdo_mysql.so That *really* doesn't make sense. This virtualenv was originally built with --system-site-packages to include the system Python-MySQL. But relengwebadm doesn't have Python-MySQL installed, and installing it would change the minor version of the MySQL libraries, so I installed it in the virtualenv. And now I have two problems. I think I'll recreate the virtualenv without --system-site-packages and carefully ship it to the two webhead nodes.
Assignee | ||
Comment 4•10 years ago
|
||
OK, new virtualenv is created and deployed. Let's see if this problem is gone.
Assignee | ||
Comment 5•10 years ago
|
||
Nope.
Assignee | ||
Comment 6•10 years ago
|
||
I can't reproduce this in staging, either. I compared installed packages on the two webheads, and the only difference is in the patchlevel of the kernel package - *very* unlikely to be causing this. I added some additional logging and deployed it. That should at least narrow down what exactly is wrong here.
Assignee | ||
Comment 7•10 years ago
|
||
And it finally failed again: 2014-01-29 09:15:38,560 WARNI [setup_engine] [Dummy-1] caught something else so the OperationalError MySQLdb is throwing is not the same one the code is catching :(
Assignee | ||
Comment 8•10 years ago
|
||
Just got some emails about this from web2, so at least the two identical machines are not acting differently!
Assignee | ||
Comment 9•10 years ago
|
||
Got some more: 2014-01-29 17:09:26,176 WARNI [setup_engine] [Dummy-1] caught something else: got (<class '_mysql_exceptions.OperationalError'>, OperationalError(2006, 'MySQL server has gone away'), <traceback object at 0x7f70194b4200>) exp <class '_mysql_exceptions.OperationalError'> The first bit is sys.exc_info(); the second is the expected exception class. I guess I need to log their id() too.
Assignee | ||
Comment 10•10 years ago
|
||
ID's differ, and interestingly it's the *expected* class that seems to differ between worker processes: 2014-01-29 21:11:31,617 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880 2014-01-29 22:02:03,644 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968 2014-01-29 22:02:04,347 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880 2014-01-29 22:39:05,110 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968 2014-01-29 22:39:06,813 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880 2014-01-29 23:21:51,000 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968 2014-01-29 23:25:33,922 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
Assignee | ||
Comment 11•10 years ago
|
||
http://docs.sqlalchemy.org/en/rel_0_8/core/pooling.html#disconnect-handling-pessimistic actually recommends catching all exceptions. The code in the attachment mentioned in the description is from https://stackoverflow.com/questions/15753102/python-sqlalchemy-how-do-i-ensure-connection-not-stale-using-new-event-system. I added some code to _mysql_exceptions.py to see if I could catch it creating multiple instances of the exception class, and I can't. So, really, I have no idea what's going on here. Hoewever, it seems best to follow the SQLAlchemy advice anyway. I will replace the bare except with 'except Exception', though.
Assignee | ||
Comment 12•10 years ago
|
||
Running this in staging now, but staging doesn't tend to show this kind of failure so that only demonstrates "no SyntaxErrors"
Assignee: relops → dustin
Attachment #8368049 -
Flags: review?(bhearsum)
Comment 13•10 years ago
|
||
Comment on attachment 8368049 [details] [diff] [review] bug964876.patch Review of attachment 8368049 [details] [diff] [review]: ----------------------------------------------------------------- Huh, that recipe from sqlalchemy raises a red flag. Is it worthwhile logging the exception to make sure we don't make any weird undiscovered bugs harder to debug? I guess it's not so bad when we're only catching from execute()...
Attachment #8368049 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 14•10 years ago
|
||
If there is some weird error, sqlalchemy will retry this checkout three times, then raise it. So we'd get *some* warning.
Comment 15•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #14) > If there is some weird error, sqlalchemy will retry this checkout three > times, then raise it. So we'd get *some* warning. Ah, OK. This WFM then.
Assignee | ||
Comment 16•10 years ago
|
||
Comment on attachment 8368049 [details] [diff] [review] bug964876.patch ..and deployed to production
Attachment #8368049 -
Flags: checked-in+
Assignee | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•