Closed Bug 964876 Opened 10 years ago Closed 10 years ago

Detect a gone-away MySQL server in buildapi

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

Attachments

(1 file)

bug 863268 attachment 8363708 [details] [diff] [review] isn't working:

Traceback (most recent call last):
  File "/data/www/buildapi/buildapi/buildapi/lib/mq.py", line 213, in receive
    r = s.query(buildapidb.JobRequest).get(message_data['request_id'])
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 820, in get
    return loading.load_on_ident(self, key)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/loading.py", line 226, in load_on_ident
    return q.one()
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2310, in one
    ret = list(self)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2353, in __iter__
    return self._execute_and_instances(context)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2366, in _execute_and_instances
    close_with_result=True)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2357, in _connection_from_session
    **kw)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 799, in connection
    close_with_result=close_with_result)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 803, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 299, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect
    self.pool.connect(),
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 272, in connect
    return _ConnectionFairy(self).checkout()
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 510, in checkout
    self)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/event.py", line 409, in __call__
    fn(*args, **kw)
  File "/data/www/buildapi/buildapi/buildapi/config/environment.py", line 24, in checkout_listener
    cursor.execute("SELECT 1")
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
OperationalError: (2006, 'MySQL server has gone away')

It's running the listener's query.. but not handling the OperationalError correctly.
This *might* have been a reload problem: I updated python-MySQL yesterday, and only restarted httpd on web2, not web1.  This error was from web1, while I don't see any errors on web2.  So it's possible that the OperationalError class being raised is a different class instance than the one being caught.  I'll keep an eye on it.
This is still failing after an httpd restart on web1 yesterday - and still only failing on web1.
working:

[root@web2.releng.webapp.scl3 ~]# lsof -p 17357 | grep _mysql
httpd   17357 buildapi  mem    REG                8,3   143284    1711258 /data/www/buildapi/virtualenv/lib/python2.6/site-packages/_mysql.so
httpd   17357 buildapi  mem    REG                8,3    57304     136539 /usr/lib64/python2.6/site-packages/_mysql.so
httpd   17357 buildapi  mem    REG                8,3    30112    2240294 /usr/lib64/php/modules/pdo_mysql.so

non-working:

[root@web1.releng.webapp.scl3 ~]# lsof -p 28807 | grep _mysql
httpd   28807 buildapi  mem    REG                8,3   143284    1711075 /data/www/buildapi/virtualenv/lib/python2.6/site-packages/_mysql.so
httpd   28807 buildapi  mem    REG                8,3    30112       6801 /usr/lib64/php/modules/pdo_mysql.so

That *really* doesn't make sense.  This virtualenv was originally built with --system-site-packages to include the system Python-MySQL.  But relengwebadm doesn't have Python-MySQL installed, and installing it would change the minor version of the MySQL libraries, so I installed it in the virtualenv.  And now I have two problems.

I think I'll recreate the virtualenv without --system-site-packages and carefully ship it to the two webhead nodes.
OK, new virtualenv is created and deployed.  Let's see if this problem is gone.
Nope.
I can't reproduce this in staging, either.

I compared installed packages on the two webheads, and the only difference is in the patchlevel of the kernel package - *very* unlikely to be causing this.

I added some additional logging and deployed it.  That should at least narrow down what exactly is wrong here.
And it finally failed again:

2014-01-29 09:15:38,560 WARNI [setup_engine] [Dummy-1] caught something else

so the OperationalError MySQLdb is throwing is not the same one the code is catching :(
Just got some emails about this from web2, so at least the two identical machines are not acting differently!
Got some more:

2014-01-29 17:09:26,176 WARNI [setup_engine] [Dummy-1] caught something else: got (<class '_mysql_exceptions.OperationalError'>, OperationalError(2006, 'MySQL server has gone away'), <traceback object at 0x7f70194b4200>) exp <class '_mysql_exceptions.OperationalError'>

The first bit is sys.exc_info(); the second is the expected exception class.  I guess I need to log their id() too.
ID's differ, and interestingly it's the *expected* class that seems to differ between worker processes:

2014-01-29 21:11:31,617 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
2014-01-29 22:02:03,644 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968
2014-01-29 22:02:04,347 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
2014-01-29 22:39:05,110 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968
2014-01-29 22:39:06,813 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
2014-01-29 23:21:51,000 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968
2014-01-29 23:25:33,922 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
http://docs.sqlalchemy.org/en/rel_0_8/core/pooling.html#disconnect-handling-pessimistic actually recommends catching all exceptions.  The code in the attachment mentioned in the description is from https://stackoverflow.com/questions/15753102/python-sqlalchemy-how-do-i-ensure-connection-not-stale-using-new-event-system.

I added some code to _mysql_exceptions.py to see if I could catch it creating multiple instances of the exception class, and I can't.  So, really, I have no idea what's going on here.

Hoewever, it seems best to follow the SQLAlchemy advice anyway.  I will replace the bare except with 'except Exception', though.
Attached patch bug964876.patchSplinter Review
Running this in staging now, but staging doesn't tend to show this kind of failure so that only demonstrates "no SyntaxErrors"
Assignee: relops → dustin
Attachment #8368049 - Flags: review?(bhearsum)
Comment on attachment 8368049 [details] [diff] [review]
bug964876.patch

Review of attachment 8368049 [details] [diff] [review]:
-----------------------------------------------------------------

Huh, that recipe from sqlalchemy raises a red flag. Is it worthwhile logging the exception to make sure we don't make any weird undiscovered bugs harder to debug? I guess it's not so bad when we're only catching from execute()...
Attachment #8368049 - Flags: review?(bhearsum) → review+
If there is some weird error, sqlalchemy will retry this checkout three times, then raise it.  So we'd get *some* warning.
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #14)
> If there is some weird error, sqlalchemy will retry this checkout three
> times, then raise it.  So we'd get *some* warning.

Ah, OK. This WFM then.
Comment on attachment 8368049 [details] [diff] [review]
bug964876.patch

..and deployed to production
Attachment #8368049 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: