Detect a gone-away MySQL server in buildapi

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: dustin, Assigned: dustin)

Tracking

Details

Attachments

(1 attachment)

(Assignee)

Description

4 years ago
bug 863268 attachment 8363708 [details] [diff] [review] isn't working:

Traceback (most recent call last):
  File "/data/www/buildapi/buildapi/buildapi/lib/mq.py", line 213, in receive
    r = s.query(buildapidb.JobRequest).get(message_data['request_id'])
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 820, in get
    return loading.load_on_ident(self, key)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/loading.py", line 226, in load_on_ident
    return q.one()
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2310, in one
    ret = list(self)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2353, in __iter__
    return self._execute_and_instances(context)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2366, in _execute_and_instances
    close_with_result=True)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2357, in _connection_from_session
    **kw)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 799, in connection
    close_with_result=close_with_result)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 803, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 299, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect
    self.pool.connect(),
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 272, in connect
    return _ConnectionFairy(self).checkout()
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 510, in checkout
    self)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/event.py", line 409, in __call__
    fn(*args, **kw)
  File "/data/www/buildapi/buildapi/buildapi/config/environment.py", line 24, in checkout_listener
    cursor.execute("SELECT 1")
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
OperationalError: (2006, 'MySQL server has gone away')

It's running the listener's query.. but not handling the OperationalError correctly.
(Assignee)

Comment 1

4 years ago
This *might* have been a reload problem: I updated python-MySQL yesterday, and only restarted httpd on web2, not web1.  This error was from web1, while I don't see any errors on web2.  So it's possible that the OperationalError class being raised is a different class instance than the one being caught.  I'll keep an eye on it.
(Assignee)

Comment 2

4 years ago
This is still failing after an httpd restart on web1 yesterday - and still only failing on web1.
(Assignee)

Comment 3

4 years ago
working:

[root@web2.releng.webapp.scl3 ~]# lsof -p 17357 | grep _mysql
httpd   17357 buildapi  mem    REG                8,3   143284    1711258 /data/www/buildapi/virtualenv/lib/python2.6/site-packages/_mysql.so
httpd   17357 buildapi  mem    REG                8,3    57304     136539 /usr/lib64/python2.6/site-packages/_mysql.so
httpd   17357 buildapi  mem    REG                8,3    30112    2240294 /usr/lib64/php/modules/pdo_mysql.so

non-working:

[root@web1.releng.webapp.scl3 ~]# lsof -p 28807 | grep _mysql
httpd   28807 buildapi  mem    REG                8,3   143284    1711075 /data/www/buildapi/virtualenv/lib/python2.6/site-packages/_mysql.so
httpd   28807 buildapi  mem    REG                8,3    30112       6801 /usr/lib64/php/modules/pdo_mysql.so

That *really* doesn't make sense.  This virtualenv was originally built with --system-site-packages to include the system Python-MySQL.  But relengwebadm doesn't have Python-MySQL installed, and installing it would change the minor version of the MySQL libraries, so I installed it in the virtualenv.  And now I have two problems.

I think I'll recreate the virtualenv without --system-site-packages and carefully ship it to the two webhead nodes.
(Assignee)

Comment 4

4 years ago
OK, new virtualenv is created and deployed.  Let's see if this problem is gone.
(Assignee)

Comment 5

4 years ago
Nope.
(Assignee)

Comment 6

4 years ago
I can't reproduce this in staging, either.

I compared installed packages on the two webheads, and the only difference is in the patchlevel of the kernel package - *very* unlikely to be causing this.

I added some additional logging and deployed it.  That should at least narrow down what exactly is wrong here.
(Assignee)

Comment 7

4 years ago
And it finally failed again:

2014-01-29 09:15:38,560 WARNI [setup_engine] [Dummy-1] caught something else

so the OperationalError MySQLdb is throwing is not the same one the code is catching :(
(Assignee)

Comment 8

4 years ago
Just got some emails about this from web2, so at least the two identical machines are not acting differently!
(Assignee)

Comment 9

4 years ago
Got some more:

2014-01-29 17:09:26,176 WARNI [setup_engine] [Dummy-1] caught something else: got (<class '_mysql_exceptions.OperationalError'>, OperationalError(2006, 'MySQL server has gone away'), <traceback object at 0x7f70194b4200>) exp <class '_mysql_exceptions.OperationalError'>

The first bit is sys.exc_info(); the second is the expected exception class.  I guess I need to log their id() too.
(Assignee)

Comment 10

4 years ago
ID's differ, and interestingly it's the *expected* class that seems to differ between worker processes:

2014-01-29 21:11:31,617 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
2014-01-29 22:02:03,644 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968
2014-01-29 22:02:04,347 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
2014-01-29 22:39:05,110 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968
2014-01-29 22:39:06,813 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
2014-01-29 23:21:51,000 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968
2014-01-29 23:25:33,922 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
(Assignee)

Comment 11

4 years ago
http://docs.sqlalchemy.org/en/rel_0_8/core/pooling.html#disconnect-handling-pessimistic actually recommends catching all exceptions.  The code in the attachment mentioned in the description is from https://stackoverflow.com/questions/15753102/python-sqlalchemy-how-do-i-ensure-connection-not-stale-using-new-event-system.

I added some code to _mysql_exceptions.py to see if I could catch it creating multiple instances of the exception class, and I can't.  So, really, I have no idea what's going on here.

Hoewever, it seems best to follow the SQLAlchemy advice anyway.  I will replace the bare except with 'except Exception', though.
(Assignee)

Comment 12

4 years ago
Created attachment 8368049 [details] [diff] [review]
bug964876.patch

Running this in staging now, but staging doesn't tend to show this kind of failure so that only demonstrates "no SyntaxErrors"
Assignee: relops → dustin
Attachment #8368049 - Flags: review?(bhearsum)
Comment on attachment 8368049 [details] [diff] [review]
bug964876.patch

Review of attachment 8368049 [details] [diff] [review]:
-----------------------------------------------------------------

Huh, that recipe from sqlalchemy raises a red flag. Is it worthwhile logging the exception to make sure we don't make any weird undiscovered bugs harder to debug? I guess it's not so bad when we're only catching from execute()...
Attachment #8368049 - Flags: review?(bhearsum) → review+
(Assignee)

Comment 14

4 years ago
If there is some weird error, sqlalchemy will retry this checkout three times, then raise it.  So we'd get *some* warning.
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #14)
> If there is some weird error, sqlalchemy will retry this checkout three
> times, then raise it.  So we'd get *some* warning.

Ah, OK. This WFM then.
(Assignee)

Comment 16

4 years ago
Comment on attachment 8368049 [details] [diff] [review]
bug964876.patch

..and deployed to production
Attachment #8368049 - Flags: checked-in+
(Assignee)

Updated

4 years ago
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.