964876 - Detect a gone-away MySQL server in buildapi

Assignee

Description

•

10 years ago

bug 863268 attachment 8363708 [details] [diff] [review] isn't working:

Traceback (most recent call last):
  File "/data/www/buildapi/buildapi/buildapi/lib/mq.py", line 213, in receive
    r = s.query(buildapidb.JobRequest).get(message_data['request_id'])
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 820, in get
    return loading.load_on_ident(self, key)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/loading.py", line 226, in load_on_ident
    return q.one()
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2310, in one
    ret = list(self)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2353, in __iter__
    return self._execute_and_instances(context)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2366, in _execute_and_instances
    close_with_result=True)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2357, in _connection_from_session
    **kw)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 799, in connection
    close_with_result=close_with_result)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 803, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 299, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect
    self.pool.connect(),
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 272, in connect
    return _ConnectionFairy(self).checkout()
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 510, in checkout
    self)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/sqlalchemy/event.py", line 409, in __call__
    fn(*args, **kw)
  File "/data/www/buildapi/buildapi/buildapi/config/environment.py", line 24, in checkout_listener
    cursor.execute("SELECT 1")
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/data/www/buildapi/virtualenv/lib/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
OperationalError: (2006, 'MySQL server has gone away')

It's running the listener's query.. but not handling the OperationalError correctly.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 1

•

10 years ago

This *might* have been a reload problem: I updated python-MySQL yesterday, and only restarted httpd on web2, not web1.  This error was from web1, while I don't see any errors on web2.  So it's possible that the OperationalError class being raised is a different class instance than the one being caught.  I'll keep an eye on it.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

10 years ago

This is still failing after an httpd restart on web1 yesterday - and still only failing on web1.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

10 years ago

working:

[root@web2.releng.webapp.scl3 ~]# lsof -p 17357 | grep _mysql
httpd   17357 buildapi  mem    REG                8,3   143284    1711258 /data/www/buildapi/virtualenv/lib/python2.6/site-packages/_mysql.so
httpd   17357 buildapi  mem    REG                8,3    57304     136539 /usr/lib64/python2.6/site-packages/_mysql.so
httpd   17357 buildapi  mem    REG                8,3    30112    2240294 /usr/lib64/php/modules/pdo_mysql.so

non-working:

[root@web1.releng.webapp.scl3 ~]# lsof -p 28807 | grep _mysql
httpd   28807 buildapi  mem    REG                8,3   143284    1711075 /data/www/buildapi/virtualenv/lib/python2.6/site-packages/_mysql.so
httpd   28807 buildapi  mem    REG                8,3    30112       6801 /usr/lib64/php/modules/pdo_mysql.so

That *really* doesn't make sense.  This virtualenv was originally built with --system-site-packages to include the system Python-MySQL.  But relengwebadm doesn't have Python-MySQL installed, and installing it would change the minor version of the MySQL libraries, so I installed it in the virtualenv.  And now I have two problems.

I think I'll recreate the virtualenv without --system-site-packages and carefully ship it to the two webhead nodes.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

10 years ago

OK, new virtualenv is created and deployed.  Let's see if this problem is gone.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

10 years ago

Nope.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

10 years ago

I can't reproduce this in staging, either.

I compared installed packages on the two webheads, and the only difference is in the patchlevel of the kernel package - *very* unlikely to be causing this.

I added some additional logging and deployed it.  That should at least narrow down what exactly is wrong here.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

10 years ago

And it finally failed again:

2014-01-29 09:15:38,560 WARNI [setup_engine] [Dummy-1] caught something else

so the OperationalError MySQLdb is throwing is not the same one the code is catching :(

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 8

•

10 years ago

Just got some emails about this from web2, so at least the two identical machines are not acting differently!

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 9

•

10 years ago

Got some more:

2014-01-29 17:09:26,176 WARNI [setup_engine] [Dummy-1] caught something else: got (<class '_mysql_exceptions.OperationalError'>, OperationalError(2006, 'MySQL server has gone away'), <traceback object at 0x7f70194b4200>) exp <class '_mysql_exceptions.OperationalError'>

The first bit is sys.exc_info(); the second is the expected exception class.  I guess I need to log their id() too.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

10 years ago

ID's differ, and interestingly it's the *expected* class that seems to differ between worker processes:

2014-01-29 21:11:31,617 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
2014-01-29 22:02:03,644 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968
2014-01-29 22:02:04,347 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
2014-01-29 22:39:05,110 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968
2014-01-29 22:39:06,813 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880
2014-01-29 23:21:51,000 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119575681968
2014-01-29 23:25:33,922 WARNI [setup_engine] [Dummy-1] caught something else: got 140119780647184 exp 140119812896880

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 11

•

10 years ago

http://docs.sqlalchemy.org/en/rel_0_8/core/pooling.html#disconnect-handling-pessimistic actually recommends catching all exceptions.  The code in the attachment mentioned in the description is from https://stackoverflow.com/questions/15753102/python-sqlalchemy-how-do-i-ensure-connection-not-stale-using-new-event-system.

I added some code to _mysql_exceptions.py to see if I could catch it creating multiple instances of the exception class, and I can't.  So, really, I have no idea what's going on here.

Hoewever, it seems best to follow the SQLAlchemy advice anyway.  I will replace the bare except with 'except Exception', though.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

10 years ago

Attached patch bug964876.patch — Details — Splinter Review

Running this in staging now, but staging doesn't tend to show this kind of failure so that only demonstrates "no SyntaxErrors"

Assignee: relops → dustin

Attachment #8368049 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Comment 13

•

10 years ago

Comment on attachment 8368049 [details] [diff] [review]
bug964876.patch

Review of attachment 8368049 [details] [diff] [review]:
-----------------------------------------------------------------

Huh, that recipe from sqlalchemy raises a red flag. Is it worthwhile logging the exception to make sure we don't make any weird undiscovered bugs harder to debug? I guess it's not so bad when we're only catching from execute()...

Attachment #8368049 - Flags: review?(bhearsum) → review+

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 14

•

10 years ago

If there is some weird error, sqlalchemy will retry this checkout three times, then raise it.  So we'd get *some* warning.

bhearsum@mozilla.com (:bhearsum)

Comment 15

•

10 years ago

(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #14)
> If there is some weird error, sqlalchemy will retry this checkout three
> times, then raise it.  So we'd get *some* warning.

Ah, OK. This WFM then.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 16

•

10 years ago

Comment on attachment 8368049 [details] [diff] [review]
bug964876.patch

..and deployed to production

Attachment #8368049 - Flags: checked-in+

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

Detect a gone-away MySQL server in buildapi

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: dustin)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated

Attachment

General

Description

File Name

Content Type