1034377 - "Lock wait timeout exceeded" produces a 500 error; should be 409

Assignee

Description

•

11 years ago

Traceback from loadtest: Traceback (most recent call last): File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/SyncStorage-1.5.6-py2.6.egg/syncstorage/storage/sql/dbconnect.py", line 422, in report_backend_errors_wrapper return func(self, *args, **kwds) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/SyncStorage-1.5.6-py2.6.egg/syncstorage/storage/sql/dbconnect.py", line 530, in execute return self._exec_with_cleanup(connection, query_str, **params) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/mozsvc/metrics.py", line 183, in timed_func return func(*args, **kwds) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/SyncStorage-1.5.6-py2.6.egg/syncstorage/storage/sql/dbconnect.py", line 550, in _exec_with_cleanup return connection.execute(sqltext(query_str), **params) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 720, in execute return meth(self, multiparams, params) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/sql/elements.py", line 317, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 817, in _execute_clauseelement compiled_sql, distilled_params File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 947, in _execute_context context) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1108, in _handle_dbapi_exception exc_info File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/util/compat.py", line 185, in raise_from_cause reraise(type(exception), exception, tb=exc_tb) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 940, in _execute_context context) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 435, in do_execute cursor.execute(statement, parameters) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/cursors.py", line 132, in execute result = self._query(query) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/cursors.py", line 271, in _query conn.query(q) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/connections.py", line 726, in query self._affected_rows = self._read_query_result(unbuffered=unbuffered) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/connections.py", line 861, in _read_query_result result.read() File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/connections.py", line 1064, in read first_packet = self.connection._read_packet() File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/connections.py", line 826, in _read_packet packet.check_error() File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/connections.py", line 370, in check_error raise_mysql_exception(self._data) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/err.py", line 116, in raise_mysql_exception _check_mysql_exception(errinfo) File "/media/ephemeral0/server-syncstorage/lib/python2.6/site-packages/pymysql/err.py", line 112, in _check_mysql_exception raise InternalError(errno, errorvalue) InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') '/* [queryName=LOCK_COLLECTION_WRITE, retry=1] */ SELECT last_modified FROM user_collections WHERE userid=%s AND collection=%s FOR UPDATE' (567217, 6) This should have been caught and turned into a "409 Conflict" but it doesn't look like it was. Probably not catching the right error subclass.

Ryan Kelly [:rfkelly]

Assignee

Updated

•

11 years ago

Assignee: nobody → rfkelly

Ryan Kelly [:rfkelly]

Assignee

Updated

•

11 years ago

Summary: "Lock wait timeout exceeded" prouces a 500 error; should be 409 → "Lock wait timeout exceeded" produces a 500 error; should be 409

James Bonacci [:jbonacci]

Updated

•

11 years ago

Whiteboard: [qa+]

Ryan Kelly [:rfkelly]

Assignee

Comment 1

•

11 years ago

AFAICT, the fact that this traceback starts in "report_backend_errors_wrapper" rather than in the top-level "catch_backend_errors_tween" means that it's logging the error before (correctly) turning it into a BackendError instance. So maybe this is actually being caught and reported properly, and it's just the logged traceback that is confusing. I'll dig a little more into the behaviour here though.

Ryan Kelly [:rfkelly]

Assignee

Comment 2

•

11 years ago

Attached patch syncstorage-better-conflict-errors.diff — Details — Splinter Review

Here are some improvements to our "409 Conflict" generation based on my invetigation for this traceback: * Add an explicit message when logging such tracebacks, so that we can more easily distinguish "uncaught error" from "caught and logged". * Treat "1213: deadlock found when trying to get lock" as a retryable locking-related error * Move the "db locking error" => "ConflictError" mapping down onto the session object, so we correctly detect these if they show up on any query rather than just on the lock-taking query These changes helped the "409 Conflict" response show up more reliably in my local testing on both sqlite and mysql (as opposed to these errors being reported as a generic "503 Service Unavailable"). Pinging :callahad for review in the interests of disseminating knowledge, and :bobm for feedback to sanity-check adding "1213" to the list of retryable errors.

Attachment #8457839 - Flags: review?(dan.callahan)

Attachment #8457839 - Flags: feedback?(bobm)

Bob Micheletto [:bobm]

Comment 3

•

11 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #2) > Pinging :callahad for review in the interests of disseminating knowledge, > and :bobm for feedback to sanity-check adding "1213" to the list of > retryable errors. Sounds good to me. Is there / will there be a retry limit?

Ryan Kelly [:rfkelly]

Assignee

Comment 4

•

11 years ago

> Sounds good to me. Is there / will there be a retry limit? Yes, this flows into the existing retry logic which will retry exactly once, then fail back to the client with a 503 or 409, depending on the error.

Bob Micheletto [:bobm]

Comment 5

•

11 years ago

Comment on attachment 8457839 [details] [diff] [review] syncstorage-better-conflict-errors.diff Sounds like a good strategy to me.

Attachment #8457839 - Flags: feedback?(bobm) → feedback+

Ryan Kelly [:rfkelly]

Assignee

Comment 6

•

11 years ago

Comment on attachment 8457839 [details] [diff] [review] syncstorage-better-conflict-errors.diff Bumping review to :telliott, let's get this landed before the next round of deployments.

Attachment #8457839 - Flags: review?(dan.callahan) → review?(telliott)

Toby Elliott [:telliott]

Updated

•

11 years ago

Attachment #8457839 - Flags: review?(telliott) → review+

Richard Newman [:rnewman]

Comment 7

•

11 years ago

Bear in mind that desktop currently doesn't support 409 Conflict: Bug 959034.

Ryan Kelly [:rfkelly]

Assignee

Comment 8

•

11 years ago

> Bear in mind that desktop currently doesn't support 409 Conflict: Bug 959034. Ugh, I had forgotten about the error-bar aspect of that. I'll ponder the implications before committing.

Ryan Kelly [:rfkelly]

Assignee

Comment 9

•

11 years ago

I committed this because it gives better internal retry handling, and because we already generate 409s under some circumstances anyway: https://github.com/mozilla-services/server-syncstorage/commit/ee8349145405285bb0f4b78b28320c272fec156f There's some risk that this might surface an error bar in situations where it wouldn't previously, but it seems acceptably small to me. We should separately consider the best path forward there.

Ryan Kelly [:rfkelly]

Assignee

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Comment 10

•

11 years ago

OK. I checked my notes. I do not see any specific deployments coming up, so I will just hold this one.

Bob Micheletto [:bobm]

Comment 11

•

11 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #9) > I committed this because it gives better internal retry handling, and > because we already generate 409s under some circumstances anyway: Should we add a graph with a 409 + UA break down?

Ryan Kelly [:rfkelly]

Assignee

Comment 12

•

11 years ago

> Should we add a graph with a 409 + UA break down? Nah. See Bug 1051739 where I just disabled them entirely until we get proper client support.

James Bonacci [:jbonacci]

Comment 13

•

11 years ago

server-syncstorage 1.5.9 is now in Production

Status: RESOLVED → VERIFIED

BMO Automation

Updated

•

2 years ago

Product: Cloud Services → Cloud Services Graveyard