Last Comment Bug 668664 - kill pending sql query on thread exit
: kill pending sql query on thread exit
Status: VERIFIED FIXED
[needs loadtest][qa+]
:
Product: Cloud Services
Classification: Client Software
Component: Server: Core (show other bugs)
: unspecified
: All All
: P1 normal (vote)
: ---
Assigned To: Ryan Kelly [:rfkelly]
: James Bonacci [:jbonacci]
:
Mentors:
Depends on:
Blocks: 907479 623604 676423 784598
  Show dependency treegraph
 
Reported: 2011-06-30 15:57 PDT by Tarek Ziadé (:tarek)
Modified: 2014-03-14 11:56 PDT (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
kill pending queries (6.49 KB, patch)
2011-07-11 09:36 PDT, Tarek Ziadé (:tarek)
no flags Details | Diff | Splinter Review
kill pending *select* queries (6.67 KB, patch)
2011-07-11 11:08 PDT, Tarek Ziadé (:tarek)
no flags Details | Diff | Splinter Review
kill pending *select* queries (6.43 KB, patch)
2011-07-21 17:18 PDT, Tarek Ziadé (:tarek)
no flags Details | Diff | Splinter Review
patch to kill the current query when execution is interrupted (2.94 KB, patch)
2013-06-02 22:25 PDT, Ryan Kelly [:rfkelly]
telliott: review+
Details | Diff | Splinter Review
patch to kill the current query when execution is interrupted (3.46 KB, patch)
2013-06-03 15:34 PDT, Ryan Kelly [:rfkelly]
telliott: review+
Details | Diff | Splinter Review
sync15-cleanup-query-on-error.diff (5.17 KB, patch)
2014-03-05 20:55 PST, Ryan Kelly [:rfkelly]
telliott: review+
tarek: feedback+
Details | Diff | Splinter Review

Description Tarek Ziadé (:tarek) 2011-06-30 15:57:24 PDT
If the current thread gets killed, we need to kill any pending SQL query sent to mysql.

notes to myself

1. register queries being executed in a threadlocal
2. catch SIGINT (maybe SIGHUP -- need to check gunicorn
3. see how to kill the query
4. quit and let the lib close the socket
Comment 1 Richard Soderberg [:atoll] 2011-06-30 15:59:00 PDT
mysql(1) binary has a SIGINT handler that does this, in a way that actually tells the server to abort the query, and gets back a confirmation.  we should investigate how this is done.
Comment 2 Tarek Ziadé (:tarek) 2011-06-30 15:59:11 PDT
+ timed out queries. probably via a decorator
Comment 3 Tarek Ziadé (:tarek) 2011-07-11 09:33:16 PDT
I have a first prototype that will trigger query KILLS on SIGINT/SIGTERM I would like to try on high load on the load infra.

The goal is to check that when the request > 30 seconds and Gunicorn restarts the worker, all pending requests are killed in MySQL.

If the prototype works as expected, I'll finish the code + tests in a cleaner version.

Pete do you think we can do a load session with this patch ?
Comment 4 Tarek Ziadé (:tarek) 2011-07-11 09:36:55 PDT
Created attachment 545187 [details] [diff] [review]
kill pending queries

Just a first prototype to validate the idea. Don't look at the code :-)
Comment 5 Toby Elliott [:telliott] 2011-07-11 09:46:40 PDT
This actually seems kind of dangerous. Do we really want to kill a delete if it takes longer than the timeout?
Comment 6 Tarek Ziadé (:tarek) 2011-07-11 09:54:46 PDT
I think We want to kill any operation that lasts more than Zeus timeout in fact. 

Having a query last more that 30 seconds while the web app that triggered it is already dead because killed by Zeus/Nginx/Gunicorn, is already a lost cause for the client side: the client already gets a 5xx.
Comment 7 Richard Soderberg [:atoll] 2011-07-11 09:55:27 PDT
Probably not. Some filtering of what to kill would be sensible. Like, initially, SELECTs only. Especially since MySQL rollbacks are 10x as slow as letting it finish under any sort of load (until recent fixed versions).
Comment 8 Richard Soderberg [:atoll] 2011-07-11 09:57:36 PDT
It's okay if the client gets a 5xx for a DELETE and then later on we finish their DELETE. That way in 5-10 minutes it'll be done when they try to DELETE again!
Comment 9 Tarek Ziadé (:tarek) 2011-07-11 10:00:59 PDT
(In reply to comment #8)
> It's okay if the client gets a 5xx for a DELETE and then later on we finish
> their DELETE. That way in 5-10 minutes it'll be done when they try to DELETE
> again!

What are the queries that are on the edge right now ? 

select(s) or batch post(s) ? or do we have a mix of all kind on high load ? I made the assumption that the batch posts were the longest ones.
Comment 10 Richard Soderberg [:atoll] 2011-07-11 10:04:41 PDT
The three I know of are, in order of estimated(!) frequency, below.  This is from memory, so any actual evidence takes precedence.

DELETE FROM collection WHERE uid=?
(often; any time a user resets sync or clears sync data)

SELECT MAX(timestamp) GROUP BY collection
(rarely; unless we flushed memcache, in which case continuously for several hours)

INSERT INTO collection (), (), (), (), (), (), ()
(unknown; usually requires write saturation of db, which is an incident-class event anyways)
Comment 11 Richard Soderberg [:atoll] 2011-07-11 10:05:16 PDT
For the SELECT case, we currently have helper scripts that detect that specific query and terminate it at 100 seconds.  We don't touch DELETE or INSERT in any automated fashion.
Comment 12 Tarek Ziadé (:tarek) 2011-07-11 10:10:48 PDT
Mmm so:

1- the goal is to kill some queries so the DB does not get saturated when the web app is forced-restarted at 30s

2- the SELECT seems to be a rare event

What would be the danger to kill all pending queries on the kill/restart, even DELETEs ?  I can't think of a case where it's could be a problem
Comment 13 Richard Soderberg [:atoll] 2011-07-11 10:12:21 PDT
Rolling back a 100 second DELETE can take up to 1000 seconds. Terrifically unsafe.

The SELECT is a condition that crops up when the DB is saturated and/or unexpectedly busy.  Which occurs any time we trigger more traffic than usual, through either memcache or storage upgrade or etc. events.
Comment 14 Tarek Ziadé (:tarek) 2011-07-11 10:45:19 PDT
(In reply to comment #13)
> Rolling back a 100 second DELETE can take up to 1000 seconds. Terrifically
> unsafe.

That's very long indeed.

> The SELECT is a condition that crops up when the DB is saturated and/or
> unexpectedly busy.  Which occurs any time we trigger more traffic than
> usual, through either memcache or storage upgrade or etc. events.

Fair enough. Adding a some filtering options
Comment 15 Tarek Ziadé (:tarek) 2011-07-11 11:08:44 PDT
Created attachment 545216 [details] [diff] [review]
kill pending *select* queries
Comment 16 Pete Fritchman [:petef] 2011-07-14 12:51:52 PDT
Comment on attachment 545216 [details] [diff] [review]
kill pending *select* queries

kill_pending_queries looks like it hardcodes localhost.  Don't we have to run kill_pending_queries on every sqluri we know about?
Comment 17 Tarek Ziadé (:tarek) 2011-07-21 17:18:12 PDT
Created attachment 547567 [details] [diff] [review]
kill pending *select* queries
Comment 18 Tarek Ziadé (:tarek) 2011-09-05 04:02:09 PDT
Pete, can we work on this this week ? the patch is getting old and deprecated a bit
Comment 19 James Bonacci [:jbonacci] 2012-08-25 18:56:25 PDT
:atoll :tarek :rfkelly
Any specific load testing still needed for this?
Comment 20 Tarek Ziadé (:tarek) 2012-09-26 07:03:37 PDT
I did all the work but people lost interest - to be reopened on the next sync fire
Comment 21 Ryan Kelly [:rfkelly] 2013-06-02 17:18:44 PDT
CC'ing Mark and Bob since this topic has come up in IRC discussions.
Comment 22 Ryan Kelly [:rfkelly] 2013-06-02 17:38:21 PDT
Tentatively re-opening this - we need to solve this situation, but we may be able to do it via some mysql-level scripting rather than by hacking around in the connection.
Comment 23 Ryan Kelly [:rfkelly] 2013-06-02 22:25:52 PDT
Created attachment 757229 [details] [diff] [review]
patch to kill the current query when execution is interrupted

Here's an updated patch based on IRC discussions of our current needs.  Rather than killing all running queries at shutdown time, it tries to handle it at the level of individual calls to safe_execute().

Basic idea is that if query execution is interrupted by a control-flow exception (such as KeyboardInterrupt or gevent.Timeout) then we open a backdoor connection to mysql and kill the running query.  We depend on some external trigger to generate the interrupt, for example the timeout logic in Bug 878668.

For now, I've made it kill any kind of query rather than just SELECT queries.  The rationale for this is simply, if there has been an interrupt, then there's nothing around to call COMMIT at the end of the query, so it will be rolled back anyway when the connection is returned to the pool.

IIUC this is the behavior we're seeing in production - the query runs to completion but then gets rolled back at the end.  It should be simple to filter so it only kills certain types of query, will let Ops decide what's best in that regard based on current production behavior.
Comment 24 Toby Elliott [:telliott] 2013-06-03 02:17:49 PDT
Comment on attachment 757229 [details] [diff] [review]
patch to kill the current query when execution is interrupted

assuming the kill command doesn't somehow get wedged, this should do the trick. Please double-check that the permissions make this work right.
Comment 25 Ryan Kelly [:rfkelly] 2013-06-03 03:19:55 PDT
I'll await feedback from Ops before committing this, re: whether to kill just the SELECTs or all interrupted queries.
Comment 26 Ryan Kelly [:rfkelly] 2013-06-03 03:27:23 PDT
(In reply to Toby Elliott [:telliott] from comment #24)
> assuming the kill command doesn't somehow get wedged, this should do the
> trick. Please double-check that the permissions make this work right.

Noting that my local testing shows this working correctly, both with "root" and "sync" mysql user accounts.
Comment 27 Ryan Kelly [:rfkelly] 2013-06-03 15:34:15 PDT
Created attachment 757683 [details] [diff] [review]
patch to kill the current query when execution is interrupted

Two small tweaks to the previous patch:  only kill SELECT/INSERT/UPDATE queries and not DELETEs, and force the connection to be removed from the pool after cleanup.
Comment 28 Toby Elliott [:telliott] 2013-06-03 15:53:57 PDT
Comment on attachment 757683 [details] [diff] [review]
patch to kill the current query when execution is interrupted

Patch will work (hopefully), but will break if we ever stick a comment at the start of the query string. Worth noting somewhere.
Comment 29 Ryan Kelly [:rfkelly] 2013-06-03 17:18:39 PDT
Committed with extra comments in http://hg.mozilla.org/services/server-core/rev/ccb65155bf49

Leaving bug open while we test this out in the real world...
Comment 30 Ryan Kelly [:rfkelly] 2013-06-05 15:51:48 PDT
Thinking on this for a while, there are a couple of things that may go wrong with this in production.  Not worth panicking about, but worth noting for reference:

  * this depends on the ability to quickly create a fresh connection to the server, in order the execute the out-of-band kill operation.  If the server is really overloaded, we may not be able to do that quickly.  We could consider taking one at startup time to use as an "administration channel" and just holding it open until we need it.

  * we should try to measure whether the KILL command itself will get caught up in the MySQL backlog and fail to run in a timely manner.
Comment 31 Ryan Kelly [:rfkelly] 2013-06-05 18:20:07 PDT
This failed in stage due to a PyMySQL bug, where it incorrectly reports the threadid once they get above a certain size:

  https://github.com/petehunt/PyMySQL/pull/150

I'll include a local fix for this in the next deployment.
Comment 32 Toby Elliott [:telliott] 2013-06-06 01:42:46 PDT
(In reply to Ryan Kelly [:rfkelly] from comment #30)

>   * we should try to measure whether the KILL command itself will get caught
> up in the MySQL backlog and fail to run in a timely manner.

It is quite possible for a KILL command to take minutes/hours to run (though I don't know that we have anything likely to take that long here). Don't assume KILL is instant.
Comment 33 Tarek Ziadé (:tarek) 2013-06-06 04:42:40 PDT
Another approach would be to build a separate application that gets notified of queries to kill - so it can deal with in in an asynchronously manner without adding more workload to the main app
Comment 34 Ryan Kelly [:rfkelly] 2013-06-06 15:27:56 PDT
(In reply to Toby Elliott [:telliott] from comment #32)
> (In reply to Ryan Kelly [:rfkelly] from comment #30)
> >   * we should try to measure whether the KILL command itself will get caught
> > up in the MySQL backlog and fail to run in a timely manner.
> It is quite possible for a KILL command to take minutes/hours to run (though
> I don't know that we have anything likely to take that long here). Don't
> assume KILL is instant.

Do you mean that executing the actual "KILL X" command can take a long time, or it can take a long time for the query to actually be killed?  From what I understand of the implementation of KILL, it just sets a flag somewhere and the query is killed asynchronously.  So the actual sql query should run pretty fast...?


(In reply to Tarek Ziadé (:tarek) from comment #33)
> Another approach would be to build a separate application that gets notified
> of queries to kill - so it can deal with in in an asynchronously manner
> without adding more workload to the main app

Nice, this would neatly solve the two issues I mentioned above.  We could also combine it with the reaping functionality currently used to kill super-long-running queries.  Something to keep in mind if this goes pear-shaped in production.
Comment 35 Toby Elliott [:telliott] 2013-06-06 15:38:04 PDT
(In reply to Ryan Kelly [:rfkelly] from comment #34)
> (In reply to Toby Elliott [:telliott] from comment #32)
> > (In reply to Ryan Kelly [:rfkelly] from comment #30)
> > >   * we should try to measure whether the KILL command itself will get caught
> > > up in the MySQL backlog and fail to run in a timely manner.
> > It is quite possible for a KILL command to take minutes/hours to run (though
> > I don't know that we have anything likely to take that long here). Don't
> > assume KILL is instant.
> 
> Do you mean that executing the actual "KILL X" command can take a long time,
> or it can take a long time for the query to actually be killed?  From what I
> understand of the implementation of KILL, it just sets a flag somewhere and
> the query is killed asynchronously.  So the actual sql query should run
> pretty fast...?
> 

Yes, the latter. You can't magically assume the query has gone away once you issue the kill - it can get caught in the backlog. I imagine a box could get sufficiently wedged that the kill command itself got stuck, but I suspect the only thing that's going to fix that box is pushing the red button on the front of it.
Comment 36 Ryan Kelly [:rfkelly] 2013-06-06 15:38:35 PDT
James, Bob and I took this for a spin in stage yesterday, and it seemed to behave itself under our simulated load conditions.  The force queries to time out, we basically:

  * dropped the gunicorn '-t <timeout>' argument to two seconds
  * ran a loadtest pointing at a single db machine
  * did a lot of disk I/O on the db machine

Timeouts were raised, queries were killed, good times were had by all.  Such a simulation ain't the real world of course, but it's a good start.
Comment 37 James Bonacci [:jbonacci] 2013-06-10 09:28:14 PDT
Clearing NeedInfo field since OPs has already worked with this in Stage, at least.
:rfkelly - what are the next steps?
Comment 38 Ryan Kelly [:rfkelly] 2013-06-10 15:15:02 PDT
(In reply to James Bonacci [:jbonacci] from comment #37)
> :rfkelly - what are the next steps?

I'm not real confident rolling this out until it's had a full 24-hour loadtest in stage.  We need to nurse stage back to health for long enough to run this thing.
Comment 39 James Bonacci [:jbonacci] 2013-06-10 15:26:15 PDT
:rfkelly I was afraid you'd say that ;-)
See my email to the group on that subject...
Not sure how w/o some real work 
:bomb ^^
Comment 40 Ryan Kelly [:rfkelly] 2014-03-05 20:55:02 PST
Created attachment 8386556 [details] [diff] [review]
sync15-cleanup-query-on-error.diff

We should try to get this into sync1.5.  I've updated the patch for the new codebase, and addressed a couple of nits form the previous review.  I've tested this locally by simulating KeyboardInterrupt errors and it appears to be working as expected

We'd also need to get the gunicorn timeout logic from Bug 878668 into sync1.5 prod in order to make this a complete solution.
Comment 41 Toby Elliott [:telliott] 2014-03-07 11:30:30 PST
Comment on attachment 8386556 [details] [diff] [review]
sync15-cleanup-query-on-error.diff

Review of attachment 8386556 [details] [diff] [review]:
-----------------------------------------------------------------

::: syncstorage/storage/sql/dbconnect.py
@@ +553,5 @@
> +                # getting the threadid is specific to the PyMySQL driver.
> +                # Other drivers will cause an AttributeError, failing through
> +                # to the "finally" clause at the end of this block.
> +                thread_id = connection.connection.server_thread_id[0]
> +                self.logger.debug("  killing connection %d", thread_id)

It's a little strange to log this before we know that we're actually going to do anything (for non-mySQL implementation). I guess we don't really know until we try, though.
Comment 42 Ryan Kelly [:rfkelly] 2014-03-09 15:27:30 PDT
Non-mysql implementations fail at `thread_id=connection.connection.server_thread_id[0]` and so won't actually hit the log line below.
Comment 43 Tarek Ziadé (:tarek) 2014-03-10 08:16:45 PDT
Comment on attachment 8386556 [details] [diff] [review]
sync15-cleanup-query-on-error.diff

>diff --git a/syncstorage/storage/sql/dbconnect.py b/syncstorage/storage/sql/dbconnect.py
>index 71459ce..98f2ee8 100644
>--- a/syncstorage/storage/sql/dbconnect.py
>+++ b/syncstorage/storage/sql/dbconnect.py
>@@ -19,6 +19,7 @@ This behaviour is off by default; pass shard=True to enable it.
> 
> import os
> import re
>+import sys
> import copy
> import urlparse
> import traceback
>@@ -43,8 +44,15 @@ from syncstorage.storage.sql import (queries_generic,
>                                      queries_mysql)
> 
> 
>+# Regex to match safe dataase field/column names.

s/dataase/database

> SAFE_FIELD_NAME_RE = re.compile("^[a-zA-Z0-9_]+$")
> 
>+# Regex to match specific kinds of query that are safe to kill.
>+# It's a SELECT, INSERT or UPDATE with optional leading comment.
>+SAFE_TO_KILL_QUERY = r"^\s*(/\*.*\*/)?\s*(SELECT|INSERT|UPDATE)\s"

shouldn't we have re.M ? I suspect we can have queries with multiple lines

>+SAFE_TO_KILL_QUERY = re.compile(SAFE_TO_KILL_QUERY, re.I)
>+
>+# The ttl to use for rows that are never supposted to expire.

s/supposted/supposed

> MAX_TTL = 2100000000
> 
> metadata = MetaData()
>@@ -493,7 +501,7 @@ class DBConnection(object):
>             # successfully used as part of this transaction.
>             try:
>                 query_str = self._render_query(query, params, annotations)
>-                return connection.execute(sqltext(query_str), **params)
>+                return self._exec_with_cleanup(connection, query_str, **params)
>             except DBAPIError, exc:
>                 if not is_retryable_db_error(self._connector.engine, exc):
>                     raise
>@@ -507,7 +515,7 @@ class DBConnection(object):
>                 transaction = connection.begin()
>                 annotations["retry"] = "1"
>                 query_str = self._render_query(query, params, annotations)
>-                return connection.execute(sqltext(query_str), **params)
>+                return self._exec_with_cleanup(connection, query_str, **params)
>         finally:
>             # Now that the underlying connection has been used, remember it
>             # so that all subsequent queries are part of the same transaction.
>@@ -515,6 +523,65 @@ class DBConnection(object):
>                 self._connection = connection
>                 self._transaction = transaction
> 
>+    def _exec_with_cleanup(self, connection, query_str, **params):
>+        """Execution wrapper that kills queries if it is interrupted.
>+
>+        This is a wrapper around connection.execute() that will clean up
>+        any running query if the execution is interrupted by a control-flow
>+        exception such as KeyboardInterrupt or gevent.Timeout.
>+
>+        The cleanup currently works only for the PyMySQL driver.  Other
>+        drivers will still execute fine, they just won't get the cleanup.
>+        """
>+        try:
>+            return connection.execute(sqltext(query_str), **params)
>+        except Exception:
>+            # Normal exceptions are passed straight through.
>+            raise
>+        except BaseException:
>+            # Control-flow exceptions trigger the cleanup logic.
>+            exc, val, tb = sys.exc_info()
>+            self.logger.debug("query was interrupted by %s", val)
>+            # Only cleanup SELECT, INSERT or UPDATE statements.
>+            # There are concerns that rolling back DELETEs is too costly.
>+            if not SAFE_TO_KILL_QUERY.match(query_str):
>+                msg = "  refusing to kill unsafe query: %s"
>+                self.logger.debug(msg, query_str[:100])
>+                raise
>+            try:
>+                # The KILL command is specific to MySQL, and this method of
>+                # getting the threadid is specific to the PyMySQL driver.
>+                # Other drivers will cause an AttributeError, failing through
>+                # to the "finally" clause at the end of this block.
>+                thread_id = connection.connection.server_thread_id[0]
>+                self.logger.debug("  killing connection %d", thread_id)
>+                cleanup_query = "KILL %d" % (thread_id,)
>+                # Use a freshly-created connection so that we don't block
>+                # waiting for something from the pool.  Unfortunately this
>+                # requires use of a private API and raw cursor access.
>+                cleanup_conn = self._connector.engine.pool._create_connection()
>+                try:
>+                    cleanup_cursor = cleanup_conn.connection.cursor()
>+                    try:
>+                        cleanup_cursor.execute(cleanup_query)
>+                        msg = "  successfully killed %d"
>+                        self.logger.debug(msg, thread_id)
>+                    except Exception:
>+                        msg = "  failed to kill %d"

Failed to kill *or* failed to log.. I would put the logger.debug call in the finally statement - or after


>+                        self.logger.exception(msg, thread_id)
>+                        raise
>+                    finally:
>+                        cleanup_cursor.close()
>+                finally:
>+                    cleanup_conn.close()
>+            finally:
>+                try:
>+                    # Don't return this connection to the pool.
>+                    connection.invalidate()
>+                finally:
>+                    # Always re-raise the original error.
>+                    raise exc, val, tb
>+
>     def _render_query(self, query, params, annotations):
>         """Render a query into its final string form, to send to database.
>
Comment 44 Ryan Kelly [:rfkelly] 2014-03-10 19:28:40 PDT
Thanks Tarek!

> shouldn't we have re.M ? I suspect we can have queries with multiple lines

We might have a single query spanning multiple lines, but I don't think we want the re.M semantics in this case - we still want '^' to match only the very start of the query.  We do not send multiple queries in a single string.
Comment 45 Ryan Kelly [:rfkelly] 2014-03-10 19:40:13 PDT
Committed with Tarek's suggestions in https://github.com/mozilla-services/server-syncstorage/commit/7dc7ccb716f7eedb43e799985cdf1107b0472cdb
Comment 46 James Bonacci [:jbonacci] 2014-03-14 11:56:22 PDT
Verified in code plus latest sync-related load tests...

Note You need to log in before you can comment on or make changes to this bug.