Closed Bug 1323840 Opened 8 years ago Closed 7 years ago

Redash seems to regularly run multiple copies of a query

Categories

(Cloud Services :: Metrics: Data Tools, defect, P2)

defect
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rfkelly, Assigned: robotblake)

Details

(Whiteboard: [SvcOps])

Attachments

(2 files)

Attached is a screenshot of the queries currently running against the FxA redshift instance. There are five queries, all scheduled jobs from redash, and: * Two of them have query hash "001515..." * Three of them have query hash "278660..." These queries are set to refresh either every 24 hours, or every 7 days. Is it expected that redash would run multiple copies of the same query?
I should add, I regularly see this behaviour, where several instances of the same query are being run and backing up the queue for other work.
I'll also note that our queries are pretty long runtimes currently, several hours in many cases (although we're working on speeding them up!). So I wonder if redash is doing something like: * Checking whether the query needs to be refreshed, and firing off a job to refresh it * Coming back an hour later and checking again * Finding that the query has not been refreshed (because the job hasn't finished yet) and so kicking off another attempt
Summary: Redash seems to regularly run multiple copes of a query → Redash seems to regularly run multiple copies of a query
I've noticed issues in the past with the re:dash workers crashing / restarting and losing state and I'm guessing that's what's happening here. It currently shows a single copy of each of those queries running (screenshot attached). I'll do some digging in the logs and see if that looks like the issue. Irregardless of the cause (and beyond trying to change some of the re:dash code to make the workers more resilient to failures), I wonder if it would make sense to figure out a way to cancel duplicate queries on some sort of schedule (cron or the like). Amazon has docs on cancelling long running queries at [1]. [1] http://docs.aws.amazon.com/redshift/latest/dg/cancel_query.html
> I wonder if it would make sense to figure out a way to cancel duplicate queries on some sort of schedule Fwiw, I don't think this would fix the problems we're having because the scheduled queries get immediately respawned after you cancel them. I see this frequently. A recent example: > fxa=# select pid, trim(user_name), starttime, substring(query,1,85) > from stv_recents > where status='Running'; > pid | btrim | starttime | substring > -------+-------+----------------------------+--------------------------------------------------------------------------------------- > 25097 | fxa | 2016-12-20 14:45:55.025846 | /* Username: Scheduled, Task ID: 1f947d96-0e97-4b78-b77d-53351edd4d88, Query ID: 1856 > 28109 | fxa | 2016-12-20 15:46:56.427289 | /* Username: Scheduled, Task ID: da326a0c-110d-4df8-947f-4bce19151cde, Query ID: 1856 > (2 rows) Here we can see a single scheduled query demonstrating the problem that :rfkelly described initially. Notice that the two copies of it are spaced around an hour apart; that's usually the case I find, presumably the 1 hour interval is significant somewhere in the code. I have also seen cases where a 3rd copy is staggered a further hour behind the 2nd. Anyway, these two were maxing out the CPU and preventing me measuring some performance improvements I was working on for another query. So I cancelled both of them: > fxa=# cancel 25097; > CANCEL > fxa=# cancel 28109; > CANCEL Then, seconds later: > fxa=# select pid, trim(user_name), starttime, substring(query,1,85) > from stv_recents > where status='Running'; > pid | btrim | starttime | substring > -------+-------+----------------------------+--------------------------------------------------------------------------------------- > 28930 | fxa | 2016-12-20 16:02:56.953624 | /* Username: Scheduled, Task ID: 57b93d72-783e-4ac1-8af0-899658b72b73, Query ID: 1856 > (1 row) The same query id, respawned. Is there any config in redash that controls this respawn behaviour? If that were possible then the mooted cron job may become a possibility.
This was down to two schedulers running simultaneously and has been fixed.
Assignee: nobody → bimsland
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Attaching a screenshot showing that this issue is not resolved for us yet, sorry. The screenshot is of the AWS console for our redshift cluster. In it you can see 9 separate instances of query 2412, all invoked by the scheduler, in a single 3-hour window. You can see that the first 5 instances of it all completed successfully, yet it still got rescheduled. I terminated the 6th instance in a forlorn attempt to speed up another query I was running. I quickly regretted that decision because, shortly afterwards, instances 7, 8 and 9 were started. According to redash, query 2412 [1] is scheduled to run once every 24 hours. Any chance somebody could take another look at this to try and figure out what's going wrong?
Re-opening because this problem is not fixed for us, see comment 6.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Fwiw, these scheduled queries appear to have settled down again now. Not sure if anyything's been intentionally fixed in the last couple of days?
Fwiw, this has started happening again in the last day or two, e.g. right now we have three copies of scheduled query #2255 running, started within 30 minutes of each other.
I don't think we're being bitten by this any more. Recent changes to the redshift schemata improved querys speed plus the cluster was beefed up too. I'm assuming that all helped, closing this down again.
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: