Closed Bug 1323840 Opened 8 years ago Closed 7 years ago

Redash seems to regularly run multiple copies of a query

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rfkelly, Assigned: robotblake)

Details

(Whiteboard: [SvcOps])

Attachments

(2 files)

Screen Shot 2016-12-16 at 09.11.27.png 8 years ago Ryan Kelly [:rfkelly] 85.20 KB, image/png		Details
Screenshot showing the problem still happening with query 2412. 8 years ago Phil Booth [:pb] 518.35 KB, image/png		Details

Ryan Kelly [:rfkelly]

Reporter

Description

•

8 years ago

Attached image Screen Shot 2016-12-16 at 09.11.27.png — Details

Attached is a screenshot of the queries currently running against the FxA redshift instance. There are five queries, all scheduled jobs from redash, and: * Two of them have query hash "001515..." * Three of them have query hash "278660..." These queries are set to refresh either every 24 hours, or every 7 days. Is it expected that redash would run multiple copies of the same query?

Ryan Kelly [:rfkelly]

Reporter

Comment 1

•

8 years ago

I should add, I regularly see this behaviour, where several instances of the same query are being run and backing up the queue for other work.

Ryan Kelly [:rfkelly]

Reporter

Comment 2

•

8 years ago

I'll also note that our queries are pretty long runtimes currently, several hours in many cases (although we're working on speeding them up!). So I wonder if redash is doing something like: * Checking whether the query needs to be refreshed, and firing off a job to refresh it * Coming back an hour later and checking again * Finding that the query has not been refreshed (because the job hasn't finished yet) and so kicking off another attempt

Summary: Redash seems to regularly run multiple copes of a query → Redash seems to regularly run multiple copies of a query

Blake Imsland [:robotblake]

Assignee

Comment 3

•

8 years ago

I've noticed issues in the past with the re:dash workers crashing / restarting and losing state and I'm guessing that's what's happening here. It currently shows a single copy of each of those queries running (screenshot attached). I'll do some digging in the logs and see if that looks like the issue. Irregardless of the cause (and beyond trying to change some of the re:dash code to make the workers more resilient to failures), I wonder if it would make sense to figure out a way to cancel duplicate queries on some sort of schedule (cron or the like). Amazon has docs on cancelling long running queries at [1]. [1] http://docs.aws.amazon.com/redshift/latest/dg/cancel_query.html

Phil Booth [:pb]

Comment 4

•

8 years ago

> I wonder if it would make sense to figure out a way to cancel duplicate queries on some sort of schedule Fwiw, I don't think this would fix the problems we're having because the scheduled queries get immediately respawned after you cancel them. I see this frequently. A recent example: > fxa=# select pid, trim(user_name), starttime, substring(query,1,85) > from stv_recents > where status='Running'; > pid | btrim | starttime | substring > -------+-------+----------------------------+--------------------------------------------------------------------------------------- > 25097 | fxa | 2016-12-20 14:45:55.025846 | /* Username: Scheduled, Task ID: 1f947d96-0e97-4b78-b77d-53351edd4d88, Query ID: 1856 > 28109 | fxa | 2016-12-20 15:46:56.427289 | /* Username: Scheduled, Task ID: da326a0c-110d-4df8-947f-4bce19151cde, Query ID: 1856 > (2 rows) Here we can see a single scheduled query demonstrating the problem that :rfkelly described initially. Notice that the two copies of it are spaced around an hour apart; that's usually the case I find, presumably the 1 hour interval is significant somewhere in the code. I have also seen cases where a 3rd copy is staggered a further hour behind the 2nd. Anyway, these two were maxing out the CPU and preventing me measuring some performance improvements I was working on for another query. So I cancelled both of them: > fxa=# cancel 25097; > CANCEL > fxa=# cancel 28109; > CANCEL Then, seconds later: > fxa=# select pid, trim(user_name), starttime, substring(query,1,85) > from stv_recents > where status='Running'; > pid | btrim | starttime | substring > -------+-------+----------------------------+--------------------------------------------------------------------------------------- > 28930 | fxa | 2016-12-20 16:02:56.953624 | /* Username: Scheduled, Task ID: 57b93d72-783e-4ac1-8af0-899658b72b73, Query ID: 1856 > (1 row) The same query id, respawned. Is there any config in redash that controls this respawn behaviour? If that were possible then the mooted cron job may become a possibility.

Blake Imsland [:robotblake]

Assignee

Comment 5

•

8 years ago

This was down to two schedulers running simultaneously and has been fixed.

Assignee: nobody → bimsland

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Phil Booth [:pb]

Comment 6

•

8 years ago

Attached image Screenshot showing the problem still happening with query 2412. — Details

Attaching a screenshot showing that this issue is not resolved for us yet, sorry. The screenshot is of the AWS console for our redshift cluster. In it you can see 9 separate instances of query 2412, all invoked by the scheduler, in a single 3-hour window. You can see that the first 5 instances of it all completed successfully, yet it still got rescheduled. I terminated the 6th instance in a forlorn attempt to speed up another query I was running. I quickly regretted that decision because, shortly afterwards, instances 7, 8 and 9 were started. According to redash, query 2412 [1] is scheduled to run once every 24 hours. Any chance somebody could take another look at this to try and figure out what's going wrong?

Phil Booth [:pb]

Comment 7

•

8 years ago

Re-opening because this problem is not fixed for us, see comment 6.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Phil Booth [:pb]

Comment 8

•

8 years ago

Missing link to the infamous query 2412 from comment 6: [1] https://sql.telemetry.mozilla.org/queries/2412?p_start_date=2017-01-01

Phil Booth [:pb]

Comment 9

•

8 years ago

Fwiw, these scheduled queries appear to have settled down again now. Not sure if anyything's been intentionally fixed in the last couple of days?

Phil Booth [:pb]

Comment 10

•

8 years ago

Fwiw, this has started happening again in the last day or two, e.g. right now we have three copies of scheduled query #2255 running, started within 30 minutes of each other.

Phil Booth [:pb]

Comment 11

•

7 years ago

I don't think we're being bitten by this any more. Recent changes to the redshift schemata improved querys speed plus the cluster was beefed up too. I'm assuming that all helped, closing this down again.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 7 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Redash seems to regularly run multiple copies of a query

Categories

(Cloud Services :: Metrics: Data Tools, defect, P2)

Tracking

(Not tracked)

People

(Reporter: rfkelly, Assigned: robotblake)

References

Details

(Whiteboard: [SvcOps])

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type