Closed Bug 1500473 Opened 7 years ago Closed 4 years ago

Add monitoring for the re:dash query queue

Categories

(Data Platform and Tools :: Monitoring & Alerting, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mreid, Unassigned)

Details

As a user of STMO, it can be frustrating when a query that is normally quick to run gets stuck in the queue and ends up taking much longer. We should add some monitoring to get a sense of the impact of queueing on STMO users.
We already have monitoring for queues and alert based on thresholds. I would give you the link to the datadog page but the page is down right now :\ I don't think we need additional monitoring here, I think we need to provide feedback to users when their jobs are queued and possibly the position they are in the queue.
Which of these series corresponds to the queue for interactive querying? As I read it, it looks like "redash.queues.queries", but that consistently shows zero.
Flags: needinfo?(jthomas)
big_queries = athena interactive queries, specifically, Athena and Athena Search data sources presto_queries = presto interactive queries, specifically, Presto and Presto Search data sources If you mouse over the tool tip it will list which data source is associated with the queue. There is a 'waiting' queue but I can't really find much documentation how this relates to actual scheduling [1]. Even though some queues are empty there are query tasks in the 'waiting' queue. I am assuming queries get pulled off normal queues and put into the waiting queue when they are about to get executed. CC'ing :jezdez for possible insight. [1] https://sql.telemetry.mozilla.org/admin/queries/tasks#waiting
Flags: needinfo?(jthomas)
Is it possible to quantify some sort of average time in queue? Queue length is one thing, but I do feel like length of time in queue is a better barometer of user impact and queue times can change fairly dramatically even though the query size stays consistent if our average query time grows
Just got another report on slack of queries being stuck in queue. All the queues shown on under "Queues" on the Admin Status page[1] are showing zero, but there are several queries that appear in the "Queries Queue" page[2]. This query queue can be monitored using the Redash API at [3], is that something we can easily add to DataDog? Inside that JSON response, we're interested in tracking the length of the "waiting" element. It would also be interesting to keep track of the age of the oldest item in that array as an indicator of "time in queue" per Sunah's suggestion above. [1] https://sql.telemetry.mozilla.org/admin/status [2] https://sql.telemetry.mozilla.org/admin/queries/tasks#waiting [3] https://sql.telemetry.mozilla.org/api/admin/queries/tasks
I can't verify if that's the case here (don't see how many workers we have), but this sounds somewhat similar: https://github.com/getredash/redash/issues/1782#issuecomment-313104463 tldr: by default Celery worker is prefetching 4 tasks at a time, if the 1st one is long running (scheduled queries?) then others need to wait.

(In reply to Mark Reid [:mreid] from comment #6)

This query queue can be monitored using the Redash API at [3], is that
something we can easily add to DataDog? Inside that JSON response, we're
interested in tracking the length of the "waiting" element.

It would also be interesting to keep track of the age of the oldest item in
that array as an indicator of "time in queue" per Sunah's suggestion above.

[1] https://sql.telemetry.mozilla.org/admin/status
[2] https://sql.telemetry.mozilla.org/admin/queries/tasks#waiting
[3] https://sql.telemetry.mozilla.org/api/admin/queries/tasks

Thanks I didn't realize that this data was available. We should be able to poll this, add it into datadog and add alerting around it.

A large waiting queue may indicate data sources are overloaded and based on that we can figure out what to do next.

We won't be putting further investment into redash in favor of Looker.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.