Closed Bug 1291692 Opened 8 years ago Closed 8 years ago

Presto cluster unusable

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: robotblake)

Details

User Story

Long running queries (mostly over Android datasets) are trashing our small Presto cluster causing any query to run extremely slow, defeating the main purpose to get answers from data as quickly as possible. The size of the cluster has to be increased and the Fennec team should look into optimizing the queries to reduce running time.
      No description provided.
Severity: normal → critical
Flags: needinfo?(bimsland)
Assignee: nobody → bimsland
Points: --- → 1
:robotblake spun up a large cluster yesterday which alleviated some issues, but unfortunately not all of them. We've cut back to the old cluster and disabled scheduled queries temporarily while we continue to investigate. The number of scheduled queries has been increasing (unused queries are probably not removed) and are often scheduled to run at the same time which can lead to cluster overloading. For now, ad-hoc queries should work.
(In reply to Wesley Dawson [:whd] from comment #1)
> :robotblake spun up a large cluster yesterday which alleviated some issues,
> but unfortunately not all of them. We've cut back to the old cluster and
> disabled scheduled queries temporarily while we continue to investigate. The
> number of scheduled queries has been increasing (unused queries are probably
> not removed) and are often scheduled to run at the same time which can lead
> to cluster overloading. For now, ad-hoc queries should work.

What about having a separate cluster for scheduled queries and one for interactive ones?
We're going to spin up a second cluster for scheduled queries.
Points: 1 → 2
Flags: needinfo?(bimsland)
Priority: -- → P1
To mitigate these issues from the other side of the fence, :bbermes is going to do some cleanup on the fennec scheduled queries. If that isn't enough we can try to sample the datasets, like we do with the longitudinal dataset.
It looks like what :bbermes did improved the situation. :robotblake can you confirm the query backlog is down to a normal size?
(In reply to Mauro Doglio [:mdoglio] from comment #5)
> It looks like what :bbermes did improved the situation. :robotblake can you
> confirm the query backlog is down to a normal size?

It does look like things are slightly better, but they're not completely solved. We're still seeing some "insufficient resources" failures and we're backed up an average of ~80-100 queries so while we're not backing up as bad as it had been (> 1000 backlog / 6+ hour runtimes) some more work will still need to be done to scale and / or move some of the heavier queries to a dedicated cluster.

Expanding on that, while scaling may help (hide?) the immediate issues we're seeing (and huge huge thanks to both :mdoglio and :bbermes for their work on this!), I feel like we're still flailing around in the dark as to how to deal with this as usage increases. We should definitely look into instrumenting (via JMX) the Presto clusters to get a better idea of what's going on and get some sort of alerting set up so we know when things start to go south, but that doesn't solve the problems we're running into.

I'm planning on reaching out to the Presto team to see if any of them have time to talk so if anyone is interested in being involved in that conversation please let me know.
I'll create a new bug for adding the JMX instrumentation, but after splitting the clusters we're no longer seeing the failures that we were before.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.