Closed
Bug 1291692
Opened 9 years ago
Closed 9 years ago
Presto cluster unusable
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Assigned: robotblake)
Details
User Story
Long running queries (mostly over Android datasets) are trashing our small Presto cluster causing any query to run extremely slow, defeating the main purpose to get answers from data as quickly as possible. The size of the cluster has to be increased and the Fennec team should look into optimizing the queries to reduce running time.
No description provided.
Reporter | ||
Updated•9 years ago
|
Severity: normal → critical
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(bimsland)
Updated•9 years ago
|
Assignee: nobody → bimsland
Points: --- → 1
Comment 1•9 years ago
|
||
:robotblake spun up a large cluster yesterday which alleviated some issues, but unfortunately not all of them. We've cut back to the old cluster and disabled scheduled queries temporarily while we continue to investigate. The number of scheduled queries has been increasing (unused queries are probably not removed) and are often scheduled to run at the same time which can lead to cluster overloading. For now, ad-hoc queries should work.
Reporter | ||
Comment 2•9 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #1)
> :robotblake spun up a large cluster yesterday which alleviated some issues,
> but unfortunately not all of them. We've cut back to the old cluster and
> disabled scheduled queries temporarily while we continue to investigate. The
> number of scheduled queries has been increasing (unused queries are probably
> not removed) and are often scheduled to run at the same time which can lead
> to cluster overloading. For now, ad-hoc queries should work.
What about having a separate cluster for scheduled queries and one for interactive ones?
Assignee | ||
Comment 3•9 years ago
|
||
We're going to spin up a second cluster for scheduled queries.
Points: 1 → 2
Flags: needinfo?(bimsland)
Priority: -- → P1
Comment 4•9 years ago
|
||
To mitigate these issues from the other side of the fence, :bbermes is going to do some cleanup on the fennec scheduled queries. If that isn't enough we can try to sample the datasets, like we do with the longitudinal dataset.
Comment 5•9 years ago
|
||
It looks like what :bbermes did improved the situation. :robotblake can you confirm the query backlog is down to a normal size?
Assignee | ||
Comment 6•9 years ago
|
||
(In reply to Mauro Doglio [:mdoglio] from comment #5)
> It looks like what :bbermes did improved the situation. :robotblake can you
> confirm the query backlog is down to a normal size?
It does look like things are slightly better, but they're not completely solved. We're still seeing some "insufficient resources" failures and we're backed up an average of ~80-100 queries so while we're not backing up as bad as it had been (> 1000 backlog / 6+ hour runtimes) some more work will still need to be done to scale and / or move some of the heavier queries to a dedicated cluster.
Expanding on that, while scaling may help (hide?) the immediate issues we're seeing (and huge huge thanks to both :mdoglio and :bbermes for their work on this!), I feel like we're still flailing around in the dark as to how to deal with this as usage increases. We should definitely look into instrumenting (via JMX) the Presto clusters to get a better idea of what's going on and get some sort of alerting set up so we know when things start to go south, but that doesn't solve the problems we're running into.
I'm planning on reaching out to the Presto team to see if any of them have time to talk so if anyone is interested in being involved in that conversation please let me know.
Assignee | ||
Comment 7•9 years ago
|
||
I'll create a new bug for adding the JMX instrumentation, but after splitting the clusters we're no longer seeing the failures that we were before.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•