Closed Bug 1291692 Opened 9 years ago Closed 9 years ago

Presto cluster unusable

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: robotblake)

Details

User Story

Long running queries (mostly over Android datasets) are trashing our small Presto cluster causing any query to run extremely slow, defeating the main purpose to get answers from data as quickly as possible. The size of the cluster has to be increased and the Fennec team should look into optimizing the queries to reduce running time.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

9 years ago

No description provided.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Severity: normal → critical

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Flags: needinfo?(bimsland)

Thomas Huelbert

Updated

•

9 years ago

Assignee: nobody → bimsland

Points: --- → 1

Wesley Dawson [:whd]

Comment 1

•

9 years ago

:robotblake spun up a large cluster yesterday which alleviated some issues, but unfortunately not all of them. We've cut back to the old cluster and disabled scheduled queries temporarily while we continue to investigate. The number of scheduled queries has been increasing (unused queries are probably not removed) and are often scheduled to run at the same time which can lead to cluster overloading. For now, ad-hoc queries should work.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 2

•

9 years ago

(In reply to Wesley Dawson [:whd] from comment #1) > :robotblake spun up a large cluster yesterday which alleviated some issues, > but unfortunately not all of them. We've cut back to the old cluster and > disabled scheduled queries temporarily while we continue to investigate. The > number of scheduled queries has been increasing (unused queries are probably > not removed) and are often scheduled to run at the same time which can lead > to cluster overloading. For now, ad-hoc queries should work. What about having a separate cluster for scheduled queries and one for interactive ones?

Blake Imsland [:robotblake]

Assignee

Comment 3

•

9 years ago

We're going to spin up a second cluster for scheduled queries.

Points: 1 → 2

Flags: needinfo?(bimsland)

Priority: -- → P1

Mauro Doglio [:mdoglio]

Comment 4

•

9 years ago

To mitigate these issues from the other side of the fence, :bbermes is going to do some cleanup on the fennec scheduled queries. If that isn't enough we can try to sample the datasets, like we do with the longitudinal dataset.

Mauro Doglio [:mdoglio]

Comment 5

•

9 years ago

It looks like what :bbermes did improved the situation. :robotblake can you confirm the query backlog is down to a normal size?

Blake Imsland [:robotblake]

Assignee

Comment 6

•

9 years ago

(In reply to Mauro Doglio [:mdoglio] from comment #5) > It looks like what :bbermes did improved the situation. :robotblake can you > confirm the query backlog is down to a normal size? It does look like things are slightly better, but they're not completely solved. We're still seeing some "insufficient resources" failures and we're backed up an average of ~80-100 queries so while we're not backing up as bad as it had been (> 1000 backlog / 6+ hour runtimes) some more work will still need to be done to scale and / or move some of the heavier queries to a dedicated cluster. Expanding on that, while scaling may help (hide?) the immediate issues we're seeing (and huge huge thanks to both :mdoglio and :bbermes for their work on this!), I feel like we're still flailing around in the dark as to how to deal with this as usage increases. We should definitely look into instrumenting (via JMX) the Presto clusters to get a better idea of what's going on and get some sort of alerting set up so we know when things start to go south, but that doesn't solve the problems we're running into. I'm planning on reaching out to the Presto team to see if any of them have time to talk so if anyone is interested in being involved in that conversation please let me know.

Blake Imsland [:robotblake]

Assignee

Comment 7

•

9 years ago

I'll create a new bug for adding the JMX instrumentation, but after splitting the clusters we're no longer seeing the failures that we were before.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Presto cluster unusable

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: rvitillo, Assigned: robotblake)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated