If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Status

Cloud Services
Metrics: Pipeline
P1
critical
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: rvitillo, Assigned: robotblake)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

User Story

Long running queries (mostly over Android datasets) are trashing our small Presto cluster causing any query to run extremely slow, defeating the main purpose to get answers from data as quickly as possible. The size of the cluster has to be increased and the Fennec team should look into optimizing the queries to reduce running time.
Comment hidden (empty)
(Reporter)

Updated

a year ago
Severity: normal → critical
(Reporter)

Updated

a year ago
Flags: needinfo?(bimsland)

Updated

a year ago
Assignee: nobody → bimsland
Points: --- → 1
:robotblake spun up a large cluster yesterday which alleviated some issues, but unfortunately not all of them. We've cut back to the old cluster and disabled scheduled queries temporarily while we continue to investigate. The number of scheduled queries has been increasing (unused queries are probably not removed) and are often scheduled to run at the same time which can lead to cluster overloading. For now, ad-hoc queries should work.
(In reply to Wesley Dawson [:whd] from comment #1)
> :robotblake spun up a large cluster yesterday which alleviated some issues,
> but unfortunately not all of them. We've cut back to the old cluster and
> disabled scheduled queries temporarily while we continue to investigate. The
> number of scheduled queries has been increasing (unused queries are probably
> not removed) and are often scheduled to run at the same time which can lead
> to cluster overloading. For now, ad-hoc queries should work.

What about having a separate cluster for scheduled queries and one for interactive ones?
(Assignee)

Comment 3

a year ago
We're going to spin up a second cluster for scheduled queries.
Points: 1 → 2
Flags: needinfo?(bimsland)
Priority: -- → P1
To mitigate these issues from the other side of the fence, :bbermes is going to do some cleanup on the fennec scheduled queries. If that isn't enough we can try to sample the datasets, like we do with the longitudinal dataset.
It looks like what :bbermes did improved the situation. :robotblake can you confirm the query backlog is down to a normal size?
(Assignee)

Comment 6

a year ago
(In reply to Mauro Doglio [:mdoglio] from comment #5)
> It looks like what :bbermes did improved the situation. :robotblake can you
> confirm the query backlog is down to a normal size?

It does look like things are slightly better, but they're not completely solved. We're still seeing some "insufficient resources" failures and we're backed up an average of ~80-100 queries so while we're not backing up as bad as it had been (> 1000 backlog / 6+ hour runtimes) some more work will still need to be done to scale and / or move some of the heavier queries to a dedicated cluster.

Expanding on that, while scaling may help (hide?) the immediate issues we're seeing (and huge huge thanks to both :mdoglio and :bbermes for their work on this!), I feel like we're still flailing around in the dark as to how to deal with this as usage increases. We should definitely look into instrumenting (via JMX) the Presto clusters to get a better idea of what's going on and get some sort of alerting set up so we know when things start to go south, but that doesn't solve the problems we're running into.

I'm planning on reaching out to the Presto team to see if any of them have time to talk so if anyone is interested in being involved in that conversation please let me know.
(Assignee)

Comment 7

a year ago
I'll create a new bug for adding the JMX instrumentation, but after splitting the clusters we're no longer seeing the failures that we were before.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.