Closed Bug 1319086 Opened 8 years ago Closed 7 years ago

Presto is basically unusable on European mornings (error 65537)

Categories

(Data Platform and Tools Graveyard :: Presto, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: robotblake)

References

Details

(Whiteboard: [SvcOps])

It looks like every new day has some new interesting redash failure waiting for me :-) I was unable to access re:dash for the whole day, and I heard other people having the same issues.

Here's the output for the problem:

> Error running query: {"errorCode":65537,"message":"Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (getting task status http://172.31.22.226:8889/v1/task/20161121_152736_03251_9rh2e.1.4 - 20 failures, time since last success 122.81s)","errorType":"INTERNAL_ERROR","failureInfo":{"type":"com.facebook.presto.spi.PrestoException","message":"Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (getting task status http://172.31.22.226:8889/v1/task/20161121_152736_03251_9rh2e.1.4 - 20 failures, time since last success 122.81s)","suppressed":[{"type":"com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.ServiceUnavailableException","message":"Server returned SERVICE_UNAVAILABLE: http://172.31.22.226:8889/v1/task/20161121_152736_03251_9rh2e.1.4/status","suppressed":[],"stack":["com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.onSuccess(SimpleHttpResponseHandler.java:52)","com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.onSuccess(SimpleHttpResponseHandler.java:27)","com.google.common.util.concurrent.Futures$6.run(Futures.java:1319)","io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:77)","java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)","java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)","java.lang.Thread.run(Thread.java:745)"]},{"type":"com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.ServiceUnavailableException","message":"Server returned SERVICE_UNAVAILABLE: http://172.31.22.226:8889/v1/task/20161121_152736_03251_9rh2e.1.4/status","suppressed":[],"stack":
Blake, is there anything we can do to make redash more resilient/reliable during European working hours? :'( This is starting to drive me crazy.
Whiteboard: [SvcOps]
Severity: normal → blocker
Summary: Redash is basically unusable on European mornings (error 65537) → Presto is basically unusable on European mornings (error 65537)
Assignee: nobody → bimsland
Points: --- → 3
Priority: -- → P1
I'm getting this error today https://pastebin.mozilla.org/8956710
Can we please prioritize this bug?
Just wanted to give a long overdue update. We've worked on a plan for Presto over the next quarter that involves a mix of moving important and / or heavy queries to Athena as well as moving the cluster we run off of EMR. As part of that process we'll be hooking up proper monitoring / instrumentation so we can proactively resolve issues and get better insight into root causes. If you've got any questions please feel free to send me a message via email or on IRC.
Severity: blocker → normal
Component: Metrics: Pipeline → Presto
Product: Cloud Services → Data Platform and Tools
We've moved Presto off of EMR and it's proven to be more stable, as part of that process we also now have proactive monitoring that alerts us of errors and node failures. This was always sort of a meta bug so I'm going to resolve it and we can open new bugs if there end up being additional failures or stability issues.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.