Closed Bug 1319086 Opened 8 years ago Closed 7 years ago

Presto is basically unusable on European mornings (error 65537)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: robotblake)

References

Details

(Whiteboard: [SvcOps])

Alessio Placitelli [:Dexter]

Reporter

Description

•

8 years ago

It looks like every new day has some new interesting redash failure waiting for me :-) I was unable to access re:dash for the whole day, and I heard other people having the same issues. Here's the output for the problem: > Error running query: {"errorCode":65537,"message":"Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (getting task status http://172.31.22.226:8889/v1/task/20161121_152736_03251_9rh2e.1.4 - 20 failures, time since last success 122.81s)","errorType":"INTERNAL_ERROR","failureInfo":{"type":"com.facebook.presto.spi.PrestoException","message":"Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (getting task status http://172.31.22.226:8889/v1/task/20161121_152736_03251_9rh2e.1.4 - 20 failures, time since last success 122.81s)","suppressed":[{"type":"com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.ServiceUnavailableException","message":"Server returned SERVICE_UNAVAILABLE: http://172.31.22.226:8889/v1/task/20161121_152736_03251_9rh2e.1.4/status","suppressed":[],"stack":["com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.onSuccess(SimpleHttpResponseHandler.java:52)","com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.onSuccess(SimpleHttpResponseHandler.java:27)","com.google.common.util.concurrent.Futures$6.run(Futures.java:1319)","io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:77)","java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)","java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)","java.lang.Thread.run(Thread.java:745)"]},{"type":"com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.ServiceUnavailableException","message":"Server returned SERVICE_UNAVAILABLE: http://172.31.22.226:8889/v1/task/20161121_152736_03251_9rh2e.1.4/status","suppressed":[],"stack":

Alessio Placitelli [:Dexter]

Reporter

Comment 1

•

8 years ago

Blake, is there anything we can do to make redash more resilient/reliable during European working hours? :'( This is starting to drive me crazy.

Whiteboard: [SvcOps]

Roberto Agostino Vitillo (:rvitillo)

Updated

•

8 years ago

Severity: normal → blocker

Roberto Agostino Vitillo (:rvitillo)

Updated

•

8 years ago

Summary: Redash is basically unusable on European mornings (error 65537) → Presto is basically unusable on European mornings (error 65537)

Roberto Agostino Vitillo (:rvitillo)

Updated

•

8 years ago

Blocks: 1255751

Thomas Huelbert

Updated

•

8 years ago

Assignee: nobody → bimsland

Points: --- → 3

Priority: -- → P1

Mauro Doglio [:mdoglio]

Comment 2

•

8 years ago

I'm getting this error today https://pastebin.mozilla.org/8956710 Can we please prioritize this bug?

Blake Imsland [:robotblake]

Assignee

Comment 4

•

8 years ago

Just wanted to give a long overdue update. We've worked on a plan for Presto over the next quarter that involves a mix of moving important and / or heavy queries to Athena as well as moving the cluster we run off of EMR. As part of that process we'll be hooking up proper monitoring / instrumentation so we can proactively resolve issues and get better insight into root causes. If you've got any questions please feel free to send me a message via email or on IRC.

Jason Thomas [:jason]

Updated

•

8 years ago

Severity: blocker → normal

Mark Reid [:mreid]

Updated

•

8 years ago

Component: Metrics: Pipeline → Presto

Product: Cloud Services → Data Platform and Tools

Blake Imsland [:robotblake]

Assignee

Comment 5

•

7 years ago

We've moved Presto off of EMR and it's proven to be more stable, as part of that process we also now have proactive monitoring that alerts us of errors and node failures. This was always sort of a meta bug so I'm going to resolve it and we can open new bugs if there end up being additional failures or stability issues.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

5 years ago

Product: Data Platform and Tools → Data Platform and Tools Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Presto is basically unusable on European mornings (error 65537)

Categories

(Data Platform and Tools Graveyard :: Presto, defect, P1)

Tracking

(Not tracked)

People

(Reporter: Dexter, Assigned: robotblake)

References

Details

(Whiteboard: [SvcOps])

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Updated

Updated

Comment 2

Comment 4

Updated

Updated

Comment 5

Updated