Closed Bug 1313345 Opened 8 years ago Closed 8 years ago

Getting Redash errorCode 65544 on european mornings

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RT, Assigned: robotblake)

References

Details

(Whiteboard: [SvcOps])

It's the second time I hit the following issue when running redash queries in european morning (when I hit the issue re-running the query gets me the same error. I got the following error when running https://sql.telemetry.mozilla.org/queries/1451/source#2575 at 9:40 am CEST (same query now running fine at 2:50pm CEST):

"Error running query: {"errorCode":65544,"message":"Could not communicate with the remote task. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes.","errorType":"INTERNAL_ERROR","failureInfo":{"type":"com.facebook.presto.spi.PrestoException","message":"Could not communicate with the remote task. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes.","suppressed":[],"stack":["com.facebook.presto.server.remotetask.ContinuousTaskStatusFetcher.updateTaskStatus(ContinuousTaskStatusFetcher.java:234)","com.facebook.presto.server.remotetask.ContinuousTaskStatusFetcher.success(ContinuousTaskStatusFetcher.java:168)","com.facebook.presto.server.remotetask.ContinuousTaskStatusFetcher.success(ContinuousTaskStatusFetcher.java:52)","com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.onSuccess(SimpleHttpResponseHandler.java:49)","com.facebook.presto.server.remotetask.SimpleHttpResponseHandler.onSuccess(SimpleHttpResponseHandler.java:27)","com.google.common.util.concurrent.Futures$6.run(Futures.java:1319)","io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:77)","java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)","java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)","java.lang.Thread.run(Thread.java:745)"]},"errorName":"REMOTE_TASK_MISMATCH"}
Flags: needinfo?(bimsland)
Whiteboard: [SvcOps]
This appears to be down to timing and a potential memory leak in the Hive + Parquet reader that Presto uses.  

Regarding timing, it looks like we've got a ton of scheduled queries running right around the time that these issues are occurring, I don't have a good solution for this except saying that we should spend some time doing an audit of what our scheduled queries are at some point in the near future.

Regarding the memory leak, while we're not experiencing the complete meltdown that we would run into a couple months ago, it can still take 10-20 minutes for the cluster to recover which is really not ideal. There have been some fixes in newer releases of Presto (0.152 in particular, we're running 0.150) that may help this issue from occurring so we should take a look at getting that upgrade prioritized in the next couple of sprints.
Flags: needinfo?(bimsland)
Presto has been upgraded to 0.152.3 which should help these issues, feel free to reopen with additional data if it occurs again.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
I am getting this again, several time in a raw when running a query (https://sql.telemetry.mozilla.org/queries/1487/source#2628) this morning (in the last hour):

Error running query: {"errorCode":65540,"message":"Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://172.31.30.156:8889/v1/task/20161107_110311_02421_ty2ep.2.23/results/16/0 - requests failed for 60.01s)","errorType":"INTERNAL_ERROR","failureInfo":{"cause":{"cause":{"type":"java.util.concurrent.TimeoutException","message":"Idle timeout 30000 ms","suppressed":[],"stack":["org.eclipse.jetty.client.http.HttpConnectionOverHTTP.onIdleExpired(HttpConnectionOverHTTP.java:104)","org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:162)","org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)","org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)","java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)","java.util.concurrent.FutureTask.run(FutureTask.java:266)","java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)","java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)","java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)","java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)","java.lang.Thread.run(Thread.java:745)"]},"message":"java.util.concurrent.TimeoutException: Idle timeout 30000 ms","suppressed":[],"stack":["com.google.common.base.Throwables.propagate(Throwables.java:160)","io.airlift.http.client.ResponseHandlerUtils.propagate(ResponseHandlerUtils.java:22)","com.facebook.presto.operator.HttpPageBufferClient$PageResponseHandler.handleException(HttpPageBufferClient.java:540)","com.facebook.presto.operator.HttpPageBufferClient$PageResponseHandler.handleException(HttpPageBufferClient.java:527)","io.airlift.http.client.jetty.JettyHttpClient$JettyResponseFuture.failed(JettyHttpClient.java:870)","io.airlift.http.client.jetty.JettyHttpClient$BufferingResponseListener.onComplete(JettyHttpClient.java:1104)","org.eclipse.jetty.client.ResponseNotifier.notifyComplete(ResponseNotifier.java:193)","org.eclipse.jetty.client.ResponseNotifier.notifyComplete(ResponseNotifier.java:185)","org.eclipse.jetty.client.HttpReceiver.terminateResponse(HttpReceiver.java:457)","org.eclipse.jetty.client.HttpReceiver.abort(HttpReceiver.java:528)","org.eclipse.jetty.client.HttpChannel.abortResponse(HttpChannel.java:129)","org.eclipse.jetty.client.HttpChannel.abort(HttpChannel.java:122)","org.eclipse.jetty.client.HttpExchange.abort(HttpExchange.java:257)","org.eclipse.jetty.client.HttpConversation.abort(HttpConversation.java:141)","org.eclipse.jetty.client.HttpRequest.abort(HttpRequest.java:704)","org.eclipse.jetty.client.http.HttpConnectionOverHTTP.abort(HttpConnectionOverHTTP.java:157)","org.eclipse.jetty.client.http.HttpConnectionOverHTTP.close(HttpConnectionOverHTTP.java:143)","org.eclipse.jetty.client.http.HttpConnectionOverHTTP.onIdleExpired(HttpConnectionOverHTTP.java:104)","org.eclipse.jetty.io.AbstractEndPoint.onIdleExpired(AbstractEndPoint.java:162)","org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)","org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)","java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)","java.util.concurrent.FutureTask.run(FutureTask.java:266)","java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)","java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)","java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)","java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)","java.lang.Thread.run(Thread.java:745)"],"type":"java.lang.RuntimeException"},"message":"Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://172.31.30.156:8889/v1/task/20161107_110311_02421_ty2ep.2.23/results/16/0 - requests failed for 60.01s)","suppressed":[],"stack":["com.facebook.presto.operator.HttpPageBufferClient$1.onFailure(HttpPageBufferClient.java:375)","com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)","java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)","java.util.concurrent.FutureTask.run(FutureTask.java:266)","java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)","java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)","java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)","java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)","java.lang.Thread.run(Thread.java:745)"],"type":"com.facebook.presto.operator.PageTransportTimeoutException"},"errorName":"PAGE_TRANSPORT_TIMEOUT"}

Is this the same issue or do I need to open another bug?
Flags: needinfo?(bimsland)
Looks like it's a related issue, we need to get some monitoring / stats in place so we can figure out what's triggering the issue. I'll work on getting some bugs filed for those. :(
Flags: needinfo?(bimsland)
Severity: normal → blocker
Assignee: nobody → bimsland
Reopened as we received more complaints on slack (#fx-metrics).
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.