Closed Bug 1121422 Opened 5 years ago Closed 4 years ago

Etherpad.mozilla.org slow, unresponsive

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

x86
macOS
task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rwatson, Assigned: nmaul)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/306] )

User reported problems. Frequent disconnects. I'm getting frequent:
Service Unavailable

The service is temporarily unavailable. Please try again later.

Have reloaded etherpad and the issue still persists.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/272]
etherpad1: 
at java.lang.Thread.run(Thread.java:745)
2015-01-14 11:52:21.486::WARN:  handle failed
java.lang.IllegalStateException
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:385)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Task checkForStalePads execution failed with non-200 response: 500
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/272]
restarted the service, had to kill some proc's but sprang back to life after that.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/273]
Just had the same issue now. Usul will add more details.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
got this stack trace on 3 when I logged in :

.scala:75)
        at net.appjet.bodylock.BodyLock$.runInContext(bodylock.scala:111)
        at net.appjet.bodylock.InnerExecutable.execute(bodylock.scala:75)
        at net.appjet.oui.execution$$anonfun$execute$4.apply(execution.scala:508)
        at net.appjet.oui.NoninheritedDynamicVariable.withValue(dynamicvar.scala:38)
        at net.appjet.oui.ExecutionContextUtils$.withContext(execution.scala:334)
        at net.appjet.oui.execution$.execute(execution.scala:504)
        at net.appjet.oui.execution$.runOutOfBand(execution.scala:631)
        at net.appjet.ajstdlib.execution$.runTask(ajstdlib.scala:140)
        at net.appjet.ajstdlib.execution$TaskRunner.call(ajstdlib.scala:177)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
^C[2015-01-16 08:05:11.079-0800]: Shutting down...
[2015-01-16 08:05:14.080-0800]: ...done, running onshutdown.
[2015-01-16 08:05:14.398-0800]: ...done, stopping server.
[2015-01-16 08:05:14.507-0800]: ...done, flushing logs.
[2015-01-16 08:05:14.508-0800]: ...done.n
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/273]
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/306]
The stack trace doesn't indicate that it's "broken" in the sense we care about here (etherpad being unresponsive)... that can happen throughout normal operation, and is rarely related to why etherpad has stopped working altogether.

The real thing to look for is which types of pads aren't working... if team site pads are broken, then etherpad1.webapp.phx1 needs etherpad restarted. If normal public pads are broken, then it's etherpad3.

I will use this bug to work on automation to restart etherpad periodically to try and avoid this. It will still mean a downtime while etherpad restarts, but at least it'll come right back and we're much less likely to get a bug, IRC poke, or nagios alert about it.

I'll also talk to :gcox and :cknowles about more RAM in the VMs... that can prolong the inevitable, but not entirely prevent it.


In the meantime, if/when it breaks, please report here as to which node broke, when, and (if known) how we detected the breakage (user report, nagios, or whatever). Thank you!
Assignee: server-ops-webops → nmaul
etherpad1 is upgraded from 12GB RAM to 20GB. This *should* reduce the frequency of failures, though I suspect it will still fail eventually. I'm still planning on some sort of script that will kick it when it starts to go bad, but detecting that is somewhat difficult.
Had to restart the team pad today.
It's happening again at least for the ateam team etherpad. It's trying to reconnect all the time which makes it impossible to work with this site.
Severity: normal → critical
Kicked etherpad1.
Severity: critical → normal
Depends on: 1134037
Two things I have noticed here in the last couple of days when this problem started to happen for me:

1. I'm using my notebook in a docking station. So a WIFI and an ethernet connection were active, and caused a constant reconnection of the etherpad. Once I killed WIFI and only had the ethernet connection, all was working fine.

2. Today I simply had to restart Firefox to make the problem go away. It was open since Feb 19th and the machine was in hibernation mode a couple of times. Might be that there is something broken in Firefox on Linux in combination with etherpad at least. On OS X the problem didn't appear.
Bounced etherpad on etherpad1 for team pads not saving, again.

screen session contained (at the end:

Exception in thread "654803997@qtp-705990479-10071" net.appjet.ajstdlib.SocketManager$HandlerException: An error occurred while handling a request: 500 - You like apples? An error occurred in the error handler while handling an error. How do you like <i>them</i> apples?<br>
net.appjet.bodylock.JSRuntimeException: Error while executing: TypeError: Can't use instanceof on a non-object. (module etherpad/log.js#121)<br>
        at net.appjet.bodylock.InnerExecutable.execute(bodylock.scala:84)<br>  
        at net.appjet.oui.execution$$anonfun$execute$4.apply(execution.scala:517)<br>
        at net.appjet.oui.NoninheritedDynamicVariable.withValue(dynamicvar.scala:38)<br>
        at net.appjet.oui.ExecutionContextUtils$.withContext(execution.scala:334)<br>
        at net.appjet.oui.execution$.execute(execution.scala:504)<br>
        at net.appjet.ajstdlib.SocketManager$$anon$11$$anon$12.run(streaming.scala:89)<br>
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)<br>
Caused by: org.mozilla.javascript.EcmaError: TypeError: Can't use instanceof on a non-object. (module etherpad/log.js#121)<br>
        at org.mozilla.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3559)<br>
        at org.mozilla.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3537)<br>
        at org.mozilla.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3565)<br>
        at org.mozilla.javascript.ScriptRuntime.typeError0(ScriptRuntime.java:3571)<br>
        at org.mozilla.javascript.ScriptRuntime.in(ScriptRuntime.java:2912)<br>
        at JS$module$etherpad$log$js$12._c8(module etherpad/log.js:121)<br>
        at JS$module$etherpad$log$js$12.call(module etherpad/log.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.callName0(OptRuntime.java:108)<br>
        at JS$module$etherpad$log$js$12._c18(module etherpad/log.js:204)<br>
        at JS$module$etherpad$log$js$12.call(module etherpad/log.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.callName(OptRuntime.java:97)<br>
        at JS$module$main$js$8._c6(module main.js:131)<br>
        at JS$module$main$js$8.call(module main.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.call1(OptRuntime.java:66)<br>
        at JS$module$onerror$js$354._c0(module onerror.js:24)<br>
        at JS$module$onerror$js$354.call(module onerror.js)<br>
        at org.mozilla.javascript.ContextFactory.doTopCall(ContextFactory.java:401)<br>
        at org.mozilla.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3004)<br>
        at JS$module$onerror$js$354.call(module onerror.js)<br>
        at JS$module$onerror$js$354.exec(module onerror.js)<br>
        at net.appjet.bodylock.InnerExecutable$$anonfun$execute$1.apply(bodylock.scala:76)<br>
        at net.appjet.bodylock.InnerExecutable$$anonfun$execute$1.apply(bodylock.scala:75)<br>
        at net.appjet.bodylock.BodyLock$.runInContext(bodylock.scala:111)<br>
        at net.appjet.bodylock.InnerExecutable.execute(bodylock.scala:75)<br>
        ... 6 more<br>
Caused by:<br>
net.appjet.bodylock.JSRuntimeException: Error while executing: Communications link failure<br>
<br>
The last packet successfully received from the server was 426,602 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.<br>
        at net.appjet.bodylock.InnerExecutable.execute(bodylock.scala:84)<br>
        at net.appjet.oui.execution$$anonfun$execute$4.apply(execution.scala:508)<br>
        at net.appjet.oui.NoninheritedDynamicVariable.withValue(dynamicvar.scala:38)<br>
        at net.appjet.oui.ExecutionContextUtils$.withContext(execution.scala:334)<br>
        at net.appjet.oui.execution$.execute(execution.scala:504)<br>
        at net.appjet.ajstdlib.SocketManager$$anon$11$$anon$12.run(streaming.scala:89)<br>
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)<br>
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure<br>
<br>
The last packet successfully received from the server was 426,602 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.<br>
        at sun.reflect.GeneratedConstructorAccessor69.newInstance(Unknown Source)<br>
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)<br>
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)<br>
        at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)<br>
        at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1116)<br>
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3090)<br>
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2979)<br>
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3520)<br>
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1990)<br>
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2151)<br>
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2619)<br>
        at com.mysql.jdbc.ConnectionImpl.setAutoCommit(ConnectionImpl.java:4997)<br>
        at com.mchange.v2.c3p0.impl.NewProxyConnection.setAutoCommit(NewProxyConnection.java:881)<br>
        at net.appjet.ajstdlib.SQLBase.getConnectionFromPool(sqlbase.scala:74)<br>
        at net.appjet.ajstdlib.SQLBase.withConnection(sqlbase.scala:89)<br>
        at net.appjet.ajstdlib.SQLBase.inTransaction(sqlbase.scala:105)<br>
        at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source)<br>
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)<br>
        at java.lang.reflect.Method.invoke(Method.java:606)<br>
        at org.mozilla.javascript.MemberBox.invoke(MemberBox.java:160)<br>
        at org.mozilla.javascript.NativeJavaMethod.call(NativeJavaMethod.java:243)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.call1(OptRuntime.java:66)<br>
        at JS$module$sqlbase$sqlcommon$js$16._c6(module sqlbase/sqlcommon.js:64)<br>
        at JS$module$sqlbase$sqlcommon$js$16.call(module sqlbase/sqlcommon.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.call1(OptRuntime.java:66)<br>
        at JS$module$etherpad$pad$model$js$37._c6(module etherpad/pad/model.js:109)<br>
        at JS$module$etherpad$pad$model$js$37.call(module etherpad/pad/model.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.call0(OptRuntime.java:57)<br>
        at JS$module$sync$js$15._c7(module sync.js:75)<br>
        at JS$module$sync$js$15.call(module sync.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.call2(OptRuntime.java:76)<br>
        at JS$module$etherpad$pad$model$js$37._c48(module etherpad/pad/model.js:503)<br>
        at JS$module$etherpad$pad$model$js$37.call(module etherpad/pad/model.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.callName(OptRuntime.java:97)<br>
        at JS$module$etherpad$pad$model$js$37._c5(module etherpad/pad/model.js:108)<br>
        at JS$module$etherpad$pad$model$js$37.call(module etherpad/pad/model.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.call2(OptRuntime.java:76)<br>
        at JS$module$etherpad$collab$collab$server$js$34._c59(module etherpad/collab/collab_server.js:580)<br>
        at JS$module$etherpad$collab$collab$server$js$34.call(module etherpad/collab/collab_server.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.call1(OptRuntime.java:66)<br>
        at JS$module$etherpad$collab$collabroom$server$js$71._c29(module etherpad/collab/collabroom_server.js:264)<br>
        at JS$module$etherpad$collab$collabroom$server$js$71.call(module etherpad/collab/collabroom_server.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.callName(OptRuntime.java:97)<br>
        at JS$module$etherpad$collab$collabroom$server$js$71._c36(module etherpad/collab/collabroom_server.js:327)<br>
        at JS$module$etherpad$collab$collabroom$server$js$71.call(module etherpad/collab/collabroom_server.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.callN(OptRuntime.java:86)<br>
        at JS$module$main$js$8._c16(module main.js:185)<br>
        at JS$module$main$js$8.call(module main.js)<br>
        at org.mozilla.javascript.optimizer.OptRuntime.call2(OptRuntime.java:76)<br>
        at JS$module$oncomet$js$278._c0(module oncomet.js:31)<br>
        at JS$module$oncomet$js$278.call(module oncomet.js)<br>
        at org.mozilla.javascript.ContextFactory.doTopCall(ContextFactory.java:4
01)<br>
        at org.mozilla.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3004)<br>
        at JS$module$oncomet$js$278.call(module oncomet.js)<br>
        at JS$module$oncomet$js$278.exec(module oncomet.js)<br>
        at net.appjet.bodylock.InnerExecutable$$anonfun$execute$1.apply(bodylock.scala:76)<br>
        at net.appjet.bodylock.InnerExecutable$$anonfun$execute$1.apply(bodylock.scala:75)<br>
        at net.appjet.bodylock.BodyLock$.runInContext(bodylock.scala:111)<br>
        at net.appjet.bodylock.InnerExecutable.execute(bodylock.scala:75)<br>
        ... 6 more<br>
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.<br>
        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2540)<br>
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2990)<br>
        ... 59 more
        at net.appjet.ajstdlib.SocketManager$$anon$11$$anon$12$$anonfun$run$1.apply(streaming.scala:92)
        at net.appjet.ajstdlib.SocketManager$$anon$11$$anon$12$$anonfun$run$1.apply(streaming.scala:91)
        at net.appjet.oui.execution$$anonfun$execute$4.apply(execution.scala:534)
        at net.appjet.oui.NoninheritedDynamicVariable.withValue(dynamicvar.scala:38)
        at net.appjet.oui.ExecutionContextUtils$.withContext(execution.scala:334)
        at net.appjet.oui.execution$.execute(execution.scala:504)
        at net.appjet.ajstdlib.SocketManager$$anon$11$$anon$12.run(streaming.scala:89)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Task checkForStalePads execution failed with non-200 response: 500
Task checkForStalePads execution failed with non-200 response: 500
Task writePad execution failed with non-200 response: 500
Task checkForStalePads execution failed with non-200 response: 500
com.sun.star.connection.NoConnectException: java.net.ConnectException: Connection refused
        at com.sun.star.lib.connections.socket.socketConnector.connect(socketConnector.java:166)
        at com.sun.star.comp.connections.Connector.connect(Connector.java:141)
        at com.etherpad.openofficeservice.OpenOfficeFileConverter.convertFile(importexport.scala:66)
        at com.etherpad.openofficeservice.OpenOfficeService$.convertFile(importexport.scala:194)
        at com.etherpad.openofficeservice.OpenOfficeService.convertFile(importexport.scala)
        at sun.reflect.GeneratedMethodAccessor210.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.mozilla.javascript.MemberBox.invoke(MemberBox.java:160)
        at org.mozilla.javascript.NativeJavaMethod.call(NativeJavaMethod.java:243)
        at org.mozilla.javascript.optimizer.OptRuntime.callN(OptRuntime.java:86)
        at JS$module$etherpad$importexport$importexport$js$138._c9(module etherpad/importexport/importexport.js:148)
        at JS$module$etherpad$importexport$importexport$js$138.call(module etherpad/importexport/importexport.js)
        at org.mozilla.javascript.optimizer.OptRuntime.callName(OptRuntime.java:97)
        at JS$module$etherpad$importexport$importexport$js$138._c7(module etherpad/importexport/importexport.js:88)
        at JS$module$etherpad$importexport$importexport$js$138.call(module etherpad/importexport/importexport.js)
        at org.mozilla.javascript.optimizer.OptRuntime.callN(OptRuntime.java:86)
        at JS$module$main$js$8._c12(module main.js:165)
        at JS$module$main$js$8.call(module main.js)
        at org.mozilla.javascript.ScriptRuntime.applyOrCall(ScriptRuntime.java:2348)
        at org.mozilla.javascript.BaseFunction.execIdCall(BaseFunction.java:272)
        at org.mozilla.javascript.IdFunctionObject.call(IdFunctionObject.java:127)
        at org.mozilla.javascript.optimizer.OptRuntime.call2(OptRuntime.java:76)
        at JS$module$onscheduledtask$js$173._c1(module onscheduledtask.js:32)
        at JS$module$onscheduledtask$js$173.call(module onscheduledtask.js)
        at org.mozilla.javascript.optimizer.OptRuntime.call0(OptRuntime.java:57)
        at JS$module$onscheduledtask$js$173._c0(module onscheduledtask.js:17)
        at JS$module$onscheduledtask$js$173.call(module onscheduledtask.js)
        at org.mozilla.javascript.ContextFactory.doTopCall(ContextFactory.java:401)
        at org.mozilla.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3004)
        at JS$module$onscheduledtask$js$173.call(module onscheduledtask.js)
        at JS$module$onscheduledtask$js$173.exec(module onscheduledtask.js)
        at net.appjet.bodylock.InnerExecutable$$anonfun$execute$1.apply(bodylock.scala:76)
        at net.appjet.bodylock.InnerExecutable$$anonfun$execute$1.apply(bodylock.scala:75)
        at net.appjet.bodylock.BodyLock$.runInContext(bodylock.scala:111)
        at net.appjet.bodylock.InnerExecutable.execute(bodylock.scala:75)
        at net.appjet.oui.execution$$anonfun$execute$4.apply(execution.scala:508)
        at net.appjet.oui.NoninheritedDynamicVariable.withValue(dynamicvar.scala:38)
        at net.appjet.oui.ExecutionContextUtils$.withContext(execution.scala:334)
        at net.appjet.oui.execution$.execute(execution.scala:504)
        at net.appjet.oui.execution$.runOutOfBand(execution.scala:631)
        at net.appjet.ajstdlib.execution$.runTask(ajstdlib.scala:140)
        at net.appjet.ajstdlib.execution$TaskRunner.call(ajstdlib.scala:177)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
com.etherpad.openofficeservice.OOSException: Unknown exception occurred: java.net.ConnectException: Connection refused
        at com.etherpad.openofficeservice.OpenOfficeFileConverter.convertFile(importexport.scala:122)
        at com.etherpad.openofficeservice.OpenOfficeService$.convertFile(importexport.scala:194)
        at com.etherpad.openofficeservice.OpenOfficeService.convertFile(importexport.scala)
        at sun.reflect.GeneratedMethodAccessor210.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.mozilla.javascript.MemberBox.invoke(MemberBox.java:160)
        at org.mozilla.javascript.NativeJavaMethod.call(NativeJavaMethod.java:243)
        at org.mozilla.javascript.optimizer.OptRuntime.callN(OptRuntime.java:86)
        at JS$module$etherpad$importexport$importexport$js$138._c9(module etherpad/importexport/importexport.js:148)
        at JS$module$etherpad$importexport$importexport$js$138.call(module etherpad/importexport/importexport.js)
        at org.mozilla.javascript.optimizer.OptRuntime.callName(OptRuntime.java:97)
        at JS$module$etherpad$importexport$importexport$js$138._c7(module etherpad/importexport/importexport.js:88)
        at JS$module$etherpad$importexport$importexport$js$138.call(module etherpad/importexport/importexport.js)
        at org.mozilla.javascript.optimizer.OptRuntime.callN(OptRuntime.java:86)
        at JS$module$main$js$8._c12(module main.js:165)
        at JS$module$main$js$8.call(module main.js)
        at org.mozilla.javascript.ScriptRuntime.applyOrCall(ScriptRuntime.java:2348)
        at org.mozilla.javascript.BaseFunction.execIdCall(BaseFunction.java:272)
        at org.mozilla.javascript.IdFunctionObject.call(IdFunctionObject.java:127)
        at org.mozilla.javascript.optimizer.OptRuntime.call2(OptRuntime.java:76)
        at JS$module$onscheduledtask$js$173._c1(module onscheduledtask.js:32)
        at JS$module$onscheduledtask$js$173.call(module onscheduledtask.js)
        at org.mozilla.javascript.optimizer.OptRuntime.call0(OptRuntime.java:57)
        at JS$module$onscheduledtask$js$173._c0(module onscheduledtask.js:17)
        at JS$module$onscheduledtask$js$173.call(module onscheduledtask.js)
        at org.mozilla.javascript.ContextFactory.doTopCall(ContextFactory.java:401)
        at org.mozilla.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3004)
        at JS$module$onscheduledtask$js$173.call(module onscheduledtask.js)
        at JS$module$onscheduledtask$js$173.exec(module onscheduledtask.js)
        at net.appjet.bodylock.InnerExecutable$$anonfun$execute$1.apply(bodylock.scala:76)
        at net.appjet.bodylock.InnerExecutable$$anonfun$execute$1.apply(bodylock.scala:75)
        at net.appjet.bodylock.BodyLock$.runInContext(bodylock.scala:111)
        at net.appjet.bodylock.InnerExecutable.execute(bodylock.scala:75)
        at net.appjet.oui.execution$$anonfun$execute$4.apply(execution.scala:508)
        at net.appjet.oui.NoninheritedDynamicVariable.withValue(dynamicvar.scala:38)
        at net.appjet.oui.ExecutionContextUtils$.withContext(execution.scala:334)
        at net.appjet.oui.execution$.execute(execution.scala:504)
        at net.appjet.oui.execution$.runOutOfBand(execution.scala:631)
        at net.appjet.ajstdlib.execution$.runTask(ajstdlib.scala:140)
        at net.appjet.ajstdlib.execution$TaskRunner.call(ajstdlib.scala:177)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Task checkForStalePads execution failed with non-200 response: 500
Task checkForStalePads execution failed with non-200 response: 500
Task checkForStalePads execution failed with non-200 response: 500
2015-02-23 11:52:03.062::WARN:  header full: java.lang.NullPointerException
Task checkForStalePads execution failed with non-200 response: 500
Task proPadmetaFlushEdits execution failed with non-200 response: 500
Task proPadmetaFlushEdits execution failed with non-200 response: 500
Task checkForStalePads execution failed with non-200 response: 500
Task checkForStalePads execution failed with non-200 response: 500
Task checkForStalePads execution failed with non-200 response: 500
Restarted the team etherpad again because of pads failing to stay connected
Okay, I'll try to roll out the auto-restart stuff today or tomorrow.
Duplicate of this bug: 1105754
I think this can be closed now.
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Just had to kill -9 and restart the team etherpad process, it hard hung.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Had to bounce the team etherpad again today, didn't need kill -9 this time at least.
Kicked the team etherpad again.
Should be better now. I made a change yesterday to the init script so it sleeps slightly between stop/start, which should fix a memory allocation error.
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → FIXED
WHen I've been kicking the team etherpad recently I've been issuing a stop, waiting for it to be gone completely and then issuing a start. Sadly I don't thin the memory allocation fix is going to change the issues I've been seeing.
I agree with Comment 20. The 20s sleep time in the init script is insufficient. If a restart is issued, the init scripts forks another instance without ensuring the older instance has been stopped/killed.
It doesn't sleep 20s. "stop" actually waits until the process goes away. It will wait up to 20s for that, and then give up and report a failure.

    for counter in {1..20}; do
        if pgrep -U etherpad java > /dev/null; then
            echo -n "."
            sleep 1
        else
            break
        fi
    done

Similarly, on startup it waits for the app to start responding before returning... up to 30s:

    for counter in {1..30}; do
            echo "GET / HTTP/1.0" | nc -w 1 localhost 9000 > /dev/null 2>&1 && success && break
            echo -n "."
            sleep 1
    done


Since it waits for the process to exit, there should never (theoretically) be a problem. However as it happens... sometimes there is. The process is gone but the kernel hasn't marked the memory freed as available (I guess). That's why the "restart" command now does:

  restart)
    stop
    STOP_RV=$?
    sleep 1
    start
    START_RV=$?
    RETVAL=$(( STOP_RV + START_RV ))

So, the change in comment 19 is more significant than it perhaps first appears.


etherpad1 seems to be pretty reliable- last time it didn't restart properly was March 17.
etherpad3 has trouble more often... bad startups on 3/24, 3/28, 4/02, 4/13, and 4/16.

Given that frequency, it will be a while before we know for sure if things are any better or not. The problem on 4/16 was after my change, but nothing since then. The next thing to try IMO would be to through a 'sync' in between stop and start, to make sure any buffers are flushed out... or just wait longer.
You need to log in before you can comment on or make changes to this bug.