Status

Infrastructure & Operations
WebOps: Other
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: ashish, Assigned: jakem)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/938] )

(Reporter)

Description

3 years ago
etherpad.mozilla.org went down at Apr 13 11:55 Pacific. Filing this bug for tracking.

Updated

3 years ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/938]
(Assignee)

Comment 1

3 years ago
It's up again, but keeping this bug open.

There's a bug with the way etherpad gets restarted. The init script needs to sleep between stop and start.

"Start" determines how much memory to give to java based on available memory (free + cached).

However, "stop" doesn't necessarily free all its used memory before it returns.

Combined, sometimes "start" only sees a small amount of memory available (<2GB) instead of the proper amount (~16GB). This means it starts up but with a much smaller stack size, which quickly fills up and eventually causes java hang a lot doing garbage collection and eventually OOM itself when it can't collect enough to allocate whatever new thing it's trying to do.


Obvious solution is to have the init script sleep a bit between stop/start when doing a restart. Another solution might be to have some sort of check in the start job that validates how much RAM is going to be allocated, and whine if things seem wrong.
(Assignee)

Comment 2

3 years ago
I committed a change to the init script, so this should hopefully be fixed now.
Assignee: server-ops-webops → nmaul
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.