Closed Bug 1460348 Opened 6 years ago Closed 3 years ago

[sync] Dyno running out of memory

Categories

(Webtools Graveyard :: Pontoon, enhancement, P4)

enhancement

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: mathjazz, Unassigned)

Details

Note:

This is not a new bug. It's been hitting us since we migrated Pontoon to Heroku and has been tracked under bug 1214411 initially.

--

Details:

Heroku dyno used as a worker for the sync process often runs out of memory. We see two types of errors in the logs:

R14 - Memory quota exceeded (degraded performance):
https://devcenter.heroku.com/articles/error-codes#r14-memory-quota-exceeded

R15 - Memory quota vastly exceeded (dyno is killed, sync breaks):
https://devcenter.heroku.com/articles/error-codes#r15-memory-quota-vastly-exceeded

--

Previous attempts at fixing the problem:

To address the problem, we made several optimizations to the sync process in the past, two of which stand out:

1. We fixed bug 1214411 (which tracked this problem initially) by detecting which files changed in VCS and only syncing those. That stopped aforementioned error messages from appearing constantly and only showing up when a bigger changeset is synced.

2. We fixed bug 1383252 by greatly reducing the costly hg clone operations. That reduced the average sync time from 20 to 2 minutes and allowed us to switch from using 3 Standard-2X dynos to 1 Standard-1X (also reducing the worker dyno cost by a factor of 6).

--

Current status:

We mostly see the error when bigger changeset are processed, e.g. when we run Fluent migrations or when projects that store translations in big bilingual files are synced (e.g. SUMO, AMO, MDN).

To avoid losing the worker (and damaging the sync process), we manually upgrade the sync worker to Performance-M before we run Fluent migrations, but that makes the process more manual than it could be and doesn't scale. We don't know for example when new SUMO strings will land.

--

Plan:

We should investigate what's the root cause of the problem and figure out if we can fix it programatically. A possible suspect is that the increased memory consumption is caused by the reduced number of DB queries (which are now bigger) and extensive use of prefetching, which are needed for performance reasons.

The other solution is to permanently upgrade the sync worker to a more expensive Performance-M (https://www.heroku.com/pricing), which works reliably. It's also dedicated.
We're no longer hitting the problem since we started using the Performance-M worker.
Priority: P3 → P4

I was just trying out a Fluent migration on my Pontoon instance, and even using a Standard 2X dyno for the worker, I got some "R14 - Memory quota exceeded (degraded performance)" errors during the sync. More interestingly: I noticed that even after the sync was complete, I'm still seeing those errors every few seconds:

2019-09-19T13:48:36.067934+00:00 heroku[worker.1]: Process running mem=1088M(106.2%)
2019-09-19T13:48:36.068061+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2019-09-19T13:48:55.889321+00:00 heroku[worker.1]: Process running mem=1088M(106.2%)
2019-09-19T13:48:55.889443+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2019-09-19T13:49:15.906849+00:00 heroku[worker.1]: Process running mem=1088M(106.2%)
2019-09-19T13:49:15.906963+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2019-09-19T13:49:35.905572+00:00 heroku[worker.1]: Process running mem=1088M(106.2%)

Which makes me think that there may be a leak somewhere.

*This bug has been moved to GitHub.*

*Please check it out on https://github.com/mozilla/pontoon/issues.*
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → MOVED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.