1620423 - Load on kintowe is excessive

Assignee

Description

•

5 years ago

Over time the kintowe service has grown to the limits of the Google Cloud Platform capacity. We are currently failing with some regularity and traffic continues to grow. The bottleneck seems to be DB memory. I do not feel it is clear why the database uses so much memory.

This bug tracks the overall issue, with updates to come as things are tried, information is gathered, and hypotheses are ruled out.

Ethan Glasser-Camp (:glasserc)

Assignee

Updated

•

5 years ago

Component: Storage → Operations: Storage

Product: WebExtensions → Cloud Services

QA Contact: chartjes

Ethan Glasser-Camp (:glasserc)

Assignee

Comment 1

•

5 years ago

Sven and I spent a lot of time over the last couple days getting "manual" access to the database and trying to gather what information we could. Some observations:

It seems that there are usually about ~3000 connections to the database open, of which most (80%?) are "idle" (i.e. not doing anything).
While there are lots of sequential scans of the user_principals table, this table is empty so this can be disregarded.
There is an unused index, which probably we should cut, but doesn't seem like a huge win.
Sven did some EXPLAIN ANALYZE of some common SELECT queries. They're generally relatively fast (<10 ms), entirely based on index scans, and do not seem to have obvious gains available. One query uses a temporary table but it seems to only cost about 25kb, which would not explain what we're seeing.
There are a bunch of PostgreSQL configuration parameters that we could try changing to reduce memory usage: https://www.postgresql.org/docs/10/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY. shared_buffers is a lot but that makes sense given the size of the machine and the guidance of "25% of RAM". temp_buffers and work_mem all seem fine.
Hypothetically, we could PREPARE more statements in the application to cut down on query planning time, but the value is not clear given that we recycle connections so much and since some queries are very dynamic.
I added a panel to Wei's grafana dashboard to show frequency of Kinto database operations and to try to track them against DB memory use: https://earthangel-b40313e5.influxcloud.net/d/l0TKzwHZk/kintowe-by-wezhou?orgId=1&from=now-24h&to=now . No operation seems obviously costly.
Each kintowe web node has 10 connections open to the "storage" backend and 10 more to the "permission" backend (which are physically the same DB although logically separate). Sven wants to try reducing this number. If memory usage is dominated by open connections, reducing this to 9 and 9 might buy us a few percent. However Adrian fought with this once upon a time and there were consequences for the application when we reduced it too much so maybe there's a threshold we can't cross.
We also have a recycle parameter which dictates how long we keep a connection open for before throwing it out and making a new one. Currently it's set to 900 seconds (i.e. 15 minutes). If connections accumulate memory, lowering this number could help too. These settings (and many others) can be found in https://github.com/mozilla-services/cloudops-infra/blob/master/projects/webextstoragesync/k8s/charts/webextstoragesync/conf/kinto.ini
I have been tasked with putting together a small patch to land in Firefox which will reduce the frequency of syncing for the webextensions storage. This would translate into less load which would hopefully reduce CPU and memory usage.

:wezhou

Comment 2

•

5 years ago

Another thing may be worth noting is that we have 20 uwsgi workers per pod. If it turns out we don't need that many, maybe we can reduce that number too.