Closed Bug 1620423 Opened 5 years ago Closed 5 years ago

Load on kintowe is excessive

Categories

(Cloud Services :: Operations: Kinto, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glasserc, Assigned: glasserc)

References

Details

Over time the kintowe service has grown to the limits of the Google Cloud Platform capacity. We are currently failing with some regularity and traffic continues to grow. The bottleneck seems to be DB memory. I do not feel it is clear why the database uses so much memory.

This bug tracks the overall issue, with updates to come as things are tried, information is gathered, and hypotheses are ruled out.

Component: Storage → Operations: Storage
Product: WebExtensions → Cloud Services
QA Contact: chartjes

Sven and I spent a lot of time over the last couple days getting "manual" access to the database and trying to gather what information we could. Some observations:

  • It seems that there are usually about ~3000 connections to the database open, of which most (80%?) are "idle" (i.e. not doing anything).
  • While there are lots of sequential scans of the user_principals table, this table is empty so this can be disregarded.
  • There is an unused index, which probably we should cut, but doesn't seem like a huge win.
  • Sven did some EXPLAIN ANALYZE of some common SELECT queries. They're generally relatively fast (<10 ms), entirely based on index scans, and do not seem to have obvious gains available. One query uses a temporary table but it seems to only cost about 25kb, which would not explain what we're seeing.
  • There are a bunch of PostgreSQL configuration parameters that we could try changing to reduce memory usage: https://www.postgresql.org/docs/10/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY. shared_buffers is a lot but that makes sense given the size of the machine and the guidance of "25% of RAM". temp_buffers and work_mem all seem fine.
  • Hypothetically, we could PREPARE more statements in the application to cut down on query planning time, but the value is not clear given that we recycle connections so much and since some queries are very dynamic.
  • I added a panel to Wei's grafana dashboard to show frequency of Kinto database operations and to try to track them against DB memory use: https://earthangel-b40313e5.influxcloud.net/d/l0TKzwHZk/kintowe-by-wezhou?orgId=1&from=now-24h&to=now . No operation seems obviously costly.
  • Each kintowe web node has 10 connections open to the "storage" backend and 10 more to the "permission" backend (which are physically the same DB although logically separate). Sven wants to try reducing this number. If memory usage is dominated by open connections, reducing this to 9 and 9 might buy us a few percent. However Adrian fought with this once upon a time and there were consequences for the application when we reduced it too much so maybe there's a threshold we can't cross.
  • We also have a recycle parameter which dictates how long we keep a connection open for before throwing it out and making a new one. Currently it's set to 900 seconds (i.e. 15 minutes). If connections accumulate memory, lowering this number could help too. These settings (and many others) can be found in https://github.com/mozilla-services/cloudops-infra/blob/master/projects/webextstoragesync/k8s/charts/webextstoragesync/conf/kinto.ini
  • I have been tasked with putting together a small patch to land in Firefox which will reduce the frequency of syncing for the webextensions storage. This would translate into less load which would hopefully reduce CPU and memory usage.

Another thing may be worth noting is that we have 20 uwsgi workers per pod. If it turns out we don't need that many, maybe we can reduce that number too.

Here's a summary of the mitigation measures we plan to implement, sorted by the amount of effort in relation to their expected usefulness.

  • Tune configuration settings, both for Postgres and Kinto.

    • Reduce number of database connections in the connection pools on in Kinto.
    • Reuce the lifetime of connections in Kinto.
    • Reduce the number of uwsgi workers for Kinto.
    • Reduce number of shared buffers in Postgres.
      We need to watch the table cache hit rate if we do so. Currently around 99%.
    • Reduce the operation working memory in Postgres.
      We need to watch the size of temporary disk storage and I/O performance if we do so
  • Reduce the sync frequency on the client side.
    This has the potential of reducing the load significantly, and may be all we need. (If the effort is relatively low, this should be the first item in the list.)

  • Split out the permissions schema into a separate database on a different server.
    This should be relatively straightforward. The application code shouldn't need any changes. We will probably need to write a relatively simple script to migrate the data in the relevant tables from the old database to the new one, so we can run it on Kubernetes. We need to take the app offline for the duration of the migration. If we want to reduce the downtime, we can alternatively create a read replica of the current database, and only take the app offline to promote the read replica to a standalone server and change the config.

  • Upgrade Postgres to version 11.
    There are a number of changes in Postgres 11 that have the potential to reduce memory usage and improve performance. The migration involves thoroughly testing Kinto with Postgres 11, and taking the app offline for the whole duration of the data migration. I expect both the overall effort and the downtime to be higher than for the previous option, and the expected gains to be lower, so this is only a last resort.

I see that each pod still has around 22 connections to the database. I suspect the storage connection pool has maxed out its 15 (pool_size + max_overflow) connections and the permissions connect pool has used all 5 in its pool_size plus some.

What if we set max_overflow to 0 and increase the uWSGI listen queue? This will probably increase the request latency a little because uWSGI holds the requests in its queue while waiting for the available db connections. However, it may help reduce the database memory usage, as long as the pod cpu doesn't increase because of the change.

Wei, this explanation seems plausible.

I tried to figure out what exactly our uwsgi configuration means, but the documentation is rather terse. I believe the uwsgi workers are started as (OS-level) threads with a single interpreter, and it looks like these threads all share a common DB connection pool, so there is only one connection pool per pod, as you say. I'm not completely confident that this is what the configuration means, but based on our connection numbers, that's the most reasonable assumption.

This means that we have 20 uwsgi workers competing for maximum 15 database connections. At times of high load this may be a problem. Reducing the number of uwsgi workers and increasing the size of the listen queue seems like a good idea. I think I will simply revert the previous change – it obviously didn't help, since the database is not in any better shape now.

So as the next step, I'll make the changes above, but I don't expect them to help the DB.

In the meantime, tcsc has filed https://bugzilla.mozilla.org/show_bug.cgi?id=1621806 and started working on it. This change should give us a configurable way to reduce the load on the kintowe service, so it may be all we need.

After making the changes mentioned in the previous comment, I will wait a day to see how the DB reacts. After that, I'll try to somewhat reduce the amount of shared buffers. We currently use 32% of the total memory of the instance as Postgres shared buffers. The recommended amount is about 25%.

Goole support got back to me and told me that the "memory usage" metric in the GCP console refers only to the memory usage of the database processes, but this doesn't sound completely convincing to me. In any case, I believe this number does not include operating system buffers and cache, so our big amount of shared buffers eats into the kernel's ability to cache and buffer disk IO, which I believe hurts more than the additional amount of in-process buffers helps, so I hope that this will also help.

Apart from these steps, I don't think we need to look into the third and fourth bullet point (splitting up the DB, upgrading to Postgres 11) at this point – we can first wait for the results of the other two measures.

I also asked Google support about the DB process crashes we sometimes observe during times of high load. I don't know why they are happening. If we see several of them in a row, the database fails over, but this never seems to actually improve things.

All uWGI workers share the same connection pool, because we set single-interpreter = true.

Reducing the number of uwsgi workers and increasing the size of the listen queue seems like a good idea.

I think so too. Besides, the k8s nodes we use can provide at most 4 virtual cpu cores per each, it probably doesn't make sense to have 20 workers per pod, given that we have a one pod per node ratio at the moment.

Not sure if you want to try setting max_overflow to 0 as well. I'd think that may be worth trying too.

single-interpreter = true only means that each module gets loaded only once. It does not automatically imply that there is only one database connection pool, so I wasn't sure. I now found the relevant code in the Kinto codebase, and the connection pool is indeed stored in a global variable, so there will be only one.

Having more threads than CPUs is reasonable in general. The threads spend a lot of their time blocking and waiting for the database, and they don't need any CPU while they block.

However, since we are using OS threads as workers, I wonder how we can make use of more than one CPU anyway – doesn't Kinto spend most of its time executing Python byte code, so we are subject to the GIL? Looking at the num

Since uwsgi is configured to use threads rather than processes, I wonder how we can make use of more than one CPU anyway. The uwsgi master seems to be a separate process, but that still allows only to use a maximum of 2 CPUs, yet the pods are configured with a CPU allocation of 3.25 CPUs – all rather confusing for me.

When you suggest reducing max_overflow to 0, should I also increase the pool size to 15, so the maximum stays the same? I'm not sure we can reduce the maximum number of connections in the pool given that the pool is shared between so many threads.

When you suggest reducing max_overflow to 0, should I also increase the pool size to 15, so the maximum stays the same?

If one of the goals is to reduce database memory usage then I'd suggest maybe try the following parameters:

  • pool_size = 10 for storage pool
  • pool_size = 5 for permissions pool
  • max_over_flow = 0 for both pools
  • uWSGI workers = 4
  • uWSGI listen queue = 1000

That way, each pod has 15 connections to the db maximum, if we have 120 pods, that'll be 1800 connections.

We need to watch the request lantency and per-pod CPU usage though, if those don't get too much higher than what we currently have, it may be a good sign.

Wow, that's pretty radical a proposal! I think I will be a bit more conservative, and I'll try not change everthing at once, but I'll try to play with these parameters a bit.

In particular, I think going with uwsgi workers from 20 to 4 in a single step is too much. The typical CPU usage of our current pods is below 2, so we are not hitting the CPU limits with 20 workers – if anything, we are hitting GIL limits.

Google Cloud support got back to me and confiirmed that the DB processes are OOMing. We never see high neough memory usage to actually explain that, but this may be due to the sampling frequency.

Unfortunately, we have a different problem now.

The external load increased by about 10% in the last two weeks, with most of the increase happening this week. This is about the amount of increase we've seen before over a time of three months. I don't know the cause for this increase – it still looks like organic growth to me. Maybe some moderately popular extension started using sync.

Currently, the database isn't OOMing anymore. Instead, we are running out of CPU again. Again, I don't know why.

I will still try to slightly reduce the number of shared buffers of the database, since this change has the potential to improve the overall performance of the database server by freeing some memory for OS buffers and caches.

Another note – I can't increase the uwsgi listen queue size significantly. It's currently set to its default of 100, and the default max on Linux is 128. I don't know whether it is possible to change this deafult on GKE, but I feel if so many requests get queued, chances are we are out of DB capacity anyway, so queueing even more requests won't help. During the time the DB is overloaded, we are already seeing average request latencies of 20s. Queuing requests for even longer than that doesn't feel useful to me.

https://wiki.postgresql.org/wiki/Number_Of_Database_Connections seems to also suggest that less sometimes mean more. Of course our mileage may vary I guess.

Pg will usually complete the same 10,000 transactions faster by doing them 5, 10 or 20 at a time than by doing them 500 at a time. Determining exactly how many should be done at once varies by workload and requires tuning.

Maybe some moderately popular extension started using sync.

The most recent release of multi-account containers included new sync functionality, and I've noticed a steady increase in chatter about it. There is also a bug there this addon causes high CPU usage when syncing (Bug 1615822) which could plausibly also result in it generating excessive server load.

Ryan, thanks for bringing this to my attention. I've asked on the bug you linked whether it is expected to cause higher load on the server.

I noticed the bug already existed for at least a month, while most of the traffic increase I observed happened over the last two weeks, so I'm not sure it's related. It doesn't hurt to look into it, though.

See Also: → 1569001

Update: I've been working on splitting up the database into a storage DB and a permissions DB. I have performed a few manual tests for stage. The required infrastructure changes are in https://github.com/mozilla-services/cloudops-infra/pull/1964.

The plan for deploying this change is roughly

  • Create the new DB via Terraform.
  • Run the Jenkins pipeline until after the migrations. This populates the permissions DB with its schema.
  • Take the service offline.
  • Migrate the permissions tables from the old DB using pg_dump.
  • Finish the Jenkins pipeline.
  • Bring the service online again.

I didn't get as far as I hoped today, but I still think that I'll be able to roll this out tomorrow by end of my workday.

Thanks for this, Sven!

FYI, https://bugzilla.mozilla.org/show_bug.cgi?id=1621806 was approved for uplift by RelMan, on its way to BETA for Fx75. Fx75 is in BETA for another week, with RC on March 30, and live on April 7.

Julien, that's great to hear! I didn't really expect it would make it into the current beta.

The database split is complete, and the service is healthy again.

Migrating the data took far longer than anticipated, so we had a downtime of 5 hours.

Since the client side changes are done as well, I'm closing this ticket now.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

Hey Sven, are we able to determine for how long the service will remain "healthy"? (This can wait until Monday of course).

Flags: needinfo?(sven)

Julien we are able to determine this with unfortunate precision: Until 12:42 UTC today.

The load on the service has increased far more than expected over the last two weeks. After splitting the DB, we are able to serve 25% more traffic than before, but load increased by about 40% over the last two weeks.

Status: RESOLVED → REOPENED
Flags: needinfo?(sven)
Resolution: FIXED → ---

A quick update on this:

Sven who is on PTO today bumped the database CPU cores to 80 yesterday, and database CPU usage is dropped from 100% to about 80% during peak hours.

We learned from GCP support that Cloud SQL now allows scaling DB instances to up to 96 CPUs via the API, so I decided to try going up to 80 CPUs. This looked very promising yesterday, but today the permissions DB ran out of memory again. I increased the memory for the permissions DB even further – it now has 143GB of RAM for a dataset of approximately 24GB total, which seems a bit out of control. I'll look into this tomorrow when I'm back from PTO.

The service has been mostly stable for the last 6 days. The permissions db is occasionally running out of memory, leading to very short spikes of higher error rates. Since these spikes are short, and the errors are not directly visible to normal users, I think this state is Good Enough for now, in particular given that the change to adjust traffic on the client side should land in release in a week.

The new database deployment is quite expensive to run, so we should look into reducing the sync frequency to keep the costs reasonable. Let's first see how the reduction to 80% of the traffic goes, though.

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED

Ethan also implemented an optimisation for one of the most frequent queries to the permissions db that may help reduce the load on the permissions db, see https://github.com/Kinto/kinto/pull/2475.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

The PR linked in comment 29 is now deployed to production.

Neither memory usage nor CPU load is measurably better than they were before the change, so I guess the change didn't achieve much. As the cache warms up, the average time for the query we optimised continues to go down. We are now at an average execution time of 3.2ms at a cache hit rate of 98.3%. The average time is still dominated by the 1.7% cache misses – re-executing a recent query takes about 0.1ms. So I guess there just isn't much we can do with optimising the queries.

With Firefox 75 being released to end users, we already see about 22% of the kintowe traffic coming from the new version.

The new version is expected to reduce the load on kintowe by 20%, and the initial data we see so far is in line with a 20% reduction, so eveerything seems to be working as expected. At this point, I don't expect kintowe to be overloaded again anytime soon.

However, we need to discuss whether we want to further reduce the sync frequency on the client side. We needed to scale the service up quite significantly, which adds significant operational costs. Syncing every 20 minutes should be acceptable in my opinion, which corresponds to a 50% reduction compared to the original state.

We now have about 35% of traffic coming from the newly released Firefox 75, so we expect an overall reduction of traffic by 7%. The current total load is about 9% below the load one week ago, which is within the expected variance of the data.

Kintowe has stabilised, so I'm marking this ticket as fixed. Possible cost savings are discussed in the follow-up bug 1630222.

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.