Open Bug 1637361 Opened 4 years ago Updated 2 years ago

Databases swapping

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P3)

Tracking

(Not tracked)

People

(Reporter: armenzg, Unassigned)

Details

Attachments

(1 file)

I've opened an Amazon case. I want to know what the impact could be and what we can do.

Assignee: nobody → armenzg

My question:

Hello,
I've noticed that my treeherder-prod instance has been increasing its SWAP.
What's the impact? How can I determine what's causing it?

I noticed the same on treeherder-prototype which I rebooted later last week.

Thanks for your guidance.

Armen

AWS's response:

From the case note, I understand that the Swap Usage has been increasing on your RDS MySQL instance “treeherder-prod” and you want guidance on what the impact is and how to determine what is causing it to keep increasing. I will be happy to give you answers to your questions and guidance as requested.

After reviewing the cloud watch metrics for your MySQL instance, I have indeed observed the SwapUsage metric has been on a constant rise and steadily growing over a long period of time as seen in the below graph.
SwapUsage:
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2:graph=~%28region~%27us-east-1~metrics~%28~%28~%27AWS*2fRDS~%27SwapUsage~%27DBInstanceIdentifier~%27treeherder-prod%29%29~period~300~stat~%27Average~start~%272020-05-07T18*3a32*3a19Z~end~%272020-05-12T18*3a32*3a19Z%29

Question 1: What is the impact?
Usage of swap means there is memory pressure on the instance, that is the result/output of the read operation is been fetched from the disk instead of the memory. Please note that fetching results from the disk takes a longer return time compared to fetching from the memory which is why we recommend provision enough memory by selecting the instance class configuration which fits into the expected workload.

Swap memory is used when a system requires more memory than is allocated. An IO request is a disk-operation which could be a read, or a write. [1]
Also, the Swap usage is greater than 0 on the cloud watch metric which shows there is resource contention in the DB instance. Swap usage is usually meant to be at 0, if it is more than 0 then there is resource contention.
To increase the memory of your instance to accommodate the read/write operations and reduce Swap usage, I recommend you change your instance size to a larger instance size that has more memory. [2]

I also looked at the freeable memory on your instance and can see it has been intermittent as well.
FreeableMemory:
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#metricsV2:graph=~%28region~%27us-east-1~metrics~%28~%28~%27AWS*2fRDS~%27FreeableMemory~%27DBInstanceIdentifier~%27treeherder-prod%29%29~period~300~stat~%27Average~start~%272020-05-07T18*3a32*3a19Z~end~%272020-05-12T18*3a32*3a19Z%29

The freeable memory includes the amount of physical memory left unused by the system plus the total amount of buffer or page cache memory that are free and available. So it's freeable memory across the entire system. While MySQL is the main consumer of memory on the db instance we do have internal processes in addition to the OS that use up a small amount of additional memory. Simplifying, Freeable=Free+Buffer+Cache. Free memory is the total of unassigned memory.

Question 2: How can I determine what's causing it?
The behavior we expect to see when it comes to swap usage is for the freeable to begin to drop as it is being used by system processes, once this drops to a certain level then swap space will be used in order to compensate for the low memory available to the system. This should be reflected in the cloud watch metrics by showing a drop in freeable memory, and an increase in swap usage. [3]

You should also see an increase in CPU utilization as reading/writing to disk is a more intensive process. By default RDS instances have swap enabled, the amount of available swap usage will vary depending on the instance class.

This happens at the OS level, however as RDS instances do not allow host access you do not have full control over what is installed. This effectively means that the swap usage could only be used for the DB process or by our monitoring software.

My recommendations:

  1. To increase the memory of your instance to accommodate the read/write operations and reduce Swap usage, I recommend you change your instance size to a larger instance size that has more memory. [3] Scale up to the instance class with more compute power that accommodates the workload carried out on your DB instances.
  2. Reviewing the poor performing queries in your database and tune them to reduce the workload on your DB instance.

I would recommend that you create CloudWatch alarms to monitor your freeable memory metrics so when it drops below a certain threshold you will be alerted.
Also, you can use Performance Insights (PI) to identify the SQL queries and wait events that are consuming excessive swap or memory on the DB instance and tune them.
PI will also allow you to see the true value for freeable memory. This will provide much more detail on the amount of usage and the OS load on the instance at the time.

I hope this answers your questions. Please let me know if there is anything else I can do for you, and I will be more than happy to help you. It is my absolute pleasure to be of assistance to you.

Have an AWSome rest of the day.

Resources:
[1] Amazon RDS instance using swap memory - https://aws.amazon.com/premiumsupport/knowledge-center/troubleshoot-rds-swap-memory/
[2] Changing Your DB Instance Class - https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html#Concepts.DBInstanceClass.Changing
[3]Troubleshoot freeable memory in an Amazon RDS for MySQL - https://aws.amazon.com/premiumsupport/knowledge-center/low-freeable-memory-rds-mysql-mariadb/
[4] How MySQL Uses Memory - https://dev.mysql.com/doc/refman/5.7/en/memory-use.html

Something changed on March 20th where the swap started increasing.
The Swap and freeable memory for stage/dev is more worrysome than production (since it's less than 100MB).

Currently March 20th is the beginning of the degradation and this change seems suspicious:
https://github.com/mozilla/treeherder/commit/ce35ba1b5d50140dd20a8f6f9c84db7ec8a647d5

Other changes on that timeframe:
https://github.com/mozilla/treeherder/commits/master?after=af41934af9871fdb064b2ea5a6c93076ea6dca4b+104

At a high level, the database is better at managing dirty pages than the OS; if only because the database knows what it is doing with those pages. We can reduce the amount of memory used by MySQL by adjusting the parameters mentioned in Troubleshoot freeable memory in an Amazon (above). Reducing MySQL memory usage to avoid swap should improve performance. There is a caveat, decreasing memory usage will increase peak IOPS as more pages are aggressively flushed to disk in the background to make room for future work.

The swap usage may be a result of particular queries that consume alot of memory; in which case we should observe periods of particular slowness when those queries run (or soon afterwards). The swap will linger, despite being not used, just because the OS is lazy. The swap may show artificially high if it is a few big queries that are causing the problem.

Assuming we do not move to an instance with more memory, and assuming there are no big wins by finding problem queries, I suggest

  1. We review the memory settings and number of allowed connections to verify our understanding of MySQL max memory usage. Our review should conclude we are allowing MySQL to use more memory than exists on the machine.
  2. Decide how to reduce the memory usage to fit in available memory: It may be as simple as limiting the number of connections, or limiting the memory resources dedicated to those connections. We might be required to coordinate the total number of connections between the various clients to limit total memory.

I am assuming the above config is applied atop some other configuration; there is no Innodb_buffer_pool_size mentioned in the config.

That's very good info Kyle.
I'm thinking of delaying this until June or July when manifest based scheduling is out of the woods.
If you're up to it before that let me know, otherwise, I would cherish your input when I tackle it.

I am assuming the above config is applied atop some other configuration; there is no Innodb_buffer_pool_size mentioned in the config.
The config is only for values we override.

Assignee: armenzg → nobody
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: