Closed Bug 1562882 Opened 4 months ago Closed 3 months ago

Upgrade DB instances to current Amazon instance types

Categories

(Tree Management :: Treeherder: Infrastructure, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

Attachments

(7 files)

+++ This bug was initially created as a clone of Bug #1561415 +++

I want to change treeherder-prod, treeherder-prod-ro and treeherder-dev to from the older generation M4 instance types to their M5 current generation instance types.

This is simillar to bug 1561415 which upgraded treeherder-stage.

This will hopefully help performance issues tracked on bug 1553199.

Keeping sheriffs in the loop that we're going to upgrade the production database. This should not have any downtime as we have multi zone availability.

dividehex was not available yesterday. NI to let us know when it will be possible to upgrade. Thanks!

Flags: needinfo?(jwatkins)

I can proceed with the upgrades today or wait until next week? Tomorrow is a US holiday (July 4th) and Fridays are not a good day to commit to operational changes. If today doesn't work, we can wait until Monday morning.

Flags: needinfo?(jwatkins) → needinfo?(armenzg)

via slack:
Armen (armenzg) [9:10 AM]
today works

Flags: needinfo?(armenzg)

I'm going to roll these upgrades one at a time. Starting with the prod-ro instance, which just completed.

aws_db_instance.treeherder-prod-ro-rds: Modifications complete after 6m29s (ID: treeherder-prod-ro)

Attached image image.png

There was a blip in database inserts.
There were 1372 store jobs operations that failed.
Nevertheless, these are celery tasks that are retryable:
https://github.com/mozilla/treeherder/blob/master/treeherder/etl/tasks/pulse_tasks.py#L11-L19

Hopefully we have not lost any tasks but we might have.

This screenshot is to show the queue depth for the last week since the upgrade.

Attached image Swap usage

I just realized I can compare two instances.

I did not realize the Swap usage improvements until now.

The treeherder-prod instance was started and failover has completed but is still in the process of "modifying." This may take awhile to complete.

Wed, 03 Jul 2019 16:37:42 GMT Applying modification to database instance class
Wed, 03 Jul 2019 16:45:06 GMT DB instance shutdown
Wed, 03 Jul 2019 16:45:32 GMT Multi-AZ instance failover started.
Wed, 03 Jul 2019 16:45:41 GMT DB instance restarted
Wed, 03 Jul 2019 16:46:17 GMT Multi-AZ instance failover completed

Since the upgrade is taking much longer than I anticipated, I've filed a ticket with AWS to be sure it isn't stuck on their side.
"""
Hello, we initiated upgrade on our production multi-AZ mysql instance today in order to migrate from db.m4.2xlarge to db.m5.2xlarge. The apply immediately option was used (via terraform/aws api). It looks like the failover successfully completed and the application utilizing the rds only saw a ~min or so of interruption during the failover. But currently, the rds instance is still showing 'modifying' and hasn't completed the upgraded as of yet. It has been over 3 hours.

Is this normal? When can we expect the operation to complete?
"""

Attached image Brieft CloudAMQP alerts

Just keeping track that we passed a threshold and we came back within normal levels.

AWS responded. TLDR; be patient, all is well.

"""
Hello,

Thank you for contacting the AWS Premium Support Database team, my name is Kajol and its my pleasure to assist you with this case today.

I understand you have concerns regarding scaling up your instance class from db.m4.2xlarge to db.m5.2xlarge on a multi AZ RDS instance ‘treeherder-prod’ and it is still in modifying. I will be happy to check on it for you.

I looked into the RDS instance ‘treeherder-prod’ and I can confirm that the API call was registered for modifying instance class at 2019-07-03 16:37:20 UTC. I can also confirm the following events which states that MAZ failover completed successfully and hence 60-120 sec of downtime.

2019-07-03 16:46:17 UTC Multi-AZ instance failover completed User

2019-07-03 16:45:41 UTC DB instance restarted User

2019-07-03 16:45:32 UTC Multi-AZ instance failover started. User

2019-07-03 16:45:06 UTC DB instance shutdown User

2019-07-03 16:37:42 UTC Applying modification to database instance class User

Currently your instance is available to use but the modifying status is because of the secondary. Looking further I could see that secondary is still being scaled up. Further checking the workflows and all details and researching on it, I can see that the secondary instance was having capacity issues in that Availability zone to scale to db.m5.2xlarge and hence had to be moved to different AZ and thus not only the underlying host but also volumes had to be replaced to be placed in the AZ and thus it is taking time.

With growing capacity’s we would want our customer to be in AZ with sufficient resources and hence this step is important and is taking a little while, it is not normal and occurs on a rare basis, no action is required from your end. I can confirm that the execution is progressing smoothly and its not stuck, please wait for a while and it should complete.

Hope this answers your question. Please reach out to me, should you have any further questions or concerns. I will be happy to help you resolve your issue.

We value your feedback, please take a moment to share your experience. You'll see a survey at the end of this communication. If you'd like to let me know how I'm doing, take few seconds to fill that out or rate this correspondence. This will help us to continue improving our support experience for you. I hope you have an amazing day ahead!

Best regards,

Kajol A.
Amazon Web Services
"""

It's still in progress. I'm glad it has not failed. I can't believe we're in this situation.
For now, Treeherder is still working and the performance is similar to prior to the update (which is good).
If we ever need to do this again, let's see if we have a different approach that would work OR assume it's perfectly fine to be waiting like this.
I'm so surprised we completed the treeherder-stage within an hour even though it contains the same amount of data.

Anyway, it is what it is.

Jake this completed this morning:

Fri, 05 Jul 2019 00:46:35 GMT Finished applying modification to DB instance class

There's various messages after that time in case you want to review any of them (like "Amazon RDS has encountered a fatal error running enhanced monitoring on your instance treeherder-prod and this feature has been disabled.This is likely due to the rds-monitoring-role not being present or configured incorrectly in your account. Please refer to the troubleshooting section in the Amazon RDS documentation for further details.")

I didn't see anything worrysome but just in case something is important that you would care for.

When you have a chance could you please upgrade treeherder-dev? That's the last one.

(In reply to Armen [:armenzg] from comment #16)

There's various messages after that time in case you want to review any of them (like "Amazon RDS has encountered a fatal error running enhanced monitoring on your instance treeherder-prod and this feature has been disabled.This is likely due to the rds-monitoring-role not being present or configured incorrectly in your account. Please refer to the troubleshooting section in the Amazon RDS documentation for further details.")

I've updated the aws support ticket with this reply. We will need to re-enable the enhanced monitoring if it is safe to do so. I'll upgrade treeherder-dev after treeherder-prod is squared away.

"""
Thank you for you explanation on matter. The instance did ultimately finish after 1 day, 8 hours, 8 minutes and 53 seconds.
Wed, 03 Jul 2019 16:37:42 GMT Applying modification to database instance class
Fri, 05 Jul 2019 00:46:35 GMT Finished applying modification to DB instance class

Although, it looks as though enhanced monitoring was disabled as indicated by the errors in the rds log:
Fri, 05 Jul 2019 00:46:16 GMT Amazon RDS has encountered a fatal error running enhanced monitoring on your instance treeherder-prod and this feature has been disabled. This is likely due to the rds-monitoring-role not being present or configured incorrectly in your account. Please refer to the troubleshooting section in the Amazon RDS documentation for further details.
Fri, 05 Jul 2019 00:47:40 GMT Monitoring Interval changed to 0

Enhanced monitoring was enabled and working before the RDS scale up operation. As far as I can tell, the 'rds-monitoring-role' IAM role is still properly define with the 'AmazonRDSEnhancedMonitoringRole' policy attached. We use the same role for the 'treeherder-stage' rds which underwent the same scale up operation without incident.

Why did this happen? Is it safe to apply enhanced monitoring again?
"""

Response from AWS support:

"""

Hello,

Thank you for getting back.

I can confirm that EM was disabled, I cannot confirm if this was related to instance scale you have performed, to answer why this happened and can you enable it back again, I have reached out to my internal team with my analysis and getting further information on this.

2019-07-05 00:47:40 UTC Monitoring Interval changed to 0

2019-07-05 00:47:16 UTC Amazon RDS has encountered a fatal error running enhanced monitoring on your instance treeherder-prod and this feature has been disabled.This is likely due to the rds-monitoring-role not being present or configured incorrectly in your account. Please refer to the troubleshooting section in the Amazon RDS documentation for further details.

I will get back to you as soon as I have any updates.

Best regards,

Kajol A.
Amazon Web Services
"""

I've re-enabled enhanced monitoring on treeherder-prod.
aws_db_instance.treeherder-prod-rds: Modifications complete after 2m27s (ID: treeherder-prod)

I've started the upgrade on treeherder-dev
Sat, 06 Jul 2019 01:52:46 GMT Applying modification to database instance class

(In reply to Jake Watkins [:dividehex] from comment #20)

I've started the upgrade on treeherder-dev
Sat, 06 Jul 2019 01:52:46 GMT Applying modification to database instance class

Completed. aws_db_instance.treeherder-dev-rds: Modifications complete after 5m47s (ID: treeherder-dev)
Sat, 06 Jul 2019 01:57:16 GMT Finished applying modification to DB instance class

Thanks Jake for all your work here.
So far it all seems to work well.
I will keep an eye on treeherder-prod this week to see if our bottlenecks are now gone.

I think the biggest gain is the increased Write IOPS (now matching what treeherder-stage was performing) and all swaping being gone.

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Type: defect → task
You need to log in before you can comment on or make changes to this bug.