Closed Bug 1562882 Opened 6 years ago Closed 6 years ago

Upgrade DB instances to current Amazon instance types

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

Attachments

(7 files)

Link to GitHub pull-request: https://github.com/mozilla-platform-ops/devservices-aws/pull/105 6 years ago GitHub Bugzilla PR Linker 64 bytes, text/x-github-pull-request		Details \| Review
image.png 6 years ago Armen [:armenzg] 308.31 KB, image/png		Details
CPU metrics for treeherder-stage in the last week 6 years ago Armen [:armenzg] 175.52 KB, image/png		Details
Read IOPS for treeherder-stage 6 years ago Armen [:armenzg] 210.37 KB, image/png		Details
Swap usage 6 years ago Armen [:armenzg] 83.19 KB, image/png		Details
Brieft CloudAMQP alerts 6 years ago Armen [:armenzg] 32.74 KB, image/png		Details
Write IOPS sum 6 hours period 6 years ago Armen [:armenzg] 216.42 KB, image/png		Details

Armen [:armenzg]

Assignee

Description

•

6 years ago

+++ This bug was initially created as a clone of Bug #1561415 +++

I want to change treeherder-prod, treeherder-prod-ro and treeherder-dev to from the older generation M4 instance types to their M5 current generation instance types.

This is simillar to bug 1561415 which upgraded treeherder-stage.

This will hopefully help performance issues tracked on bug 1553199.

GitHub Bugzilla PR Linker

Comment 1

•

6 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla-platform-ops/devservices-aws/pull/105 — Details

Armen [:armenzg]

Assignee

Comment 2

•

6 years ago

Keeping sheriffs in the loop that we're going to upgrade the production database. This should not have any downtime as we have multi zone availability.

Armen [:armenzg]

Assignee

Comment 3

•

6 years ago

dividehex was not available yesterday. NI to let us know when it will be possible to upgrade. Thanks!

Flags: needinfo?(jwatkins)

Jake Watkins [:dividehex]

Comment 4

•

6 years ago

I can proceed with the upgrades today or wait until next week? Tomorrow is a US holiday (July 4th) and Fridays are not a good day to commit to operational changes. If today doesn't work, we can wait until Monday morning.

Flags: needinfo?(jwatkins) → needinfo?(armenzg)

Jake Watkins [:dividehex]

Comment 5

•

6 years ago

via slack:
Armen (armenzg) [9:10 AM]
today works

Flags: needinfo?(armenzg)

Jake Watkins [:dividehex]

Comment 6

•

6 years ago

I'm going to roll these upgrades one at a time. Starting with the prod-ro instance, which just completed.

aws_db_instance.treeherder-prod-ro-rds: Modifications complete after 6m29s (ID: treeherder-prod-ro)

Armen [:armenzg]

Assignee

Comment 7

•

6 years ago

Attached image image.png — Details

There was a blip in database inserts.
There were 1372 store jobs operations that failed.
Nevertheless, these are celery tasks that are retryable:
https://github.com/mozilla/treeherder/blob/master/treeherder/etl/tasks/pulse_tasks.py#L11-L19

Hopefully we have not lost any tasks but we might have.

Armen [:armenzg]

Assignee

Comment 8

•

6 years ago

Attached image CPU metrics for treeherder-stage in the last week — Details

This screenshot is to show the queue depth for the last week since the upgrade.

Armen [:armenzg]

Assignee

Comment 9

•

6 years ago

Attached image Read IOPS for treeherder-stage — Details

Armen [:armenzg]

Assignee

Comment 10

•

6 years ago

Attached image Swap usage — Details

I just realized I can compare two instances.

I did not realize the Swap usage improvements until now.

Jake Watkins [:dividehex]

Comment 11

•

6 years ago

The treeherder-prod instance was started and failover has completed but is still in the process of "modifying." This may take awhile to complete.

Wed, 03 Jul 2019 16:37:42 GMT Applying modification to database instance class
Wed, 03 Jul 2019 16:45:06 GMT DB instance shutdown
Wed, 03 Jul 2019 16:45:32 GMT Multi-AZ instance failover started.
Wed, 03 Jul 2019 16:45:41 GMT DB instance restarted
Wed, 03 Jul 2019 16:46:17 GMT Multi-AZ instance failover completed

Jake Watkins [:dividehex]

Comment 12

•

6 years ago

Since the upgrade is taking much longer than I anticipated, I've filed a ticket with AWS to be sure it isn't stuck on their side.
"""
Hello, we initiated upgrade on our production multi-AZ mysql instance today in order to migrate from db.m4.2xlarge to db.m5.2xlarge. The apply immediately option was used (via terraform/aws api). It looks like the failover successfully completed and the application utilizing the rds only saw a ~min or so of interruption during the failover. But currently, the rds instance is still showing 'modifying' and hasn't completed the upgraded as of yet. It has been over 3 hours.

Is this normal? When can we expect the operation to complete?
"""

Armen [:armenzg]

Assignee

Comment 13

•

6 years ago

Attached image Brieft CloudAMQP alerts — Details

Just keeping track that we passed a threshold and we came back within normal levels.

Jake Watkins [:dividehex]

Comment 14

•

6 years ago

AWS responded. TLDR; be patient, all is well.

"""
Hello,

Thank you for contacting the AWS Premium Support Database team, my name is Kajol and its my pleasure to assist you with this case today.

I understand you have concerns regarding scaling up your instance class from db.m4.2xlarge to db.m5.2xlarge on a multi AZ RDS instance ‘treeherder-prod’ and it is still in modifying. I will be happy to check on it for you.

I looked into the RDS instance ‘treeherder-prod’ and I can confirm that the API call was registered for modifying instance class at 2019-07-03 16:37:20 UTC. I can also confirm the following events which states that MAZ failover completed successfully and hence 60-120 sec of downtime.

2019-07-03 16:46:17 UTC Multi-AZ instance failover completed User

2019-07-03 16:45:41 UTC DB instance restarted User

2019-07-03 16:45:32 UTC Multi-AZ instance failover started. User

2019-07-03 16:45:06 UTC DB instance shutdown User

2019-07-03 16:37:42 UTC Applying modification to database instance class User

Currently your instance is available to use but the modifying status is because of the secondary. Looking further I could see that secondary is still being scaled up. Further checking the workflows and all details and researching on it, I can see that the secondary instance was having capacity issues in that Availability zone to scale to db.m5.2xlarge and hence had to be moved to different AZ and thus not only the underlying host but also volumes had to be replaced to be placed in the AZ and thus it is taking time.

With growing capacity’s we would want our customer to be in AZ with sufficient resources and hence this step is important and is taking a little while, it is not normal and occurs on a rare basis, no action is required from your end. I can confirm that the execution is progressing smoothly and its not stuck, please wait for a while and it should complete.

Hope this answers your question. Please reach out to me, should you have any further questions or concerns. I will be happy to help you resolve your issue.

We value your feedback, please take a moment to share your experience. You'll see a survey at the end of this communication. If you'd like to let me know how I'm doing, take few seconds to fill that out or rate this correspondence. This will help us to continue improving our support experience for you. I hope you have an amazing day ahead!

Best regards,

Kajol A.
Amazon Web Services
"""

Armen [:armenzg]

Assignee

Comment 15

•

6 years ago

It's still in progress. I'm glad it has not failed. I can't believe we're in this situation.
For now, Treeherder is still working and the performance is similar to prior to the update (which is good).
If we ever need to do this again, let's see if we have a different approach that would work OR assume it's perfectly fine to be waiting like this.
I'm so surprised we completed the treeherder-stage within an hour even though it contains the same amount of data.

Anyway, it is what it is.

Armen [:armenzg]

Assignee

Comment 16

•

6 years ago

Jake this completed this morning:

Fri, 05 Jul 2019 00:46:35 GMT Finished applying modification to DB instance class

There's various messages after that time in case you want to review any of them (like "Amazon RDS has encountered a fatal error running enhanced monitoring on your instance treeherder-prod and this feature has been disabled.This is likely due to the rds-monitoring-role not being present or configured incorrectly in your account. Please refer to the troubleshooting section in the Amazon RDS documentation for further details.")

I didn't see anything worrysome but just in case something is important that you would care for.

When you have a chance could you please upgrade treeherder-dev? That's the last one.

Jake Watkins [:dividehex]

Comment 17

•

6 years ago

(In reply to Armen [:armenzg] from comment #16)

There's various messages after that time in case you want to review any of them (like "Amazon RDS has encountered a fatal error running enhanced monitoring on your instance treeherder-prod and this feature has been disabled.This is likely due to the rds-monitoring-role not being present or configured incorrectly in your account. Please refer to the troubleshooting section in the Amazon RDS documentation for further details.")

I've updated the aws support ticket with this reply. We will need to re-enable the enhanced monitoring if it is safe to do so. I'll upgrade treeherder-dev after treeherder-prod is squared away.

"""
Thank you for you explanation on matter. The instance did ultimately finish after 1 day, 8 hours, 8 minutes and 53 seconds.
Wed, 03 Jul 2019 16:37:42 GMT Applying modification to database instance class
Fri, 05 Jul 2019 00:46:35 GMT Finished applying modification to DB instance class

Although, it looks as though enhanced monitoring was disabled as indicated by the errors in the rds log:
Fri, 05 Jul 2019 00:46:16 GMT Amazon RDS has encountered a fatal error running enhanced monitoring on your instance treeherder-prod and this feature has been disabled. This is likely due to the rds-monitoring-role not being present or configured incorrectly in your account. Please refer to the troubleshooting section in the Amazon RDS documentation for further details.
Fri, 05 Jul 2019 00:47:40 GMT Monitoring Interval changed to 0

Enhanced monitoring was enabled and working before the RDS scale up operation. As far as I can tell, the 'rds-monitoring-role' IAM role is still properly define with the 'AmazonRDSEnhancedMonitoringRole' policy attached. We use the same role for the 'treeherder-stage' rds which underwent the same scale up operation without incident.

Why did this happen? Is it safe to apply enhanced monitoring again?
"""

Jake Watkins [:dividehex]

Comment 18

•

6 years ago

Response from AWS support:

"""

Hello,

Thank you for getting back.

I can confirm that EM was disabled, I cannot confirm if this was related to instance scale you have performed, to answer why this happened and can you enable it back again, I have reached out to my internal team with my analysis and getting further information on this.

2019-07-05 00:47:40 UTC Monitoring Interval changed to 0

2019-07-05 00:47:16 UTC Amazon RDS has encountered a fatal error running enhanced monitoring on your instance treeherder-prod and this feature has been disabled.This is likely due to the rds-monitoring-role not being present or configured incorrectly in your account. Please refer to the troubleshooting section in the Amazon RDS documentation for further details.

I will get back to you as soon as I have any updates.

Best regards,

Kajol A.
Amazon Web Services
"""

Jake Watkins [:dividehex]

Comment 19

•

6 years ago

I've re-enabled enhanced monitoring on treeherder-prod.
aws_db_instance.treeherder-prod-rds: Modifications complete after 2m27s (ID: treeherder-prod)

Jake Watkins [:dividehex]

Comment 20

•

6 years ago

I've started the upgrade on treeherder-dev
Sat, 06 Jul 2019 01:52:46 GMT Applying modification to database instance class

Jake Watkins [:dividehex]

Comment 21

•

6 years ago

(In reply to Jake Watkins [:dividehex] from comment #20)

I've started the upgrade on treeherder-dev
Sat, 06 Jul 2019 01:52:46 GMT Applying modification to database instance class

Completed. aws_db_instance.treeherder-dev-rds: Modifications complete after 5m47s (ID: treeherder-dev)
Sat, 06 Jul 2019 01:57:16 GMT Finished applying modification to DB instance class

Armen [:armenzg]

Assignee

Comment 22

•

6 years ago

Attached image Write IOPS *sum* *6 hours* period — Details

Thanks Jake for all your work here.
So far it all seems to work well.
I will keep an eye on treeherder-prod this week to see if our bottlenecks are now gone.

I think the biggest gain is the increased Write IOPS (now matching what treeherder-stage was performing) and all swaping being gone.

Armen [:armenzg]

Assignee

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Assignee

Updated

•

6 years ago

Type: defect → task

You need to log in before you can comment on or make changes to this bug.