RDS is massively overprovisioned
Categories
(Socorro :: Infra, task)
Tracking
(Not tracked)
People
(Reporter: brian, Assigned: brian)
Details
Our postgresql rds databases are massively overprovisioned in all environments.
This includes both the instance type and the storage (which is currently using provisioned iops).
We should do whatever is easily done to reduce this and save some money.
At a minimum, I think we can change storage classes without any downtime now, so we should go to GP2.
Comment 1•5 years ago
|
||
I'm +1 on this. I don't have any plans to change how we're using RDS in the foreseeable future.
I'm game for doing this first week of December if that works for you.
Assignee | ||
Comment 2•5 years ago
|
||
Looks like I was wrong about "no downtime". The docs at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.StorageTypes.html say "An immediate outage occurs when the storage type changes."
Oremj, do you know how long the outage lasted when you converted socorro stage, or any of the other DBs that you've converted recently?
Comment 3•5 years ago
|
||
For informational purposes, a db outage will affect the following things:
- cronrun jobs -- but they'll recover and run (including backfill where appropriate) as soon as the outage is over
- the webapp -- it'll show HTTP 500 errors during the outage, but will be fine after it's over
- crash report processing -- crash reports that aren't processed or fail to process during the outage will process afterwards
I think that's it.
Assignee | ||
Comment 4•5 years ago
|
||
After reviewing utilization, the other change I think we can make is going in prod from m4.4xlarge to m4.large.
This will also require downtime, on the order of ~5 minutes.
We should stagger the changes, at least a week between each.
Since instance type is the bigger cost savings, I vote we do it first.
Willkg, what do you need to do in advance? Send the a description of what we're doing and the date+time to the stability list?
Miles, as secondary do you object to either of these changes?
Comment 5•5 years ago
|
||
I think sending an email to the stability list with date/time of expected outage 24 hours in advance is fine. I can do that when we've got it scheduled.
Comment 7•5 years ago
|
||
(In reply to Brian Pitts from comment #2)
Looks like I was wrong about "no downtime". The docs at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.StorageTypes.html say "An immediate outage occurs when the storage type changes."
Oremj, do you know how long the outage lasted when you converted socorro stage, or any of the other DBs that you've converted recently?
There should be 0 downtime when making this change although their can be degraded io performance while it occurs. jbuck, did you experience any problems when making this change to any of your databases?
Assignee | ||
Comment 8•5 years ago
|
||
Willkg, how about we announce this:
- 10AM eastern 12/9 we're going to change the storage type. No user impact expected.
- 10AM eastern 12/16 we're going to change the instance type. 5 minutes of webapp downtime expected.
Comment 9•5 years ago
|
||
I set up these things because it makes it easier to convert time zones:
12/9 at 10am: https://www.timeanddate.com/worldclock/fixedtime.html?msg=Crash+Stats+maintenance%3A+phase+1&iso=20191209T1000&p1=43&am=00
I sent out a heads-up email just now. I'll plan to send out emails before and after each outage.
Assignee | ||
Comment 10•5 years ago
|
||
Thanks Will!
Assignee | ||
Comment 11•5 years ago
|
||
Storage conversion for prod is in progress.
Assignee | ||
Comment 12•5 years ago
|
||
Storage resize finished.
I've kicked off the instance type change in stage, as preview of doing prod next week.
Assignee | ||
Comment 13•5 years ago
|
||
Stage is back.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 14•5 years ago
|
||
Prod instance resize is complete.
Description
•