Closed Bug 1599664 Opened 5 years ago Closed 5 years ago

RDS is massively overprovisioned

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: brian, Assigned: brian)

Details

Our postgresql rds databases are massively overprovisioned in all environments.

This includes both the instance type and the storage (which is currently using provisioned iops).

We should do whatever is easily done to reduce this and save some money.

At a minimum, I think we can change storage classes without any downtime now, so we should go to GP2.

I'm +1 on this. I don't have any plans to change how we're using RDS in the foreseeable future.

I'm game for doing this first week of December if that works for you.

Looks like I was wrong about "no downtime". The docs at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.StorageTypes.html say "An immediate outage occurs when the storage type changes."

Oremj, do you know how long the outage lasted when you converted socorro stage, or any of the other DBs that you've converted recently?

Flags: needinfo?(oremj)

For informational purposes, a db outage will affect the following things:

  1. cronrun jobs -- but they'll recover and run (including backfill where appropriate) as soon as the outage is over
  2. the webapp -- it'll show HTTP 500 errors during the outage, but will be fine after it's over
  3. crash report processing -- crash reports that aren't processed or fail to process during the outage will process afterwards

I think that's it.

After reviewing utilization, the other change I think we can make is going in prod from m4.4xlarge to m4.large.

This will also require downtime, on the order of ~5 minutes.

We should stagger the changes, at least a week between each.

Since instance type is the bigger cost savings, I vote we do it first.

Willkg, what do you need to do in advance? Send the a description of what we're doing and the date+time to the stability list?

Miles, as secondary do you object to either of these changes?

Flags: needinfo?(willkg)
Flags: needinfo?(miles)

I think sending an email to the stability list with date/time of expected outage 24 hours in advance is fine. I can do that when we've got it scheduled.

Flags: needinfo?(willkg)

This is a good change to make, all good by me!

Flags: needinfo?(miles)

(In reply to Brian Pitts from comment #2)

Looks like I was wrong about "no downtime". The docs at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.StorageTypes.html say "An immediate outage occurs when the storage type changes."

Oremj, do you know how long the outage lasted when you converted socorro stage, or any of the other DBs that you've converted recently?

There should be 0 downtime when making this change although their can be degraded io performance while it occurs. jbuck, did you experience any problems when making this change to any of your databases?

Flags: needinfo?(oremj) → needinfo?(jbuckley)

Willkg, how about we announce this:

  1. 10AM eastern 12/9 we're going to change the storage type. No user impact expected.
  2. 10AM eastern 12/16 we're going to change the instance type. 5 minutes of webapp downtime expected.
Flags: needinfo?(willkg)

I set up these things because it makes it easier to convert time zones:

12/9 at 10am: https://www.timeanddate.com/worldclock/fixedtime.html?msg=Crash+Stats+maintenance%3A+phase+1&iso=20191209T1000&p1=43&am=00

12/16 at 10am: https://www.timeanddate.com/worldclock/fixedtime.html?msg=Crash+stats+maintenance%3A+phase+2+%28website+outage%29&iso=20191216T10&p1=43&am=15

I sent out a heads-up email just now. I'll plan to send out emails before and after each outage.

Flags: needinfo?(willkg)

Thanks Will!

Storage conversion for prod is in progress.

Flags: needinfo?(jbuckley)

Storage resize finished.

I've kicked off the instance type change in stage, as preview of doing prod next week.

Stage is back.

Status: NEW → ASSIGNED

Prod instance resize is complete.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.