If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Grant IAM permission rds:RestoreDBClusterFromSnapshot to treeherder devs

RESOLVED FIXED

Status

Tree Management
Treeherder: Infrastructure
P2
normal
RESOLVED FIXED
a year ago
2 months ago

People

(Reporter: emorley, Unassigned)

Tracking

(Depends on: 1 bug)

Details

(Reporter)

Description

a year ago
I was trying to spin up a new RDS instance from a prod snapshot, but got:

User emorley is not authorized to restore from database snapshot on arn:aws:rds:us-east-1:699292812394:og:default:mysql-5-6 (Service: AmazonRDS; Status Code: 403; Error Code: AccessDenied; Request ID: 32eab727-9146-11e6-8713-813b4fc88f9e). Please check with your administrator.

I'm guessing we need `RestoreDBClusterFromSnapshot`?
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.IAM.ResourcePermissions.html
(Reporter)

Comment 1

a year ago
To add some more context...

I think we'll have two use-cases moving forwards:
1) occasionally resetting treeherder-{prototype,stage} back to prod
2) spinning up RDS instances from prod's DB, for short-lived experiments

There isn't a way to reset an existing RDS instance to a different instance's snapshot directly, so doing #1 means "restore from snapshot to a new RDS instance", "delete the old instance", "rename the new to the old" (or some variation, eg delete first to avoid the rename). However doing this means the raw instance IDs will no longer match Terraform state, so would need re-syncing afterwards. 

Alternatively, Terraform's `snapshot_identifier` option could be used to create an instance from a snapshot. However this is suboptimal too, since (a) it's not clear how one gets Terraform to delete the existing instance prior to recreating it, (b) the snapshot identifier is dynamic (eg "rds:treeherder-prod-2016-10-13-07-05"; there doesn't appear to be an alias for "latest") so would require hand-editing each time, (c) it means treeherder devs can't do this themselves.

For prototype/stage, I'm open to which of the above approaches makes most sense.

However for short-lived experiments, I believe it may be better to not use Terraform to track them at all, and instead for treeherder devs to spin them up manually. Quite a bit of their configs will already be kept in sync anyway (eg the parameter groups). Spinning these up manually will require the IAM permissions in this bug.

Also for the prototype RDS instance and any created for short lived experiments, I was thinking we could do the following to save money:
* use single-AZ (halves both the instance and storage cost)
* turn off backups (reduces snapshot storage cost - we can just reset to prod if there are issues)
* turn off enhanced monitoring (reduces CloudWatch costs)

Keen to know your thoughts? :-)

Comment 2

11 months ago
Thankfully, terraform has a nascent import function which will make it easier to handle #1. However, it will still require manipulating the terraform state and now that it's remote that might be ... interesting. So let's only do resets rarely and with coordination in advance. 

My major concern with #2 is tracking who's spinning up instances, for what, and ensuring that they ARE short lived. I'm going to require that all instances be tagged:
BugID, Name (this shows up under the DB Instance column in RDS), Owner (MoCo email or your devservices acct name)

If you like, we could craft a simple terraform config that you could use to create temporary RDS instances. It wouldn't use remote state, so you'd still have to look at the console to avoid stepping on toes, but it would be faster than doing it manually (and we could ensure all of the above).

Comment 3

11 months ago
Looking at the IAM policy - we can (and do) allow various actions to resources matching a pattern, currently treeherder-*. So that we don't let you accidentally delete treeherder-prod, how about we come up with a naming scheme that you use for your short lived instances? That'll allow us to give you considerably more free reign on them?
(Reporter)

Comment 4

11 months ago
(In reply to Kendall Libby [:fubar] from comment #2)
> Thankfully, terraform has a nascent import function which will make it
> easier to handle #1. However, it will still require manipulating the
> terraform state and now that it's remote that might be ... interesting. So
> let's only do resets rarely and with coordination in advance. 

Hopefully they won't be too often, but I'd also really like to not let infra-restrictions get in the way of development velocity if possible (this was part of the benefit of moving to Heroku).

It looks like this is exactly what we need:
https://github.com/hashicorp/terraform/issues/8828

With that the process would presumably be:
* Manually delete the stage instance
* Re-run Terraform, at which point it notices the stage instance is missing and tries to re-create it with the appropriate `snapshot_identifier`

> If you like, we could craft a simple terraform config that you could use to
> create temporary RDS instances. It wouldn't use remote state, so you'd still
> have to look at the console to avoid stepping on toes, but it would be
> faster than doing it manually (and we could ensure all of the above).

This sounds good to me. When I tried to create a new instance in comment 0, there were a number of options that needed setting (and more that would need changing after it was created, since things like parameter group can't be set until after creation) that having it scripted would actually save us time too.

> Looking at the IAM policy - we can (and do) allow various actions to
> resources matching a pattern, currently treeherder-*. So that we don't let
> you accidentally delete treeherder-prod, how about we come up with a naming
> scheme that you use for your short lived instances? That'll allow us to give
> you considerably more free reign on them?

Also sounds good :-)
(Reporter)

Comment 5

11 months ago
(In reply to Ed Morley [:emorley] from comment #4)
> It looks like this is exactly what we need:
> https://github.com/hashicorp/terraform/issues/8828
> 
> With that the process would presumably be:
> * Manually delete the stage instance
> * Re-run Terraform, at which point it notices the stage instance is missing
> and tries to re-create it with the appropriate `snapshot_identifier`

Actually an even better way for #1 is already possible in Terraform 0.7.5+, thanks to:
https://github.com/hashicorp/terraform/issues/8793
https://github.com/hashicorp/terraform/pull/8806

ie:
* Define treeherder-stage in terms of `snapshot_identifier = <hardcoded snapshot identifier>`
* Next time stage needs resetting, change the value for `snapshot_identifier`
* Terraform will automatically destroy the old stage instance and create a new one

Now it does mean having to hardcode a specific snapshot identifer that is updated at each reset, but perhaps that's a good thing - it's then easy to find the date at which stage was last reset (since the identifier contains the date).

Comment 6

11 months ago
(In reply to Ed Morley [:emorley] from comment #5)
> Actually an even better way for #1 is already possible in Terraform 0.7.5+,
> thanks to:
> https://github.com/hashicorp/terraform/issues/8793
> https://github.com/hashicorp/terraform/pull/8806
> 
> ie:
> * Define treeherder-stage in terms of `snapshot_identifier = <hardcoded
> snapshot identifier>`
> * Next time stage needs resetting, change the value for `snapshot_identifier`
> * Terraform will automatically destroy the old stage instance and create a
> new one
> 
> Now it does mean having to hardcode a specific snapshot identifer that is
> updated at each reset, but perhaps that's a good thing - it's then easy to
> find the date at which stage was last reset (since the identifier contains
> the date).

I like this. It makes it much easier for us to re-roll staging to keep things moving fast, while also keeping the infra at a stable/known state. I'll start poking at things!
(Reporter)

Comment 7

11 months ago
Until this issue is fixed, we'll have to manually update the password on stage after restoring from snapshot, otherwise it will have prod's password instead:
https://github.com/hashicorp/terraform/issues/8604

Comment 8

11 months ago
https://github.com/mozilla-platform-ops/devservices-aws/pull/11 updates the IAM policy to allow the above.
(Reporter)

Comment 9

3 months ago
Cameron and I were trying to create a new temporary test instance using the "restore from snapshot" feature here:
https://console.aws.amazon.com/rds/home?region=us-east-1#db-snapshots:

However got this IAM error:

User emorley is not authorized to restore from database snapshot on arn:aws:rds:us-east-1:699292812394:og:default:mysql-5-6 (Service: AmazonRDS; Status Code: 403; Error Code: AccessDenied; Request ID: cd4cb770-61ae-11e7-b981-dd655a9e2148). Please check with your administrator.

...this was even when using a new instance name of eg "treeherder-dev-foo", which should match the IAM exemption for "treherder-dev-*" for performing management commands.

Jake, I don't suppose you could add the missing IAM permission? :-)
Flags: needinfo?(jwatkins)
(Reporter)

Comment 10

3 months ago
Ah searching for "RestoreDBInstanceFromDBSnapshot" on:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.IAM.ResourcePermissions.html

...shows we need:

db:db-instance-name
og:option-group-name
snapshot:snapshot-name
subgrp:subnet-group-name

...which in this case translates to:

db:treeherder-dev-foo
og:default:mysql-5.6
snapshot:treeherder-prod-... (or perhaps snapshot:rds:treeherder-prod-... - it's not clear)
subgrp:treeherder-dbgrp

The current IAM profile gives us `og:treeherder-dev*` but that option group does not exist - treeherder is using the default AWS group.
(Reporter)

Comment 11

3 months ago
I've opened:
https://github.com/mozilla-platform-ops/devservices-aws/pull/51
(Reporter)

Comment 12

3 months ago
Landed as:
https://github.com/mozilla-platform-ops/devservices-aws/commit/a5fa229776c822bc90cc833240d248542291358b

With that, we were able to create an RDS instance.

Points to note for anyone doing similar in the future:
* Use the "restore a snapshot" feature on: https://console.aws.amazon.com/rds/home?region=us-east-1#db-snapshots
* Make sure to give it a name that starts with "treeherder-dev" so it picks up the needed IAM permissions (for example "treeherder-dev-emorley")
* Choose options: multi-az:no, publicly accessible:yes, instance size: either leave as-is if needing identical to prod or reduce the size to say m4.xlarge to match dev/stage.
* After the backup is restored you'll need to modify the instance to:
  - change the parameter group to "treeherder"
  - change the security group to "treeherder_heroku-sg"
  - edit the tags so:
    -> `Env` is say `dev` rather than `prod`
    -> `Name` is say `treeherder-emorley-rds` rather than `treeherder-prod-rds`
    -> `Bug` is the bug number for the project
  - (optionally) disable backups
  (Unfortunately these options cannot be changed when initiating the restore, only after)
* Remember to delete it when you're done! :-)

The bug as filed is fixed, so closing this out.
If we want to add Terraform scripts for this in the future (per comment 2), let's open a new bug :-)
Status: NEW → RESOLVED
Last Resolved: 3 months ago
Flags: needinfo?(jwatkins)
Resolution: --- → FIXED
(Reporter)

Updated

3 months ago
Depends on: 1378573
(Reporter)

Comment 13

3 months ago
(In reply to Ed Morley [:emorley] from comment #7)
> Until this issue is fixed, we'll have to manually update the password on
> stage after restoring from snapshot, otherwise it will have prod's password
> instead:
> https://github.com/hashicorp/terraform/issues/8604

This is now fixed in Terraform v0.8.2+:
https://github.com/hashicorp/terraform/pull/8622
(In reply to Ed Morley [:emorley] from comment #11)
> I've opened:
> https://github.com/mozilla-platform-ops/devservices-aws/pull/51

 +            "arn:aws:rds:${var.region}:${data.aws_caller_identity.current.account_id}:og:default:mysql-5-6", 


With dev and stage on 5.7, should we modify that to include 'og:default:mysql-5-7' ?
Flags: needinfo?(emorley)
(Reporter)

Comment 15

2 months ago
(In reply to Kendall Libby [:fubar] from comment #14)
> With dev and stage on 5.7, should we modify that to include
> 'og:default:mysql-5-7' ?

The PR that added the initial MySQL 5.7 parts landed after this, and changed this line too:
https://github.com/mozilla-platform-ops/devservices-aws/pull/53/files#diff-5627f73db5e7bfe1c9ef9b0c9cd94902R138
Flags: needinfo?(emorley)
That's what I get for relying only on the PR and not looking at what I had *just* pulled down. Thanks!
You need to log in before you can comment on or make changes to this bug.