1152982 - DEPLOY updated Resources stack to Prod

Reporter

Description

•

10 years ago

The resources CFN stack was recently changed: - Changed from deploy time generated JSON template back to a static json template with parameters. - Adds an EIP for Inferno to be accessible via the VPN - Adds tags to shared security groups This is a big change and it comes with risks of severely affecting the production system. It needs to be extensively tested. Let's discuss / agree on the testing plan in this bug before rolling it out to production. Initial thoughts: - spin up a brand new stack in stage with the old (no param), jenkins runtime stack - apply the new CFN template changes on top of it - ensure that: - S3 bucket is not modified in any way - security groups are updated as expected - other changes are done as expected - Repeat the above procedure in production with a cloned stack - Apply the updated CFN stack to the new production resource stack - Ensure all updates to resources are expected - Apply the CFN update to the actual production stack

Benson Wong [:mostlygeek]

Reporter

Updated

•

10 years ago

Assignee: nobody → dthornton

Summary: Deploy updated resources stack to Prod → Deploy updated Resources stack to Prod

Benson Wong [:mostlygeek]

Reporter

Updated

•

10 years ago

Summary: Deploy updated Resources stack to Prod → DEPLOY updated Resources stack to Prod

Benson Wong [:mostlygeek]

Reporter

Updated

•

10 years ago

Blocks: 1153007

Benson Wong [:mostlygeek]

Reporter

Updated

•

10 years ago

Blocks: 1153018

Daniel Thorn [:relud]

Assignee

Comment 1

•

10 years ago

deploying the resources stack and onyx elb are the same job in jenkins. so this deployment will include updating the onyx elb to remove the stack tag from it.

Daniel Thorn [:relud]

Assignee

Comment 2

•

10 years ago

ignore that: the onyx elb update is being split out in jenkins

Daniel Thorn [:relud]

Assignee

Comment 3

•

10 years ago

planned for 10:00 PDT on 2015/04/16

Benson Wong [:mostlygeek]

Reporter

Comment 4

•

10 years ago

Since the resources stack spins up the core and manage all the interconnected resources (EIPs, Security groups, SNS/SQS/S3/etc) it is important that we are extra thorough in planning out the deployment. Scheduled for 10:00am Thursday, April 15, 2015 Required before deployment: - rollback / recovery plan if the deployment fails - validation plan if stack successfully updated: - how do we check if everything is still working? - onyx, splice, infernyx, redshift, s3, edgecast/cdn, other? - if validation fails do we roll forward or back?

Karl Thiessen [:kthiessen, he/him]

Comment 5

•

10 years ago

Thursday's the 16th, correct?

Daniel Thorn [:relud]

Assignee

Comment 6

•

10 years ago

yes

Benson Wong [:mostlygeek]

Reporter

Comment 7

•

10 years ago

Sorry April 16th/2015 @ 10am.

Daniel Thorn [:relud]

Assignee

Comment 8

•

10 years ago

rollback: we should record the cfn template that is live before we deploy, and deploy that as an update to the stack if the stack succeeds in updating but causes unexpected issues.

Daniel Thorn [:relud]

Assignee

Comment 9

•

10 years ago

if the stack fails to the point that it cannot be updated, and it enters the 'rollback failed' state, then we would need to change the resources stack name in ansible and deploy a full new stack of *everything*. steps: change resources stack name in ansible destroy the dns records for *_tiles.prod.mozaws.net deploy new resources stack file bug to update vpn routes for the groups vpn_cloudops_redshift, vpn_tiles_splice, and vpn_tiles_mapreduce deploy onyx, disco, splice stacks deploy processor, infernyx stacks scale old onyx to 0 eventually destroy old stacks

Benson Wong [:mostlygeek]

Reporter

Comment 10

•

10 years ago

:relud could you also describe the testing you've done for this in stage. From my understanding you: - took the jenkins generated CFN and applied your new static template to it - made sure it worked, that the CFN stack updated successfully? Any other details I'm missing?

Benson Wong [:mostlygeek]

Reporter

Comment 11

•

10 years ago

:relud would this be a valid test for updating the resources stack: - spin up a new stack w/ a copy of the current jenkins template, giving us a whole new set of resources - have jenkins apply the new template to *that* stack - if the update completes successfully with the prod params, then we have high confidence it will apply to the other stack the same

Flags: needinfo?(dthornton)

Daniel Thorn [:relud]

Assignee

Comment 12

•

10 years ago

testing I've done for this in stage: - took the jenkins generated CFN and applied your new static template to it - made sure it worked, that the CFN stack updated successfully - applied the jenkins generated CFN as an update to the stack - made sure it worked, that the CFN stack updated successfully - applied your new static template as an update to the stack - made sure it worked, that the CFN stack updated successfully so basically, i tested the update and the rollback a valid test: >spin up a new stack w/ a copy of the current jenkins template this is not possible due to the route53 records that the stack creates. we could remove the route53 records from both stacks, and do this, in order to test most of it. the closest we can come to testing changes to route53 is test in stage, as above.

Flags: needinfo?(dthornton)

Benson Wong [:mostlygeek]

Reporter

Comment 13

•

10 years ago

The R53 records are created in the "EipR53" resource. We should edit the DNS names manually so "disco_tiles.prod.mozaws.net." becomes "disco_test.prod.mozaws.net." In the new template let's pass in DnsSuffix as "test". Everything else should be the same. :relud is there anything else that has baked in logic?

Daniel Thorn [:relud]

Assignee

Comment 14

•

10 years ago

that should work.

Benson Wong [:mostlygeek]

Reporter

Comment 15

•

10 years ago

Update: - deployed a copy of the jenkins CFN stack - ran the updated static cfn + params on the copy stack - update ran cleanly - rollback ran cleanly Looking at the changes: - IAM roles were changed due to some ordering of rules. The rules themselves don't appear to be changed - SG were changed since tags were added to them - EIP for infernxy was created as expected - no SNS/SQS/S3 resources were changed There shouldn't be any surprises when we run the same template against the real production stack tomorrow. We just need to figure out some verifications.

Daniel Thorn [:relud]

Assignee

Comment 16

•

10 years ago

verifications: onyx->redshift test should validate that security groups are still working for everything except: learnyx, redshift, splice, zenko doing a select on redshift over the vpn should validate the redshift security group accessing splice and zenko over the vpn should validate that those are working logging into learnyx over vpn and running a list in disco and connecting to redshift and doing a select should validate the learnyx security group i can't think of a way to validate iam roles without deploying everything infernyx eip will be validated when infernyx is deployed.

Benson Wong [:mostlygeek]

Reporter

Comment 17

•

10 years ago

Verification: - test that auto-scaling work as usual with onyx - force terminate a box - let auto-scale replace it - make sure the new box works as normal - using the end to end test re: iam roles, those should take effect almost immediately. Where the IAM roles grant access to resources (SNS/SQS/S3) we should be able to validate that with the onyx->redshift test. The permissions to access secrets/blackbox will require us to start a new server. For onyx I'll do that with the auto-scale test above. For splice/zenko we should respin and validate again.

Daniel Thorn [:relud]

Assignee

Comment 18

•

10 years ago

for splice we can kill the host and let it auto-scale back, like with onyx. zenko can wait to be verified until the query thing is resolved.

Benson Wong [:mostlygeek]

Reporter

Comment 19

•

10 years ago

As per :tblow's suggestion, delaying this bug until :oyiptong, :tspurway or :mardak has a chance to review it.

Tim Spurway [:tspurway]

Comment 20

•

10 years ago

This looks good guys. I am satisfied we have a good rollback plan in case things go pear-shaped. Let's roll this out.

Daniel Thorn [:relud]

Assignee

Comment 21

•

10 years ago

deploy started

Daniel Thorn [:relud]

Assignee

Comment 22

•

10 years ago

deploy completed onyx-redshift test started at 10:25:01 PDT and completed at 10:30:20 PDT

Daniel Thorn [:relud]

Assignee

Comment 23

•

10 years ago

completed successfully*

Benson Wong [:mostlygeek]

Reporter

Comment 24

•

10 years ago

Onyx auto-scaling verification: successful. - Terminated an Onyx app server - Autoscaling successfully replaced it with a working one

Olivier Yiptong [:oyiptong]

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Olivier Yiptong [:oyiptong]

Updated

•

10 years ago

Status: RESOLVED → VERIFIED