Closed Bug 1152982 Opened 9 years ago Closed 9 years ago

DEPLOY updated Resources stack to Prod

Categories

(Content Services Graveyard :: Tiles: Ops, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: mostlygeek, Assigned: relud)

References

Details

The resources CFN stack was recently changed:

- Changed from deploy time generated JSON template back to a static json template with parameters. 
- Adds an EIP for Inferno to be accessible via the VPN
- Adds tags to shared security groups


This is a big change and it comes with risks of severely affecting the production system. It needs to be extensively tested. Let's discuss / agree on the testing plan in this bug before rolling it out to production. 

Initial thoughts:

- spin up a brand new stack in stage with the old (no param), jenkins runtime stack
- apply the new CFN template changes on top of it
- ensure that: 
  - S3 bucket is not modified in any way
  - security groups are updated as expected
  - other changes are done as expected
- Repeat the above procedure in production with a cloned stack
- Apply the updated CFN stack to the new production resource stack
- Ensure all updates to resources are expected
- Apply the CFN update to the actual production stack
Assignee: nobody → dthornton
Summary: Deploy updated resources stack to Prod → Deploy updated Resources stack to Prod
Summary: Deploy updated Resources stack to Prod → DEPLOY updated Resources stack to Prod
Blocks: 1153007
Blocks: 1153018
deploying the resources stack and onyx elb are the same job in jenkins. so this deployment will include updating the onyx elb to remove the stack tag from it.
ignore that: the onyx elb update is being split out in jenkins
planned for 10:00 PDT on 2015/04/16
Since the resources stack spins up the core and manage all the interconnected resources (EIPs, Security groups, SNS/SQS/S3/etc) it is important that we are extra thorough in planning out the deployment. 

Scheduled for 10:00am Thursday, April 15, 2015

Required before deployment:

- rollback / recovery plan if the deployment fails
- validation plan if stack successfully updated:
  - how do we check if everything is still working? 
  - onyx, splice, infernyx, redshift, s3, edgecast/cdn, other?
- if validation fails do we roll forward or back?
Thursday's the 16th, correct?
yes
Sorry April 16th/2015 @ 10am.
rollback: we should record the cfn template that is live before we deploy, and deploy that as an update to the stack if the stack succeeds in updating but causes unexpected issues.
if the stack fails to the point that it cannot be updated, and it enters the 'rollback failed' state, then we would need to change the resources stack name in ansible and deploy a full new stack of *everything*.

steps:

change resources stack name in ansible
destroy the dns records for *_tiles.prod.mozaws.net
deploy new resources stack
file bug to update vpn routes for the groups vpn_cloudops_redshift, vpn_tiles_splice, and vpn_tiles_mapreduce
deploy onyx, disco, splice stacks
deploy processor, infernyx stacks
scale old onyx to 0
eventually destroy old stacks
:relud could you also describe the testing you've done for this in stage. From my understanding you:

- took the jenkins generated CFN and applied your new static template to it 
- made sure it worked, that the CFN stack updated successfully?

Any other details I'm missing?
:relud would this be a valid test for updating the resources stack: 

- spin up a new stack w/ a copy of the current jenkins template, giving us a whole new set of resources
- have jenkins apply the new template to *that* stack
- if the update completes successfully with the prod params, then we have high confidence it will apply to the other stack the same
Flags: needinfo?(dthornton)
testing I've done for this in stage:

- took the jenkins generated CFN and applied your new static template to it 
- made sure it worked, that the CFN stack updated successfully
- applied the jenkins generated CFN as an update to the stack
- made sure it worked, that the CFN stack updated successfully
- applied your new static template as an update to the stack
- made sure it worked, that the CFN stack updated successfully

so basically, i tested the update and the rollback

a valid test:
>spin up a new stack w/ a copy of the current jenkins template
this is not possible due to the route53 records that the stack creates. we could remove the route53 records from both stacks, and do this, in order to test most of it. the closest we can come to testing changes to route53 is test in stage, as above.
Flags: needinfo?(dthornton)
The R53 records are created in the "EipR53" resource. We should edit the DNS names manually so "disco_tiles.prod.mozaws.net." becomes "disco_test.prod.mozaws.net."

In the new template let's pass in DnsSuffix as "test". 

Everything else should be the same. 

:relud is there anything else that has baked in logic?
that should work.
Update: 

- deployed a copy of the jenkins CFN stack
- ran the updated static cfn + params on the copy stack
- update ran cleanly
- rollback ran cleanly

Looking at the changes:

- IAM roles were changed due to some ordering of rules. The rules themselves don't appear to be changed
- SG were changed since tags were added to them
- EIP for infernxy was created as expected
- no SNS/SQS/S3 resources were changed

There shouldn't be any surprises when we run the same template against the real production stack tomorrow. We just need to figure out some verifications.
verifications:

onyx->redshift test should validate that security groups are still working for everything except: learnyx, redshift, splice, zenko
doing a select on redshift over the vpn should validate the redshift security group
accessing splice and zenko over the vpn should validate that those are working
logging into learnyx over vpn and running a list in disco and connecting to redshift and doing a select should validate the learnyx security group

i can't think of a way to validate iam roles without deploying everything

infernyx eip will be validated when infernyx is deployed.
Verification:

- test that auto-scaling work as usual with onyx
  - force terminate a box 
  - let auto-scale replace it
  - make sure the new box works as normal
    - using the end to end test

re: iam roles, those should take effect almost immediately. Where the IAM roles grant access to resources (SNS/SQS/S3) we should be able to validate that with the onyx->redshift test. 

The permissions to access secrets/blackbox will require us to start a new server. For onyx I'll do that with the auto-scale test above.

For splice/zenko we should respin and validate again.
for splice we can kill the host and let it auto-scale back, like with onyx. zenko can wait to be verified until the query thing is resolved.
As per :tblow's suggestion, delaying this bug until :oyiptong, :tspurway or :mardak has a chance to review it.
This looks good guys.  I am satisfied we have a good rollback plan in case things go pear-shaped.  Let's roll this out.
deploy started
deploy completed

onyx-redshift test started at 10:25:01 PDT and completed at 10:30:20 PDT
completed successfully*
Onyx auto-scaling verification: successful. 

- Terminated an Onyx app server
- Autoscaling successfully replaced it with a working one
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.