Closed
Bug 1152982
Opened 10 years ago
Closed 10 years ago
DEPLOY updated Resources stack to Prod
Categories
(Content Services Graveyard :: Tiles: Ops, defect)
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: mostlygeek, Assigned: relud)
References
Details
The resources CFN stack was recently changed:
- Changed from deploy time generated JSON template back to a static json template with parameters.
- Adds an EIP for Inferno to be accessible via the VPN
- Adds tags to shared security groups
This is a big change and it comes with risks of severely affecting the production system. It needs to be extensively tested. Let's discuss / agree on the testing plan in this bug before rolling it out to production.
Initial thoughts:
- spin up a brand new stack in stage with the old (no param), jenkins runtime stack
- apply the new CFN template changes on top of it
- ensure that:
- S3 bucket is not modified in any way
- security groups are updated as expected
- other changes are done as expected
- Repeat the above procedure in production with a cloned stack
- Apply the updated CFN stack to the new production resource stack
- Ensure all updates to resources are expected
- Apply the CFN update to the actual production stack
Reporter | ||
Updated•10 years ago
|
Assignee: nobody → dthornton
Summary: Deploy updated resources stack to Prod → Deploy updated Resources stack to Prod
Reporter | ||
Updated•10 years ago
|
Summary: Deploy updated Resources stack to Prod → DEPLOY updated Resources stack to Prod
Assignee | ||
Comment 1•10 years ago
|
||
deploying the resources stack and onyx elb are the same job in jenkins. so this deployment will include updating the onyx elb to remove the stack tag from it.
Assignee | ||
Comment 2•10 years ago
|
||
ignore that: the onyx elb update is being split out in jenkins
Assignee | ||
Comment 3•10 years ago
|
||
planned for 10:00 PDT on 2015/04/16
Reporter | ||
Comment 4•10 years ago
|
||
Since the resources stack spins up the core and manage all the interconnected resources (EIPs, Security groups, SNS/SQS/S3/etc) it is important that we are extra thorough in planning out the deployment.
Scheduled for 10:00am Thursday, April 15, 2015
Required before deployment:
- rollback / recovery plan if the deployment fails
- validation plan if stack successfully updated:
- how do we check if everything is still working?
- onyx, splice, infernyx, redshift, s3, edgecast/cdn, other?
- if validation fails do we roll forward or back?
Comment 5•10 years ago
|
||
Thursday's the 16th, correct?
Assignee | ||
Comment 6•10 years ago
|
||
yes
Reporter | ||
Comment 7•10 years ago
|
||
Sorry April 16th/2015 @ 10am.
Assignee | ||
Comment 8•10 years ago
|
||
rollback: we should record the cfn template that is live before we deploy, and deploy that as an update to the stack if the stack succeeds in updating but causes unexpected issues.
Assignee | ||
Comment 9•10 years ago
|
||
if the stack fails to the point that it cannot be updated, and it enters the 'rollback failed' state, then we would need to change the resources stack name in ansible and deploy a full new stack of *everything*.
steps:
change resources stack name in ansible
destroy the dns records for *_tiles.prod.mozaws.net
deploy new resources stack
file bug to update vpn routes for the groups vpn_cloudops_redshift, vpn_tiles_splice, and vpn_tiles_mapreduce
deploy onyx, disco, splice stacks
deploy processor, infernyx stacks
scale old onyx to 0
eventually destroy old stacks
Reporter | ||
Comment 10•10 years ago
|
||
:relud could you also describe the testing you've done for this in stage. From my understanding you:
- took the jenkins generated CFN and applied your new static template to it
- made sure it worked, that the CFN stack updated successfully?
Any other details I'm missing?
Reporter | ||
Comment 11•10 years ago
|
||
:relud would this be a valid test for updating the resources stack:
- spin up a new stack w/ a copy of the current jenkins template, giving us a whole new set of resources
- have jenkins apply the new template to *that* stack
- if the update completes successfully with the prod params, then we have high confidence it will apply to the other stack the same
Flags: needinfo?(dthornton)
Assignee | ||
Comment 12•10 years ago
|
||
testing I've done for this in stage:
- took the jenkins generated CFN and applied your new static template to it
- made sure it worked, that the CFN stack updated successfully
- applied the jenkins generated CFN as an update to the stack
- made sure it worked, that the CFN stack updated successfully
- applied your new static template as an update to the stack
- made sure it worked, that the CFN stack updated successfully
so basically, i tested the update and the rollback
a valid test:
>spin up a new stack w/ a copy of the current jenkins template
this is not possible due to the route53 records that the stack creates. we could remove the route53 records from both stacks, and do this, in order to test most of it. the closest we can come to testing changes to route53 is test in stage, as above.
Flags: needinfo?(dthornton)
Reporter | ||
Comment 13•10 years ago
|
||
The R53 records are created in the "EipR53" resource. We should edit the DNS names manually so "disco_tiles.prod.mozaws.net." becomes "disco_test.prod.mozaws.net."
In the new template let's pass in DnsSuffix as "test".
Everything else should be the same.
:relud is there anything else that has baked in logic?
Assignee | ||
Comment 14•10 years ago
|
||
that should work.
Reporter | ||
Comment 15•10 years ago
|
||
Update:
- deployed a copy of the jenkins CFN stack
- ran the updated static cfn + params on the copy stack
- update ran cleanly
- rollback ran cleanly
Looking at the changes:
- IAM roles were changed due to some ordering of rules. The rules themselves don't appear to be changed
- SG were changed since tags were added to them
- EIP for infernxy was created as expected
- no SNS/SQS/S3 resources were changed
There shouldn't be any surprises when we run the same template against the real production stack tomorrow. We just need to figure out some verifications.
Assignee | ||
Comment 16•10 years ago
|
||
verifications:
onyx->redshift test should validate that security groups are still working for everything except: learnyx, redshift, splice, zenko
doing a select on redshift over the vpn should validate the redshift security group
accessing splice and zenko over the vpn should validate that those are working
logging into learnyx over vpn and running a list in disco and connecting to redshift and doing a select should validate the learnyx security group
i can't think of a way to validate iam roles without deploying everything
infernyx eip will be validated when infernyx is deployed.
Reporter | ||
Comment 17•10 years ago
|
||
Verification:
- test that auto-scaling work as usual with onyx
- force terminate a box
- let auto-scale replace it
- make sure the new box works as normal
- using the end to end test
re: iam roles, those should take effect almost immediately. Where the IAM roles grant access to resources (SNS/SQS/S3) we should be able to validate that with the onyx->redshift test.
The permissions to access secrets/blackbox will require us to start a new server. For onyx I'll do that with the auto-scale test above.
For splice/zenko we should respin and validate again.
Assignee | ||
Comment 18•10 years ago
|
||
for splice we can kill the host and let it auto-scale back, like with onyx. zenko can wait to be verified until the query thing is resolved.
Reporter | ||
Comment 19•10 years ago
|
||
As per :tblow's suggestion, delaying this bug until :oyiptong, :tspurway or :mardak has a chance to review it.
Comment 20•10 years ago
|
||
This looks good guys. I am satisfied we have a good rollback plan in case things go pear-shaped. Let's roll this out.
Assignee | ||
Comment 21•10 years ago
|
||
deploy started
Assignee | ||
Comment 22•10 years ago
|
||
deploy completed
onyx-redshift test started at 10:25:01 PDT and completed at 10:30:20 PDT
Assignee | ||
Comment 23•10 years ago
|
||
completed successfully*
Reporter | ||
Comment 24•10 years ago
|
||
Onyx auto-scaling verification: successful.
- Terminated an Onyx app server
- Autoscaling successfully replaced it with a working one
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•