Closed Bug 1306753 Opened 8 years ago Closed 8 years ago

Deploy balrog scriptworker to production environment

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mtabara, Assigned: mtabara)

References

Details

Attachments

(2 files, 1 obsolete file)

In bug 1289822 we managed to get a staging environment for balrog-scriptworker. However, in order to move forward with nightlies we need a production environment. For that, certain specifics need to be done. A. Machine related things: a1. I need an account to acess AWS console [0] a2. Need to identify the right subnet for this. I could go with creating a new subnet or using an existing one. The key thing here is to have it suitable for communication with Balrog. Could potentiall file a bug to netflows to open them up but ideally I'll reuse something, most likely srv.releng.*.mozilla.com. Once I get the access to Balrog, I should filter after srv.releng.*.mozilla.com to see if they can talk to Balrog admin API. a3. Should we go with srv.releng.*.mozilla.com, I need to identify subnet ids for it per aws region in order to fill in [1]. a4. setup the firewall security groups to fill in [2] a5. include a rule a6 choose proper FQDN a7. With subnets and firewall all set, I can go ahead and set up the configs[3] and user-data-configs[4] a8. Have them reviewed and deployed with [5] a9. Get myself acccess to prod aws manager with invtools available to run the necessary scripts to start the production instance B. Puppet related things: b10. before migrating to production, I need to make sure existing puppet patches work against a clean EC2 instance. I already got access in bug 1306610 but the machine lacks the certs so I'll go ahead and create a new one based on [7] b11. have Coop (because git blame says so :P) review the cruncher configs pinning to requests-2.8.1 before pypi puppetizing- requests-2.10.0 b12. add nagios - use bug 1295196 as an example. Also we need another check for signing, the pending queue; we may need a check like that for balrog scriptworker as well. We'll probably use Queue.pendingTasks for that: https://docs.taskcluster.net/reference/platform/queue/api-docs#pendingTasks b13. we'll need a new client for the rolled out production instances that don't use dummy worker types. I set up https://tools.taskcluster.net/auth/clients/#project%252freleng%252fscriptworker%252fsigning-linux for the signing scriptworkers. We may want to have a clientid per instance, but for now they share As client scopes, I could try something like: project:releng:balrog:* queue:claim-task:scriptworker-prov-v1/balrog-* queue:poll-task-urls:scriptworker-prov-v1/balrog-* queue:worker-id:balrog-v1/balrog-* or something alike. Those need to match your production worker group, provisioner id, and worker id. I'm not sure if there are other balrog scriptworker types, in which case I may want to have names that differentiate the different types. b14.The secrets eventually need to be stored in hiera. b15. moco-config.pp #TC balrog scriptworkers configs need to be altered to reflect the production environment b16. moco-nodes.pp for production should have high trust levels and security for production b17. some of the config.json chain of trust vars may have to change with the next release of scriptworker b18. since I have the FQDN I can go ahead and tweak puppet moco-nodes too, something like [6]. b19. Deploy all and close laptop lid :) [0]: https://mozilla-releng.signin.aws.amazon.com/console [1]: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/subnets.yml [2]: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/securitygroups.yml [3]: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/signingworker [4]: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/signingworker.user-data [5]: https://github.com/mozilla-releng/build-cloud-tools [6]: http://hg.mozilla.org/build/puppet/file/tip/manifests/moco-nodes.pp#l1158 [7]: https://bugzilla.mozilla.org/show_bug.cgi?id=1289822#c3
I recreated the environment from scratch in bug 1306610 and adjusted slightly.
Attachment #8796729 - Attachment is obsolete: true
(In reply to Mihai Tabara [:mtabara] from comment #0) > b10. before migrating to production, I need to make sure existing puppet > patches work against a clean EC2 instance. I already got access in bug > 1306610 but the machine lacks the certs so I'll go ahead and create a new > one based on [7] done now. > b11. have Coop (because git blame says so :P) review the cruncher configs > pinning to requests-2.8.1 before pypi puppetizing- requests-2.10.0 Making few steps in advance here to unblock myself for later on today. :coop: r? https://bugzilla.mozilla.org/attachment.cgi?id=8797168&action=diff#a/modules/cruncher/manifests/reportor.pp_sec2 and r? https://bugzilla.mozilla.org/attachment.cgi?id=8797168&action=diff#a/modules/cruncher/manifests/slave_health.pp_sec2 git blame says you've added the files so I thought it'd be safer if you glanced at these two cruncher changes before I'll push this to puppet later on.
Flags: needinfo?(coop)
(In reply to Mihai Tabara [:mtabara] from comment #2) > Making few steps in advance here to unblock myself for later on today. > :coop: > r? > https://bugzilla.mozilla.org/attachment.cgi?id=8797168&action=diff#a/modules/ > cruncher/manifests/reportor.pp_sec2 and > r? > https://bugzilla.mozilla.org/attachment.cgi?id=8797168&action=diff#a/modules/ > cruncher/manifests/slave_health.pp_sec2 Thanks for thinking of me! ;) Yes, pinning versions here is fine.
Flags: needinfo?(coop)
Depends on: 1307218
@bhearsum: I need to create a client for the rolled-out production instance(s) for balrogworker. Momentarily they'll share but we may end up having a clientId / production instance. For that reason I just created https://tools.taskcluster.net/auth/clients/#project%252freleng%252fscriptworker%252fbalrog and I expect to have the FQDN for the productions instances something like: balrog-1.srv.releng.use1.mozilla.com balrog-2.srv.releng.use1.mozilla.com ... Is this naming convention fine with you or should I pick-up something else? I'm not sure if there are other balrog scriptworker types, in which case I may want to have names that differentiate.
Flags: needinfo?(bhearsum)
(In reply to Mihai Tabara [:mtabara] from comment #4) > @bhearsum: > > I need to create a client for the rolled-out production instance(s) for > balrogworker. Momentarily they'll share but we may end up having a clientId > / production instance. For that reason I just created > https://tools.taskcluster.net/auth/clients/ > #project%252freleng%252fscriptworker%252fbalrog and I expect to have the > FQDN for the productions instances something like: > > balrog-1.srv.releng.use1.mozilla.com > balrog-2.srv.releng.use1.mozilla.com > ... > > Is this naming convention fine with you or should I pick-up something else? > I'm not sure if there are other balrog scriptworker types, in which case I > may want to have names that differentiate. "balrogworker" or something similar would be best. When I see "balrog" I think of the balrog server, not the balrogworker.
Flags: needinfo?(bhearsum)
(In reply to Mihai Tabara [:mtabara] from comment #0) > A. Machine related things: > > a1. I need an account to acess AWS console [0] done now > a2. Need to identify the right subnet for this. I could go with creating a > new subnet or using an existing one. The key thing here is to have it > suitable for communication with Balrog. Could potentiall file a bug to > netflows to open them up but ideally I'll reuse something, most likely > srv.releng.*.mozilla.com. Once I get the access to Balrog, I should filter > after srv.releng.*.mozilla.com to see if they can talk to Balrog admin API. done now - we'll stick to srv.releng.*.mozilla.com. I tested the netflows and there's communication open towards Balrog Admin API > a3. Should we go with srv.releng.*.mozilla.com, I need to identify subnet > ids for it per aws region in order to fill in [1]. > a4. setup the firewall security groups to fill in [2] > a5. include a rule WIP now for these ^ > a6 choose proper FQDN balrogworker-1.srv.releng.use1.mozilla.com balrogworker-2.srv.releng.usw2.mozilla.com ... (potentially others) > a7. With subnets and firewall all set, I can go ahead and set up the > configs[3] and user-data-configs[4] > a8. Have them reviewed and deployed with [5] > a9. Get myself acccess to prod aws manager with invtools available to run > the necessary scripts to start the production instance TODO > B. Puppet related things: > b11. have Coop (because git blame says so :P) review the cruncher configs > pinning to requests-2.8.1 before pypi puppetizing- requests-2.10.0 done now. thanks Coop > b12. add nagios - use bug 1295196 as an example. Also we need another check > for signing, the pending queue; we may need a check like that for balrog > scriptworker as well. We'll probably use Queue.pendingTasks for that: > https://docs.taskcluster.net/reference/platform/queue/api-docs#pendingTasks Filed separate bugs to deal with once this bug is landed. > b13. we'll need a new client for the rolled out production instances that > don't use dummy worker types. I set up > https://tools.taskcluster.net/auth/clients/ > #project%252freleng%252fscriptworker%252fsigning-linux for the signing > scriptworkers. We may want to have a clientid per instance, but for now > they share > > As client scopes, I could try something like: > project:releng:balrog:* > queue:claim-task:scriptworker-prov-v1/balrog-* > queue:poll-task-urls:scriptworker-prov-v1/balrog-* > queue:worker-id:balrog-v1/balrog-* > > or something alike. > Those need to match your production worker group, provisioner id, and worker > id. > I'm not sure if there are other balrog scriptworker types, in which case I > may want to have names that differentiate the different types. done. Thanks bhearsum for suggestion. I went up and created https://tools.taskcluster.net/auth/clients/#project%252freleng%252fscriptworker%252fbalrogworker > b14.The secrets eventually need to be stored in hiera. > b15. moco-config.pp #TC balrog scriptworkers configs need to be altered to > reflect the production environment > b16. moco-nodes.pp for production should have high trust levels and security > for production > b17. some of the config.json chain of trust vars may have to change with the > next release of scriptworker > b18. since I have the FQDN I can go ahead and tweak puppet moco-nodes too, > something like [6]. > b19. Deploy all and close laptop lid :) TODO
See Also: → 1307565
Leftovers and status update: (In reply to Mihai Tabara [:mtabara] from comment #0) > A. Machine related things: > a8. Have them reviewed and deployed with [5] Aki suggested we start with only one instance first to ease our job whilst debugging in the first iterations. Created a first draft PR at https://github.com/mozilla-releng/build-cloud-tools/pull/256 > a9. Get myself acccess to prod aws manager with invtools available to run > the necessary scripts to start the production instance Will use https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#AWS_machines once aforementioned PR gets deployed. > B. Puppet related things: > b14.The secrets eventually need to be stored in hiera. Will do this just before landing the puppet patch to production. > b15-b18: adapt staging puppet changes to production Just pushed code to mozreview.
Comment on attachment 8797846 [details] Bug 1306753 - add balrog scriptworker manifests. https://reviewboard.mozilla.org/r/83466/#review82082 Looks good!
Attachment #8797846 - Flags: review?(aki) → review+
Blocks: 1308053
Note to self: there are two mistakes that I did: 1. I was supposed to run the PR in the build-puppet git repo to make sure whatever linter checks we have there pass. I failed to pass the tests once we merged to production 2. I was supposed to add the secrets to hiera beforehand an deploy the puppet changes before we started doing the https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#AWS_machines So the right order was supposed to be: * make sure no leftovers are left - almost forgot about the requsts-2.11.1 deployment under puppet pypi * feed prod secrets to hiera * land puppet changes and merge to production (always PR in build-<repo> equivalent on github to have the checks pre-run) * land build cloud-tools * use https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#AWS_machines to create the instance(s) * test the instance I still have to add nagions alerts but that's tracked in a separte bug, chained to this one. I've tested with a dummy task that's failing exepected with malformed payload - https://tools.taskcluster.net/task-inspector/#DEsK_aiDTrOqINHLl4AjBA/0 Our new balrogworker is balrogworker-1.srv.releng.use1.mozilla.com I'll close this now but will track the nagios alerts and integration&testing with nightlies in a different bugs.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: