Closed Bug 1330276 Opened 3 years ago Closed 3 years ago

[tcmigration] Upgrade beetmover and balrog workers to CoT on date

Categories

(Release Engineering :: General, defect)

defect
Not set

Tracking

(firefox53 fixed)

RESOLVED FIXED
Tracking Status
firefox53 --- fixed

People

(Reporter: mtabara, Assigned: mtabara)

References

Details

(Whiteboard: toverify)

Attachments

(9 files)

47 bytes, text/x-github-pull-request
Details | Review
47 bytes, text/x-github-pull-request
Details | Review
54 bytes, text/x-github-pull-request
aki
: review+
mtabara
: checked-in+
Details | Review
59 bytes, text/x-review-board-request
aki
: review+
mtabara
: checked-in+
Details
59 bytes, text/x-review-board-request
mtabara
: checked-in+
Details
59 bytes, text/x-review-board-request
aki
: review+
mtabara
: checked-in+
Details
59 bytes, text/x-review-board-request
aki
: review+
mtabara
: checked-in+
Details
59 bytes, text/x-review-board-request
aki
: review+
mtabara
: checked-in+
Details
54 bytes, text/x-github-pull-request
aki
: review+
mtabara
: checked-in+
Details | Review
This bug tracks the progress for making the existing {beetmover,balrog}worker on `date` to be CoT enabled. The bug focuses mostly on the instances and puppetization that needs to be taken care of in order to make a smooth cut-over. 

Context:
a. in the past months we only had one worker for each beetmover[1] and balrog[2] on date
b. we've ramped up another instance for each, that is [3] and [4] currently ramped-up but not 100% tested they work at full capacity. Will do that in a bit.
c. all four instances[1][2][3][4] are not CoT-enabled yet, as they are pinned to 0.7.x scriptworker for a while now
d. puppetizing to upgrade {beetmover,balrog}worker to latest version of scriptworker lies here[5][6]

My plan is as follows:

1. Fix [2] errors from b.
2. Create a new "0.7.x" environment in puppet that points to current production tip of [7]
3. Pin [3] and [4] to this newly created environment
4. Stop [3][4] instances from aws console in order to prevent them claiming tasks from the TC queue
5. Upgrade puppet with CoT-enabling patches and change the moco-nodes.pp regex to match the "use1" workers only.
6. Fix any potential bustage that may come up
7. Add GPG keys pairs for [3] and [4] both in hiera and cot-gpg-keys

Once the 'upstreamArtifacts' in-tree patch lands on `date` and new nightlies are triggered, we can test the {beetmover,balrog}worker(s).
8. when everything is smooth, unpin [3][4] 

[1]: beetmoverworker-1.srv.releng.use1.mozilla.com
[2]: balrogworker-1.srv.releng.use1.mozilla.com
[3]: beetmoverworker-2.srv.releng.usw1.mozilla.com
[4]: balrogworker-2.srv.releng.usw1.mozilla.com
[5]: https://github.com/mozilla/build-puppet/pull/23
[6]: https://github.com/mozilla/build-puppet/pull/26
[7]: https://github.com/MihaiTabara/build-puppet
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> My plan is as follows:
> 
> 1. Fix [2] errors from b.

done - I had forgotten to pin [4] to <production> branch from https://github.com/mozilla-releng/balrogscript and puppet pulled HEAD of repo which points to latest change, CoT-able stuff.

Turned-off [1] and reran https://tools.taskcluster.net/task-group-inspector/#/XfgSRZ-PRV2g5z4d3tHDqw/b4lmOPnYTvumVpkR4QXeRQ?_k=1mn9c3 to make sure it is grabbed by [3] successfully. 

The same with [2] and [4] with successfully retriggering of https://tools.taskcluster.net/task-group-inspector/#/dLIbdzkWTz-6Wnufa0v_Ig/VQMwuDjSSVKZa-BQ7gYnIQ?_k=mv6c3m
Temporary pin usw2 instances to mtabara environment while use1 stay on track for puppet production.
Attachment #8825792 - Flags: review?(bugspam.Callek)
Attachment #8825792 - Flags: review?(bugspam.Callek) → review?(arich)
Attachment #8825792 - Flags: review?(arich) → review?
I talked with miahi, and he's going to do development in his puppet env instead of checking things into the dafault/production branch at this point. No review needed.
Comment on attachment 8825792 [details] [review]
Bug 1330276 - temporarily pin workers to specific environment.

NO longer r? needed. Thanks arr for showing me a better way to do this ;)
Attachment #8825792 - Flags: review?
Status update:
* All four machines are fully working, based on puppet production
* I am using https://github.com/MihaiTabara/build-puppet/tree/cot in my puppet environment to pin the configs and to upgrade iteratively both the beetmoverworker and balrogworker
* (balrogworker-2.srv.releng.usw2.mozilla.com, beetmoverworker-2.srv.releng.usw2.mozilla.com) will be our non-CoT backup, talking to puppet production. I will shut them down to prevent them claiming tasks from TC Queue to the detriment of the other set (balrogworker-1.srv.releng.use1.mozilla.com, beetmoverworker-1.srv.releng.use1.mozilla.com)
* the latter set will be pinned to my environment and upgraded to CoT-compatible
* once we're happy about the results and everything goes well, we can roll the changes to puppet default/production and unpin the (balrogworker-1.srv.releng.use1.mozilla.com, beetmoverworker-1.srv.releng.use1.mozilla.com)
* if all goes well, after that we can resurect the (balrogworker-2.srv.releng.usw2.mozilla.com, beetmoverworker-2.srv.releng.usw2.mozilla.com) too - they will get the latest changes and hence will automatically update to CoT-enabled.

The main benefit of this strategy that :arr advised me to use is that we keep the "testing in production" contained as much as possible, without affecting the other machines nor touching puppet default/production branches. Additionally, we have two machines as backup stopped that can be resurrected anytime should we fail to upgrade to CoT.
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> My plan is as follows:
> 
> 1. Fix [2] errors from b.
> 2. Create a new "0.7.x" environment in puppet that points to current
> production tip of [7]
> 3. Pin [3] and [4] to this newly created environment
> 4. Stop [3][4] instances from aws console in order to prevent them claiming

Done so far.

(I'll rephrase it to {beetmover/balrog}worker-1 have been pinned to my environment and are up next for CoT-enabling surgery. In the meantime, {beetmover/balrog}worker-2 are stopped and kept as fallback in case things go south.
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> 5. Upgrade puppet with CoT-enabling patches and change the moco-nodes.pp
> regex to match the "use1" workers only.

Done. I merged the patches from [5] and [6] and under https://github.com/MihaiTabara/build-puppet/tree/cot and landed the changes to reflect new CoT-able workers in [1] and [2]. Everything looks good so far, need to test the workers to see everything works as expected.

> [1]: beetmoverworker-1.srv.releng.use1.mozilla.com
> [2]: balrogworker-1.srv.releng.use1.mozilla.com
> [3]: beetmoverworker-2.srv.releng.usw1.mozilla.com
> [4]: balrogworker-2.srv.releng.usw1.mozilla.com
> [5]: https://github.com/mozilla/build-puppet/pull/23
> [6]: https://github.com/mozilla/build-puppet/pull/26
> [7]: https://github.com/MihaiTabara/build-puppet
Dropping the PR here for the moment to keep a reference.
Once the migration is completed and tested, I'll diff against tip of default/puppet and ask for review.
Preparing these in advance for when the migration is complete and tested. The backup machines can be then restarted and automatically puppetized with default tip of puppet, hence picking up the CoT changes. In order for the gpg homedirs to work, public keys need to be in place. Will add them in hiera shortly.
Attachment #8826153 - Flags: review?(aki)
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> 7. Add GPG keys pairs for [3] and [4] both in hiera and cot-gpg-keys
> 
> [3]: beetmoverworker-2.srv.releng.usw2.mozilla.com
> [4]: balrogworker-2.srv.releng.usw2.mozilla.com

This is done now too, a bit in advance for whenever we'll rollout the backup instances [3][4], currently stopped from aws-console.
Comment on attachment 8826153 [details] [review]
Add GPG keys for usw2 instances for both beetmover and balrog new instances.

Merged and tagged per https://github.com/mozilla-releng/cot-gpg-keys/blob/master/README.md
Attachment #8826153 - Flags: review?(aki) → review+
Copy-pasting from IRC for later use.

mtabara> Callek: I'm a bit lost with all the staging / production balrog conversation from the last 2 days. Let me see if I got it right. Are the following facts true?
<mtabara> 1. current graphs running are pushing to Firefox/Fennec-date-nightly-latest in Balrog staging and also have the firefox/mobile.js update server set to the staging instance.
<mtabara> 2. once the graphs are green and work as expected, hopefully ASAP, we need to land a patch on `date` that is basically reverting https://irccloud.mozilla.com/pastebin/eLtTKRWm/ 

on the nightly build *before* we switch to production balrog, we'll need to revert -- 

diff --git a/browser/app/profile/firefox.js b/browser/app/profile/firefox.js
--- a/browser/app/profile/firefox.js
+++ b/browser/app/profile/firefox.js
@@ -135,7 +135,7 @@
 pref("app.update.staging.enabled", true);
 
 // Update service URL:
-pref("app.update.url", "https://aus5.mozilla.org/update/6/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%SYSTEM_CAPABILITIES%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/update.xml");
+pref("app.update.url", "https://aus4.stage.mozaws.net/update/6/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%SYSTEM_CAPABILITIES%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/update.xml");
 // app.update.url.manual is in branding section
 // app.update.url.details is in branding section
 
diff --git a/mobile/android/app/mobile.js b/mobile/android/app/mobile.js
--- a/mobile/android/app/mobile.js
+++ b/mobile/android/app/mobile.js
@@ -540,7 +540,7 @@
 // used by update service to decide whether or not to
 // automatically download an update
 pref("app.update.autodownload", "wifi");
-pref("app.update.url.android", "https://aus5.mozilla.org/update/4/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/%MOZ_VERSION%/update.xml");
+pref("app.update.url.android", "https://aus4.stage.mozaws.net/update/4/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/%MOZ_VERSION%/update.xml");
 
 #ifdef MOZ_UPDATER
 /* prefs used specifically for updating the app */

and retrigger nightlies. (Basically this changes nothing on the surface, the only difference been that the user end product binaries have that value flipped internally)
<mtabara> 3. Once the graphs from above are green again (updates are still sent to Firefox/Fennec-date-nightly-latest in Balrog staging) we need to flip https://github.com/MihaiTabara/build-puppet/blob/cot/modules/balrog_scriptworker/manifests/settings.pp#L17 to use Balrog Production username/password. Rebuild balrogworker with this change and retrigger nightlies. At
<mtabara> this point, once the graph is green, we should be seeing `Firefox/Fennec-date-nightly-latest in Balrog production` and everyone is happy and we can move forward.
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> My plan is as follows:
> 
> 1. Fix [2] errors from b.
> 2. Create a new "0.7.x" environment in puppet that points to current
> production tip of [7]
> 3. Pin [3] and [4] to this newly created environment
> 4. Stop [3][4] instances from aws console in order to prevent them claiming
> tasks from the TC queue
> 5. Upgrade puppet with CoT-enabling patches and change the moco-nodes.pp
> regex to match the "use1" workers only.
> 6. Fix any potential bustage that may come up
> 7. Add GPG keys pairs for [3] and [4] both in hiera and cot-gpg-keys
> 
> Once the 'upstreamArtifacts' in-tree patch lands on `date` and new nightlies
> are triggered, we can test the {beetmover,balrog}worker(s).
> 8. when everything is smooth, unpin [3][4] 
> [1]: beetmoverworker-1.srv.releng.use1.mozilla.com
> [2]: balrogworker-1.srv.releng.use1.mozilla.com
> [3]: beetmoverworker-2.srv.releng.usw1.mozilla.com
> [4]: balrogworker-2.srv.releng.usw1.mozilla.com
> [5]: https://github.com/mozilla/build-puppet/pull/23
> [6]: https://github.com/mozilla/build-puppet/pull/26
> [7]: https://github.com/MihaiTabara/build-puppet

All done now - in-tree patches have landed, various misc bugs have been addressed in puppet, scripts and task-graph. Graphs are now green for both Fennec and Desktop. CoT pieces have largely and nicely connected. I've unpinned the usw2 (non-CoT so far stopped in aws console and kept as back-up) and upgraded them to CoT too.

Therefore we now have four workers, all CoT-able against https://github.com/mozilla/build-puppet/pull/28.

beetmoverworker-1.srv.releng.use1.mozilla.com
beetmoverworker-2.srv.releng.usw1.mozilla.com
balrogworker-1.srv.releng.use1.mozilla.com
balrogworker-2.srv.releng.usw1.mozilla.com

Having two workers per logic piece (beetmover, balrog) makes the things twice as faster from now on.
I'll keep the bug open for another 1-2 days to make sure all four workers nicely claim work.
Whiteboard: toverify
Shortly, sometimes this week, I'll dump the build-puppet patch, move it one last time through reviewboard and land it to default/production puppet. Then, I'll unpin the current workers from my environment and let them grab their stuff from puppet.
Is there anything you can think of that would make us back out to 0.7.x, that you would find after waiting a few more days?
I can't think of anything, so I'd lean towards rolling out to prod puppet and closing this bug sooner.
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #13)
> All done now - in-tree patches have landed, various misc bugs have been
> addressed in puppet, scripts and task-graph. Graphs are now green for both
> Fennec and Desktop. CoT pieces have largely and nicely connected. I've
> unpinned the usw2 (non-CoT so far stopped in aws console and kept as
> back-up) and upgraded them to CoT too.
> 
> Therefore we now have four workers, all CoT-able against
> https://github.com/mozilla/build-puppet/pull/28.

Forgot to thank :aki, :jlund and :Callek for help in dealing with the above hiccups!

(In reply to Aki Sasaki [:aki] from comment #15)
> Is there anything you can think of that would make us back out to 0.7.x,
> that you would find after waiting a few more days?
> I can't think of anything, so I'd lean towards rolling out to prod puppet
> and closing this bug sooner.

To be honest, I think we're far past that inflection point where we'd want to come back. I'd even trade a 1-2 day bustage on date rather tha going back. CoT FTW! :)

The only thing in my mind is upcoming checksums munging in beetmoverscript which is likely to require some beetmoverscript pypi bumps or alike. I thought from this perspective it'd be easier to bump in my environment rather than pushing to default and merging. But overall, I think you're right, that is actually outside of this bug's scope.
See Also: → 1331611
Comment on attachment 8827479 [details]
Bug 1330276 - Upgrade beetemover and balrog workers to CoT.

https://reviewboard.mozilla.org/r/105156/#review105978

Thanks! We can either make the /builds/scriptworker change now, or in a followup bug.

::: modules/balrog_scriptworker/manifests/settings.pp:2
(Diff revision 1)
>  class balrog_scriptworker::settings {
> -    include ::config
> +    $root = "/builds/balrogworker"

To allow for shared nagios checks, we'll need this to be /builds/scriptworker.

::: modules/beetmover_scriptworker/manifests/settings.pp:2
(Diff revision 1)
>  class beetmover_scriptworker::settings {
> -    include ::config
> +    $root = "/builds/beetmoverworker"

same here.
Attachment #8827479 - Flags: review?(aki) → review+
Attachment #8827517 - Flags: review?(aki)
Attachment #8827518 - Flags: review?(aki)
Comment on attachment 8827518 [details]
Tweak workers to use shared scriptworker location.

https://reviewboard.mozilla.org/r/105178/#review105990
Attachment #8827518 - Flags: review?(aki) → review+
Comment on attachment 8827538 [details]
Bug 1330276 - nicely handle the switch from dev to production.

https://reviewboard.mozilla.org/r/105192/#review105998

We can deal with separate beetmover creds in https://github.com/mozilla-releng/beetmoverscript/issues/28 .
Attachment #8827538 - Flags: review?(aki) → review+
Comment on attachment 8827517 [details]
Switch balrogworker to use Balrog production credentials.

Comments addressed via IRC and r+'ed already.
Attachment #8827517 - Flags: review?(aki)
Comment on attachment 8826153 [details] [review]
Add GPG keys for usw2 instances for both beetmover and balrog new instances.

https://hg.mozilla.org/build/puppet/rev/31b4ede393da and merged to production.
Attachment #8826153 - Flags: checked-in+
Comment on attachment 8827479 [details]
Bug 1330276 - Upgrade beetemover and balrog workers to CoT.

https://hg.mozilla.org/build/puppet/rev/31b4ede393da and merged to production.
Attachment #8827479 - Flags: checked-in+
Comment on attachment 8827517 [details]
Switch balrogworker to use Balrog production credentials.

https://hg.mozilla.org/build/puppet/rev/31b4ede393da and merged to production.
Attachment #8827517 - Flags: checked-in+
Comment on attachment 8827518 [details]
Tweak workers to use shared scriptworker location.

https://hg.mozilla.org/build/puppet/rev/31b4ede393da and merged to production.
Attachment #8827518 - Flags: checked-in+
Comment on attachment 8827538 [details]
Bug 1330276 - nicely handle the switch from dev to production.

https://hg.mozilla.org/build/puppet/rev/31b4ede393da and merged to production.
Attachment #8827538 - Flags: checked-in+
Comment on attachment 8827633 [details]
Bug 1330276 - remove unused signing_cert in balrogscript.

https://reviewboard.mozilla.org/r/105248/#review106080
Attachment #8827633 - Flags: review?(aki) → review+
Comment on attachment 8827651 [details] [review]
Remove signing-cert verification from balrogscript as it is not used.

As noted in the PR, this looks good, but highlights the fact we still need to fix https://github.com/mozilla-releng/balrogscript/issues/16 .
Attachment #8827651 - Flags: review?(aki) → review+
Comment on attachment 8827633 [details]
Bug 1330276 - remove unused signing_cert in balrogscript.

https://hg.mozilla.org/projects/date/rev/91e9227e02a7
Attachment #8827633 - Flags: checked-in+
Comment on attachment 8827651 [details] [review]
Remove signing-cert verification from balrogscript as it is not used.

https://github.com/mozilla-releng/balrogscript/commit/f125e9b886157e329fb03064c5f0e919fbfa8335
Attachment #8827651 - Flags: checked-in+
(In reply to Aki Sasaki [:aki] from comment #33)
> Comment on attachment 8827651 [details] [review]
> Remove signing-cert verification from balrogscript as it is not used.
> 
> As noted in the PR, this looks good, but highlights the fact we still need
> to fix https://github.com/mozilla-releng/balrogscript/issues/16 .

https://github.com/mozilla-releng/balrogscript/issues/16#issuecomment-273351095 expands on this comment.
https://hg.mozilla.org/mozilla-central/rev/0bf730c7d8cb
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Reoening this. Would like to keep the bug open for another 1-2 days until I add the scopes / channel check in balrogscript.
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #39)
> Reoening this. Would like to keep the bug open for another 1-2 days until I
> add the scopes / channel check in balrogscript.

Mihai: is this still in progress
Flags: needinfo?(mtabara)
(In reply to Chris Cooper [:coop] from comment #40)
> (In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #39)
> > Reoening this. Would like to keep the bug open for another 1-2 days until I
> > add the scopes / channel check in balrogscript.
> 
> Mihai: is this still in progress

Yep. :aki and I discussed last week to add this scopes check too to improve security.
I'll address this as soon as I finish checksums.
Flags: needinfo?(mtabara)
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #39)
> Reoening this. Would like to keep the bug open for another 1-2 days until I
> add the scopes / channel check in balrogscript.

We now have this check described under https://github.com/mozilla-releng/balrogscript/issues/16 and landed in https://github.com/mozilla-releng/balrogscript/commit/b4b4d8e7c6281f669e092fd45413ec994c495321. Beetmoverworkers are to be updated to use 0.0.6 soon.

We can now close this.
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
No longer blocks: 1329944
See Also: → 1329944
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.