Closed
Bug 1330276
Opened 8 years ago
Closed 8 years ago
[tcmigration] Upgrade beetmover and balrog workers to CoT on date
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(firefox53 fixed)
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
firefox53 | --- | fixed |
People
(Reporter: mtabara, Assigned: mtabara)
References
Details
(Whiteboard: toverify)
Attachments
(9 files)
47 bytes,
text/x-github-pull-request
|
Details | Review | |
47 bytes,
text/x-github-pull-request
|
Details | Review | |
54 bytes,
text/x-github-pull-request
|
mozilla
:
review+
mtabara
:
checked-in+
|
Details | Review |
59 bytes,
text/x-review-board-request
|
mozilla
:
review+
mtabara
:
checked-in+
|
Details |
59 bytes,
text/x-review-board-request
|
mtabara
:
checked-in+
|
Details |
59 bytes,
text/x-review-board-request
|
mozilla
:
review+
mtabara
:
checked-in+
|
Details |
59 bytes,
text/x-review-board-request
|
mozilla
:
review+
mtabara
:
checked-in+
|
Details |
59 bytes,
text/x-review-board-request
|
mozilla
:
review+
mtabara
:
checked-in+
|
Details |
54 bytes,
text/x-github-pull-request
|
mozilla
:
review+
mtabara
:
checked-in+
|
Details | Review |
This bug tracks the progress for making the existing {beetmover,balrog}worker on `date` to be CoT enabled. The bug focuses mostly on the instances and puppetization that needs to be taken care of in order to make a smooth cut-over.
Context:
a. in the past months we only had one worker for each beetmover[1] and balrog[2] on date
b. we've ramped up another instance for each, that is [3] and [4] currently ramped-up but not 100% tested they work at full capacity. Will do that in a bit.
c. all four instances[1][2][3][4] are not CoT-enabled yet, as they are pinned to 0.7.x scriptworker for a while now
d. puppetizing to upgrade {beetmover,balrog}worker to latest version of scriptworker lies here[5][6]
My plan is as follows:
1. Fix [2] errors from b.
2. Create a new "0.7.x" environment in puppet that points to current production tip of [7]
3. Pin [3] and [4] to this newly created environment
4. Stop [3][4] instances from aws console in order to prevent them claiming tasks from the TC queue
5. Upgrade puppet with CoT-enabling patches and change the moco-nodes.pp regex to match the "use1" workers only.
6. Fix any potential bustage that may come up
7. Add GPG keys pairs for [3] and [4] both in hiera and cot-gpg-keys
Once the 'upstreamArtifacts' in-tree patch lands on `date` and new nightlies are triggered, we can test the {beetmover,balrog}worker(s).
8. when everything is smooth, unpin [3][4]
[1]: beetmoverworker-1.srv.releng.use1.mozilla.com
[2]: balrogworker-1.srv.releng.use1.mozilla.com
[3]: beetmoverworker-2.srv.releng.usw1.mozilla.com
[4]: balrogworker-2.srv.releng.usw1.mozilla.com
[5]: https://github.com/mozilla/build-puppet/pull/23
[6]: https://github.com/mozilla/build-puppet/pull/26
[7]: https://github.com/MihaiTabara/build-puppet
Assignee | ||
Comment 1•8 years ago
|
||
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> My plan is as follows:
>
> 1. Fix [2] errors from b.
done - I had forgotten to pin [4] to <production> branch from https://github.com/mozilla-releng/balrogscript and puppet pulled HEAD of repo which points to latest change, CoT-able stuff.
Turned-off [1] and reran https://tools.taskcluster.net/task-group-inspector/#/XfgSRZ-PRV2g5z4d3tHDqw/b4lmOPnYTvumVpkR4QXeRQ?_k=1mn9c3 to make sure it is grabbed by [3] successfully.
The same with [2] and [4] with successfully retriggering of https://tools.taskcluster.net/task-group-inspector/#/dLIbdzkWTz-6Wnufa0v_Ig/VQMwuDjSSVKZa-BQ7gYnIQ?_k=mv6c3m
Assignee | ||
Comment 2•8 years ago
|
||
Temporary pin usw2 instances to mtabara environment while use1 stay on track for puppet production.
Attachment #8825792 -
Flags: review?(bugspam.Callek)
Assignee | ||
Updated•8 years ago
|
Attachment #8825792 -
Flags: review?(bugspam.Callek) → review?(arich)
Updated•8 years ago
|
Attachment #8825792 -
Flags: review?(arich) → review?
Comment 3•8 years ago
|
||
I talked with miahi, and he's going to do development in his puppet env instead of checking things into the dafault/production branch at this point. No review needed.
Assignee | ||
Comment 4•8 years ago
|
||
Comment on attachment 8825792 [details] [review]
Bug 1330276 - temporarily pin workers to specific environment.
NO longer r? needed. Thanks arr for showing me a better way to do this ;)
Attachment #8825792 -
Flags: review?
Assignee | ||
Comment 5•8 years ago
|
||
Status update:
* All four machines are fully working, based on puppet production
* I am using https://github.com/MihaiTabara/build-puppet/tree/cot in my puppet environment to pin the configs and to upgrade iteratively both the beetmoverworker and balrogworker
* (balrogworker-2.srv.releng.usw2.mozilla.com, beetmoverworker-2.srv.releng.usw2.mozilla.com) will be our non-CoT backup, talking to puppet production. I will shut them down to prevent them claiming tasks from TC Queue to the detriment of the other set (balrogworker-1.srv.releng.use1.mozilla.com, beetmoverworker-1.srv.releng.use1.mozilla.com)
* the latter set will be pinned to my environment and upgraded to CoT-compatible
* once we're happy about the results and everything goes well, we can roll the changes to puppet default/production and unpin the (balrogworker-1.srv.releng.use1.mozilla.com, beetmoverworker-1.srv.releng.use1.mozilla.com)
* if all goes well, after that we can resurect the (balrogworker-2.srv.releng.usw2.mozilla.com, beetmoverworker-2.srv.releng.usw2.mozilla.com) too - they will get the latest changes and hence will automatically update to CoT-enabled.
The main benefit of this strategy that :arr advised me to use is that we keep the "testing in production" contained as much as possible, without affecting the other machines nor touching puppet default/production branches. Additionally, we have two machines as backup stopped that can be resurrected anytime should we fail to upgrade to CoT.
Assignee | ||
Comment 6•8 years ago
|
||
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> My plan is as follows:
>
> 1. Fix [2] errors from b.
> 2. Create a new "0.7.x" environment in puppet that points to current
> production tip of [7]
> 3. Pin [3] and [4] to this newly created environment
> 4. Stop [3][4] instances from aws console in order to prevent them claiming
Done so far.
(I'll rephrase it to {beetmover/balrog}worker-1 have been pinned to my environment and are up next for CoT-enabling surgery. In the meantime, {beetmover/balrog}worker-2 are stopped and kept as fallback in case things go south.
Assignee | ||
Comment 7•8 years ago
|
||
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> 5. Upgrade puppet with CoT-enabling patches and change the moco-nodes.pp
> regex to match the "use1" workers only.
Done. I merged the patches from [5] and [6] and under https://github.com/MihaiTabara/build-puppet/tree/cot and landed the changes to reflect new CoT-able workers in [1] and [2]. Everything looks good so far, need to test the workers to see everything works as expected.
> [1]: beetmoverworker-1.srv.releng.use1.mozilla.com
> [2]: balrogworker-1.srv.releng.use1.mozilla.com
> [3]: beetmoverworker-2.srv.releng.usw1.mozilla.com
> [4]: balrogworker-2.srv.releng.usw1.mozilla.com
> [5]: https://github.com/mozilla/build-puppet/pull/23
> [6]: https://github.com/mozilla/build-puppet/pull/26
> [7]: https://github.com/MihaiTabara/build-puppet
Assignee | ||
Comment 8•8 years ago
|
||
Dropping the PR here for the moment to keep a reference.
Once the migration is completed and tested, I'll diff against tip of default/puppet and ask for review.
Assignee | ||
Comment 9•8 years ago
|
||
Preparing these in advance for when the migration is complete and tested. The backup machines can be then restarted and automatically puppetized with default tip of puppet, hence picking up the CoT changes. In order for the gpg homedirs to work, public keys need to be in place. Will add them in hiera shortly.
Attachment #8826153 -
Flags: review?(aki)
Assignee | ||
Comment 10•8 years ago
|
||
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> 7. Add GPG keys pairs for [3] and [4] both in hiera and cot-gpg-keys
>
> [3]: beetmoverworker-2.srv.releng.usw2.mozilla.com
> [4]: balrogworker-2.srv.releng.usw2.mozilla.com
This is done now too, a bit in advance for whenever we'll rollout the backup instances [3][4], currently stopped from aws-console.
Comment 11•8 years ago
|
||
Comment on attachment 8826153 [details] [review]
Add GPG keys for usw2 instances for both beetmover and balrog new instances.
Merged and tagged per https://github.com/mozilla-releng/cot-gpg-keys/blob/master/README.md
Attachment #8826153 -
Flags: review?(aki) → review+
Assignee | ||
Comment 12•8 years ago
|
||
Copy-pasting from IRC for later use.
mtabara> Callek: I'm a bit lost with all the staging / production balrog conversation from the last 2 days. Let me see if I got it right. Are the following facts true?
<mtabara> 1. current graphs running are pushing to Firefox/Fennec-date-nightly-latest in Balrog staging and also have the firefox/mobile.js update server set to the staging instance.
<mtabara> 2. once the graphs are green and work as expected, hopefully ASAP, we need to land a patch on `date` that is basically reverting https://irccloud.mozilla.com/pastebin/eLtTKRWm/
on the nightly build *before* we switch to production balrog, we'll need to revert --
diff --git a/browser/app/profile/firefox.js b/browser/app/profile/firefox.js
--- a/browser/app/profile/firefox.js
+++ b/browser/app/profile/firefox.js
@@ -135,7 +135,7 @@
pref("app.update.staging.enabled", true);
// Update service URL:
-pref("app.update.url", "https://aus5.mozilla.org/update/6/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%SYSTEM_CAPABILITIES%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/update.xml");
+pref("app.update.url", "https://aus4.stage.mozaws.net/update/6/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%SYSTEM_CAPABILITIES%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/update.xml");
// app.update.url.manual is in branding section
// app.update.url.details is in branding section
diff --git a/mobile/android/app/mobile.js b/mobile/android/app/mobile.js
--- a/mobile/android/app/mobile.js
+++ b/mobile/android/app/mobile.js
@@ -540,7 +540,7 @@
// used by update service to decide whether or not to
// automatically download an update
pref("app.update.autodownload", "wifi");
-pref("app.update.url.android", "https://aus5.mozilla.org/update/4/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/%MOZ_VERSION%/update.xml");
+pref("app.update.url.android", "https://aus4.stage.mozaws.net/update/4/%PRODUCT%/%VERSION%/%BUILD_ID%/%BUILD_TARGET%/%LOCALE%/%CHANNEL%/%OS_VERSION%/%DISTRIBUTION%/%DISTRIBUTION_VERSION%/%MOZ_VERSION%/update.xml");
#ifdef MOZ_UPDATER
/* prefs used specifically for updating the app */
and retrigger nightlies. (Basically this changes nothing on the surface, the only difference been that the user end product binaries have that value flipped internally)
<mtabara> 3. Once the graphs from above are green again (updates are still sent to Firefox/Fennec-date-nightly-latest in Balrog staging) we need to flip https://github.com/MihaiTabara/build-puppet/blob/cot/modules/balrog_scriptworker/manifests/settings.pp#L17 to use Balrog Production username/password. Rebuild balrogworker with this change and retrigger nightlies. At
<mtabara> this point, once the graph is green, we should be seeing `Firefox/Fennec-date-nightly-latest in Balrog production` and everyone is happy and we can move forward.
Assignee | ||
Comment 13•8 years ago
|
||
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> My plan is as follows:
>
> 1. Fix [2] errors from b.
> 2. Create a new "0.7.x" environment in puppet that points to current
> production tip of [7]
> 3. Pin [3] and [4] to this newly created environment
> 4. Stop [3][4] instances from aws console in order to prevent them claiming
> tasks from the TC queue
> 5. Upgrade puppet with CoT-enabling patches and change the moco-nodes.pp
> regex to match the "use1" workers only.
> 6. Fix any potential bustage that may come up
> 7. Add GPG keys pairs for [3] and [4] both in hiera and cot-gpg-keys
>
> Once the 'upstreamArtifacts' in-tree patch lands on `date` and new nightlies
> are triggered, we can test the {beetmover,balrog}worker(s).
> 8. when everything is smooth, unpin [3][4]
> [1]: beetmoverworker-1.srv.releng.use1.mozilla.com
> [2]: balrogworker-1.srv.releng.use1.mozilla.com
> [3]: beetmoverworker-2.srv.releng.usw1.mozilla.com
> [4]: balrogworker-2.srv.releng.usw1.mozilla.com
> [5]: https://github.com/mozilla/build-puppet/pull/23
> [6]: https://github.com/mozilla/build-puppet/pull/26
> [7]: https://github.com/MihaiTabara/build-puppet
All done now - in-tree patches have landed, various misc bugs have been addressed in puppet, scripts and task-graph. Graphs are now green for both Fennec and Desktop. CoT pieces have largely and nicely connected. I've unpinned the usw2 (non-CoT so far stopped in aws console and kept as back-up) and upgraded them to CoT too.
Therefore we now have four workers, all CoT-able against https://github.com/mozilla/build-puppet/pull/28.
beetmoverworker-1.srv.releng.use1.mozilla.com
beetmoverworker-2.srv.releng.usw1.mozilla.com
balrogworker-1.srv.releng.use1.mozilla.com
balrogworker-2.srv.releng.usw1.mozilla.com
Having two workers per logic piece (beetmover, balrog) makes the things twice as faster from now on.
I'll keep the bug open for another 1-2 days to make sure all four workers nicely claim work.
Whiteboard: toverify
Assignee | ||
Comment 14•8 years ago
|
||
Shortly, sometimes this week, I'll dump the build-puppet patch, move it one last time through reviewboard and land it to default/production puppet. Then, I'll unpin the current workers from my environment and let them grab their stuff from puppet.
Comment 15•8 years ago
|
||
Is there anything you can think of that would make us back out to 0.7.x, that you would find after waiting a few more days?
I can't think of anything, so I'd lean towards rolling out to prod puppet and closing this bug sooner.
Assignee | ||
Comment 16•8 years ago
|
||
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #13)
> All done now - in-tree patches have landed, various misc bugs have been
> addressed in puppet, scripts and task-graph. Graphs are now green for both
> Fennec and Desktop. CoT pieces have largely and nicely connected. I've
> unpinned the usw2 (non-CoT so far stopped in aws console and kept as
> back-up) and upgraded them to CoT too.
>
> Therefore we now have four workers, all CoT-able against
> https://github.com/mozilla/build-puppet/pull/28.
Forgot to thank :aki, :jlund and :Callek for help in dealing with the above hiccups!
(In reply to Aki Sasaki [:aki] from comment #15)
> Is there anything you can think of that would make us back out to 0.7.x,
> that you would find after waiting a few more days?
> I can't think of anything, so I'd lean towards rolling out to prod puppet
> and closing this bug sooner.
To be honest, I think we're far past that inflection point where we'd want to come back. I'd even trade a 1-2 day bustage on date rather tha going back. CoT FTW! :)
The only thing in my mind is upcoming checksums munging in beetmoverscript which is likely to require some beetmoverscript pypi bumps or alike. I thought from this perspective it'd be easier to bump in my environment rather than pushing to default and merging. But overall, I think you're right, that is actually outside of this bug's scope.
Comment hidden (mozreview-request) |
Comment 18•8 years ago
|
||
mozreview-review |
Comment on attachment 8827479 [details]
Bug 1330276 - Upgrade beetemover and balrog workers to CoT.
https://reviewboard.mozilla.org/r/105156/#review105978
Thanks! We can either make the /builds/scriptworker change now, or in a followup bug.
::: modules/balrog_scriptworker/manifests/settings.pp:2
(Diff revision 1)
> class balrog_scriptworker::settings {
> - include ::config
> + $root = "/builds/balrogworker"
To allow for shared nagios checks, we'll need this to be /builds/scriptworker.
::: modules/beetmover_scriptworker/manifests/settings.pp:2
(Diff revision 1)
> class beetmover_scriptworker::settings {
> - include ::config
> + $root = "/builds/beetmoverworker"
same here.
Attachment #8827479 -
Flags: review?(aki) → review+
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Updated•8 years ago
|
Attachment #8827517 -
Flags: review?(aki)
Attachment #8827518 -
Flags: review?(aki)
Comment 21•8 years ago
|
||
mozreview-review |
Comment on attachment 8827518 [details]
Tweak workers to use shared scriptworker location.
https://reviewboard.mozilla.org/r/105178/#review105990
Attachment #8827518 -
Flags: review?(aki) → review+
Comment hidden (mozreview-request) |
Comment 23•8 years ago
|
||
mozreview-review |
Comment on attachment 8827538 [details]
Bug 1330276 - nicely handle the switch from dev to production.
https://reviewboard.mozilla.org/r/105192/#review105998
We can deal with separate beetmover creds in https://github.com/mozilla-releng/beetmoverscript/issues/28 .
Attachment #8827538 -
Flags: review?(aki) → review+
Assignee | ||
Comment 24•8 years ago
|
||
Comment on attachment 8827517 [details]
Switch balrogworker to use Balrog production credentials.
Comments addressed via IRC and r+'ed already.
Attachment #8827517 -
Flags: review?(aki)
Assignee | ||
Comment 25•8 years ago
|
||
Comment on attachment 8826153 [details] [review]
Add GPG keys for usw2 instances for both beetmover and balrog new instances.
https://hg.mozilla.org/build/puppet/rev/31b4ede393da and merged to production.
Attachment #8826153 -
Flags: checked-in+
Assignee | ||
Comment 26•8 years ago
|
||
Comment on attachment 8827479 [details]
Bug 1330276 - Upgrade beetemover and balrog workers to CoT.
https://hg.mozilla.org/build/puppet/rev/31b4ede393da and merged to production.
Attachment #8827479 -
Flags: checked-in+
Assignee | ||
Comment 27•8 years ago
|
||
Comment on attachment 8827517 [details]
Switch balrogworker to use Balrog production credentials.
https://hg.mozilla.org/build/puppet/rev/31b4ede393da and merged to production.
Attachment #8827517 -
Flags: checked-in+
Assignee | ||
Comment 28•8 years ago
|
||
Comment on attachment 8827518 [details]
Tweak workers to use shared scriptworker location.
https://hg.mozilla.org/build/puppet/rev/31b4ede393da and merged to production.
Attachment #8827518 -
Flags: checked-in+
Assignee | ||
Comment 29•8 years ago
|
||
Comment on attachment 8827538 [details]
Bug 1330276 - nicely handle the switch from dev to production.
https://hg.mozilla.org/build/puppet/rev/31b4ede393da and merged to production.
Attachment #8827538 -
Flags: checked-in+
Comment hidden (mozreview-request) |
Comment 31•8 years ago
|
||
mozreview-review |
Comment on attachment 8827633 [details]
Bug 1330276 - remove unused signing_cert in balrogscript.
https://reviewboard.mozilla.org/r/105248/#review106080
Attachment #8827633 -
Flags: review?(aki) → review+
Assignee | ||
Comment 32•8 years ago
|
||
Attachment #8827651 -
Flags: review?(aki)
Comment 33•8 years ago
|
||
Comment on attachment 8827651 [details] [review]
Remove signing-cert verification from balrogscript as it is not used.
As noted in the PR, this looks good, but highlights the fact we still need to fix https://github.com/mozilla-releng/balrogscript/issues/16 .
Attachment #8827651 -
Flags: review?(aki) → review+
Assignee | ||
Comment 34•8 years ago
|
||
Comment on attachment 8827633 [details]
Bug 1330276 - remove unused signing_cert in balrogscript.
https://hg.mozilla.org/projects/date/rev/91e9227e02a7
Attachment #8827633 -
Flags: checked-in+
Assignee | ||
Comment 35•8 years ago
|
||
Comment on attachment 8827651 [details] [review]
Remove signing-cert verification from balrogscript as it is not used.
https://github.com/mozilla-releng/balrogscript/commit/f125e9b886157e329fb03064c5f0e919fbfa8335
Attachment #8827651 -
Flags: checked-in+
Comment 36•8 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #33)
> Comment on attachment 8827651 [details] [review]
> Remove signing-cert verification from balrogscript as it is not used.
>
> As noted in the PR, this looks good, but highlights the fact we still need
> to fix https://github.com/mozilla-releng/balrogscript/issues/16 .
https://github.com/mozilla-releng/balrogscript/issues/16#issuecomment-273351095 expands on this comment.
Comment 37•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/0bf730c7d8cb01d0c8474d71dfffa84849c6d2be
Bug 1330276 - remove unused signing_cert in balrogscript.r=aki
Comment 38•8 years ago
|
||
bugherder |
Assignee | ||
Comment 39•8 years ago
|
||
Reoening this. Would like to keep the bug open for another 1-2 days until I add the scopes / channel check in balrogscript.
Comment 40•8 years ago
|
||
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #39)
> Reoening this. Would like to keep the bug open for another 1-2 days until I
> add the scopes / channel check in balrogscript.
Mihai: is this still in progress
Flags: needinfo?(mtabara)
Assignee | ||
Comment 41•8 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #40)
> (In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #39)
> > Reoening this. Would like to keep the bug open for another 1-2 days until I
> > add the scopes / channel check in balrogscript.
>
> Mihai: is this still in progress
Yep. :aki and I discussed last week to add this scopes check too to improve security.
I'll address this as soon as I finish checksums.
Flags: needinfo?(mtabara)
Assignee | ||
Comment 42•8 years ago
|
||
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #39)
> Reoening this. Would like to keep the bug open for another 1-2 days until I
> add the scopes / channel check in balrogscript.
We now have this check described under https://github.com/mozilla-releng/balrogscript/issues/16 and landed in https://github.com/mozilla-releng/balrogscript/commit/b4b4d8e7c6281f669e092fd45413ec994c495321. Beetmoverworkers are to be updated to use 0.0.6 soon.
We can now close this.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Comment 43•7 years ago
|
||
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•