Closed Bug 1150162 Opened 9 years ago Closed 9 years ago

rework release runner for release promotion/taskcluster

Categories

(Release Engineering :: Release Automation: Other, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(6 files, 1 obsolete file)

We're going to have to do this at some point. I'm looking at doing it in the near term to make bug 1145774 easier.

Unless we move ALL of the jobs to taskcluster at the same time, this will depend on the bridge.
So, the bridge is coming along pretty nicely. It's far enough along that I think this could be started at any point. Rail, Nick and I chatted about this a little bit last week, the executive summary is:
* We intend to generate a task graph of tagging through "ready for release on channel X" at the start of releases.
** Releases that don't automatically push to mirrors will have two big sections. The first will go from tagging to "updates available on localtest", the second will block on a human decision step and takes care of "push to mirrors" -> "ready for release on channel X"
* Release runner will be responsible for creating and submitting the task graph
* The human decision task will be a button exposed on Ship It. Clicking it will make the right API calls (via JS) to mark the task as completed. The human clicking the button will need to have Taskcluster credentials in their browser to make this work.

Rail, Nick - did I miss anything?

Lawrence, Sylvestre - cc'ing you two because I suspect you're interested in the Ship It changes. This is looking like it will end up being a good proof of concept for how to expose additional actions to Ship It.
Sounds like a promising plan. Thanks for sharing
Unassigning myself for now because it's not clear if I'll be the one initially doing this. Also attaching this to the release promotion tracking bug.
Assignee: bhearsum → nobody
This patch is two things. Firstly, it rips out all of the stuff we don't need for "old style" (aka, non-prompted) releases, including: reconfigs, sendchanges, release config bumps, etc. There's a few things I've left commented out, such as release-drivers mail sending, because we may still need them.

The other part of this patch calls out to releasetasks to generate a task graph (such as https://pastebin.mozilla.org/8841222) after getting a release request. This seems to mostly work, although there's a bunch of stubbing and hardcoding right now. I also disabled task graph submission until it's a little more up to snuff. This version of release runner has no idea what release configs are, and looks up things in the branch config instead. I've got a buildbot-configs patch that adds a few necessary things to date's branch config that's needed to make this work.

One obvious implication of this patch is that we will not have a version of release runner that works for both release promotion, and old-style releases. I'm considering this a feature: we'll leave the old release runner up and running for Thunderbird only (and eventually handing that machine off to them), and we'll deploy a new production release runner for release promotion, when we're ready.
Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
Attachment #8641690 - Flags: feedback?(rail)
Attachment #8641690 - Flags: feedback?(nthomas)
Attachment #8641691 - Flags: feedback?(rail)
Attachment #8641691 - Flags: feedback?(nthomas)
Attached patch update date branch config (obsolete) — Splinter Review
Attachment #8641692 - Flags: feedback?(rail)
Attachment #8641692 - Flags: feedback?(nthomas)
Comment on attachment 8641691 [details] [review]
Add initial framework for graph generation, with tons of todos

In overall this looks like a good start. I believe we'll have more ideas while we progress.
Attachment #8641691 - Flags: feedback?(rail) → feedback+
Comment on attachment 8641690 [details] [diff] [review]
update release runner to use releasetasks to generate task graphs

This is a good start. I'd probably replace it with something async in the future.
Attachment #8641690 - Flags: feedback?(rail) → feedback+
Comment on attachment 8641692 [details] [diff] [review]
update date branch config

Hmmm. I'm a bit worried about reproducible builds here. Because we don't rely on any tag of buildbot-coinfigs (we are going to stop tagging), we may need to find a better way to get these values. Maybe in the future, when the scheduling is in TC and we extend the graph instead of creating a new one, we can use something from the tree or the graph.
Attachment #8641692 - Flags: feedback?(rail)
Comment on attachment 8641690 [details] [diff] [review]
update release runner to use releasetasks to generate task graphs

Review of attachment 8641690 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine to me.

::: buildfarm/release/release-runner.py
@@ +78,5 @@
>          }
>      return partials
>  
>  
> +# TODO: replace release sanity with direct checks of en-US and l10n revisions (and other things if needed)

Definitely think we'll need this.

@@ +197,5 @@
> +            rr.update_status(release, 'Generating task graph')
> +
> +            # TODO: replace with parsing of l10n json, because plain l10n changesets
> +            # does not have platform specific exceptions
> +            def parse_l10n_changesets(*args, **kwargs):

Does this require some changes in the data supplied by l10n.mo via ship-it ?
Attachment #8641690 - Flags: feedback?(nthomas) → feedback+
Comment on attachment 8641691 [details] [review]
Add initial framework for graph generation, with tons of todos

I made some (probably TC-noob) comments over on the PR, mostly this was a learning exercise for me :-)
Attachment #8641691 - Flags: feedback?(nthomas) → feedback+
Attachment #8641692 - Flags: feedback?(nthomas) → feedback+
(In reply to Rail Aliiev [:rail] from comment #9)
> Hmmm. I'm a bit worried about reproducible builds here. Because we don't
> rely on any tag of buildbot-coinfigs (we are going to stop tagging), we may
> need to find a better way to get these values. Maybe in the future, when the
> scheduling is in TC and we extend the graph instead of creating a new one,
> we can use something from the tree or the graph.

Everything will be glorious when all the scheduling is TC, because it'll be derived from files in the tree ?
(In reply to Rail Aliiev [:rail] from comment #9)
> Comment on attachment 8641692 [details] [diff] [review]
> update date branch config
> 
> Hmmm. I'm a bit worried about reproducible builds here. Because we don't
> rely on any tag of buildbot-coinfigs (we are going to stop tagging), we may
> need to find a better way to get these values. Maybe in the future, when the
> scheduling is in TC and we extend the graph instead of creating a new one,
> we can use something from the tree or the graph.

I'm not sure how much I care about reproducible builds + Buildbot. We don't rely on a tag here, it's true, but if you try rerun the same release 6 months later against some tagged version of buildbot-configs, the masters will have moved on and perhaps deleted some builders you need.

When we're off Buildbot we'll need somewhere else to put these values for sure. The only place I can think of that makes sense is Ship It. We could certainly do it sooner, but I'm not sure I want to tack that work on.
(In reply to Nick Thomas [:nthomas] from comment #10)
> Comment on attachment 8641690 [details] [diff] [review]
> update release runner to use releasetasks to generate task graphs
> 
> Review of attachment 8641690 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks fine to me.
> 
> ::: buildfarm/release/release-runner.py
> @@ +78,5 @@
> >          }
> >      return partials
> >  
> >  
> > +# TODO: replace release sanity with direct checks of en-US and l10n revisions (and other things if needed)
> 
> Definitely think we'll need this.
> 
> @@ +197,5 @@
> > +            rr.update_status(release, 'Generating task graph')
> > +
> > +            # TODO: replace with parsing of l10n json, because plain l10n changesets
> > +            # does not have platform specific exceptions
> > +            def parse_l10n_changesets(*args, **kwargs):
> 
> Does this require some changes in the data supplied by l10n.mo via ship-it ?

Yes! Rail and I looked at this a bit and it looks like there's no way to get a Firefox-compatible JSON changesets from l10n.m.o at the moment. I'll follow-up on that and see if we can get it fixed.
(In reply to Nick Thomas [:nthomas] from comment #12)
> (In reply to Rail Aliiev [:rail] from comment #9)
> > Hmmm. I'm a bit worried about reproducible builds here. Because we don't
> > rely on any tag of buildbot-coinfigs (we are going to stop tagging), we may
> > need to find a better way to get these values. Maybe in the future, when the
> > scheduling is in TC and we extend the graph instead of creating a new one,
> > we can use something from the tree or the graph.
> 
> Everything will be glorious when all the scheduling is TC, because it'll be
> derived from files in the tree ?

Whoops, missed this comment before writing my reply. In tree might make sense, though it requires a rebuild if you want to change the platforms you're shipping...
(In reply to Ben Hearsum (:bhearsum) from comment #14)
> (In reply to Nick Thomas [:nthomas] from comment #10)
> > Comment on attachment 8641690 [details] [diff] [review]
> > update release runner to use releasetasks to generate task graphs
> > 
> > Review of attachment 8641690 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > Looks fine to me.
> > 
> > ::: buildfarm/release/release-runner.py
> > @@ +78,5 @@
> > >          }
> > >      return partials
> > >  
> > >  
> > > +# TODO: replace release sanity with direct checks of en-US and l10n revisions (and other things if needed)
> > 
> > Definitely think we'll need this.
> > 
> > @@ +197,5 @@
> > > +            rr.update_status(release, 'Generating task graph')
> > > +
> > > +            # TODO: replace with parsing of l10n json, because plain l10n changesets
> > > +            # does not have platform specific exceptions
> > > +            def parse_l10n_changesets(*args, **kwargs):
> > 
> > Does this require some changes in the data supplied by l10n.mo via ship-it ?
> 
> Yes! Rail and I looked at this a bit and it looks like there's no way to get
> a Firefox-compatible JSON changesets from l10n.m.o at the moment. I'll
> follow-up on that and see if we can get it fixed.

We decided out of band not to pursue using JSON changesets. We'll continue to use the plain changesets, and hardcode the ja/ja-JP-mac exception.
Comment on attachment 8641692 [details] [diff] [review]
update date branch config

Rail, any change of heart on this given comments #12 and 13?
Attachment #8641692 - Flags: review?(rail)
Morphing this bug into what it has become (oops): release runner modifications for release promotion/taskcluster, and probably a few buildbot-configs/buildbotcustom tweaks to get an initial job running. I'll consider this done when I'm able to submit a release to ship it dev, have release runner dev notice it & submit to taskcluster, and one of the jobs runs in Buildbot and reports status back to Taskcluster.

bug 1178315 is covering the actual graph generation code, which will be extended by other bugs as we're ready to move jobs.
No longer blocks: 1145774
Summary: move release automation scheduling to taskcluster → rework release runner for release promotion/taskcluster
Scheduling goes away because Taskcluster will be handling it, and I think partner repacks go away because of bug 1181183?

In looking at this I realized we already have a "date_source" builder enabled in Buildbot, which I think I'll target as my prototype job =).
Attachment #8645909 - Flags: review?(nthomas)
I've got a WIP branch to generate a source builder task, but it needs more work (tests, at the very least) before it's ready: https://github.com/rail/releasetasks/compare/rail:master...bhearsum:source?expand=1
Comment on attachment 8645909 [details] [diff] [review]
remove scheduling + partner repacks from release promotion buildbot code

Review of attachment 8645909 [details] [diff] [review]:
-----------------------------------------------------------------

Feel free to nuke the |topLevelBuilders = []| a little earlier too, seems unlikely we'll need that.
Attachment #8645909 - Flags: review?(nthomas) → review+
(In reply to Ben Hearsum (:bhearsum) from comment #20)
> I've got a WIP branch to generate a source builder task, but it needs more
> work (tests, at the very least) before it's ready:
> https://github.com/rail/releasetasks/compare/rail:master...bhearsum:
> source?expand=1

I'd like to keep in mind the possibility of partial updates from, eg 40.0build2 to 40.0build4, so that beta users get the RCs as fast as possible.
(In reply to Nick Thomas [:nthomas] from comment #22)
> (In reply to Ben Hearsum (:bhearsum) from comment #20)
> > I've got a WIP branch to generate a source builder task, but it needs more
> > work (tests, at the very least) before it's ready:
> > https://github.com/rail/releasetasks/compare/rail:master...bhearsum:
> > source?expand=1
> 
> I'd like to keep in mind the possibility of partial updates from, eg
> 40.0build2 to 40.0build4, so that beta users get the RCs as fast as possible.

Good point, I'll add the build numbers into the mix too.
Attachment #8645909 - Flags: checked-in+
Here's a small pull request just for updating this basename.
Attachment #8646510 - Flags: review?(nthomas)
Attachment #8646510 - Flags: review?(nthomas) → review+
Attachment #8646510 - Flags: checked-in+
Attachment #8641692 - Flags: review?(rail) → review+
Attachment #8641692 - Flags: checked-in+
Comment on attachment 8641692 [details] [diff] [review]
update date branch config

Had to back this out because we hit the builder limit, presumably because the gecko version was bumped.
Attachment #8641692 - Flags: checked-in+ → checked-in-
eeeeew :(
Depends on: 1194807
Rail and I did some work on the releasetasks code yesterday: https://github.com/rail/releasetasks/compare/9407516...master

We mostly focused on writing some initial tests for the funsize graph code, and I think we've managed to find a sustainable model by writing tests that only verify specific parts of the graph. We'll probably want some full-graph tests later, but these seem like a reasonable balance between completeness and maintainability for the majority of them.

We did a few other things to help improve testability:
* Added get_task_by_name and get_task_by_slugid helpers to find the things on the graph we want to verify. This is because a task graph is actually a list of dicts, rather than a dict with useful key names. We've got no choice but to iterate over it to find what we want.
* To facilitate this, we add a "task_name" key to the "extra" of each task when tests are running. We might want to consider including this all of the time.
* Started putting ifdefs around each subgraph include (eg, source_enabled, l10n_enabled, updates_enable). This lets us only generate the parts of the graph we need in tests, so we don't need to pass variables that aren't relevant. Eg: we don't need repo_path to test en-US funsize.

Writing these tests actually uncovered a bug too! We realized that certain scopes should or shouldn't be included based on the type of signing we're doing (both at the graph level, and task level).

Now that the master branch of https://github.com/rail/releasetasks has code that can generate a release graph with the source builder enabled, I'm going to work on getting the end to end flow of ship it submission -> release runner -> task graph -> buildbot job going.
Here's an updated version of the patch for date's config that disable tests to avoid the builder limit.
Attachment #8641692 - Attachment is obsolete: true
Attachment #8648219 - Flags: review?(rail)
Attachment #8648219 - Flags: review?(rail) → review+
Attachment #8648219 - Flags: checked-in+
I got pretty close to this working today. Release runner is responding to requests, and even generating graphs....but it's failing to submit because of authorization failures. I adjusted the scopes on its credentials to allow it to create task graphs, but it takes an hour for that to propagate, so I won't be able to test it again until Monday.

Note to self: submit a new release to release runner dev on Monday and watch the release runner log!
I got a task graph to submit successfully today \o/. I had to work through some additional issuess with scopes - I pushed fixes for those on the master releasetasks branch.

I submitted Firefox-40.0b1-build1 to https://ship-it-dev.allizom.org, marked it as ready, and release runner went off and generated this task graph: https://tools.taskcluster.net/task-graph-inspector/#j6SUKhdKSFy3GvUCylnVlg/4N-0yTkmQzuNmq3zjLf7cw/

You'll see that it failed, which has found two bugs
1) "date" is very out date, and doesn't have mozharness in it. I tried to do a merge but failed miserably.
2) The Buildbot status was not reflected to Taskcluster properly, and Taskcluster retried the task. It looks like Buildbot never sent out the build finished message based on my read of the bbb logs, though it's hard to be sure. I'm going to watch bbb more closely next time I try this and see if it repros.
In production
How hard would it be to modify the subject of the emails from the staging release runner from '[release-runner] failed' to '[staging release-runner] failed' ? Nice to have, assuming my filtering for the notifications-dev list works properly.
(In reply to Nick Thomas [:nthomas] from comment #34)
> How hard would it be to modify the subject of the emails from the staging
> release runner from '[release-runner] failed' to '[staging release-runner]
> failed' ? Nice to have, assuming my filtering for the notifications-dev list
> works properly.

I'll see what I can do, it should be possible...
(In reply to Ben Hearsum (:bhearsum) from comment #35)
> (In reply to Nick Thomas [:nthomas] from comment #34)
> > How hard would it be to modify the subject of the emails from the staging
> > release runner from '[release-runner] failed' to '[staging release-runner]
> > failed' ? Nice to have, assuming my filtering for the notifications-dev list
> > works properly.
> 
> I'll see what I can do, it should be possible...

Fixed in https://github.com/bhearsum/tools/commit/9eabe1417e963fee6c7e8bfdb4553dad09ba680d

I still need to figure out how to handle deployment here, since release runner is now forked :(
(In reply to Ben Hearsum (:bhearsum) from comment #36)
> (In reply to Ben Hearsum (:bhearsum) from comment #35)
> > (In reply to Nick Thomas [:nthomas] from comment #34)
> > > How hard would it be to modify the subject of the emails from the staging
> > > release runner from '[release-runner] failed' to '[staging release-runner]
> > > failed' ? Nice to have, assuming my filtering for the notifications-dev list
> > > works properly.
> > 
> > I'll see what I can do, it should be possible...
> 
> Fixed in
> https://github.com/bhearsum/tools/commit/
> 9eabe1417e963fee6c7e8bfdb4553dad09ba680d
> 
> I still need to figure out how to handle deployment here, since release
> runner is now forked :(

I came up with a short list of what's different (or expected to be different) between old and new release runner:
* The code (obviously)
* New release runner has additive changes to the config, and obsoletes some of the existing options
* New release runner has additional python package requirements

There's nothing in the list above that can't co-exist for now, so I'm not going to pursue improving the Python part of this deployment for now. My thinking is that once we move all Firefox and Fennec releases to the new release runner we can detach the old from Puppet and remove the obsolete parts of the config.

The only thing I'd like to now is get my release runner code reviewed and landed, which means putting the current release runner code on a branch, in case it needs updating before the new one is done.

I'm going to deal with the Puppet parts in bug 1194854, and the release runner changes here. If anyone thinks this idea is bad feel free to say so - I'm happy to pursue something different.
This has lots of TODOs and missing features, but it's what's currently running in dev, and I think it would be good to get it landed, and then continue to iterate on it. If someone would prefer it, we can also continue iterating without landing in a production repo.

Highlights include:
* Rip out everything that we know we don't need anymore (reconfigs, sendchanges, config bumping, etc.)
* Rip out release sanity, with the intent that we'll replace it with newer, better checks.
* Comment out RD mail sending, because it's not clear how to find the right address and some of the details needed to send the mail
* Hardcode l10n changesets, because we still need to write the code to generate a correct one based on the input l10n changesets
Attachment #8656622 - Flags: review?(rail)
Attachment #8656622 - Flags: review?(nthomas)
Attachment #8656622 - Flags: review?(rail) → review+
Comment on attachment 8656622 [details] [diff] [review]
initial implementation of new release runner

Review of attachment 8656622 [details] [diff] [review]:
-----------------------------------------------------------------

r+

::: buildfarm/release/release-runner.py
@@ +257,3 @@
>          except:
>              # We explicitly do not raise an error here because there's no
>              # reason not to start other releases if the sendchange fails for

nit, reference to a sendchange
Attachment #8656622 - Flags: review?(nthomas) → review+
Comment on attachment 8656622 [details] [diff] [review]
initial implementation of new release runner

I created the "old-release-runner" branch in https://hg.mozilla.org/build/tools/rev/62fc7aa0a999, and updated buildbot-master81's release runner instance to use it.

I landed the new code on default in https://hg.mozilla.org/build/tools/rev/5e4cc2e5d77f, and updated buildbot-master83's release runner instance to use that instead of my github repo.
Attachment #8656622 - Flags: checked-in+
OK, so the base version of the new release runner is deployed and running now. As with other parts of release promotion, we'll improve and fix it as we go along, in more targeted bugs. Closing this.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Looks like we can set releaserunner_tools_branch in puppet, so that an emergency rebuild gets the right code.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: