Closed Bug 1040013 Opened 10 years ago Closed 8 years ago

No manual reconfigs: buildbot masters and foopies to update themselves

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1057888

People

(Reporter: pmoore, Unassigned)

References

Details

So, currently a reconfig is:

a) merge changes to production branches:
    * buildbotcustom:    default -> production-0.8
    * buildbot-configs:  default -> production
    * mozharness:        default -> production
b) ssh onto all buildbot masters, and run (see http://hg.mozilla.org/build/tools/file/tip/lib/python/util/fabric/actions.py):
    * show_revisions (hg ident -i against buildbotcustom, buildbot-configs)
    * update ('source bin/activate && make update' in basedir)
    * checkconfig ('make checkconfig' in basedir)
    * reconfig (copy over buildbot-wrangler.py to master, run 'rm -f *.pyc; python buildbot-wrangler.py reconfig <master_dir>')
    * show_revisions (hg ident -i against buildbotcustom, buildbot-configs)
c) ssh onto all foopies, and run (see http://hg.mozilla.org/build/tools/file/tip/buildfarm/maintenance/foopy_fabric.py):
    * show_revision (hg -R /builds/tools ident -i)
    * update ('cd /builds/tools; hg pull && hg update -r default; find /builds/tools -name \\*.pyc -exec rm {} \\; hg -ident -i')
    * show_revision (hg -R /builds/tools ident -i)
d) publish changes to https://wiki.mozilla.org/ReleaseEngineering/Maintenance
e) (manually) update bugs listed in d) to say they have been deployed

However, parts b) -> c) could be moved over to the foopies and masters, if we set the foopies and buildbot masters to routinely pull changes from the production branches of buildbotcustom, buildbot-configs, and the default branch of tools; and then in the case of changes occurring, steps b) and c) happen automatically. This means, landing something in production just becomes merging to production branch, and waiting up to 30 mins (same as what we have for mozharness). In the case of emergencies, we could still maintain a fabric job for triggering the update (so you don't have to wait the full 30 mins).

Mozharness has already shown that this model works, and avoids manual work, and opening ssh connections from a personal machine to all foopies and buildbot masters in the data centres.

As soon as b) and c) are automatic, d) becomes redundant - to see what is in production, just look at what is on the production branch. Then e) also becomes redundant for releng members that have permission to land changes on production branches, since they do not need to wait for a feedback status from another releng member to say their change has been propagated to production. For non-releng members who can't land on production branch, the merge to production can still be made my an authorized person, and they could still update the bugs (step e) above) but this will be a small subset of all changes.
Summary: Let's ditch reconfigs altogether → No manual reconfigs: buildbot masters and foopies to update themselves
From Simone (:simone):

What in case of cross-repo patches sets? With current reconfig we can guarantee that sets of inter-related patches applied to different repos are deployed in a single reconfig, avoiding (temporary) inconsistencies.
Hey Simone,

Thanks for your feedback.

The short answer: we should build in something to handle this race condition. Thanks for highlighting this.

Current situation
=================
On foopies we only have the tools repo, so no coordination required there across repos.

For buildbot masters, a manage_masters.py "update" results in pulling buildbotcustom, buildbot-configs and tools in sequence, e.g. see here: https://hg.mozilla.org/build/buildbot-configs/file/production/Makefile.master#l18

Once this update is triggered, it will get the latest versions of all three, which are hopefully in sync.

Requirement of any new solution
===============================
Race condition avoidance: we would need a way to make sure that a reconfig didn't start after a human has updated one or more repos, but before updating all of them.

Options
=======

Option 1) we could set the reconfig only to refresh working directories if none of the repos have been updated e.g. in the last X minutes - this gives a window of time for people to push to all affected repos

Option 2) we could set a flag somewhere to disable reconfigs, and when we’ve updated all repos, we could reset the flag, for example. This could be implemented many ways, perhaps a file checked into a repo somewhere.

Pete
I'm not sure these are my complete thoughts, but this is what I've got for now:
One of the reasons we have "push" reconfigs rather than "pull" is to keep things as in sync as possible. With the number of masters we have these days, it's a bit more difficult (ie, sometimes you don't start the reconfigs on every master at the same time), but because there's somebody looking at it, we don't get into a situation where one master is out of sync for a potentially long time. We need visibility into this sort of thing. I'm tempted to suggest that Nagios might be a good place for this. Perhaps a check that goes critical if a master hasn't reconfiged N minutes after a merge to production? Few things are as visible to us as Nagios

Another concern is around matching changes. Eg, buildbot-configs changes that depend on a buildbotcustom change. Do you have plan for making sure that reconfigs don't start until both are landed? Maybe this is something that should be tied to CI? Eg, reconfigs don't start until Jenkins tests succeed on the production branches. There's also some rare cases where changes require human intervention. For example, whenever we add a new symlink in the master directory, a human needs to update the masters by hand. 

To summarize, I think we need at least:
* Good visibility and error reporting for masters who somehow end up out of sync.
* A plan for dealing with matching changes to multiple repos.
Another thought:
Release runner (the script that polls ship it and triggers releases) performs reconfigs of build+masters masters, and kicks off releases after the reconfig is finished. Whether push or pull, it needs some way detect when the changes it makes are live to build+scheduler masters, otherwise it can't safely trigger releases.
See Also: → 1010126
See Also: → 978928
(In reply to Ben Hearsum [:bhearsum] from comment #3)
> I'm not sure these are my complete thoughts, but this is what I've got for
> now:
> One of the reasons we have "push" reconfigs rather than "pull" is to keep
> things as in sync as possible. With the number of masters we have these
> days, it's a bit more difficult (ie, sometimes you don't start the reconfigs
> on every master at the same time), but because there's somebody looking at
> it, we don't get into a situation where one master is out of sync for a
> potentially long time. We need visibility into this sort of thing. I'm
> tempted to suggest that Nagios might be a good place for this. Perhaps a
> check that goes critical if a master hasn't reconfiged N minutes after a
> merge to production? Few things are as visible to us as Nagios

This seems like an excellent idea. ++

> Another concern is around matching changes. Eg, buildbot-configs changes
> that depend on a buildbotcustom change. Do you have plan for making sure
> that reconfigs don't start until both are landed?

see comment 2

> Maybe this is something
> that should be tied to CI? Eg, reconfigs don't start until Jenkins tests
> succeed on the production branches. There's also some rare cases where
> changes require human intervention. For example, whenever we add a new
> symlink in the master directory, a human needs to update the masters by
> hand.

I think tying to the CI is an excellent idea. I think we need a separate way to trigger "these repos are now ready to be released to production; please update in this order (etc)" than just a push to the production branch. It might make sense to design a few different possible workflows to allow this, bearing in mind use cases where manual changes need to be made to buildbot masters as part of a roll out.

Maybe there is some merit in tying changes to puppet, so that puppet is our only update mechanism, and then changes on buildbot masters that would otherwise need to be done manually could also be written as puppet changes. I think we have several different options - my takeaway from this is to propose some alternative workflows for releasing changes, and attach them to this bug. The *main* point of this bug is to avoid that we need to log onto all machines to roll out changes - it may be that we still need several steps for rolling out changes, but I'm hoping we can come up with a way where the state of a repository or repositories can be read by the updating machine, and it can update itself.

> 
> To summarize, I think we need at least:
> * Good visibility and error reporting for masters who somehow end up out of
> sync.
> * A plan for dealing with matching changes to multiple repos.
Consistency in approach is a good thing, and using puppet for deployments is certainly a good option.

The puppet deployment process could also use some additional automation and monitoring, if that would make things easier.  For example, we could coordinate updates to puppet across all masters, rather than relying on cronjobs, and announce the update in #releng (perhaps via https://github.com/mozilla/build-relengapi/issues/104).

We could also use something like mcollective to coordinate puppet runs on toplevel::server hosts, so that after landing a puppet change it'd be easy to trigger all buildmasters to adopt that change at roughly the same time.
I'm considering using something like https://forge.puppetlabs.com/puppetlabs/vcsrepo#mercurial
to keep the repos up-to-date every time puppet runs, and for downstream actions (steps a), b), c) in comment 0) to tie these to hg hooks, e.g.:

from: http://www.selenic.com/hg/help/hgrc

"post-<command>"
    Run after successful invocations of the associated command. The contents of the command line are passed as "$HG_ARGS" and the result code in "$HG_RESULT". Parsed command line arguments are passed as "$HG_PATS" and "$HG_OPTS". These contain string representations of the python data internally passed to <command>. "$HG_OPTS" is a dictionary of options (with unspecified options set to their defaults). "$HG_PATS" is a list of arguments. Hook failure is ignored.

With the combination of the two, we should be able to rely on puppet to keep the repo checked out and up-to-date, and our hg hooks to callback when there has been a change to a repo.

Using the callbacks, we could implement some logic around making sure we only update if none of the repos have been updated for a chosen amount of time (e.g. 5 mins) to ensure that if somebody needs to make a change that touches multiple repos, they have a long enough window of time to push changes to all affected repos before the reconfig will kick in.

In addition I have raised bug 1087335 today to discuss the possibility of rolling several of our repos into a single repo (among other solutions). If this solution was chosen, the problem of "coordinating deployment of changes across multiple repos" goes away - you can update all the code in one go with an atomic push.

We would need to be careful that the hg hooks used in production would not have undue effects on user checkouts of the same codebase.

Lastly, I'm open to other suggestions about how to use puppet to keep our repos up-to-date and also to ensure we run reconfig actions when the repos get updated. A simpler approach may also work, like watching the revision id of the working dir in a cron running every minute, and if it changes, triggering the reconfig. Lots of possibilities.
See Also: → 1087335
Blocks: 1101631
coop, I think this is a dupe... (since we no longer care about foopies)
Flags: needinfo?(coop)
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(coop)
Resolution: --- → DUPLICATE
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.