self-host releng puppetmasters at Mozilla



Infrastructure & Operations
5 years ago
5 years ago


(Reporter: dustin, Assigned: dustin)



(Whiteboard: [2013Q2] [tracker])


(4 attachments, 2 obsolete attachments)

The PuppetAgain masters are currently built using infra puppet, but they're increasingly diverging from the infra puppet masters (which will soon be running 3.0, and using hiera, and all manner of other neat stuff).

These should be self-hosted, so they look to themselves (or other puppet masters) for their configuration.

This may go hand-in-hand with re-doing the certificate handling to be more in line with what PuppetLabs will support.
Assignee: server-ops-releng → dustin
Amy points out that self-hosting the masters may not be such a good idea: it would mean we have critical systems that aren't like all of the others and aren't set up to be cared for by the SRE team.
Assignee: dustin → server-ops-releng
Assignee: server-ops-releng → dustin
Depends on: 836014
Created attachment 708393 [details] [diff] [review]

Make the puppet startup type a parameter

Then the standalone puppetmaster uses type 'none', and installs a
separate update crontask.

This is a lead-up to writing a toplevel::server:puppetmaster::clustered that *does* use puppet::periodic.

I tested this on Darwin, CentOS, and Ubuntu (where only puppet::none and puppet:atboot are implemented), and with a CentOS standalone node definition.
Attachment #708393 - Flags: review?(rail)
Comment on attachment 708393 [details] [diff] [review]

nit can we do s/none/manual/  to imply that puppet is manually (e.g. via cron) executed rather than none which reads [to me] like we don't intend puppet to ever run again on that machine.

I'm open to other words than 'none'
Attachment #708393 - Flags: feedback+
Actually, that's exactly what this does mean.  Puppet agent never runs.  So I think it's a good name :)
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Actually, that's exactly what this does mean.  Puppet agent never runs.  So
> I think it's a good name :)

puppet apply running in a cron !=== "none" to me, was more my point.
As far as the puppet module's concerned, it's none.  The puppet apply happens as part of the puppetmaster's operation.
Comment on attachment 708393 [details] [diff] [review]

Review of attachment 708393 [details] [diff] [review]:

    _~ )_)_~
    _!__!__!_     it!
Attachment #708393 - Flags: review?(rail) → review+
Landed and backed right out because:

[root@foopy105 ~]# puppet agent --test
info: Retrieving plugin
info: Caching catalog for
info: Applying configuration version '0af6a712faf6'
notice: /Stage[main]/Puppet::None/File[/etc/cron.d/puppetcheck.cron]/ensure: removed
notice: Finished catalog run in 6.17 seconds
I verified no systems were impacted during the 2 minutes this was landed.
..and duped to a bug from 2.6.2 that will never be fixed.
(In reply to Dustin J. Mitchell [:dustin] from comment #8)
> Landed and backed right out because:
> [root@foopy105 ~]# puppet agent --test
> info: Retrieving plugin
> info: Caching catalog for
> info: Applying configuration version '0af6a712faf6'
> notice: /Stage[main]/Puppet::None/File[/etc/cron.d/puppetcheck.cron]/ensure:
> removed
> notice: Finished catalog run in 6.17 seconds

(In reply to Dustin J. Mitchell [:dustin] from comment #9)
> I verified no systems were impacted during the 2 minutes this was landed.

huh c#8 seems to be a clear sign that something was affected ;-) (--test is not a dry-run)

Did you restore foopy105 (and any other potentially run machines) to a state that actually allows it to continue to run puppet?
I'm peeking in on this again.

Rail, the major sticking point here that led to this landing and backout is maintaining support for masters that use 'puppet apply' rather than running puppet::periodic.

How would you feel about changing that arrangement?  We could freeze updates to the existing AWS masters (just comment out the 'puppet apply') then build out new masters in AWS using this self-hosted method, and re-certify all of the AWS hosts there.  We'll need to re-certify all of the Mozilla-hosted clients anyway, so that's not a whole lot of additional work.  It would simplify the manifests substantially.
Flags: needinfo?(rail)
From some IRC conversations, it sounds like this is something we can try, at least.  I'll start putting it together.
Flags: needinfo?(rail)
I don't feel comfortable to switch to the model we wanted to avoid when created puppetmaster manifests in the first place. The masters won't be standalone anymore...

BTW, you can use "include toplevel::server::puppetmaster" instead of "include toplevel::server::puppetmaster::standalone" to make it update files against other masters.
Right, the idea is to build masters that are part of a cluster, not standalone.  I'm not sure how that's related to how they configure themselves.

I've yet to hear an explanation of how 'puppet apply' is not worse than 'puppet agent'.  I explicitly decided not to argue about it when you landed your patches, so that things could keep moving.

The closest I've heard is that, with 'puppet agent', the master might restart itself, and this would either cause problems with the run in progress, or cause the master to fail.  I don't see it causing problems with the run in progress, as the catalog is already generated by the time the agent begins processing it.  And I don't see it causing problems with the master failing permanently, any more than you would see with 'puppet apply' -- if we land a bogus patch that breaks the masters, both methods will cause the master to fail.

There's also concern about bootstrapping.  This is something I'll need to address in building clustered puppet masters anyway.  I don't know exactly how I'll solve this yet, but if 'puppet agent' will work, even sort-of, for that, then that's probably adequate.  Infra puppet has this problem solved, so I'll probably crib from their work.

The downsides to running 'puppet apply' are:
 - it's tricky/hacky to configure servers to use that instead of puppet::periodic (see above)
 - 'puppet apply' is not the same as an agent, particularly around custom ruby code and references to the server.  This could cause unexpected problems down the road.

As a point of reference, infra puppet masters run 'puppet agent' against themselves with no issues whatsoever.

So, I'd like to try this out.  Certainly if you can demonstrate a concrete failure case, I'll take that into account.
Great! This more or less eliminates my concerns about the agent approach. One missing thing that would prevent us from switching will be (easy) cert generation for slaves, esp the ones in ec2. should be replaced with something else to make the transition smooth.
Cert generation will be done via HTTP using getcert.cgi, and we can adjust that to allow remote generation as well (e.g., via SSH).  Thanks!
Whiteboard: [2013Q2] [tracker]
Duplicate of this bug: 798414
Created attachment 737013 [details] [diff] [review]

The attached patch was just successfully used to bootstrap a puppet master, with a version of puppet very close to what will be 3.2.0.  There are lots of TODO's left, but mostly around code quality and flexibility - the core functionality is in place except for cert revocation.
Created attachment 738707 [details] [diff] [review]
Attachment #737013 - Attachment is obsolete: true
Attachment #738707 - Flags: review?(bugspam.Callek)
Comment on attachment 738707 [details] [diff] [review]

If you'd prefer to look on github:
Attachment #738707 - Flags: review?(rail)
Comment on attachment 738707 [details] [diff] [review]

Wrong file? It doesn't look like a patch. Sorry, I just realized this. :/
Attachment #738707 - Flags: review?(rail)
Created attachment 739537 [details] [diff] [review]

Ugh, that paste-your-patch-here box..  Apparently I put my comments there.

Everything on my checklist is finished here.  There's a *lot* to review, so I'll let you dive right in.  It's probably best to start with high-level stuff -- ask me where things aren't clear, or if you see something that's wrong.  We can get into nits later.
Attachment #738707 - Attachment is obsolete: true
Attachment #738707 - Flags: review?(bugspam.Callek)
Attachment #739537 - Flags: review?(rail)
Comment on attachment 739537 [details] [diff] [review]

Review of attachment 739537 [details] [diff] [review]:

In overall it looks great. I really liked the fact that a lot of SSL related files have been moved under version control.

Some nits below.

::: manifests/extlookup/relabs-config.csv
@@ +7,5 @@
> +global_authorized_keys,dustin
> +distinguished_puppetmaster,
> +puppet_again_repo,
> +xxpuppetmaster_upstream_rsync_source,rsync://
> +puppetmaster_upstream_rsync_source,relabs08:/data/

I assume it'll be fixed in the final version

@@ +8,5 @@
> +distinguished_puppetmaster,
> +puppet_again_repo,
> +xxpuppetmaster_upstream_rsync_source,rsync://
> +puppetmaster_upstream_rsync_source,relabs08:/data/
> +puppetmaster_upstream_rsync_args,--exclude=repos/apt 

trailing space

::: modules/puppetmaster/manifests/puppetsync_user.pp
@@ +6,5 @@
> +    include puppetmaster::settings
> +    $homedir = "/var/lib/puppetsync-home"
> +
> +    case $::operatingsystem {
> +        CentOS, Ubuntu: {

I think you can drop Ubuntu, since we don't support puppetmasters on this platform

::: modules/puppetmaster/manifests/rsync.pp
@@ +15,5 @@
> +
> +    $cron_schedule = $frequency ? {
> +        'often' => "*/5 * * * *",
> +        'half-hour' =>  "$rand_halfhour,$rand_secondhalf * * * *",
> +        'hourly' => "$rand_minute * * * *",

Can you fail if an unexpected value is passed?

::: modules/puppetmaster/templates/rsync.cron.erb
@@ +4,5 @@
> +MAILTO="<%= scope.lookupvar('::config::puppet_notif_email') %>"
> +# note that this rsync runs locally as root, but remotely as puppetsync; the -e
> +# instructs SSH to use puppetsync's SSH key and known_hosts, and to use those
> +# hosts appropriately
> +<%= @cron_schedule %> root rsync -e 'ssh -l puppetsync -l puppetsync -i <%= @puppetsync_home %>/.ssh/id_rsa -oUserKnownHostsFile=<%= @puppetsync_home %>/.ssh/known_hosts -oBatchMode=yes -oStrictHostKeyChecking=yes -oCheckHostIP=no' <% if @delete %>--delete<% end %> -rlpt <%- if @direction == 'to-distinguished-master' %><%= @from_dir %> <%= @distinguished_master %>:<%= @to_dir %><% else %><%= @distinguished_master %>:<%= @from_dir %> <%= @to_dir %><% end %>

"ssh -l puppetsync -l puppetsync" - please remove a duplicate

::: modules/puppetmaster/templates/ssl_git_config.erb
@@ +6,5 @@
> +	repositoryformatversion = 0
> +	filemode = true
> +	bare = false
> +	logallrefupdates = true
> +	sharedRepository = 0644

can you replace tabs in this file with spaces?

::: modules/puppetmaster/templates/
@@ +11,5 @@
> +# try pushing, but ignore errors
> +git push -q "${distinguished_common}" master 2>/dev/null || true
> +
> +# then pull and push until the push works
> +for i in range {1..10}; do

Please remove "range".
Attachment #739537 - Flags: review?(rail) → review+
relabs-config.csv will likely change around often -- it's just the config for the test cluster.  So don't worry too much about what you see there.

I need to drastically simplify that puppetmaster::rsync stuff.  I thought I'd be using it for lots of different syncs, so I tried to make it very general.  It turns out that's not necessary.

Some amusing other bugs in there you've caught - thanks :)
Depends on: 865215
Depends on: 865223
Blocks: 865799
Depends on: 872545
Depends on: 872549
Depends on: 872726

Comment 28

5 years ago
talos-linux*-ix-00{5,6} are yours.
Small issue with the Ubuntu slaves: installing the puppetlabs packages tries to restart the puppet service.. which is, of course, already running.  In this situation, our initscript helpfully retries forever, hanging the puppet run.  Killing the /etc/puppet/init start process allows the puppet run to proceed (and fail, but the next run is fine).

I'm still thinking about how to work around this.  Hopefully it will not involve a custom package. I think that automatically starting/stopping services is a serious bug, but I've already had this conversation with PL about the RPMs, and they do not see it my way.
I also see

dpkg: error processing /var/cache/apt/archives/puppet_3.2.1-1puppetlabs1_all.deb (--unpack):
 trying to overwrite '/usr/share/man/man8/puppet-help.8.gz', which is also in package puppet-common 2.7.17-1mozilla1

during the upgrade process, but the install appears to succeed all the same, so I think that can be safely ignored.
Depends on: 875439
There are now six 3.2.0 puppetmasters in AWS; these are part of the moco cluster, rather than standalone - bug 872545.

There are nine buildmasters in scl3 running puppet-3.2.0 - bug 867593.

There are four Ubuntu iX systems running puppet-3.2.0 and ready for testing by releng - bug 872549.

Manifests are written for Lion builders (upgrading all the way from 0.24.8!) and ready for releng testing in June - bug 760093.

moved hosts:

servo buildmaster
buildbot-master8{1..9} - bug 867593
talos-linux64-ix-00{5,6} - bug 872549
Depends on: 882141
Depends on: 882739
I just landed some changes to the kickstart process for both CentOS and Ubuntu - there's now a "Puppet 3.2.0" menu tree in the PXE interface.
Blocks: 884426
Depends on: 884502
Depends on: 884506
 - lion builders - bug 760093
 - mountain lion talos - bug 882739
 - linux builders - bug 884506
 - buildmasters - bug 884502
OK, all of the deps are done.  Rail's even shut down puppetmaster-02 in AWS.  Tasks now:

1. Watch the logs on the 2.7.x masters for a bit, to verify we haven't forgotten a silo.
2. Shut the 2.7.x masters down.
3. Verify that all hosts on the 3.2.x masters are using certs issued by those masters.
4. Revoke the 2.7.x masters' SSL certs.
All of these - which were added early in the process - were still using certs on the old masters.  I've re-issued certs for all of them.  This list was generated from the diff between logged connections to the new masters and the list of certs issued by the new masters.
Depends on: 885823
I turned off the scl1 and mtv1 masters, but the scl3 is still live to host the public mirror.  I've filed a flow request to switch that to the new scl3 master, and I'll also have to change the distinguished master to scl3.
I disabled access to the puppetmaster via HTTP on the scl3 host, so it's just serving /data via HTTP and rsync.
I'm running a "practice" switch of the relabs distinguished master.  It seems to be going well so far.
Created attachment 766730 [details] [diff] [review]

We need to do this because the DM is the one that serves rsync and public HTTP, and that needs to be proxied by a load balancer in scl3.  That, and scl1 will be closing soon so it's not a good spot for a DM.
Attachment #766730 - Flags: review?(rail)
Attachment #766730 - Flags: review?(rail) → review+
Attachment #766730 - Flags: checked-in+
Created attachment 767804 [details] [diff] [review]

The node-scope variable didn't work very well, and yuck anyway.
Attachment #767804 - Flags: review?(bugspam.Callek)
OK, releng-puppet2.srv.scl3 is now running the public mirror.  Servers have been swapped in zeus.  releng-puppet1.srv.releng.scl3 is no longer running apache or rsync.  I'm out for the rest of the week, so I'll finish the teardown next week.
Comment on attachment 767804 [details] [diff] [review]

r+ from Callek in irc
Attachment #767804 - Flags: review?(bugspam.Callek)
Attachment #767804 - Flags: review+
Attachment #767804 - Flags: checked-in+
OK, among many other things, I cleaned up the hacked DS deployments for mountain lion talos and snow leopard builders (the only two OS X flavors on puppetagain).
Last Resolved: 5 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.