self-host releng puppetmasters at Mozilla

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: dustin, Assigned: dustin)

Tracking

Details

(Whiteboard: [2013Q2] [tracker])

Attachments

(4 attachments, 2 obsolete attachments)

The PuppetAgain masters are currently built using infra puppet, but they're increasingly diverging from the infra puppet masters (which will soon be running 3.0, and using hiera, and all manner of other neat stuff).

These should be self-hosted, so they look to themselves (or other puppet masters) for their configuration.

This may go hand-in-hand with re-doing the certificate handling to be more in line with what PuppetLabs will support.
Assignee: server-ops-releng → dustin
Amy points out that self-hosting the masters may not be such a good idea: it would mean we have critical systems that aren't like all of the others and aren't set up to be cared for by the SRE team.
Assignee: dustin → server-ops-releng
Assignee: server-ops-releng → dustin
Depends on: 836014
Created attachment 708393 [details] [diff] [review]
bug825056.patch

Make the puppet startup type a parameter

Then the standalone puppetmaster uses type 'none', and installs a
separate update crontask.

This is a lead-up to writing a toplevel::server:puppetmaster::clustered that *does* use puppet::periodic.

I tested this on Darwin, CentOS, and Ubuntu (where only puppet::none and puppet:atboot are implemented), and with a CentOS standalone node definition.
Attachment #708393 - Flags: review?(rail)
Comment on attachment 708393 [details] [diff] [review]
bug825056.patch

nit can we do s/none/manual/  to imply that puppet is manually (e.g. via cron) executed rather than none which reads [to me] like we don't intend puppet to ever run again on that machine.

I'm open to other words than 'none'
Attachment #708393 - Flags: feedback+
Actually, that's exactly what this does mean.  Puppet agent never runs.  So I think it's a good name :)
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Actually, that's exactly what this does mean.  Puppet agent never runs.  So
> I think it's a good name :)

puppet apply running in a cron !=== "none" to me, was more my point.
As far as the puppet module's concerned, it's none.  The puppet apply happens as part of the puppetmaster's operation.
Comment on attachment 708393 [details] [diff] [review]
bug825056.patch

Review of attachment 708393 [details] [diff] [review]:
-----------------------------------------------------------------

_~
    _~ )_)_~
    )_))_))_)
    _!__!__!_     it!
    \_______/
  ~~~~~~~~~~~~~
Attachment #708393 - Flags: review?(rail) → review+
Landed and backed right out because:

[root@foopy105 ~]# puppet agent --test --server=releng-puppet1.build.mtv1.mozilla.com
info: Retrieving plugin
info: Caching catalog for foopy105.p10.releng.scl1.mozilla.com
info: Applying configuration version '0af6a712faf6'
notice: /Stage[main]/Puppet::None/File[/etc/cron.d/puppetcheck.cron]/ensure: removed
notice: Finished catalog run in 6.17 seconds
I verified no systems were impacted during the 2 minutes this was landed.
http://projects.puppetlabs.com/issues/13537
..and duped to a bug from 2.6.2 that will never be fixed.
(In reply to Dustin J. Mitchell [:dustin] from comment #8)
> Landed and backed right out because:
> 
> [root@foopy105 ~]# puppet agent --test
> --server=releng-puppet1.build.mtv1.mozilla.com
> info: Retrieving plugin
> info: Caching catalog for foopy105.p10.releng.scl1.mozilla.com
> info: Applying configuration version '0af6a712faf6'
> notice: /Stage[main]/Puppet::None/File[/etc/cron.d/puppetcheck.cron]/ensure:
> removed
> notice: Finished catalog run in 6.17 seconds

(In reply to Dustin J. Mitchell [:dustin] from comment #9)
> I verified no systems were impacted during the 2 minutes this was landed.

huh c#8 seems to be a clear sign that something was affected ;-) (--test is not a dry-run)

Did you restore foopy105 (and any other potentially run machines) to a state that actually allows it to continue to run puppet?
obviously
I'm peeking in on this again.

Rail, the major sticking point here that led to this landing and backout is maintaining support for masters that use 'puppet apply' rather than running puppet::periodic.

How would you feel about changing that arrangement?  We could freeze updates to the existing AWS masters (just comment out the 'puppet apply') then build out new masters in AWS using this self-hosted method, and re-certify all of the AWS hosts there.  We'll need to re-certify all of the Mozilla-hosted clients anyway, so that's not a whole lot of additional work.  It would simplify the manifests substantially.
Flags: needinfo?(rail)
From some IRC conversations, it sounds like this is something we can try, at least.  I'll start putting it together.
Flags: needinfo?(rail)
I don't feel comfortable to switch to the model we wanted to avoid when created puppetmaster manifests in the first place. The masters won't be standalone anymore...

BTW, you can use "include toplevel::server::puppetmaster" instead of "include toplevel::server::puppetmaster::standalone" to make it update files against other masters.
Right, the idea is to build masters that are part of a cluster, not standalone.  I'm not sure how that's related to how they configure themselves.

I've yet to hear an explanation of how 'puppet apply' is not worse than 'puppet agent'.  I explicitly decided not to argue about it when you landed your patches, so that things could keep moving.

The closest I've heard is that, with 'puppet agent', the master might restart itself, and this would either cause problems with the run in progress, or cause the master to fail.  I don't see it causing problems with the run in progress, as the catalog is already generated by the time the agent begins processing it.  And I don't see it causing problems with the master failing permanently, any more than you would see with 'puppet apply' -- if we land a bogus patch that breaks the masters, both methods will cause the master to fail.

There's also concern about bootstrapping.  This is something I'll need to address in building clustered puppet masters anyway.  I don't know exactly how I'll solve this yet, but if 'puppet agent' will work, even sort-of, for that, then that's probably adequate.  Infra puppet has this problem solved, so I'll probably crib from their work.

The downsides to running 'puppet apply' are:
 - it's tricky/hacky to configure servers to use that instead of puppet::periodic (see above)
 - 'puppet apply' is not the same as an agent, particularly around custom ruby code and references to the server.  This could cause unexpected problems down the road.

As a point of reference, infra puppet masters run 'puppet agent' against themselves with no issues whatsoever.

So, I'd like to try this out.  Certainly if you can demonstrate a concrete failure case, I'll take that into account.
Great! This more or less eliminates my concerns about the agent approach. One missing thing that would prevent us from switching will be (easy) cert generation for slaves, esp the ones in ec2. http://hg.mozilla.org/build/puppet/file/b087ee71e52b/setup/ca-scripts/generate-cert.sh should be replaced with something else to make the transition smooth.
Cert generation will be done via HTTP using getcert.cgi, and we can adjust that to allow remote generation as well (e.g., via SSH).  Thanks!
Whiteboard: [2013Q2] [tracker]
Duplicate of this bug: 798414
Created attachment 737013 [details] [diff] [review]
self-host-preview.patch

The attached patch was just successfully used to bootstrap a puppet master, with a version of puppet very close to what will be 3.2.0.  There are lots of TODO's left, but mostly around code quality and flexibility - the core functionality is in place except for cert revocation.
Created attachment 738707 [details] [diff] [review]
bug825056-self-host.patch
Attachment #737013 - Attachment is obsolete: true
Attachment #738707 - Flags: review?(bugspam.Callek)
Comment on attachment 738707 [details] [diff] [review]
bug825056-self-host.patch

If you'd prefer to look on github:

https://github.com/djmitche/releng-puppet/commit/bug825056
Attachment #738707 - Flags: review?(rail)
Comment on attachment 738707 [details] [diff] [review]
bug825056-self-host.patch

Wrong file? It doesn't look like a patch. Sorry, I just realized this. :/
Attachment #738707 - Flags: review?(rail)
Created attachment 739537 [details] [diff] [review]
bug825056-self-host.patch

Ugh, that paste-your-patch-here box..  Apparently I put my comments there.

Everything on my checklist is finished here.  There's a *lot* to review, so I'll let you dive right in.  It's probably best to start with high-level stuff -- ask me where things aren't clear, or if you see something that's wrong.  We can get into nits later.
Attachment #738707 - Attachment is obsolete: true
Attachment #738707 - Flags: review?(bugspam.Callek)
Attachment #739537 - Flags: review?(rail)
Comment on attachment 739537 [details] [diff] [review]
bug825056-self-host.patch

Review of attachment 739537 [details] [diff] [review]:
-----------------------------------------------------------------

In overall it looks great. I really liked the fact that a lot of SSL related files have been moved under version control.

Some nits below.

::: manifests/extlookup/relabs-config.csv
@@ +7,5 @@
> +global_authorized_keys,dustin
> +distinguished_puppetmaster,relabs03.build.mtv1.mozilla.com
> +puppet_again_repo,http://hg.mozilla.org/users/dmitchell_mozilla.com/bug825056/
> +xxpuppetmaster_upstream_rsync_source,rsync://puppetagain.pub.build.mozilla.org/data/
> +puppetmaster_upstream_rsync_source,relabs08:/data/

I assume it'll be fixed in the final version

@@ +8,5 @@
> +distinguished_puppetmaster,relabs03.build.mtv1.mozilla.com
> +puppet_again_repo,http://hg.mozilla.org/users/dmitchell_mozilla.com/bug825056/
> +xxpuppetmaster_upstream_rsync_source,rsync://puppetagain.pub.build.mozilla.org/data/
> +puppetmaster_upstream_rsync_source,relabs08:/data/
> +puppetmaster_upstream_rsync_args,--exclude=repos/apt 

trailing space

::: modules/puppetmaster/manifests/puppetsync_user.pp
@@ +6,5 @@
> +    include puppetmaster::settings
> +    $homedir = "/var/lib/puppetsync-home"
> +
> +    case $::operatingsystem {
> +        CentOS, Ubuntu: {

I think you can drop Ubuntu, since we don't support puppetmasters on this platform

::: modules/puppetmaster/manifests/rsync.pp
@@ +15,5 @@
> +
> +    $cron_schedule = $frequency ? {
> +        'often' => "*/5 * * * *",
> +        'half-hour' =>  "$rand_halfhour,$rand_secondhalf * * * *",
> +        'hourly' => "$rand_minute * * * *",

Can you fail if an unexpected value is passed?

::: modules/puppetmaster/templates/rsync.cron.erb
@@ +4,5 @@
> +MAILTO="<%= scope.lookupvar('::config::puppet_notif_email') %>"
> +# note that this rsync runs locally as root, but remotely as puppetsync; the -e
> +# instructs SSH to use puppetsync's SSH key and known_hosts, and to use those
> +# hosts appropriately
> +<%= @cron_schedule %> root rsync -e 'ssh -l puppetsync -l puppetsync -i <%= @puppetsync_home %>/.ssh/id_rsa -oUserKnownHostsFile=<%= @puppetsync_home %>/.ssh/known_hosts -oBatchMode=yes -oStrictHostKeyChecking=yes -oCheckHostIP=no' <% if @delete %>--delete<% end %> -rlpt <%- if @direction == 'to-distinguished-master' %><%= @from_dir %> <%= @distinguished_master %>:<%= @to_dir %><% else %><%= @distinguished_master %>:<%= @from_dir %> <%= @to_dir %><% end %>

"ssh -l puppetsync -l puppetsync" - please remove a duplicate

::: modules/puppetmaster/templates/ssl_git_config.erb
@@ +6,5 @@
> +	repositoryformatversion = 0
> +	filemode = true
> +	bare = false
> +	logallrefupdates = true
> +	sharedRepository = 0644

can you replace tabs in this file with spaces?

::: modules/puppetmaster/templates/ssl_git_sync.sh.erb
@@ +11,5 @@
> +# try pushing, but ignore errors
> +git push -q "${distinguished_common}" master 2>/dev/null || true
> +
> +# then pull and push until the push works
> +for i in range {1..10}; do

Please remove "range".
Attachment #739537 - Flags: review?(rail) → review+
relabs-config.csv will likely change around often -- it's just the config for the test cluster.  So don't worry too much about what you see there.

I need to drastically simplify that puppetmaster::rsync stuff.  I thought I'd be using it for lots of different syncs, so I tried to make it very general.  It turns out that's not necessary.

Some amusing other bugs in there you've caught - thanks :)
Depends on: 865215
Depends on: 865223
Blocks: 865799
Depends on: 872545
Depends on: 872549
Depends on: 872726

Comment 28

5 years ago
talos-linux*-ix-00{5,6} are yours.
Small issue with the Ubuntu slaves: installing the puppetlabs packages tries to restart the puppet service.. which is, of course, already running.  In this situation, our initscript helpfully retries forever, hanging the puppet run.  Killing the /etc/puppet/init start process allows the puppet run to proceed (and fail, but the next run is fine).

I'm still thinking about how to work around this.  Hopefully it will not involve a custom package. I think that automatically starting/stopping services is a serious bug, but I've already had this conversation with PL about the RPMs, and they do not see it my way.
I also see

dpkg: error processing /var/cache/apt/archives/puppet_3.2.1-1puppetlabs1_all.deb (--unpack):
 trying to overwrite '/usr/share/man/man8/puppet-help.8.gz', which is also in package puppet-common 2.7.17-1mozilla1

during the upgrade process, but the install appears to succeed all the same, so I think that can be safely ignored.
Depends on: 875439
There are now six 3.2.0 puppetmasters in AWS; these are part of the moco cluster, rather than standalone - bug 872545.

There are nine buildmasters in scl3 running puppet-3.2.0 - bug 867593.

There are four Ubuntu iX systems running puppet-3.2.0 and ready for testing by releng - bug 872549.

Manifests are written for Lion builders (upgrading all the way from 0.24.8!) and ready for releng testing in June - bug 760093.

moved hosts:

mobile-imaging-stage1.p127.scl1
mobile-imaging-???.p?.scl1
foopy*
servo buildmaster
buildbot-master8{1..9} - bug 867593
talos-linux64-ix-00{5,6} - bug 872549
Depends on: 882141
Depends on: 882739
I just landed some changes to the kickstart process for both CentOS and Ubuntu - there's now a "Puppet 3.2.0" menu tree in the PXE interface.
Blocks: 884426
Depends on: 884502
Depends on: 884506
Remaining:
 - lion builders - bug 760093
 - mountain lion talos - bug 882739
 - linux builders - bug 884506
 - buildmasters - bug 884502
OK, all of the deps are done.  Rail's even shut down puppetmaster-02 in AWS.  Tasks now:

1. Watch the logs on the 2.7.x masters for a bit, to verify we haven't forgotten a silo.
2. Shut the 2.7.x masters down.
3. Verify that all hosts on the 3.2.x masters are using certs issued by those masters.
4. Revoke the 2.7.x masters' SSL certs.
All of these - which were added early in the process - were still using certs on the old masters.  I've re-issued certs for all of them.  This list was generated from the diff between logged connections to the new masters and the list of certs issued by the new masters.

buildbot-master81.srv.releng.scl3.mozilla.com 
buildbot-master82.srv.releng.scl3.mozilla.com 
buildbot-master83.srv.releng.scl3.mozilla.com 
buildbot-master84.srv.releng.scl3.mozilla.com 
buildbot-master85.srv.releng.scl3.mozilla.com 
buildbot-master86.srv.releng.scl3.mozilla.com 
buildbot-master87.srv.releng.scl3.mozilla.com 
buildbot-master88.srv.releng.scl3.mozilla.com 
buildbot-master89.srv.releng.scl3.mozilla.com
foopy100.p9.releng.scl1.mozilla.com
foopy101.p9.releng.scl1.mozilla.com
foopy102.p10.releng.scl1.mozilla.com
foopy103.p10.releng.scl1.mozilla.com
foopy104.p10.releng.scl1.mozilla.com
foopy105.p10.releng.scl1.mozilla.com
foopy106.p10.releng.scl1.mozilla.com
foopy108.p10.releng.scl1.mozilla.com
foopy110.build.mtv1.mozilla.com
foopy111.build.mtv1.mozilla.com
foopy112.build.mtv1.mozilla.com
foopy113.build.mtv1.mozilla.com
foopy114.build.mtv1.mozilla.com
foopy115.build.mtv1.mozilla.com
foopy116.build.mtv1.mozilla.com
foopy117.build.mtv1.mozilla.com
foopy118.build.mtv1.mozilla.com
foopy119.build.mtv1.mozilla.com
foopy120.build.mtv1.mozilla.com
foopy121.build.mtv1.mozilla.com
foopy122.build.mtv1.mozilla.com
foopy123.build.mtv1.mozilla.com
foopy124.build.mtv1.mozilla.com
foopy125.build.mtv1.mozilla.com
foopy126.build.mtv1.mozilla.com
foopy128.build.mtv1.mozilla.com
foopy25.build.scl1.mozilla.com
foopy26.build.mtv1.mozilla.com
foopy27.build.mtv1.mozilla.com
foopy28.build.mtv1.mozilla.com
foopy29.build.mtv1.mozilla.com
foopy30.build.mtv1.mozilla.com
foopy31.build.mtv1.mozilla.com
foopy32.build.mtv1.mozilla.com
foopy33.build.scl1.mozilla.com
foopy34.build.scl1.mozilla.com
foopy35.build.scl1.mozilla.com
foopy36.build.scl1.mozilla.com
foopy37.build.scl1.mozilla.com
foopy39.p1.releng.scl1.mozilla.com
foopy40.p1.releng.scl1.mozilla.com
foopy41.p1.releng.scl1.mozilla.com
foopy42.p1.releng.scl1.mozilla.com
foopy43.p1.releng.scl1.mozilla.com
foopy44.p1.releng.scl1.mozilla.com
foopy45.p1.releng.scl1.mozilla.com
foopy46.p2.releng.scl1.mozilla.com
foopy47.p2.releng.scl1.mozilla.com
foopy48.p2.releng.scl1.mozilla.com
foopy49.p2.releng.scl1.mozilla.com
foopy50.p2.releng.scl1.mozilla.com
foopy51.p2.releng.scl1.mozilla.com
foopy52.p2.releng.scl1.mozilla.com
foopy53.p3.releng.scl1.mozilla.com
foopy54.p3.releng.scl1.mozilla.com
foopy55.p3.releng.scl1.mozilla.com
foopy56.p3.releng.scl1.mozilla.com
foopy57.p3.releng.scl1.mozilla.com
foopy58.p3.releng.scl1.mozilla.com
foopy59.p3.releng.scl1.mozilla.com
foopy60.p4.releng.scl1.mozilla.com
foopy61.p4.releng.scl1.mozilla.com
foopy62.p4.releng.scl1.mozilla.com
foopy63.p4.releng.scl1.mozilla.com
foopy64.p4.releng.scl1.mozilla.com
foopy65.p4.releng.scl1.mozilla.com
foopy66.p4.releng.scl1.mozilla.com
foopy67.p5.releng.scl1.mozilla.com
foopy68.p5.releng.scl1.mozilla.com
foopy69.p5.releng.scl1.mozilla.com
foopy70.p5.releng.scl1.mozilla.com
foopy71.p5.releng.scl1.mozilla.com
foopy72.p5.releng.scl1.mozilla.com
foopy73.p5.releng.scl1.mozilla.com
foopy74.p6.releng.scl1.mozilla.com
foopy75.p6.releng.scl1.mozilla.com
foopy76.p6.releng.scl1.mozilla.com
foopy77.p6.releng.scl1.mozilla.com
foopy78.p6.releng.scl1.mozilla.com
foopy79.p6.releng.scl1.mozilla.com
foopy80.p6.releng.scl1.mozilla.com
foopy81.p7.releng.scl1.mozilla.com
foopy82.p7.releng.scl1.mozilla.com
foopy83.p7.releng.scl1.mozilla.com
foopy84.p7.releng.scl1.mozilla.com
foopy85.p7.releng.scl1.mozilla.com
foopy86.p7.releng.scl1.mozilla.com
foopy87.p7.releng.scl1.mozilla.com
foopy88.p8.releng.scl1.mozilla.com
foopy89.p8.releng.scl1.mozilla.com
foopy90.p8.releng.scl1.mozilla.com
foopy91.p8.releng.scl1.mozilla.com
foopy92.p8.releng.scl1.mozilla.com
foopy93.p8.releng.scl1.mozilla.com
foopy94.p8.releng.scl1.mozilla.com
foopy95.p9.releng.scl1.mozilla.com
foopy96.p9.releng.scl1.mozilla.com
foopy97.p9.releng.scl1.mozilla.com
foopy98.p9.releng.scl1.mozilla.com
foopy99.p9.releng.scl1.mozilla.com
mobile-imaging-001.p1.releng.scl1.mozilla.com
mobile-imaging-002.p2.releng.scl1.mozilla.com
mobile-imaging-003.p3.releng.scl1.mozilla.com
mobile-imaging-004.p4.releng.scl1.mozilla.com
mobile-imaging-005.p5.releng.scl1.mozilla.com
mobile-imaging-006.p6.releng.scl1.mozilla.com
mobile-imaging-007.p7.releng.scl1.mozilla.com
mobile-imaging-008.p8.releng.scl1.mozilla.com
mobile-imaging-009.p9.releng.scl1.mozilla.com
mobile-imaging-010.p10.releng.scl1.mozilla.com
releng-puppet2.srv.releng.scl3.mozilla.com
talos-linux64-ix-006.test.releng.scl3.mozilla.com
Depends on: 885823
I turned off the scl1 and mtv1 masters, but the scl3 is still live to host the public mirror.  I've filed a flow request to switch that to the new scl3 master, and I'll also have to change the distinguished master to scl3.
I disabled access to the puppetmaster via HTTP on the scl3 host, so it's just serving /data via HTTP and rsync.
I'm running a "practice" switch of the relabs distinguished master.  It seems to be going well so far.
Created attachment 766730 [details] [diff] [review]
bug825056.patch

We need to do this because the DM is the one that serves rsync and public HTTP, and that needs to be proxied by a load balancer in scl3.  That, and scl1 will be closing soon so it's not a good spot for a DM.
Attachment #766730 - Flags: review?(rail)
Attachment #766730 - Flags: review?(rail) → review+
Attachment #766730 - Flags: checked-in+
Created attachment 767804 [details] [diff] [review]
bug825056.patch

The node-scope variable didn't work very well, and yuck anyway.
Attachment #767804 - Flags: review?(bugspam.Callek)
OK, releng-puppet2.srv.scl3 is now running the public mirror.  Servers have been swapped in zeus.  releng-puppet1.srv.releng.scl3 is no longer running apache or rsync.  I'm out for the rest of the week, so I'll finish the teardown next week.
Comment on attachment 767804 [details] [diff] [review]
bug825056.patch

r+ from Callek in irc
Attachment #767804 - Flags: review?(bugspam.Callek)
Attachment #767804 - Flags: review+
Attachment #767804 - Flags: checked-in+
OK, among many other things, I cleaned up the hacked DS deployments for mountain lion talos and snow leopard builders (the only two OS X flavors on puppetagain).
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.