Closed Bug 825056 Opened 13 years ago Closed 12 years ago

self-host releng puppetmasters at Mozilla

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

(Whiteboard: [2013Q2] [tracker])

Attachments

(4 files, 2 obsolete files)

The PuppetAgain masters are currently built using infra puppet, but they're increasingly diverging from the infra puppet masters (which will soon be running 3.0, and using hiera, and all manner of other neat stuff). These should be self-hosted, so they look to themselves (or other puppet masters) for their configuration. This may go hand-in-hand with re-doing the certificate handling to be more in line with what PuppetLabs will support.
Assignee: server-ops-releng → dustin
Amy points out that self-hosting the masters may not be such a good idea: it would mean we have critical systems that aren't like all of the others and aren't set up to be cared for by the SRE team.
Assignee: dustin → server-ops-releng
Assignee: server-ops-releng → dustin
Depends on: 836014
Attached patch bug825056.patchSplinter Review
Make the puppet startup type a parameter Then the standalone puppetmaster uses type 'none', and installs a separate update crontask. This is a lead-up to writing a toplevel::server:puppetmaster::clustered that *does* use puppet::periodic. I tested this on Darwin, CentOS, and Ubuntu (where only puppet::none and puppet:atboot are implemented), and with a CentOS standalone node definition.
Attachment #708393 - Flags: review?(rail)
Comment on attachment 708393 [details] [diff] [review] bug825056.patch nit can we do s/none/manual/ to imply that puppet is manually (e.g. via cron) executed rather than none which reads [to me] like we don't intend puppet to ever run again on that machine. I'm open to other words than 'none'
Attachment #708393 - Flags: feedback+
Actually, that's exactly what this does mean. Puppet agent never runs. So I think it's a good name :)
(In reply to Dustin J. Mitchell [:dustin] from comment #4) > Actually, that's exactly what this does mean. Puppet agent never runs. So > I think it's a good name :) puppet apply running in a cron !=== "none" to me, was more my point.
As far as the puppet module's concerned, it's none. The puppet apply happens as part of the puppetmaster's operation.
Comment on attachment 708393 [details] [diff] [review] bug825056.patch Review of attachment 708393 [details] [diff] [review]: ----------------------------------------------------------------- _~ _~ )_)_~ )_))_))_) _!__!__!_ it! \_______/ ~~~~~~~~~~~~~
Attachment #708393 - Flags: review?(rail) → review+
Landed and backed right out because: [root@foopy105 ~]# puppet agent --test --server=releng-puppet1.build.mtv1.mozilla.com info: Retrieving plugin info: Caching catalog for foopy105.p10.releng.scl1.mozilla.com info: Applying configuration version '0af6a712faf6' notice: /Stage[main]/Puppet::None/File[/etc/cron.d/puppetcheck.cron]/ensure: removed notice: Finished catalog run in 6.17 seconds
I verified no systems were impacted during the 2 minutes this was landed.
http://projects.puppetlabs.com/issues/13537 ..and duped to a bug from 2.6.2 that will never be fixed.
(In reply to Dustin J. Mitchell [:dustin] from comment #8) > Landed and backed right out because: > > [root@foopy105 ~]# puppet agent --test > --server=releng-puppet1.build.mtv1.mozilla.com > info: Retrieving plugin > info: Caching catalog for foopy105.p10.releng.scl1.mozilla.com > info: Applying configuration version '0af6a712faf6' > notice: /Stage[main]/Puppet::None/File[/etc/cron.d/puppetcheck.cron]/ensure: > removed > notice: Finished catalog run in 6.17 seconds (In reply to Dustin J. Mitchell [:dustin] from comment #9) > I verified no systems were impacted during the 2 minutes this was landed. huh c#8 seems to be a clear sign that something was affected ;-) (--test is not a dry-run) Did you restore foopy105 (and any other potentially run machines) to a state that actually allows it to continue to run puppet?
obviously
I'm peeking in on this again. Rail, the major sticking point here that led to this landing and backout is maintaining support for masters that use 'puppet apply' rather than running puppet::periodic. How would you feel about changing that arrangement? We could freeze updates to the existing AWS masters (just comment out the 'puppet apply') then build out new masters in AWS using this self-hosted method, and re-certify all of the AWS hosts there. We'll need to re-certify all of the Mozilla-hosted clients anyway, so that's not a whole lot of additional work. It would simplify the manifests substantially.
Flags: needinfo?(rail)
From some IRC conversations, it sounds like this is something we can try, at least. I'll start putting it together.
Flags: needinfo?(rail)
I don't feel comfortable to switch to the model we wanted to avoid when created puppetmaster manifests in the first place. The masters won't be standalone anymore... BTW, you can use "include toplevel::server::puppetmaster" instead of "include toplevel::server::puppetmaster::standalone" to make it update files against other masters.
Right, the idea is to build masters that are part of a cluster, not standalone. I'm not sure how that's related to how they configure themselves. I've yet to hear an explanation of how 'puppet apply' is not worse than 'puppet agent'. I explicitly decided not to argue about it when you landed your patches, so that things could keep moving. The closest I've heard is that, with 'puppet agent', the master might restart itself, and this would either cause problems with the run in progress, or cause the master to fail. I don't see it causing problems with the run in progress, as the catalog is already generated by the time the agent begins processing it. And I don't see it causing problems with the master failing permanently, any more than you would see with 'puppet apply' -- if we land a bogus patch that breaks the masters, both methods will cause the master to fail. There's also concern about bootstrapping. This is something I'll need to address in building clustered puppet masters anyway. I don't know exactly how I'll solve this yet, but if 'puppet agent' will work, even sort-of, for that, then that's probably adequate. Infra puppet has this problem solved, so I'll probably crib from their work. The downsides to running 'puppet apply' are: - it's tricky/hacky to configure servers to use that instead of puppet::periodic (see above) - 'puppet apply' is not the same as an agent, particularly around custom ruby code and references to the server. This could cause unexpected problems down the road. As a point of reference, infra puppet masters run 'puppet agent' against themselves with no issues whatsoever. So, I'd like to try this out. Certainly if you can demonstrate a concrete failure case, I'll take that into account.
Great! This more or less eliminates my concerns about the agent approach. One missing thing that would prevent us from switching will be (easy) cert generation for slaves, esp the ones in ec2. http://hg.mozilla.org/build/puppet/file/b087ee71e52b/setup/ca-scripts/generate-cert.sh should be replaced with something else to make the transition smooth.
Cert generation will be done via HTTP using getcert.cgi, and we can adjust that to allow remote generation as well (e.g., via SSH). Thanks!
Whiteboard: [2013Q2] [tracker]
Attached patch self-host-preview.patch (obsolete) — Splinter Review
The attached patch was just successfully used to bootstrap a puppet master, with a version of puppet very close to what will be 3.2.0. There are lots of TODO's left, but mostly around code quality and flexibility - the core functionality is in place except for cert revocation.
Attached patch bug825056-self-host.patch (obsolete) — Splinter Review
Attachment #737013 - Attachment is obsolete: true
Attachment #738707 - Flags: review?(bugspam.Callek)
Attachment #738707 - Flags: review?(rail)
Comment on attachment 738707 [details] [diff] [review] bug825056-self-host.patch Wrong file? It doesn't look like a patch. Sorry, I just realized this. :/
Attachment #738707 - Flags: review?(rail)
Ugh, that paste-your-patch-here box.. Apparently I put my comments there. Everything on my checklist is finished here. There's a *lot* to review, so I'll let you dive right in. It's probably best to start with high-level stuff -- ask me where things aren't clear, or if you see something that's wrong. We can get into nits later.
Attachment #738707 - Attachment is obsolete: true
Attachment #738707 - Flags: review?(bugspam.Callek)
Attachment #739537 - Flags: review?(rail)
Comment on attachment 739537 [details] [diff] [review] bug825056-self-host.patch Review of attachment 739537 [details] [diff] [review]: ----------------------------------------------------------------- In overall it looks great. I really liked the fact that a lot of SSL related files have been moved under version control. Some nits below. ::: manifests/extlookup/relabs-config.csv @@ +7,5 @@ > +global_authorized_keys,dustin > +distinguished_puppetmaster,relabs03.build.mtv1.mozilla.com > +puppet_again_repo,http://hg.mozilla.org/users/dmitchell_mozilla.com/bug825056/ > +xxpuppetmaster_upstream_rsync_source,rsync://puppetagain.pub.build.mozilla.org/data/ > +puppetmaster_upstream_rsync_source,relabs08:/data/ I assume it'll be fixed in the final version @@ +8,5 @@ > +distinguished_puppetmaster,relabs03.build.mtv1.mozilla.com > +puppet_again_repo,http://hg.mozilla.org/users/dmitchell_mozilla.com/bug825056/ > +xxpuppetmaster_upstream_rsync_source,rsync://puppetagain.pub.build.mozilla.org/data/ > +puppetmaster_upstream_rsync_source,relabs08:/data/ > +puppetmaster_upstream_rsync_args,--exclude=repos/apt trailing space ::: modules/puppetmaster/manifests/puppetsync_user.pp @@ +6,5 @@ > + include puppetmaster::settings > + $homedir = "/var/lib/puppetsync-home" > + > + case $::operatingsystem { > + CentOS, Ubuntu: { I think you can drop Ubuntu, since we don't support puppetmasters on this platform ::: modules/puppetmaster/manifests/rsync.pp @@ +15,5 @@ > + > + $cron_schedule = $frequency ? { > + 'often' => "*/5 * * * *", > + 'half-hour' => "$rand_halfhour,$rand_secondhalf * * * *", > + 'hourly' => "$rand_minute * * * *", Can you fail if an unexpected value is passed? ::: modules/puppetmaster/templates/rsync.cron.erb @@ +4,5 @@ > +MAILTO="<%= scope.lookupvar('::config::puppet_notif_email') %>" > +# note that this rsync runs locally as root, but remotely as puppetsync; the -e > +# instructs SSH to use puppetsync's SSH key and known_hosts, and to use those > +# hosts appropriately > +<%= @cron_schedule %> root rsync -e 'ssh -l puppetsync -l puppetsync -i <%= @puppetsync_home %>/.ssh/id_rsa -oUserKnownHostsFile=<%= @puppetsync_home %>/.ssh/known_hosts -oBatchMode=yes -oStrictHostKeyChecking=yes -oCheckHostIP=no' <% if @delete %>--delete<% end %> -rlpt <%- if @direction == 'to-distinguished-master' %><%= @from_dir %> <%= @distinguished_master %>:<%= @to_dir %><% else %><%= @distinguished_master %>:<%= @from_dir %> <%= @to_dir %><% end %> "ssh -l puppetsync -l puppetsync" - please remove a duplicate ::: modules/puppetmaster/templates/ssl_git_config.erb @@ +6,5 @@ > + repositoryformatversion = 0 > + filemode = true > + bare = false > + logallrefupdates = true > + sharedRepository = 0644 can you replace tabs in this file with spaces? ::: modules/puppetmaster/templates/ssl_git_sync.sh.erb @@ +11,5 @@ > +# try pushing, but ignore errors > +git push -q "${distinguished_common}" master 2>/dev/null || true > + > +# then pull and push until the push works > +for i in range {1..10}; do Please remove "range".
Attachment #739537 - Flags: review?(rail) → review+
relabs-config.csv will likely change around often -- it's just the config for the test cluster. So don't worry too much about what you see there. I need to drastically simplify that puppetmaster::rsync stuff. I thought I'd be using it for lots of different syncs, so I tried to make it very general. It turns out that's not necessary. Some amusing other bugs in there you've caught - thanks :)
Depends on: 865215
Depends on: 865223
Blocks: 865799
Depends on: 872545
Depends on: 872549
talos-linux*-ix-00{5,6} are yours.
Small issue with the Ubuntu slaves: installing the puppetlabs packages tries to restart the puppet service.. which is, of course, already running. In this situation, our initscript helpfully retries forever, hanging the puppet run. Killing the /etc/puppet/init start process allows the puppet run to proceed (and fail, but the next run is fine). I'm still thinking about how to work around this. Hopefully it will not involve a custom package. I think that automatically starting/stopping services is a serious bug, but I've already had this conversation with PL about the RPMs, and they do not see it my way.
I also see dpkg: error processing /var/cache/apt/archives/puppet_3.2.1-1puppetlabs1_all.deb (--unpack): trying to overwrite '/usr/share/man/man8/puppet-help.8.gz', which is also in package puppet-common 2.7.17-1mozilla1 during the upgrade process, but the install appears to succeed all the same, so I think that can be safely ignored.
Depends on: 875439
There are now six 3.2.0 puppetmasters in AWS; these are part of the moco cluster, rather than standalone - bug 872545. There are nine buildmasters in scl3 running puppet-3.2.0 - bug 867593. There are four Ubuntu iX systems running puppet-3.2.0 and ready for testing by releng - bug 872549. Manifests are written for Lion builders (upgrading all the way from 0.24.8!) and ready for releng testing in June - bug 760093. moved hosts: mobile-imaging-stage1.p127.scl1 mobile-imaging-???.p?.scl1 foopy* servo buildmaster buildbot-master8{1..9} - bug 867593 talos-linux64-ix-00{5,6} - bug 872549
Depends on: 882141
Depends on: 882739
I just landed some changes to the kickstart process for both CentOS and Ubuntu - there's now a "Puppet 3.2.0" menu tree in the PXE interface.
Depends on: 884502
Depends on: 884506
Remaining: - lion builders - bug 760093 - mountain lion talos - bug 882739 - linux builders - bug 884506 - buildmasters - bug 884502
OK, all of the deps are done. Rail's even shut down puppetmaster-02 in AWS. Tasks now: 1. Watch the logs on the 2.7.x masters for a bit, to verify we haven't forgotten a silo. 2. Shut the 2.7.x masters down. 3. Verify that all hosts on the 3.2.x masters are using certs issued by those masters. 4. Revoke the 2.7.x masters' SSL certs.
All of these - which were added early in the process - were still using certs on the old masters. I've re-issued certs for all of them. This list was generated from the diff between logged connections to the new masters and the list of certs issued by the new masters. buildbot-master81.srv.releng.scl3.mozilla.com buildbot-master82.srv.releng.scl3.mozilla.com buildbot-master83.srv.releng.scl3.mozilla.com buildbot-master84.srv.releng.scl3.mozilla.com buildbot-master85.srv.releng.scl3.mozilla.com buildbot-master86.srv.releng.scl3.mozilla.com buildbot-master87.srv.releng.scl3.mozilla.com buildbot-master88.srv.releng.scl3.mozilla.com buildbot-master89.srv.releng.scl3.mozilla.com foopy100.p9.releng.scl1.mozilla.com foopy101.p9.releng.scl1.mozilla.com foopy102.p10.releng.scl1.mozilla.com foopy103.p10.releng.scl1.mozilla.com foopy104.p10.releng.scl1.mozilla.com foopy105.p10.releng.scl1.mozilla.com foopy106.p10.releng.scl1.mozilla.com foopy108.p10.releng.scl1.mozilla.com foopy110.build.mtv1.mozilla.com foopy111.build.mtv1.mozilla.com foopy112.build.mtv1.mozilla.com foopy113.build.mtv1.mozilla.com foopy114.build.mtv1.mozilla.com foopy115.build.mtv1.mozilla.com foopy116.build.mtv1.mozilla.com foopy117.build.mtv1.mozilla.com foopy118.build.mtv1.mozilla.com foopy119.build.mtv1.mozilla.com foopy120.build.mtv1.mozilla.com foopy121.build.mtv1.mozilla.com foopy122.build.mtv1.mozilla.com foopy123.build.mtv1.mozilla.com foopy124.build.mtv1.mozilla.com foopy125.build.mtv1.mozilla.com foopy126.build.mtv1.mozilla.com foopy128.build.mtv1.mozilla.com foopy25.build.scl1.mozilla.com foopy26.build.mtv1.mozilla.com foopy27.build.mtv1.mozilla.com foopy28.build.mtv1.mozilla.com foopy29.build.mtv1.mozilla.com foopy30.build.mtv1.mozilla.com foopy31.build.mtv1.mozilla.com foopy32.build.mtv1.mozilla.com foopy33.build.scl1.mozilla.com foopy34.build.scl1.mozilla.com foopy35.build.scl1.mozilla.com foopy36.build.scl1.mozilla.com foopy37.build.scl1.mozilla.com foopy39.p1.releng.scl1.mozilla.com foopy40.p1.releng.scl1.mozilla.com foopy41.p1.releng.scl1.mozilla.com foopy42.p1.releng.scl1.mozilla.com foopy43.p1.releng.scl1.mozilla.com foopy44.p1.releng.scl1.mozilla.com foopy45.p1.releng.scl1.mozilla.com foopy46.p2.releng.scl1.mozilla.com foopy47.p2.releng.scl1.mozilla.com foopy48.p2.releng.scl1.mozilla.com foopy49.p2.releng.scl1.mozilla.com foopy50.p2.releng.scl1.mozilla.com foopy51.p2.releng.scl1.mozilla.com foopy52.p2.releng.scl1.mozilla.com foopy53.p3.releng.scl1.mozilla.com foopy54.p3.releng.scl1.mozilla.com foopy55.p3.releng.scl1.mozilla.com foopy56.p3.releng.scl1.mozilla.com foopy57.p3.releng.scl1.mozilla.com foopy58.p3.releng.scl1.mozilla.com foopy59.p3.releng.scl1.mozilla.com foopy60.p4.releng.scl1.mozilla.com foopy61.p4.releng.scl1.mozilla.com foopy62.p4.releng.scl1.mozilla.com foopy63.p4.releng.scl1.mozilla.com foopy64.p4.releng.scl1.mozilla.com foopy65.p4.releng.scl1.mozilla.com foopy66.p4.releng.scl1.mozilla.com foopy67.p5.releng.scl1.mozilla.com foopy68.p5.releng.scl1.mozilla.com foopy69.p5.releng.scl1.mozilla.com foopy70.p5.releng.scl1.mozilla.com foopy71.p5.releng.scl1.mozilla.com foopy72.p5.releng.scl1.mozilla.com foopy73.p5.releng.scl1.mozilla.com foopy74.p6.releng.scl1.mozilla.com foopy75.p6.releng.scl1.mozilla.com foopy76.p6.releng.scl1.mozilla.com foopy77.p6.releng.scl1.mozilla.com foopy78.p6.releng.scl1.mozilla.com foopy79.p6.releng.scl1.mozilla.com foopy80.p6.releng.scl1.mozilla.com foopy81.p7.releng.scl1.mozilla.com foopy82.p7.releng.scl1.mozilla.com foopy83.p7.releng.scl1.mozilla.com foopy84.p7.releng.scl1.mozilla.com foopy85.p7.releng.scl1.mozilla.com foopy86.p7.releng.scl1.mozilla.com foopy87.p7.releng.scl1.mozilla.com foopy88.p8.releng.scl1.mozilla.com foopy89.p8.releng.scl1.mozilla.com foopy90.p8.releng.scl1.mozilla.com foopy91.p8.releng.scl1.mozilla.com foopy92.p8.releng.scl1.mozilla.com foopy93.p8.releng.scl1.mozilla.com foopy94.p8.releng.scl1.mozilla.com foopy95.p9.releng.scl1.mozilla.com foopy96.p9.releng.scl1.mozilla.com foopy97.p9.releng.scl1.mozilla.com foopy98.p9.releng.scl1.mozilla.com foopy99.p9.releng.scl1.mozilla.com mobile-imaging-001.p1.releng.scl1.mozilla.com mobile-imaging-002.p2.releng.scl1.mozilla.com mobile-imaging-003.p3.releng.scl1.mozilla.com mobile-imaging-004.p4.releng.scl1.mozilla.com mobile-imaging-005.p5.releng.scl1.mozilla.com mobile-imaging-006.p6.releng.scl1.mozilla.com mobile-imaging-007.p7.releng.scl1.mozilla.com mobile-imaging-008.p8.releng.scl1.mozilla.com mobile-imaging-009.p9.releng.scl1.mozilla.com mobile-imaging-010.p10.releng.scl1.mozilla.com releng-puppet2.srv.releng.scl3.mozilla.com talos-linux64-ix-006.test.releng.scl3.mozilla.com
I turned off the scl1 and mtv1 masters, but the scl3 is still live to host the public mirror. I've filed a flow request to switch that to the new scl3 master, and I'll also have to change the distinguished master to scl3.
I disabled access to the puppetmaster via HTTP on the scl3 host, so it's just serving /data via HTTP and rsync.
I'm running a "practice" switch of the relabs distinguished master. It seems to be going well so far.
Attached patch bug825056.patchSplinter Review
We need to do this because the DM is the one that serves rsync and public HTTP, and that needs to be proxied by a load balancer in scl3. That, and scl1 will be closing soon so it's not a good spot for a DM.
Attachment #766730 - Flags: review?(rail)
Attachment #766730 - Flags: review?(rail) → review+
Attachment #766730 - Flags: checked-in+
Attached patch bug825056.patchSplinter Review
The node-scope variable didn't work very well, and yuck anyway.
Attachment #767804 - Flags: review?(bugspam.Callek)
OK, releng-puppet2.srv.scl3 is now running the public mirror. Servers have been swapped in zeus. releng-puppet1.srv.releng.scl3 is no longer running apache or rsync. I'm out for the rest of the week, so I'll finish the teardown next week.
Comment on attachment 767804 [details] [diff] [review] bug825056.patch r+ from Callek in irc
Attachment #767804 - Flags: review?(bugspam.Callek)
Attachment #767804 - Flags: review+
Attachment #767804 - Flags: checked-in+
OK, among many other things, I cleaned up the hacked DS deployments for mountain lion talos and snow leopard builders (the only two OS X flavors on puppetagain).
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: