Closed Bug 733110 Opened 12 years ago Closed 12 years ago

PuppetAgain puppet::atboot systems use a list of masters

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(2 files)

When slaves start up, they use puppet::atboot (see https://wiki.mozilla.org/ReleaseEngineering/Puppet/Modules/puppet).  This is currently configured to act like the old puppet did: block startup until puppet runs succesfully.

However, this makes puppet a hard requirement for operations to continue.  Hard requirements are best avoided, and only where that's impossible are they worked around with (much more expensive and difficult-to-get-right) HA systems.

My rough proposal would be to have slaves try several times to run puppet, and failing that, continue startup, while complaining via email or briarpatch.

The expected failure modes for puppet are (a) incorrect manifests and (b) infrastructure failures.  It would be best to differentiate the two, as (a) will happen now and then as people commit bogus manifests and correct them, while (b) is cause for a bug and contacting IT (we will hopefully get concurrent alerts from nagios).

During "normal operations", puppet only serves to verify that a slave looks the way it should - in the vast majority of cases, the configuration will already be correct.  The times when this is not the case are when manifest changes land that have developer visibility (e.g., a perf-changing upgrade).  When that happens, there is generally a scheduled tree closure, or the change is being rolled out "early" before a buildbot or mozconfig change is made to use it -- in either case, the change is monitored by whoever landed it.  If the puppet masters are functioning correctly, no problems occur under my proposal.

We would only see problems when:
1. a developer-visible change lands
2. a puppet master fails
3. the releng person monitoring the change misses the alerts

I think that the confluence of 1 and 2 is unlikely, and that 3 is human error and can be avoided.

IMHO, this slight risk is not enough to justify adding a hard requirement to this new deployment.
(In reply to Dustin J. Mitchell [:dustin] from comment #0)
> When slaves start up, they use puppet::atboot (see
> https://wiki.mozilla.org/ReleaseEngineering/Puppet/Modules/puppet).  This is
> currently configured to act like the old puppet did: block startup until
> puppet runs succesfully.
>
> However, this makes puppet a hard requirement for operations to continue.
> Hard requirements are best avoided, and only where that's impossible are
> they worked around with (much more expensive and difficult-to-get-right) HA
> systems.
>
> My rough proposal would be to have slaves try several times to run puppet,
> and failing that, continue startup, while complaining via email or
> briarpatch.
>
> The expected failure modes for puppet are (a) incorrect manifests and (b)
> infrastructure failures.  It would be best to differentiate the two, as (a)
> will happen now and then as people commit bogus manifests and correct them,
> while (b) is cause for a bug and contacting IT (we will hopefully get
> concurrent alerts from nagios).
>
> During "normal operations", puppet only serves to verify that a slave looks
> the way it should - in the vast majority of cases, the configuration will
> already be correct.  The times when this is not the case are when manifest
> changes land that have developer visibility (e.g., a perf-changing upgrade).
> When that happens, there is generally a scheduled tree closure, or the
> change is being rolled out "early" before a buildbot or mozconfig change is
> made to use it -- in either case, the change is monitored by whoever landed
> it.  If the puppet masters are functioning correctly, no problems occur
> under my proposal.
>
> We would only see problems when:
> 1. a developer-visible change lands
> 2. a puppet master fails
> 3. the releng person monitoring the change misses the alerts
>
> I think that the confluence of 1 and 2 is unlikely, and that 3 is human
> error and can be avoided.
>
> IMHO, this slight risk is not enough to justify adding a hard requirement to
> this new deployment.

As long as we use puppet as our guarantee that the machines are in the correct state needed for builds and tests, we must run puppet before the machine goes into buildbot.  As soon as we stop using puppet to establish the guarantee, we stop benefiting from puppet.  We could be booting computers into completely invalid states due to changes in the manifest or a dependency on commands run on the slave as part of the puppet job.

If we were to cache the catalogue on each client for up to X hours and attempt to apply that catalogue if the puppet master was unreachable, we might be able to survive puppet downtime safely.  I don't think that is going to be particularly easy to implement or keep working with each new version of puppet, further burdening an upgrade of our puppet system.  I think the priority should be placed on setting up our puppet infrastructure to be redundant and resilient while still maintaining the guarantee that a successful puppet run means a slave is in a good state to start buildbot.

The list presented is not exhaustive.  I can think of a case we recently had to deal with for 4.5 months with the dongles, where there was a failure to run puppet.  That failure was a legitimate intermittent machine configuration issue.  In the proposal above, there is a very major problem.  We would've been starting buildbot on slaves with the incorrect resolution leading to useless performance data and failing tests.  

The naive suggestion for avoiding this would be "move that check to a script we run before starting buildbot".  The problem here is that we need to figure out at what point a check is in puppet, a script or both.  We'd end up having to maintain a platform and slave type specific set of scripts that roughly approximate what puppet does.  I'd rather not reinvent the wheel.

Instead of increasing complexity, random behaviour and unneeded human interaction, why not build puppet infrastructure in a highly available, scalable, fault tolerant way?  Has using a single CA and having multiple puppet masters that can each sync any slave been investigated?  Has reductive labs been contacted about help setting up highly available puppet?  Has there been investigation of caching catalogues?

Consistency and correctness is critical in a CI system.  Instead of ensuring that our machines fail-dangerous, lets make sure we've done everything we can to have them never fail and having them fail-safe when they do.  There is serious risk of allowing a failing puppet run to pass into buildbot with a concrete example which had major implications and caused a lot of work.

It is also worth mentioning that with the current behaviour, we can survive momentary puppet master outages.  The failure mode is that machines start to drop out of the pool as they finish their jobs, reboot and fail to puppet.
(In reply to John Ford [:jhford] from comment #1)
> It is also worth mentioning that with the current behaviour, we can survive
> momentary puppet master outages.  The failure mode is that machines start to
> drop out of the pool as they finish their jobs, reboot and fail to puppet.

This is a key point, and, coupled with the isolation from hardware errors that we get with VMs, suggests that we already have some flexibility.  Most typical puppet HA solutions don't gain much resiliency over this configuration.

Your point also suggests a solution, similar to how we'll handle repo mirrors:

The puppet-running scripts currently run puppet in a loop until it succeeds, falling back to a reboot after a few tries.  However, they do so against the same puppet master every time.  The change would be to add a list of all releng puppet masters, and have these scripts cycle through the known puppet masters to find one that's working.  We'll need to figure out how to handle certificates signed on one master and presented to another, but that can be done.

Incidentally, I do want to request an adjustment to how you're conceiving of puppet, though: puppet is designed to describe a desired state and make the machine look that way.

In cases where puppet cannot alter the state (e.g., dongles), puppet is the wrong tool -- and I think we should build a simple mechanism to add tests like this (with the mechanism maintained by puppet, so we still have a single source of configuration).  The trick for the dongles was an effective hack, but conflated puppet failures with hardware failures, which is not something we'll want when we have more monitoring and reporting from puppet and briarpatch.

Looked at this way, the machine's state *is* a cached version of the catalog.  Deviations from this (e.g., using exec{} as a replacement for @reboot crontab or rc.local or initscripts) are mis-uses of Puppet.  Again, there are a few of those in the manifests, and they have served us well, but there are better ways.

Do let me know what you think of the proposed change in my third paragraph.  The rest is just food for thought, and I'd be happy to continue that discussion in email.  Note that this change does not need to be made before B2G is up and running (the puppet::atboot class will work fine for now).
(In reply to Dustin J. Mitchell [:dustin] from comment #2)
> The puppet-running scripts currently run puppet in a loop until it succeeds,
> falling back to a reboot after a few tries.  However, they do so against the
> same puppet master every time.  The change would be to add a list of all
> releng puppet masters, and have these scripts cycle through the known puppet
> masters to find one that's working.  We'll need to figure out how to handle
> certificates signed on one master and presented to another, but that can be
> done.

I'm re-purposing this bug to the end described above.
Summary: soft failures in slave startup process for PuppetAgain → PuppetAgain puppet::atboot systems use a list of masters
http://projects.puppetlabs.com/projects/puppet/wiki/Puppet_Scalability#Centralised-Puppet-Infrastructure
says that a CA hierarchy will work with puppet -- except in 0.25.0 and higher.  And the bugs it references are still open and will be worked on "in a few weeks" a year ago.  Yay, puppetlabs.

I suspect that the doesn't-work is based on either failures in SSL verification by the puppetmaster (which we're not using -- Apache does the SSL termination and verification) or failures in the puppet CA infrastructure (which we don't have to use, since we're doing out-of-band signing).  So I got a copy of O'Reilly's "OpenSSL" on my tablet, and I'm goin' in.

Kim, do you want to come along for the ride?

Note that, even if we use a CA hierarchy, the top cert in that hierarchy will still be self-signed -- there's no need to get any external CA's involved here.
Actually, it looks like
  http://projects.puppetlabs.com/projects/puppet/wiki/Multiple_Certificate_Authorities
is exactly what I was hoping would work.  I'll give it a try in relabs.
Sure, I'd me interested in learning more about this setup.
So I'm using this script to generate a set of chained certs:

root
|
+ relabs-puptest1 CA
| |
| + relabs-puptest1 SSL cert
| |
| + relabs08 SSL cert
|
+ relabs-puptest2 CA
  |
  + relabs-puptest2 SSL cert

Now to verify that they actually work, using openssl s_client and s_server.


(and yes, these passwords are stupid - it's a test)


#! /bin/bash

set -x

cd /root
rm -rf puptest-certs || exit 1
mkdir puptest-certs || exit 1
cd puptest-certs

####
# make a self-signed top-level cert

cat <<EOF > openssl.conf
[req]
prompt = no
distinguished_name = root-ca_dn
x509_extensions = root-ca_extns
 
[root-ca_dn]
commonName = PuppetAgain Root CA
emailAddress = release@mozilla.com
organizationalUnitName = Release Engineering
organizationName = Mozilla, Inc.
 
[root-ca_extns]
authorityKeyIdentifier=keyid,issuer:always
basicConstraints = critical,CA:true
keyUsage = keyCertSign, cRLSign
EOF

openssl req -passout pass:rootcapass -x509 -newkey rsa -config openssl.conf -new -keyout root-ca.key -out root-ca.crt -outform PEM || exit 1
openssl x509 -text -in root-ca.crt || exit 1

####
# set up for signing certs with it

mkdir root-ca-certs
touch root-ca-database
echo '01' > root-ca-serial

cat <<EOF > root-ca-openssl.conf
[ca]
default_ca = root-ca
 
[root-ca]
certificate = /root/puptest-certs/root-ca.crt
private_key = /root/puptest-certs/root-ca.key
database = /root/puptest-certs/root-ca-database
new_certs_dir = /root/puptest-certs/root-ca-certs
serial = /root/puptest-certs/root-ca-serial
 
default_crl_days = 7
default_days = 1825
default_md = sha1
 
policy = general_policy
x509_extensions = general_exts
 
[general_policy]
commonName = supplied
emailAddress = supplied
organizationName = supplied
organizationalUnitName = supplied
 
[general_exts]
authorityKeyIdentifier=keyid,issuer:always
basicConstraints = critical,CA:true
keyUsage = keyCertSign, cRLSign
EOF

####
# create CA certs for each puppet master
PUPPETMASTERS="relabs-puptest1.build.mtv1.mozilla.com relabs-puptest2.build.mtv1.mozilla.com"
for pupserv in $PUPPETMASTERS; do
	# make a CA cert
	cat <<EOF > openssl.conf
[req]
prompt = no
distinguished_name = puppetmaster_ca_dn
x509_extensions = puppetmaster_ca_extns
 
[puppetmaster_ca_dn]
commonName = CA on $pupserv
emailAddress = release@mozilla.com
organizationalUnitName = Release Engineering
organizationName = Mozilla, Inc.
 
[puppetmaster_ca_extns]
authorityKeyIdentifier=keyid,issuer:always
basicConstraints = critical,CA:true
keyUsage = keyCertSign, cRLSign
EOF
	openssl genrsa -des3 -out ${pupserv}-ca.key -passout pass:pupcapass  2048 || exit 1
	openssl req -config openssl.conf -new -key ${pupserv}-ca.key -out ${pupserv}-ca.csr -passin pass:pupcapass || exit 1
	openssl req -text -in ${pupserv}-ca.csr || exit 1

	# sign it with the root ca
	openssl ca -config root-ca-openssl.conf -in ${pupserv}-ca.csr -notext -out ${pupserv}-ca.crt -batch -passin pass:rootcapass
	openssl x509 -text -in ${pupserv}-ca.crt || exit 1

	# set up for signing certs with it
	mkdir ${pupserv}-ca-certs
	touch ${pupserv}-ca-database
	echo '01' > ${pupserv}-ca-serial

	cat <<EOF > ${pupserv}-ca-openssl.conf
[ca]
default_ca = server_ca
 
[server_ca]
certificate = /root/puptest-certs/${pupserv}-ca.crt
private_key = /root/puptest-certs/${pupserv}-ca.key
database = /root/puptest-certs/${pupserv}-ca-database
new_certs_dir = /root/puptest-certs/${pupserv}-ca-certs
serial = /root/puptest-certs/${pupserv}-ca-serial
 
default_crl_days = 7
default_days = 1825
default_md = sha1
 
policy = general_policy
x509_extensions = general_exts
 
[general_policy]
commonName = supplied
emailAddress = supplied
organizationName = supplied
organizationalUnitName = supplied
 
[general_exts]
authorityKeyIdentifier=keyid,issuer:always
basicConstraints = critical,CA:false
keyUsage = keyEncipherment, digitalSignature
EOF

	# make a server cert
	cat <<EOF > openssl.conf
[req]
prompt = no
distinguished_name = puppetmaster_ca_dn
x509_extensions = puppetmaster_ca_extns
 
[puppetmaster_ca_dn]
commonName = $pupserv
emailAddress = release@mozilla.com
organizationalUnitName = Release Engineering
organizationName = Mozilla, Inc.
 
[puppetmaster_ca_extns]
authorityKeyIdentifier=keyid,issuer:always
basicConstraints = critical,CA:false
keyUsage = keyEncipherment, digitalSignature
EOF
	openssl genrsa -des3 -out ${pupserv}.key -passout pass:pupservpass  2048 || exit 1
	openssl req -config openssl.conf -new -key ${pupserv}.key -out ${pupserv}.csr -passin pass:pupservpass || exit 1
	openssl req -text -in ${pupserv}.csr || exit 1

	# and sign it with the puppetmaster ca
	openssl ca -config ${pupserv}-ca-openssl.conf -in ${pupserv}.csr -notext -out ${pupserv}.crt -batch -passin pass:pupcapass || exit 1
	openssl x509 -text -in ${pupserv}.crt || exit 1
done

# generate a certificate bundle
(
	cat root-ca.crt
	for pupserv in $PUPPETMASTERS; do
		cat ${pupserv}-ca.crt
	done
) > ca-bundle.crt

# and sign a client key with one of the puppet masters' CA key
client=relabs08.build.mtv1.mozilla.com
for pupserv in $PUPPETMASTERS; do
	# generate a client cert
	cat <<EOF > openssl.conf
[req]
prompt = no
distinguished_name = puppetmaster_ca_dn
x509_extensions = puppetmaster_ca_extns
 
[puppetmaster_ca_dn]
commonName = $client
emailAddress = release@mozilla.com
organizationalUnitName = Release Engineering
organizationName = Mozilla, Inc.
 
[puppetmaster_ca_extns]
authorityKeyIdentifier=keyid,issuer:always
basicConstraints = critical,CA:false
keyUsage = keyEncipherment, digitalSignature
EOF
	openssl genrsa -des3 -out ${client}.key -passout pass:clienetpass  2048 || exit 1
	openssl req -config openssl.conf -new -key ${client}.key -out ${client}.csr -passin pass:clienetpass || exit 1
	openssl req -text -in ${client}.csr || exit 1

	# and sign it with the puppetmaster ca
	openssl ca -config ${pupserv}-ca-openssl.conf -in ${client}.csr -notext -out ${client}.crt -batch -passin pass:pupcapass || exit 1
	openssl x509 -text -in ${client}.crt || exit 1

	# just do this once
	break
done
Well, openssl verify, s_client, and s_server don't deal well with certificates that are bundled with their intermediaries, so I need to test this with Apache.
I blogged what I've learned so far here:
  http://code.v.igoro.us/archives/72-TIL-about-SSL-certificate-chains.html

My intent is to distribute the SSLCACertificatePath directory among the puppetmasters using csync2, so that all certs will be available on all puppetmasters, both for validating client certs, and for sending to clients to provide an intermediate to validate the server cert.

Next up:
 - verify that the puppet client allows a verify depth of 2
 - get CRLs working - hopefully using the same technique of csync2'ing the CRLs around
(In reply to Dustin J. Mitchell [:dustin] from comment #10)
>  - verify that the puppet client allows a verify depth of 2

Success!  Copying into /var/lib/ssl:

./certs/relabs08.build.mtv1.mozilla.com.pem
  the relabs08 cert
./certs/ca.pem
  root-ca.crt
./private_keys/relabs08.build.mtv1.mozilla.com.pem
  the relabs08 key

gets puppet agent --test past the SSL verification stage
With the certs I generated earlier, I was able to get a successful SSL connection to :8140.

I put all of the ca certs (root + each puppetmaster) in a cert dir, and added the necessary symlinks.  Then I pointed Apache to this dir with SSLCACertificatePath, and set SSLVerifyDepth to 2.

puppet --test is still erroring out, though.

Once this is done, the next task will be revoking certificates when new certs are issued, and if possible revoking certificates from puppetmasters different from the one that issued the certs.
OK, I've now run puppet agent --test against both test puppet masters, and gotten the errors I expect due to a missing CSV file (which means all of the SSL and auth stuff is right).

I have full notes in my notebook, and will write them up somewhere, but for posterity, the surprising things of note:
 * need to remove the password from the puppet server's private key
 * need to set the puppet server's CA's private key's password to the contents of $ssldir/ca/private/ca.pass
 * passenger config needs to be

        SSLCertificateFile /var/lib/puppet/ssl-master/certs/puppet.pem
        SSLCertificateKeyFile /var/lib/puppet/ssl-master/private_keys/puppet.pem
        SSLCACertificatePath /etc/httpd/ssl/certs

        # If Apache complains about invalid signatures on the CRL, you can try disabling
        # CRL checking by commenting the next line, but this is not recommended.
        #SSLCARevocationFile     /etc/puppet/ssl/ca/ca_crl.pem
        SSLVerifyClient optional
        SSLVerifyDepth  2
        SSLOptions +StdEnvVars

specifically, verify depth 2, and using a cert path that contains all of the CA certs, hashed as OpenSSL likes.
 * set certificate_revocation = false on the client
 * sign client certs with *just* a CN -- no email, OU, etc.

All of this adds up to not using CRLs in either direction.

I need to figure out how puppet's certificate_revocation thing works, and see if I can convince that to distribute a combined CRL for all puppet masters.  Checking CRLs on the clients helps us avoid MITM attacks if a master's key is disclosed.  Evidence points to puppet not supporting this, and if that's the case I think we should file a bug with puppetlabs and move on.

I also need to figure out how to do CRL checking in Apache -- SSLCARevocationFile is, as you see above, commented out.  This kind of revocation is much more important, as it's how we mark old client certs as no longer valid, e.g., when retiring or re-imaging a machine.
No, puppet agent only handles one CRL, which conflicts directly with having multiple CAs (since there's one CRL per CA).  So I think we need to punt on the client-side CRL checking.  I filed https://projects.puppetlabs.com/issues/14550 to fix that.

On to the last step, CRL checking in Apache.
Success.  Concept proven.  Time to implement.
Attached patch bug733110.patchSplinter Review
https://github.com/djmitche/releng-puppet/commit/bug733110.patch

This disables client-side CRL checking, which doesn't work as described in previous comments.
Attachment #625304 - Flags: review?(kmoir)
Attachment #625304 - Flags: review?(kmoir) → review+
checked-in
Depends on: 757997
I installed this new stuff on relabs-puppet.build.mtv1, and successfully sent a client there that had already been signed on relabs-puptest1.

I'm also kickstarting relabs07 to see if the new puppetize.sh in bug 757997 works correctly, but I don't expect problems there.

My plan for deploying this is to land the change, manually setup releng-puppet1.build.scl1, and then manually re-sign all of the B2G nodes, using the new puppetize.sh from bug 757997.  This would entail probably 30m or so of partial downtime for the B2G builders, where they will fail to connect to puppet until everything is complete.  Given the low load, I don't expect this to be a signficant problem.

I'd like to do this Thursday.  Can I get sign-off from someone in releng for this?
(In reply to Dustin J. Mitchell [:dustin] from comment #18) 
> I'd like to do this Thursday.  Can I get sign-off from someone in releng for
> this?

Dustin: works for me as buildduty.
Excellent!  This is landed, and all of the B2G hosts are re-puppetized (by just running ./puppetize.sh in /root, and entering the deploypass when prompted).

There's a bit more to do:
 - puppet::atboot's initscript needs to try other puppetmasters if 'puppet' fails
 - I need to automatically reload apache via csync2 when the certdir changes
Waiter, there's a fly in my ointment:
  https://bugs.ruby-lang.org/issues/6493

so server certs need to list the fqdn in the subjectAltName, too.  Which is made more annoying by OpenSSL's insistence that such a value be in the openssl.conf file.
Attached patch bug733110.patchSplinter Review
https://github.com/djmitche/releng-puppet/commit/bug733110.patch

The problem in the previous comment was fixed by re-generating the server certificates on each master using the proper openssl.conf.  The wiki and infra puppet are updated accordingly.
Attachment #627047 - Flags: review?(jwatkins)
Comment on attachment 627047 [details] [diff] [review]
bug733110.patch

(not sure who's best to review this, so whoever gets there first wins!)
Attachment #627047 - Flags: review?(kmoir)
Apache is automatically reloading on certdir changes now.
Attachment #627047 - Flags: review?(kmoir) → review+
Comment on attachment 627047 [details] [diff] [review]
bug733110.patch

committed
Attachment #627047 - Flags: review?(jwatkins)
That about takes care of it, then - we're resilient again!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: