Closed Bug 623849 Opened 14 years ago Closed 11 years ago

simplify puppet cert management

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: dustin, Unassigned)

References

Details

(Whiteboard: [puppet])

Attachments

(1 obsolete file)

It's a pain to deal with Puppet's bidirectional certificate checking.  We often have to regenerate certificates on slaves, wait for them to appear on the master, and then sign them.  Generally, the next run of puppet will then blow away the certificates on the client, and the process gets repeated.

We should investigate enabling autosign, which basically lets puppet keep its SSL connections while sacrificing most of the authentication aspects.

In talking to Ben in December, the reason *not* to do this was that a puppetmaster should not accept connections either from clients not in its node list (lest we accidentally transfer files across datacenters) or from foreign clients.  The former isn't such a big deal - we're already transferring files across datacenters via NFS (bug 597912).  The latter is fixed via bug 609782, where we would refuse to hand out configuration to unknown clients.

The thing I'd like to check is that autosign does the rough equivalent of SSH's host checks: if it gets a connection from a slave with a key that differs from its existing, signed key, will autosign sign it (bad), or reject it (good)?
This is the best doc I can find about autosign:
  http://docs.puppetlabs.com/references/stable/configuration.html

The autosign parameter lets you specify a whitelist file containing hostnames or glob expressions that will be allowed, or just 'true' to autosign every incoming request.

I experimented with 'true'.  I found that, much like a client will sign the master's key on first connect and complain if that changes, with this config the master will also sign a client's key on first connect, but allow the connect to complete.

With the addition of a whitelist file containing all known slaves, this would be functionally identical to what we have now with the autosigning crontab, except it doesn't require multiple connects from the client.

I'm not sure the whitelist is worthwhile, though.  If we do not have a 'default' node, then unknown clients do not get any interesting configuration - no passwords, no keys, no nothing.

There's a maintenance burden of keeping the whitelist up to date, although this could be done from a shell script based on the node list.  It does violate the DRY principle though!

So I recommend that we remove the crontask and replace it with a simple
  autosign = true

We should also remove the automatic deletion of ssl credentials by puppet.  I don't think that's buying us anything but additional frustration, and it will become entirely moot when we begin selecting the puppet master based on IP (bug 623840)
Revised: apparently you can provide a key with one hostname on it and get configuration for a node that does not share that name.  Was under the impression that the hostname had to match the CN in the certificate, but apparently not:

[root@talos-r3-fed-001 ~]# puppetd --test --fqdn=cleanme --server staging-puppet.build.mozilla.org --logdest console
info: Creating a new certificate request for cleanme
info: Creating a new SSL key at /var/lib/puppet/ssl/private_keys/cleanme.pem
warning: peer certificate won't be verified in this SSL session
notice: Got signed certificate
info: Caching catalog at /var/lib/puppet/localconfig.yaml
notice: Starting catalog run
notice: //Node[talos-r3-fed-001.build.mozilla.org]/talosslave/talos_fedora/Service[ntpdate]/ensure: ensure changed 'stopped' to 'running'
info: Sent transaction report in 0.07 seconds
notice: Finished catalog run in 6.15 seconds

Note the Node[...] has the correct hostname in it (turns out, from the 'hostname' fact).  Hrrmph.

So we can't rely on what I wrote about a missing 'default' node.  That means we should use an autosigning whitelist, which as I wrote above is functionally identical to the crontab we have been using, except that it isn't delayed by 60s.
Changes to make by hand:

( cd /etc/puppet && ln -s manifests/puppet.conf puppet.conf )
( cd /etc/puppet && ln -s manifests/autosign.conf autosign.conf )
crontab -r # remove the autosigning crontab
service puppetmaster restart

This has the advantage that you can rm -rf /var/lib/puppet/ssl and run puppetd, and it will get a configuration on the first try -- or tell you to run --clean on the master if the master has an old key.  In other words, no more manually signing, no more extra restarts, no more waiting for the crontab to run.

The one thing I'm not sure of here is whether autosign.conf should be script-generated like this, or whether it should just be in hg (along with the shell script to regenerate it).  It'd be easier to double-check that new nodes are in autosign.conf if autosign.conf shows up in diffs, but it's definitely redundant.

We could generate it via crontab, but how often do we move machines between masters?
Attachment #502290 - Flags: review?(bhearsum)
Comment on attachment 502290 [details] [diff] [review]
m623849-puppet-manifests-r1.patch

I don't agree that transferring files over NFS isn't a big deal -- that's a bug, and one that we need to address. However, I don't think that should stop us from moving forward here.

What happens when a slave moves between colos? Do we have to delete the keys by hand?

I don't think we should rely on people remembering to run update-autosign.sh. Any reason that can't go in a cronjob?
When slaves move between colos, just run
 rm -rf /var/lib/puppset/ssl/certs/*
which deletes both the cached cert for the master, and the slave's cert.  If the new master has an old key for the slave, that will need to be puppetca --cleaned out.

As for running update-autosign.sh, IMHO putting that in a crontab makes slave management more complex, not less.  If you move a slave and it doesn't work, you'll go "huh, I wonder what's wrong", and either poke around long enough for the crontab to run (at which point you'll go "huh, now it's working, what the heck?") or you'll run update-autosign.sh yourself (obviating the need for the crontab).

We've had similar problems with the current autosigning crontab, where when you try to fix a slave, it doesn't work initially, and when you head over to the master to try to fix it, the crontab runs, at which point the slave runs puppetd again, blows away the fresh keys you just created, and restarts.  I would *much* rather things either worked or didn't work.  Having them work "maybe sometime in the next five minutes" creates new timing-based mysteries that we then spend our time trying to solve.

We could add something to ./test-manifests.sh to make sure that autosign.conf is up to date - would that help?
(In reply to comment #5)
> When slaves move between colos, just run
>  rm -rf /var/lib/puppset/ssl/certs/*
> which deletes both the cached cert for the master, and the slave's cert.  If
> the new master has an old key for the slave, that will need to be puppetca
> --cleaned out.
> 
> As for running update-autosign.sh, IMHO putting that in a crontab makes slave
> management more complex, not less.  If you move a slave and it doesn't work,
> you'll go "huh, I wonder what's wrong", and either poke around long enough for
> the crontab to run (at which point you'll go "huh, now it's working, what the
> heck?") or you'll run update-autosign.sh yourself (obviating the need for the
> crontab).

If this is a documented part of our Puppet masters I don't see how it's a bad thing. It's simply removing one manual step from updating Puppet masters. I don't think "confuses people who aren't aware of it" is a big deal.

If you insist on it being manually run, I'd rather have flushed out host files checked in, and have the script regenerate all of them. What I'm trying to avoid here is more manual steps other than "hg pull -u" when updating manifests.

> 
> We could add something to ./test-manifests.sh to make sure that autosign.conf
> is up to date - would that help?

I don't think so...that's not a script typically run on the puppet masters.
Running it in a crontab doesn't remove a manual step - it just means that when things don't work, they will start working a little while later, which isn't good from a troubleshooting perspective.

I'm not sure what you mean by "flushed out host files"?
(In reply to comment #7)
> Running it in a crontab doesn't remove a manual step - it just means that when
> things don't work, they will start working a little while later, which isn't
> good from a troubleshooting perspective.

To be clear, I'm talking about a crontab that runs every minute, like the existing hostkey accepter does. It's a trivially short period of time IMHO.

> I'm not sure what you mean by "flushed out host files"?

I mean, autosign.conf files for each master, which all the hosts listed.
I would be OK with (but I think it's overkill) running a crontab that detects discrepancies and emails someone or otherwise signals an alert - nagios, border collie, release@, whatever.  I wouldn't even mind if that alert consisted of "<engineer> forgot to update autosign.conf.  I fixed it, with the following diff: ..".  But I don't think we should be building systems that silently fix configuration errors, particularly if they only do so when the second hand is pointing to the 12.  We need *more* predictability in our systems and tools, not less.

I'm happy with either manually updating autosign.conf on the masters as they change (using a shell script), or putting the autosign.conf's in the hg repository and remembering to update them in the same commit as site.pp.

In either case, if I make a mistake with the landing, then things break, I fix the mistake, and move on.  Harkening back to the principle of least surprise, IMHO that is the least surprising behavior.
(In reply to comment #9)
> I'm happy with either manually updating autosign.conf on the masters as they
> change (using a shell script), or putting the autosign.conf's in the hg
> repository and remembering to update them in the same commit as site.pp.

Let's do the latter, then. It's much easier to remember that this needs doing if it's they're checked in files.
Attachment #502290 - Attachment is obsolete: true
Attachment #502290 - Flags: review?(bhearsum)
Blocks: 628313
Assignee: dustin → nobody
Blocks: 516808
Bug 635067 specifies adding a Makefile to the puppet-manifests directory.  This makefile could also generate the autosign.conf.
Depends on: 635067
We now do cert chaining.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → INVALID
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: