Closed Bug 884472 Opened 11 years ago Closed 10 years ago

[tracker] upgrade silos old releng puppet (2.6.x, 0.24.8)

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Unassigned)

References

Details

This vulnerability allows arbitrary remote code execution on the masters.  Every host in the releng network - including try hosts - has access to the masters.

Hosts running 3.2.0 will be easy to upgrade to 3.2.2, which is not vulnerable.
Hosts running 2.7.17 are already slated to be upgraded, and we'll accelerate that.
Hosts running older, unsupported versions are in a tough spot.

Ideas:
 * Massively accelerate migration from old-puppet to new-puppet, perhaps leaving hosts unmanaged in the interim, and shut off the old masters immediately.
 * Try to backport PuppetLabs' fix and ship patched versions on the master.
 * Pretend it never happened.

Mostly-complete list of hosts still on old puppet:
 * some buildmasters
 * all signing servers
 * redis01
 * buildapi01
 * talos-r3-fed-xxx
 * talos-r3-fed64-xxx
 * talos-r4-snow-xxx
 * talos-r4-lion-xxx
 * bld-lion-r5-xxx (moving already in bug 760093)
 * bld-centos5-32-vmw-xxx
 * bld-centos5-64-vmw-xxx
Richard looked at backporting.  Not feasible.  The fix involves monkey-patching a bunch of ruby libraries, and a complete rewrite of the serializer from 3.1.x (so, probably several complete rewrites since 2.6.x and 0.24.8).
So we're down to 2 options:
 * Massively accelerate migration from old-puppet to new-puppet, perhaps leaving hosts unmanaged in the interim, and shut off the old masters immediately.
 * Pretend it never happened.

Two scoping questions:
 - Any rough time estimate for relops/releng on the accelerated migration? 
 - Any way to tell from puppet how often these machines are actually making corrections/updates to the hosts.

Seems a reasonable approach may be to break the list of affected host types into priority order. I suspect a number of them (all?) could safely be disconnected for a while, and others migrated quickly.
Flags: needinfo?(dustin)
I don't have an overall scope, but I'll share what I can.

the servers could easily go unmanaged for a while, but unfortunately they're also the easy ones to convert:
 * some buildmasters - you're working on this AFAIK
 * all signing servers - I have a start on this, but probably a week or two
 * redis01 - probably straightforward
 * buildapi01 - same

the slaves would need to be hacked to go unmanaged, and there'd be some loss of functionality (mostly around cleanup at startup).  Are the fedora hosts even used?
 * talos-r3-fed-xxx
 * talos-r3-fed64-xxx - puppetagain doesn't support fedora, so these will be hard
 * talos-r4-snow-xxx
 * talos-r4-lion-xxx - these will probably mostly work based on work for talos-r5-mtnlion
 * bld-lion-r5-xxx - moving already in bug 760093
 * bld-centos5-32-vmw-xxx
 * bld-centos5-64-vmw-xxx - puppetagain doesn't support centos5
Flags: needinfo?(dustin)
some of the cleanup could be moved to @reboot cron jobs on the slaves
Yes, or in the neutered run-puppet.sh scripts.
redis and buildapi are pretty static in terms of puppet right now, we can probably remove them from being managed until the new manifests are ready.
This vulnerability and the impact is serious enough it warrants immediate action:

* upgrade all instances of puppet to 2.7.22 or 3.2.2.

On the older puppet instances (0.24.8 and 2.6.x) that are unsupported and where no patch is available, we will need to:

* shut down these puppetmasters and leave it shut down
* shut down these puppetmasters and begin the task of upgrading to 3.2.2

Patching this vulnerability is a high priority and efforts should start now. I believe CAB approvals may be needed, so those tasked with upgrading should file CAB requests now.
So my proposal is:

 * some buildmasters - hal in bug 867583 (for panda/tegra masters) + new bug (others)
 * all signing servers - disconnect, $someone finishes bug 869498 quickly
 * redis01 - disconnect per comment 6
 * buildapi01 - disconnect per comment 6
 * talos-r3-fed-xxx - disconnect permanently (and power off?)
 * talos-r3-fed64-xxx -    --"--
 * talos-r4-snow-xxx - disconnect and dustin file bug for upgrade
 * talos-r4-lion-xxx -    --"--
 * bld-lion-r5-xxx - in progress in bug 760093
 * bld-centos5-32-vmw-xxx - disconnect permanently
 * bld-centos5-64-vmw-xxx - disconnect permanently
What about bld-linux64-ec2-xxx slaves? They still use puppet 2.7.x.
and try-linux64-ec2-xxx...
FTR, upgrading 2.7.x may introduce SSL certificate issues (see http://projects.puppetlabs.com/issues/15561). We should explicitly check it if want to go that way.
Rail: Everything that's not old-puppet is out of scope on this bug.  In particular, upgrading linux builders is covered in bug 884506.  They'll be upgraded to 3.2.2 - we won't use 2.7.22 for anything.
Updated proposal from comment 8, as discussed with Hal and Callek:

    - masters
       - panda/tegra masters - disconnect, hal creating new hosts in bug 867583, Callek connecting to slavealloc in the next few weeks and connect at that point
       - scheduler master - disconnect, bug XXX to get re-hosted on new master
       - preproduction-master, dev-master01 - disconnect permanently
    - all signing servers - disconnect, $someone finishes bug 869498 quickly
    - redis01 - disconnect per comment 6
    - buildapi01 - disconnect per comment 6
    - releng-mirror01 - disconnect, ??? (Callek to figure out what to do)
    - talos-r3-fed-xxx - disconnect 
    - talos-r3-fed64-xxx -    --"--
    - talos-r4-snow-xxx - disconnect and dustin file bug for upgrade
    - talos-r4-lion-xxx -    --"--
    - bld-lion-r5-xxx - in progress in bug 760093
    - bld-centos5-32-vmw-xxx - disconnect permanently
    - bld-centos5-64-vmw-xxx - disconnect permanently

"Disconnect" means two different things.  On servers (masters..releng-mirror01), it means disabling the puppet agent and/or crontask.  On slaves (the remainder), it means deploying an updated version of the system startup script that will continue to start buildbot even when puppet fails.  I'll open bugs for those tasks.

Unfortunately, the four talos silos are going to be around for a while, and leaving them completely unmanaged will lead to pain in a few weeks.  We'll do two things to deal with that:
  1. Inform the CAB that we may not be able to make changes quickly there, and it may take a while to re-image hosts, so capacity may sag (Hal)
  2. clean out one of the old-puppet masters of any secrets, and re-enable it either permanently or as-needed, serving only the talos hosts.  The latter is OK, IMHO, because test hosts only give test results to users -- they do not produce any artifacts that are shipped to users.  Basically, they are all at the same trust level as try slaves (scm_level_1).

Longer-term, for these hosts, we'll look at implementing each host type with puppetagain (which will be puppet-3.2.2 or higher by then).  That has varying levels of difficulty, and in some cases is partially completed.
From etherpad copy/paste about plan.

We will (tonight) disconnect/stop puppet on "server" style old-puppet machines:

    - masters
       - panda/tegra masters - 
           * buildbot-master10.build.mtv1.mozilla.com
           * buildbot-master19.build.mtv1.mozilla.com
           * buildbot-master20.build.mtv1.mozilla.com
           * buildbot-master22.build.mtv1.mozilla.com
           * buildbot-master29.build.scl1.mozilla.com
           * buildbot-master42.build.scl1.mozilla.com
           * buildbot-master43.build.scl1.mozilla.com
           * buildbot-master44.build.scl1.mozilla.com
           * buildbot-master45.build.scl1.mozilla.com
       - scheduler master - disconnect, bug XXX to get re-hosted on new master
           * buildbot-master36.srv.releng.scl3.mozilla.com
       - preproduction-master, dev-master01 - disconnect permanently
           * preproduction-master.srv.releng.scl3.mozilla.com
           * dev-master01.build.scl1.mozilla.com
    - all signing servers - disconnect, $someone finishes bug 869498 quickly
           * signing1.build.scl1.mozilla.com
           * signing2.build.scl1.mozilla.com
           * signing3.srv.releng.scl3.mozilla.com
           * mac-signing1.srv.releng.scl3.mozilla.com
           * mac-signing2.srv.releng.scl3.mozilla.com
           * mac-signing3.build.scl1.mozilla.com
           * mac-signing4.build.scl1.mozilla.com
    - redis01 - disconnect per comment 6
           * redis01.build.scl1.mozilla.com
    - buildapi01 - disconnect per comment 6
           * buildapi01.build.scl1.mozilla.com
    - releng-mirror01 - disconnect, ??? (Callek to figure out what to do)
           * releng-mirror01.srv.releng.scl3.mozilla.com
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
>   2. clean out one of the old-puppet masters of any secrets, and re-enable
> it either permanently or as-needed, serving only the talos hosts.  The
> latter is OK, IMHO, because test hosts only give test results to users --
> they do not produce any artifacts that are shipped to users.  Basically,
> they are all at the same trust level as try slaves (scm_level_1).

fwiw IFF we do this imo we need test-host-only-releng-passwords due to risks involved; which is doable but annoying (we can achieve this in 320 test machines with puppet aspects, and in old puppet by just unconditionally changing puppet secrets)
I ran:
  /etc/init.d/puppetmaster stop; chkconfig puppetmaster off
on all four old-puppet masters:
  master-puppet1.build.scl1.mozilla.com
  scl3-production-puppet.srv.releng.scl3.mozilla.com
  scl-production-puppet.build.scl1.mozilla.com
  mv-production-puppet.build.mozilla.org
I also shut down httpd for good measure.

So the status is, old-puppet is completely disabled (yay), leaving a whole pile of machines unmanaged (boo).
(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> I also shut down httpd for good measure.
> 
> So the status is, old-puppet is completely disabled (yay), leaving a whole
> pile of machines unmanaged (boo).

Missed one staging-puppet.build.mozilla.org   (too tired to trust myself logging in as root anywhere atm though, it can wait until tomorrow)
Thanks - got it.  All that remains here is to file some dep bugs for the unmanaged silos.
Depends on: 884833
Depends on: 884837
Depends on: 863266
Depends on: 884843
OK, current status + future plans:

    - masters
       - panda/tegra masters - unmanaged, bug 867583, bug XXX (Callek?)
       - scheduler master - unmanaged, bug 884833
       - preproduction-master, dev-master01 - unmanaged forever
    - all signing servers - unmanaged, bug 869498
    - redis01 - unmanaged, bug 884837
    - buildapi01 - unmanaged, bug 804334
    - releng-mirror01 - unmanaged, bug 884843
    - talos-r3-fed-xxx &
    - talos-r3-fed64-xxx &
    - talos-r4-snow-xxx &
    - talos-r4-lion-xxx - unmanaged, bug 884847
    - bld-lion-r5-xxx - unmanaged, upgrade progress in bug 760093
    - bld-centos5-32-vmw-xxx - unmanaged forever
    - bld-centos5-64-vmw-xxx - unmanaged forever

So from here on out this is a tracker bug.
Depends on: 804334
Summary: Mitigate CVE-2013-3567 on old releng puppet (2.6.x, 0.24.8) → [tracker] upgrade silos old releng puppet (2.6.x, 0.24.8)
Added 863275 to track the decommissioning of the old puppet servers.
Depends on: 863275
Per meeting yesterday, we'll be managing the *r4* talos hosts with puppetagain.  I'll file new dep bugs for that purpose.
Depends on: 891880
Depends on: 891881
Group: infra
Component: Server Operations: RelEng → RelOps: Puppet
Product: mozilla.org → Infrastructure & Operations
QA Contact: arich → dustin
Depends on: 894133
No longer depends on: 863266
remaining work:

    - masters
       - panda/tegra masters - bug 867583
    - OS X signing servers - bug 891561
    - redis01 - unmanaged, bug 884837
    - buildapi01 - unmanaged, bug 804334
    - talos-r4-snow-xxx - bug 891881

and unmanaged forever:
    - talos-r3-{fed,fed64}-xxx - unmanaged forever
    - bld-centos5-32-vmw-xxx - unmanaged forever
    - bld-centos5-64-vmw-xxx - unmanaged forever
Assignee: dustin → relops
remaining work:

    - panda/tegra masters - bug 867583
    - self-serve - bug 894133
    - redis01 - unmanaged, bug 884837
    - buildapi01 - unmanaged, bug 804334
Depends on: 867583
No longer depends on: 884837
Adding one class to the list:

    - dev-stage & preproduction-stage - bug 775091
Depends on: 775091
I'm going to close this out.  dev-stage seems to be stuck in a world of do-we-really-need-this -- bug 775091, bug 833024, bug 808025, and bug 772177.  I'll see if I can clean those up, but in the interim, that single host isn't a "silo" -- it's just an old, unmanaged, legacy thingie.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.