884472 - [tracker] upgrade silos old releng puppet (2.6.x, 0.24.8)

Reporter

Description

•

11 years ago

This vulnerability allows arbitrary remote code execution on the masters.  Every host in the releng network - including try hosts - has access to the masters.

Hosts running 3.2.0 will be easy to upgrade to 3.2.2, which is not vulnerable.
Hosts running 2.7.17 are already slated to be upgraded, and we'll accelerate that.
Hosts running older, unsupported versions are in a tough spot.

Ideas:
 * Massively accelerate migration from old-puppet to new-puppet, perhaps leaving hosts unmanaged in the interim, and shut off the old masters immediately.
 * Try to backport PuppetLabs' fix and ship patched versions on the master.
 * Pretend it never happened.

Mostly-complete list of hosts still on old puppet:
 * some buildmasters
 * all signing servers
 * redis01
 * buildapi01
 * talos-r3-fed-xxx
 * talos-r3-fed64-xxx
 * talos-r4-snow-xxx
 * talos-r4-lion-xxx
 * bld-lion-r5-xxx (moving already in bug 760093)
 * bld-centos5-32-vmw-xxx
 * bld-centos5-64-vmw-xxx

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 1

•

11 years ago

Richard looked at backporting.  Not feasible.  The fix involves monkey-patching a bunch of ruby libraries, and a complete rewrite of the serializer from 3.1.x (so, probably several complete rewrites since 2.6.x and 0.24.8).

Hal Wine [:hwine] use NI!

Comment 2

•

11 years ago

So we're down to 2 options:
 * Massively accelerate migration from old-puppet to new-puppet, perhaps leaving hosts unmanaged in the interim, and shut off the old masters immediately.
 * Pretend it never happened.

Two scoping questions:
 - Any rough time estimate for relops/releng on the accelerated migration? 
 - Any way to tell from puppet how often these machines are actually making corrections/updates to the hosts.

Seems a reasonable approach may be to break the list of affected host types into priority order. I suspect a number of them (all?) could safely be disconnected for a while, and others migrated quickly.

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 3

•

11 years ago

I don't have an overall scope, but I'll share what I can.

the servers could easily go unmanaged for a while, but unfortunately they're also the easy ones to convert:
 * some buildmasters - you're working on this AFAIK
 * all signing servers - I have a start on this, but probably a week or two
 * redis01 - probably straightforward
 * buildapi01 - same

the slaves would need to be hacked to go unmanaged, and there'd be some loss of functionality (mostly around cleanup at startup).  Are the fedora hosts even used?
 * talos-r3-fed-xxx
 * talos-r3-fed64-xxx - puppetagain doesn't support fedora, so these will be hard
 * talos-r4-snow-xxx
 * talos-r4-lion-xxx - these will probably mostly work based on work for talos-r5-mtnlion
 * bld-lion-r5-xxx - moving already in bug 760093
 * bld-centos5-32-vmw-xxx
 * bld-centos5-64-vmw-xxx - puppetagain doesn't support centos5

Flags: needinfo?(dustin)

Hal Wine [:hwine] use NI!

Comment 4

•

11 years ago

some of the cleanup could be moved to @reboot cron jobs on the slaves

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 5

•

11 years ago

Yes, or in the neutered run-puppet.sh scripts.

Chris AtLee [:catlee]

Comment 6

•

11 years ago

redis and buildapi are pretty static in terms of puppet right now, we can probably remove them from being managed until the new manifests are ready.

Joe Stevensen [:joe]

Comment 7

•

11 years ago

This vulnerability and the impact is serious enough it warrants immediate action:

* upgrade all instances of puppet to 2.7.22 or 3.2.2.

On the older puppet instances (0.24.8 and 2.6.x) that are unsupported and where no patch is available, we will need to:

* shut down these puppetmasters and leave it shut down
* shut down these puppetmasters and begin the task of upgrading to 3.2.2

Patching this vulnerability is a high priority and efforts should start now. I believe CAB approvals may be needed, so those tasked with upgrading should file CAB requests now.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 8

•

11 years ago

So my proposal is:

 * some buildmasters - hal in bug 867583 (for panda/tegra masters) + new bug (others)
 * all signing servers - disconnect, $someone finishes bug 869498 quickly
 * redis01 - disconnect per comment 6
 * buildapi01 - disconnect per comment 6
 * talos-r3-fed-xxx - disconnect permanently (and power off?)
 * talos-r3-fed64-xxx -    --"--
 * talos-r4-snow-xxx - disconnect and dustin file bug for upgrade
 * talos-r4-lion-xxx -    --"--
 * bld-lion-r5-xxx - in progress in bug 760093
 * bld-centos5-32-vmw-xxx - disconnect permanently
 * bld-centos5-64-vmw-xxx - disconnect permanently

Rail Aliiev [:rail]

Comment 9

•

11 years ago

What about bld-linux64-ec2-xxx slaves? They still use puppet 2.7.x.

Rail Aliiev [:rail]

Comment 10

•

11 years ago

and try-linux64-ec2-xxx...

Rail Aliiev [:rail]

Comment 11

•

11 years ago

FTR, upgrading 2.7.x may introduce SSL certificate issues (see http://projects.puppetlabs.com/issues/15561). We should explicitly check it if want to go that way.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 12

•

11 years ago

Rail: Everything that's not old-puppet is out of scope on this bug.  In particular, upgrading linux builders is covered in bug 884506.  They'll be upgraded to 3.2.2 - we won't use 2.7.22 for anything.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 13

•

11 years ago

Updated proposal from comment 8, as discussed with Hal and Callek:

- masters
- panda/tegra masters - disconnect, hal creating new hosts in bug 867583, Callek connecting to slavealloc in the next few weeks and connect at that point
- scheduler master - disconnect, bug XXX to get re-hosted on new master
- preproduction-master, dev-master01 - disconnect permanently
- all signing servers - disconnect, $someone finishes bug 869498 quickly
- redis01 - disconnect per comment 6
- buildapi01 - disconnect per comment 6
- releng-mirror01 - disconnect, ??? (Callek to figure out what to do)
- talos-r3-fed-xxx - disconnect
- talos-r3-fed64-xxx - --"--
- talos-r4-snow-xxx - disconnect and dustin file bug for upgrade
- talos-r4-lion-xxx - --"--
- bld-lion-r5-xxx - in progress in bug 760093
- bld-centos5-32-vmw-xxx - disconnect permanently
- bld-centos5-64-vmw-xxx - disconnect permanently

"Disconnect" means two different things. On servers (masters..releng-mirror01), it means disabling the puppet agent and/or crontask. On slaves (the remainder), it means deploying an updated version of the system startup script that will continue to start buildbot even when puppet fails. I'll open bugs for those tasks.

Unfortunately, the four talos silos are going to be around for a while, and leaving them completely unmanaged will lead to pain in a few weeks. We'll do two things to deal with that:
1. Inform the CAB that we may not be able to make changes quickly there, and it may take a while to re-image hosts, so capacity may sag (Hal)
2. clean out one of the old-puppet masters of any secrets, and re-enable it either permanently or as-needed, serving only the talos hosts. The latter is OK, IMHO, because test hosts only give test results to users -- they do not produce any artifacts that are shipped to users. Basically, they are all at the same trust level as try slaves (scm_level_1).

Longer-term, for these hosts, we'll look at implementing each host type with puppetagain (which will be puppet-3.2.2 or higher by then). That has varying levels of difficulty, and in some cases is partially completed.

Justin Wood (:Callek)

Comment 14

•

11 years ago

From etherpad copy/paste about plan.

We will (tonight) disconnect/stop puppet on "server" style old-puppet machines:

    - masters
       - panda/tegra masters - 
           * buildbot-master10.build.mtv1.mozilla.com
           * buildbot-master19.build.mtv1.mozilla.com
           * buildbot-master20.build.mtv1.mozilla.com
           * buildbot-master22.build.mtv1.mozilla.com
           * buildbot-master29.build.scl1.mozilla.com
           * buildbot-master42.build.scl1.mozilla.com
           * buildbot-master43.build.scl1.mozilla.com
           * buildbot-master44.build.scl1.mozilla.com
           * buildbot-master45.build.scl1.mozilla.com
       - scheduler master - disconnect, bug XXX to get re-hosted on new master
           * buildbot-master36.srv.releng.scl3.mozilla.com
       - preproduction-master, dev-master01 - disconnect permanently
           * preproduction-master.srv.releng.scl3.mozilla.com
           * dev-master01.build.scl1.mozilla.com
    - all signing servers - disconnect, $someone finishes bug 869498 quickly
           * signing1.build.scl1.mozilla.com
           * signing2.build.scl1.mozilla.com
           * signing3.srv.releng.scl3.mozilla.com
           * mac-signing1.srv.releng.scl3.mozilla.com
           * mac-signing2.srv.releng.scl3.mozilla.com
           * mac-signing3.build.scl1.mozilla.com
           * mac-signing4.build.scl1.mozilla.com
    - redis01 - disconnect per comment 6
           * redis01.build.scl1.mozilla.com
    - buildapi01 - disconnect per comment 6
           * buildapi01.build.scl1.mozilla.com
    - releng-mirror01 - disconnect, ??? (Callek to figure out what to do)
           * releng-mirror01.srv.releng.scl3.mozilla.com

Justin Wood (:Callek)

Comment 15

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #13)
>   2. clean out one of the old-puppet masters of any secrets, and re-enable
> it either permanently or as-needed, serving only the talos hosts.  The
> latter is OK, IMHO, because test hosts only give test results to users --
> they do not produce any artifacts that are shipped to users.  Basically,
> they are all at the same trust level as try slaves (scm_level_1).

fwiw IFF we do this imo we need test-host-only-releng-passwords due to risks involved; which is doable but annoying (we can achieve this in 320 test machines with puppet aspects, and in old puppet by just unconditionally changing puppet secrets)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 16

•

11 years ago

I ran:
  /etc/init.d/puppetmaster stop; chkconfig puppetmaster off
on all four old-puppet masters:
  master-puppet1.build.scl1.mozilla.com
  scl3-production-puppet.srv.releng.scl3.mozilla.com
  scl-production-puppet.build.scl1.mozilla.com
  mv-production-puppet.build.mozilla.org

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 17

•

11 years ago

I also shut down httpd for good measure.

So the status is, old-puppet is completely disabled (yay), leaving a whole pile of machines unmanaged (boo).

Justin Wood (:Callek)

Comment 18

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> I also shut down httpd for good measure.
> 
> So the status is, old-puppet is completely disabled (yay), leaving a whole
> pile of machines unmanaged (boo).

Missed one staging-puppet.build.mozilla.org   (too tired to trust myself logging in as root anywhere atm though, it can wait until tomorrow)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 19

•

11 years ago

Thanks - got it.  All that remains here is to file some dep bugs for the unmanaged silos.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

Depends on: 884833

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

Depends on: 884837

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

Depends on: 863266

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

Depends on: 884843

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 20

•

11 years ago

OK, current status + future plans:

    - masters
       - panda/tegra masters - unmanaged, bug 867583, bug XXX (Callek?)
       - scheduler master - unmanaged, bug 884833
       - preproduction-master, dev-master01 - unmanaged forever
    - all signing servers - unmanaged, bug 869498
    - redis01 - unmanaged, bug 884837
    - buildapi01 - unmanaged, bug 804334
    - releng-mirror01 - unmanaged, bug 884843
    - talos-r3-fed-xxx &
    - talos-r3-fed64-xxx &
    - talos-r4-snow-xxx &
    - talos-r4-lion-xxx - unmanaged, bug 884847
    - bld-lion-r5-xxx - unmanaged, upgrade progress in bug 760093
    - bld-centos5-32-vmw-xxx - unmanaged forever
    - bld-centos5-64-vmw-xxx - unmanaged forever

So from here on out this is a tracker bug.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

Depends on: 804334

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

Summary: Mitigate CVE-2013-3567 on old releng puppet (2.6.x, 0.24.8) → [tracker] upgrade silos old releng puppet (2.6.x, 0.24.8)

Amy Rich [:arr] [:arich]

Comment 21

•

11 years ago

Added 863275 to track the decommissioning of the old puppet servers.

Depends on: 863275

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 22

•

11 years ago

Per meeting yesterday, we'll be managing the *r4* talos hosts with puppetagain.  I'll file new dep bugs for that purpose.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

Depends on: 891880

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

Depends on: 891881

Amy Rich [:arr] [:arich]

Updated

•

11 years ago

Group: infra

Component: Server Operations: RelEng → RelOps: Puppet

Product: mozilla.org → Infrastructure & Operations

QA Contact: arich → dustin

Nick Thomas [:nthomas] (UTC+12)

Updated

•

11 years ago

Depends on: 894133

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

No longer depends on: 863266

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 23

•

11 years ago

remaining work:

    - masters
       - panda/tegra masters - bug 867583
    - OS X signing servers - bug 891561
    - redis01 - unmanaged, bug 884837
    - buildapi01 - unmanaged, bug 804334
    - talos-r4-snow-xxx - bug 891881

and unmanaged forever:
    - talos-r3-{fed,fed64}-xxx - unmanaged forever
    - bld-centos5-32-vmw-xxx - unmanaged forever
    - bld-centos5-64-vmw-xxx - unmanaged forever

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

Assignee: dustin → relops

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 24

•

11 years ago

remaining work:

    - panda/tegra masters - bug 867583
    - self-serve - bug 894133
    - redis01 - unmanaged, bug 884837
    - buildapi01 - unmanaged, bug 804334

Depends on: 867583

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

11 years ago

No longer depends on: 884837

Chris Cooper [:coop] (he/him)

Comment 26

•

11 years ago

Adding one class to the list:

    - dev-stage & preproduction-stage - bug 775091

Depends on: 775091

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 27

•

10 years ago

I'm going to close this out.  dev-stage seems to be stuck in a world of do-we-really-need-this -- bug 775091, bug 833024, bug 808025, and bug 772177.  I'll see if I can clean those up, but in the interim, that single host isn't a "silo" -- it's just an old, unmanaged, legacy thingie.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED