Closed
Bug 884472
Opened 11 years ago
Closed 10 years ago
[tracker] upgrade silos old releng puppet (2.6.x, 0.24.8)
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Infrastructure & Operations
RelOps: Puppet
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Unassigned)
References
Details
This vulnerability allows arbitrary remote code execution on the masters. Every host in the releng network - including try hosts - has access to the masters. Hosts running 3.2.0 will be easy to upgrade to 3.2.2, which is not vulnerable. Hosts running 2.7.17 are already slated to be upgraded, and we'll accelerate that. Hosts running older, unsupported versions are in a tough spot. Ideas: * Massively accelerate migration from old-puppet to new-puppet, perhaps leaving hosts unmanaged in the interim, and shut off the old masters immediately. * Try to backport PuppetLabs' fix and ship patched versions on the master. * Pretend it never happened. Mostly-complete list of hosts still on old puppet: * some buildmasters * all signing servers * redis01 * buildapi01 * talos-r3-fed-xxx * talos-r3-fed64-xxx * talos-r4-snow-xxx * talos-r4-lion-xxx * bld-lion-r5-xxx (moving already in bug 760093) * bld-centos5-32-vmw-xxx * bld-centos5-64-vmw-xxx
Reporter | ||
Comment 1•11 years ago
|
||
Richard looked at backporting. Not feasible. The fix involves monkey-patching a bunch of ruby libraries, and a complete rewrite of the serializer from 3.1.x (so, probably several complete rewrites since 2.6.x and 0.24.8).
Comment 2•11 years ago
|
||
So we're down to 2 options: * Massively accelerate migration from old-puppet to new-puppet, perhaps leaving hosts unmanaged in the interim, and shut off the old masters immediately. * Pretend it never happened. Two scoping questions: - Any rough time estimate for relops/releng on the accelerated migration? - Any way to tell from puppet how often these machines are actually making corrections/updates to the hosts. Seems a reasonable approach may be to break the list of affected host types into priority order. I suspect a number of them (all?) could safely be disconnected for a while, and others migrated quickly.
Flags: needinfo?(dustin)
Reporter | ||
Comment 3•11 years ago
|
||
I don't have an overall scope, but I'll share what I can. the servers could easily go unmanaged for a while, but unfortunately they're also the easy ones to convert: * some buildmasters - you're working on this AFAIK * all signing servers - I have a start on this, but probably a week or two * redis01 - probably straightforward * buildapi01 - same the slaves would need to be hacked to go unmanaged, and there'd be some loss of functionality (mostly around cleanup at startup). Are the fedora hosts even used? * talos-r3-fed-xxx * talos-r3-fed64-xxx - puppetagain doesn't support fedora, so these will be hard * talos-r4-snow-xxx * talos-r4-lion-xxx - these will probably mostly work based on work for talos-r5-mtnlion * bld-lion-r5-xxx - moving already in bug 760093 * bld-centos5-32-vmw-xxx * bld-centos5-64-vmw-xxx - puppetagain doesn't support centos5
Flags: needinfo?(dustin)
Comment 4•11 years ago
|
||
some of the cleanup could be moved to @reboot cron jobs on the slaves
Reporter | ||
Comment 5•11 years ago
|
||
Yes, or in the neutered run-puppet.sh scripts.
Comment 6•11 years ago
|
||
redis and buildapi are pretty static in terms of puppet right now, we can probably remove them from being managed until the new manifests are ready.
Comment 7•11 years ago
|
||
This vulnerability and the impact is serious enough it warrants immediate action: * upgrade all instances of puppet to 2.7.22 or 3.2.2. On the older puppet instances (0.24.8 and 2.6.x) that are unsupported and where no patch is available, we will need to: * shut down these puppetmasters and leave it shut down * shut down these puppetmasters and begin the task of upgrading to 3.2.2 Patching this vulnerability is a high priority and efforts should start now. I believe CAB approvals may be needed, so those tasked with upgrading should file CAB requests now.
Reporter | ||
Comment 8•11 years ago
|
||
So my proposal is: * some buildmasters - hal in bug 867583 (for panda/tegra masters) + new bug (others) * all signing servers - disconnect, $someone finishes bug 869498 quickly * redis01 - disconnect per comment 6 * buildapi01 - disconnect per comment 6 * talos-r3-fed-xxx - disconnect permanently (and power off?) * talos-r3-fed64-xxx - --"-- * talos-r4-snow-xxx - disconnect and dustin file bug for upgrade * talos-r4-lion-xxx - --"-- * bld-lion-r5-xxx - in progress in bug 760093 * bld-centos5-32-vmw-xxx - disconnect permanently * bld-centos5-64-vmw-xxx - disconnect permanently
Comment 9•11 years ago
|
||
What about bld-linux64-ec2-xxx slaves? They still use puppet 2.7.x.
Comment 10•11 years ago
|
||
and try-linux64-ec2-xxx...
Comment 11•11 years ago
|
||
FTR, upgrading 2.7.x may introduce SSL certificate issues (see http://projects.puppetlabs.com/issues/15561). We should explicitly check it if want to go that way.
Reporter | ||
Comment 12•11 years ago
|
||
Rail: Everything that's not old-puppet is out of scope on this bug. In particular, upgrading linux builders is covered in bug 884506. They'll be upgraded to 3.2.2 - we won't use 2.7.22 for anything.
Reporter | ||
Comment 13•11 years ago
|
||
Updated proposal from comment 8, as discussed with Hal and Callek: - masters - panda/tegra masters - disconnect, hal creating new hosts in bug 867583, Callek connecting to slavealloc in the next few weeks and connect at that point - scheduler master - disconnect, bug XXX to get re-hosted on new master - preproduction-master, dev-master01 - disconnect permanently - all signing servers - disconnect, $someone finishes bug 869498 quickly - redis01 - disconnect per comment 6 - buildapi01 - disconnect per comment 6 - releng-mirror01 - disconnect, ??? (Callek to figure out what to do) - talos-r3-fed-xxx - disconnect - talos-r3-fed64-xxx - --"-- - talos-r4-snow-xxx - disconnect and dustin file bug for upgrade - talos-r4-lion-xxx - --"-- - bld-lion-r5-xxx - in progress in bug 760093 - bld-centos5-32-vmw-xxx - disconnect permanently - bld-centos5-64-vmw-xxx - disconnect permanently "Disconnect" means two different things. On servers (masters..releng-mirror01), it means disabling the puppet agent and/or crontask. On slaves (the remainder), it means deploying an updated version of the system startup script that will continue to start buildbot even when puppet fails. I'll open bugs for those tasks. Unfortunately, the four talos silos are going to be around for a while, and leaving them completely unmanaged will lead to pain in a few weeks. We'll do two things to deal with that: 1. Inform the CAB that we may not be able to make changes quickly there, and it may take a while to re-image hosts, so capacity may sag (Hal) 2. clean out one of the old-puppet masters of any secrets, and re-enable it either permanently or as-needed, serving only the talos hosts. The latter is OK, IMHO, because test hosts only give test results to users -- they do not produce any artifacts that are shipped to users. Basically, they are all at the same trust level as try slaves (scm_level_1). Longer-term, for these hosts, we'll look at implementing each host type with puppetagain (which will be puppet-3.2.2 or higher by then). That has varying levels of difficulty, and in some cases is partially completed.
Comment 14•11 years ago
|
||
From etherpad copy/paste about plan. We will (tonight) disconnect/stop puppet on "server" style old-puppet machines: - masters - panda/tegra masters - * buildbot-master10.build.mtv1.mozilla.com * buildbot-master19.build.mtv1.mozilla.com * buildbot-master20.build.mtv1.mozilla.com * buildbot-master22.build.mtv1.mozilla.com * buildbot-master29.build.scl1.mozilla.com * buildbot-master42.build.scl1.mozilla.com * buildbot-master43.build.scl1.mozilla.com * buildbot-master44.build.scl1.mozilla.com * buildbot-master45.build.scl1.mozilla.com - scheduler master - disconnect, bug XXX to get re-hosted on new master * buildbot-master36.srv.releng.scl3.mozilla.com - preproduction-master, dev-master01 - disconnect permanently * preproduction-master.srv.releng.scl3.mozilla.com * dev-master01.build.scl1.mozilla.com - all signing servers - disconnect, $someone finishes bug 869498 quickly * signing1.build.scl1.mozilla.com * signing2.build.scl1.mozilla.com * signing3.srv.releng.scl3.mozilla.com * mac-signing1.srv.releng.scl3.mozilla.com * mac-signing2.srv.releng.scl3.mozilla.com * mac-signing3.build.scl1.mozilla.com * mac-signing4.build.scl1.mozilla.com - redis01 - disconnect per comment 6 * redis01.build.scl1.mozilla.com - buildapi01 - disconnect per comment 6 * buildapi01.build.scl1.mozilla.com - releng-mirror01 - disconnect, ??? (Callek to figure out what to do) * releng-mirror01.srv.releng.scl3.mozilla.com
Comment 15•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #13) > 2. clean out one of the old-puppet masters of any secrets, and re-enable > it either permanently or as-needed, serving only the talos hosts. The > latter is OK, IMHO, because test hosts only give test results to users -- > they do not produce any artifacts that are shipped to users. Basically, > they are all at the same trust level as try slaves (scm_level_1). fwiw IFF we do this imo we need test-host-only-releng-passwords due to risks involved; which is doable but annoying (we can achieve this in 320 test machines with puppet aspects, and in old puppet by just unconditionally changing puppet secrets)
Reporter | ||
Comment 16•11 years ago
|
||
I ran: /etc/init.d/puppetmaster stop; chkconfig puppetmaster off on all four old-puppet masters: master-puppet1.build.scl1.mozilla.com scl3-production-puppet.srv.releng.scl3.mozilla.com scl-production-puppet.build.scl1.mozilla.com mv-production-puppet.build.mozilla.org
Reporter | ||
Comment 17•11 years ago
|
||
I also shut down httpd for good measure. So the status is, old-puppet is completely disabled (yay), leaving a whole pile of machines unmanaged (boo).
Comment 18•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #17) > I also shut down httpd for good measure. > > So the status is, old-puppet is completely disabled (yay), leaving a whole > pile of machines unmanaged (boo). Missed one staging-puppet.build.mozilla.org (too tired to trust myself logging in as root anywhere atm though, it can wait until tomorrow)
Reporter | ||
Comment 19•11 years ago
|
||
Thanks - got it. All that remains here is to file some dep bugs for the unmanaged silos.
Reporter | ||
Comment 20•11 years ago
|
||
OK, current status + future plans: - masters - panda/tegra masters - unmanaged, bug 867583, bug XXX (Callek?) - scheduler master - unmanaged, bug 884833 - preproduction-master, dev-master01 - unmanaged forever - all signing servers - unmanaged, bug 869498 - redis01 - unmanaged, bug 884837 - buildapi01 - unmanaged, bug 804334 - releng-mirror01 - unmanaged, bug 884843 - talos-r3-fed-xxx & - talos-r3-fed64-xxx & - talos-r4-snow-xxx & - talos-r4-lion-xxx - unmanaged, bug 884847 - bld-lion-r5-xxx - unmanaged, upgrade progress in bug 760093 - bld-centos5-32-vmw-xxx - unmanaged forever - bld-centos5-64-vmw-xxx - unmanaged forever So from here on out this is a tracker bug.
Reporter | ||
Updated•11 years ago
|
Summary: Mitigate CVE-2013-3567 on old releng puppet (2.6.x, 0.24.8) → [tracker] upgrade silos old releng puppet (2.6.x, 0.24.8)
Comment 21•11 years ago
|
||
Added 863275 to track the decommissioning of the old puppet servers.
Depends on: 863275
Reporter | ||
Comment 22•11 years ago
|
||
Per meeting yesterday, we'll be managing the *r4* talos hosts with puppetagain. I'll file new dep bugs for that purpose.
Updated•11 years ago
|
Group: infra
Component: Server Operations: RelEng → RelOps: Puppet
Product: mozilla.org → Infrastructure & Operations
QA Contact: arich → dustin
Reporter | ||
Comment 23•11 years ago
|
||
remaining work: - masters - panda/tegra masters - bug 867583 - OS X signing servers - bug 891561 - redis01 - unmanaged, bug 884837 - buildapi01 - unmanaged, bug 804334 - talos-r4-snow-xxx - bug 891881 and unmanaged forever: - talos-r3-{fed,fed64}-xxx - unmanaged forever - bld-centos5-32-vmw-xxx - unmanaged forever - bld-centos5-64-vmw-xxx - unmanaged forever
Reporter | ||
Updated•11 years ago
|
Assignee: dustin → relops
Reporter | ||
Comment 24•11 years ago
|
||
remaining work: - panda/tegra masters - bug 867583 - self-serve - bug 894133 - redis01 - unmanaged, bug 884837 - buildapi01 - unmanaged, bug 804334
Depends on: 867583
Comment 26•11 years ago
|
||
Adding one class to the list: - dev-stage & preproduction-stage - bug 775091
Depends on: 775091
Reporter | ||
Comment 27•10 years ago
|
||
I'm going to close this out. dev-stage seems to be stuck in a world of do-we-really-need-this -- bug 775091, bug 833024, bug 808025, and bug 772177. I'll see if I can clean those up, but in the interim, that single host isn't a "silo" -- it's just an old, unmanaged, legacy thingie.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•