Closed Bug 552058 Opened 14 years ago Closed 14 years ago

re-architect puppet infrastructure

Categories

(Release Engineering :: General, defect, P4)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

(Whiteboard: [puppet][q2goal])

Attachments

(11 files, 4 obsolete files)

166.71 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
24.07 KB, text/plain
bhearsum
: checked-in+
Details
7.77 KB, patch
bear
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
5.84 KB, patch
bear
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
1.85 KB, patch
bear
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
1.06 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
69.29 KB, patch
bear
: review+
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
599 bytes, patch
bear
: review+
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
114.10 KB, patch
rail
: review+
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
16.59 KB, patch
catlee
: review+
rail
: review+
Details | Diff | Splinter Review
726 bytes, patch
rail
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
Our Puppet infrastructure isn't scaling all too well, and certainly will not scale to remote datacenters. We need to make it handle high volume of slaves better, able to cope with remote datacenters, and perhaps other things. Here's a few ideas off the top of my head:
* Point slaves to a dumb proxy that feeds them to one of many masters
* Move off serving files over NFS to serving files with the Puppet file server


I think this needs to happen before we roll out machines in a second colo.
Whiteboard: [puppet]
we are hoping to have the new scratchbox ready for deployment in the not too distant future.  This is going to be a 3.0-4.5GB file.  Are we going to be able to handle this with the current infrastructure?
I have no way to know for sure, but I would *guess* that it would strain the NFS server significantly. There's ways to work around it, ping me for details.
I'll be driving work on this this quarter.
Assignee: nobody → bhearsum
Whiteboard: [puppet] → [puppet][q2goal]
So...this is a big patch with mostly just ground work for refactoring. Here's the high level details:
* Segregate nodes into platform/arch/type
* Use new locations for all files ($platform_{http,file}root)
* Create new fileserver configs
* Kill off some unused/pointless stuff (comment out bits, {build,sandbox}-network classes that only include one thing)
* Add test-manifests.sh for rudimentary testing
* Add DMG creator
* Switch all file {} types to use puppet://
* Switch all Mac devtools tarballs and macports repo to DMGs

I've tested this on every slave type, and it works correctly in staging.

Deployment is a little tricky, here's my plan:
* Do final sync of /N/puppet-files-reorg/staging -> /N/puppet-files-reorg/production
* Land manifests
* Update production-puppet repo, fileserver symlink
* Move /N/puppet-files to /N/puppet-files.old, set read-only
* Move /N/puppet-files-reorg to /N/puppet-files
* Do a test Puppet run on one of each slave type

My next planned steps are as follows:
* Convert Linux devtools tarballs to RPMs
* Convert scripts or other things that depend on /N to file{} + exec {} or some other thing that doesn't need the NFS share
* Separate site-production.pp into site-castro.pp and site-mpt.pp, and setup a new master in Castro to care for those slaves

I may also look at reorganizing the configs as a whole, as originally planned, but that's not the most crucial bit of this.
Attachment #444703 - Flags: review?(catlee)
Whoops, forgot to mention one big caveat here: All of the tarballs which were switched to DMGs will install *again*. This isn't an issue bandwidth-wise, because they'll hit the cache, but before deploying this it would be best to mark them all as already installed to avoid unnecessary installs. DMGs are tracked by creating files like: /var/db/.puppet_pkgdmg_installed_python-2.6.4.dmg so its simply a matter of touching those files....the tricky part is finding a good way to do it.
I also uncommented the cleanup at the end of create-dmg.sh.
Attachment #444703 - Attachment is obsolete: true
Attachment #444937 - Flags: review?(catlee)
Attachment #444703 - Flags: review?(catlee)
Attachment #444937 - Attachment is obsolete: true
Attachment #445168 - Flags: review?(catlee)
Attachment #444937 - Flags: review?(catlee)
Blocks: 564914
No longer blocks: 564914
Attachment #445168 - Flags: review?(catlee) → review+
Comment on attachment 445168 [details] [diff] [review]
phase 1 with consistent node names

I landed this and am currently working through bustage.
Attachment #445168 - Flags: checked-in+
Blocks: 566337
Blocks: 566333
I had to fix a few bustages, mostly related to merging. I did a test on one each of all the platforms:
centos 64-bit
centos 32-bit
darwin10 build
darwin9 build
darwin10 test
darwin9 test
fedora 64-bit
fedora 32-bit

After fixing all the bustage there is one now, but non-blocking issue that came up:
* darwin10 slaves require updated pkgdmg.rb for Package to work correctly. Puppet syncs this file out, but because it's already running when the file is replaced any package {} checks will still fail. This sucks pretty bad for new slaves that come up since they'll almost certainly fail their first build.
For posterity, here's what I've been using for staging -> production syncs of files:
rsync --delete -av --include="**usr/local" --exclude=local /N/staging/ /N/production/

I'll be wrapping this with something soon.
Rail, I briefly tested this RPM and it seems to work fine. I'd love to hear some feedback from somebody with more experience building RPMs on the SPEC file itself, though.

I tried using built-in Macros such as %setup and %configure, but %setup seems to require that %name is the same as the directory the source tarball creates. And configure seems to set way too many flags that we don't need, and screws up the --prefix we want.

I had to set AutoReqProv: no because without it, the package somehow depended on /usr/local/bin/python and other non-sensible things. For Twisted, Zope, Buildbot, etc. I intend to set dependencies on the other custom packages. (eg, Twisted will have Requires: python25

I had to set the package name to 'python25' to avoid overriding the system Python install.
Attachment #446301 - Flags: feedback?(rail)
Attached file flexible python.spec
(In reply to comment #11)
> I tried using built-in Macros such as %setup and %configure, but %setup seems
> to require that %name is the same as the directory the source tarball creates.
> And configure seems to set way too many flags that we don't need, and screws up
> the --prefix we want.

Attaching a spec file based on the original used in Centos 5 and yours.

I removed all patches because we don't need them, left the optimization flags as is, redefined _prefix, so now we can use default %configure (had to tweak _mandir, _infodir and _localstatedir, because they somehow didn't pick up _prefix). Now you need to change only 3 lines to compile an RPM for python 2.6.
 
> I had to set AutoReqProv: no because without it, the package somehow depended
> on /usr/local/bin/python and other non-sensible things.

sed -i 's,/usr/local/bin/python,/usr/bin/env python,g' Lib/cgi.py
solves the /usr/local problem. I'd remove AutoReqProv to be sure that we have all needed libraries installed. 

> For Twisted, Zope,
> Buildbot, etc. I intend to set dependencies on the other custom packages. (eg,
> Twisted will have Requires: python25

+1
 
> I had to set the package name to 'python25' to avoid overriding the system
> Python install.

+1

Feel free to ask a review for your specs/rpms!
Attachment #446301 - Flags: feedback?(rail) → feedback+
Thanks for your comments, Rail. I'll give that spec file a try.
Depends on: 567149
Attachment #446752 - Flags: review?(bear) → review+
Comment on attachment 446752 [details] [diff] [review]
fix busted linux nodes, dependencies for mac machines

changeset:   154:95f0f0595a8a
Attachment #446752 - Flags: checked-in+
Apparently resources within functions are private, and cannot be accessed outside of them.
Attachment #446760 - Flags: review?(bear)
Attachment #446760 - Flags: review?(bear) → review+
Comment on attachment 446760 [details] [diff] [review]
use Install_package[] instead of Package[]

changeset:   155:5576ec374af9
Attachment #446760 - Flags: checked-in+
I was overzealous in adding these, apparently. We only need to upgrade pkgdmg.rb on 10.6, thus we only need Package {}'s to depend on it in 10.6.
Attachment #446764 - Flags: review?(bear)
Attachment #446764 - Flags: review?(bear) → review+
Comment on attachment 446764 [details] [diff] [review]
remove pkgdmg.rb dependency for darwin9 installs

changeset:   156:622b58fa049d
Attachment #446764 - Flags: checked-in+
Attachment #446769 - Flags: review?(catlee) → review+
Comment on attachment 446769 [details] [diff] [review]
fix bad merge on staging slaves, the includes too

changeset:   157:b8a823a3a9c7
Attachment #446769 - Flags: checked-in+
Some (all?) of the darwin9 slaves have this sort of log:
notice: Starting catalog run
notice: //Node[build]/base/osx/Exec[check-for-macports]/returns: executed successfully
err: //Node[build]/base/osx/Exec[mount-nfs]/returns: change from notrun to 0 failed: /sbin/mount /N && /bin/sleep 10 returned 1 instead of 0 at /etc/puppet/manifests/os/osx.pp:232
notice: //Node[build]/base/osx/Exec[refresh-automount]/returns: executed successfully
notice: Finished catalog run in 1.84 seconds

I think this is because /N is already mounted.
(In reply to comment #22)
> Some (all?) of the darwin9 slaves have this sort of log:
> notice: Starting catalog run
> notice: //Node[build]/base/osx/Exec[check-for-macports]/returns: executed
> successfully
> err: //Node[build]/base/osx/Exec[mount-nfs]/returns: change from notrun to 0
> failed: /sbin/mount /N && /bin/sleep 10 returned 1 instead of 0 at
> /etc/puppet/manifests/os/osx.pp:232
> notice: //Node[build]/base/osx/Exec[refresh-automount]/returns: executed
> successfully
> notice: Finished catalog run in 1.84 seconds
> 
> I think this is because /N is already mounted.

Thanks for catching this. It's not hurting anything, merely annoying, so I'm going to roll it in with the the next larger patch.
I was expecting to see it check some other checks before the catalog finished, and so assumed it was aborting when mount-nfs had a non-zero exit status. Is that not the case ?
(In reply to comment #24)
> I was expecting to see it check some other checks before the catalog finished,
> and so assumed it was aborting when mount-nfs had a non-zero exit status. Is
> that not the case ?

It would only abort checks which require[] the mount-nfs one. It would even explicitly list them in the log. AFAICT this isn't interfering with anything.
Summary of this patch:
* Convert Linux tarballs to RPMs
* Convert separate Mercurial class into RPM in devtools
* Separate install_package into install_rpm and install_dmg
* Remove fstab entirely from Mac machines
* Remove NFS mount from Linux 64-bit machines
* Remove unused files/vars

The check {} in install_rpm is a bit tricky, let me know if the comment doesn't sufficiently explain it.

I think that's it! I did test runs on 64 and 32-bit Linux machines which involved:
* Running Puppet, and verifying that no packages were re-installed
* Moving /tools away, running Puppet, and diffing (diff results here: http://spreadsheets.google.com/pub?key=tn9krgLEm8JhdFdLp_3LfIQ&output=html)
Attachment #448845 - Flags: review?(catlee)
Attachment #448845 - Flags: review?(bear)
Attached patch check-for-rpm.shSplinter Review
Here's the check-for-rpm.sh script. I was originally just passing the http:// link to rpm, but I had some issues with the Scratchbox RPM causing a segfault.

Note that with the wget, we need at least 1.2GB free on each 32-bit slave to roll this out (which I've already verified we have). It sucks. We might be able to drop this check altogether after this is rolled out everywhere, since we won't have to worry about pre-existing installs anymore.
Attachment #448847 - Flags: review?(catlee)
Attachment #448847 - Flags: review?(bear)
Attachment #448847 - Flags: review?(bear) → review+
Comment on attachment 448845 [details] [diff] [review]
convert Linux tarballs to RPMs, a few other minor things

wow, it's a huge patch but it will make package management so much nicer.
Attachment #448845 - Flags: review?(bear) → review+
A couple things I forgot to mention:
* Rail is converting GCC 4.5 to an RPM in bug 559964
* Two slaves which have been rebuilt with the new manifests (mv-moz2-linux-slave01 and moz2-linux64-slave01) have been cycling in staging without issue.
Depends on: 570141
Attachment #448845 - Flags: review?(catlee) → review+
Comment on attachment 448847 [details] [diff] [review]
check-for-rpm.sh

r+ with the duplicate check for already-installed taken out, since it's handled in the puppet manifests.
Attachment #448847 - Flags: review?(catlee) → review+
I landed all of the spec files I could find to the newly created rpm-sources repository.

http://hg.mozilla.org/build/rpm-sources
Comment on attachment 448845 [details] [diff] [review]
convert Linux tarballs to RPMs, a few other minor things

changeset:   174:839dbf91332d
Attachment #448845 - Flags: checked-in+
Comment on attachment 448847 [details] [diff] [review]
check-for-rpm.sh

This is in staging+production now, minus the rpm -ql bit.
Attachment #448847 - Flags: checked-in+
Comment on attachment 448845 [details] [diff] [review]
convert Linux tarballs to RPMs, a few other minor things

This landing went very well. I manually tested each of: 32-bit centos, 64-bit centos, darwin9 build, darwin10 build -- which all synced up properly on the first and second runs.

I've been watching the log on the master and found one small bustage: talos_osx.pp was still using install_package instead of install_dmg. That has been fixed now, and I haven't seen any other issues.

All of the slaves which I synced up by hand have run a build successfully.
This is my first pass on multi-location support for our Puppet infrastructure. I worked through this based on what will need to happen when new slaves come up. For slaves in MPT, nothing changes, because the ref platforms are all hardcoded to talk with the Puppet server there.

Non-MPT slaves will connect to the MPT puppet master on first boot and receive an updated configuration file, pointing them to the right master. Next time they sync up (generally, at next boot) they'll connect to their local puppet master and receive whatever updates are available.

I've tested this and it works well for all of our build/test slaves, with one caveat: They will not be up-to-date with all of the changes until their second boot. For build machines this isn't so bad, because they head to staging before production. Test machines however, connect directly to the production masters after syncing with Puppet once. There's the possibility of developer-visible bustage when new machines come up because of this.

Forcing a reboot after the configuration file is synced up may fix this issue, I haven't tested it yet.

I'd love to get any feedback anyone has on this patch before going further with it.
Attachment #456085 - Flags: feedback?(rail)
Attachment #456085 - Flags: feedback?(catlee)
Attachment #456085 - Flags: feedback?(bear)
Comment on attachment 456085 [details] [diff] [review]
mostly finished multi-location support

Except of ownership of /home/cltbld/.fonts.conf file (should be cltbld:cltbld) looks good.
Attachment #456085 - Flags: feedback?(rail) → feedback+
I chatted with some folks in #puppet and they said that the best way to reload the Puppet config mid-run is rebooting, so I'll be trying that.

A side-effect of making that change is the need to ensure that nothing runs before Puppet is done. This is already the case for test machines, 32-bit linux build machines, and possibly 64-bit linux build machines. Shouldn't be a big deal for the remaining platforms.
Comment on attachment 456085 [details] [diff] [review]
mostly finished multi-location support

sorry for the delayed f+ I had looked this over a while ago but must not have saved it properly
Attachment #456085 - Flags: feedback?(bear) → feedback+
Attachment #456085 - Flags: feedback?(catlee)
OK, this is finally ready for final review. Comment #35 still applies, plus the following:
- Started syncing all of the puppet runner scripts; add buildbot blocking and error catching to all of them
- Move config files to ${local_fileroot} to allow for easier syncing between Puppet masters.
- Lots of merges from default, mostly affecting slaves in site files.

I've got a second patch to post, with diffs on the configuration files and startup scripts.

Tomorrow I'm going to write out my plan for rolling this out, as it's going to be a non-trivial affair.
Attachment #456085 - Attachment is obsolete: true
Attachment #462614 - Flags: review?(rail)
Attachment #462614 - Flags: review?(catlee)
This patch shows all of the startup scripts being updated for better error handling and buildbot blocking. Blocking works differently on different platforms. On CentOS, Puppet runs with --no-daemonize, which blocks any other init scripts from running until it's done. On Fedora, the Puppet running script also takes care of starting Buildbot. On Mac, we use the "WatchPaths" feature of launchd, which launches buildbot when one of the files it is watching is modified.

This patch also shows all of the staging configuration files. Before landing I'll be creating/updating the mpt and mv production ones. I'll also need to give the MPT production master copies of the mv and staging configuration files so it can effectively move slaves.
Attachment #462758 - Flags: review?(rail)
Attachment #462758 - Flags: review?(catlee)
Comment on attachment 446301 [details]
spec file for Python 2.5.1 (32-bit centos)

This didn't end up landing as is: http://hg.mozilla.org/build/rpm-sources/file/c8fe426b3bdb/python25/centos5-i686/python25.spec
Attachment #446301 - Attachment is obsolete: true
Comment on attachment 446461 [details]
flexible python.spec

This one landed
Attachment #446461 - Flags: checked-in+
Depends on: 584409
I found that the host key accept script is really slow. O(number of hosts in site.pp). It's getting to the point where runs start to overlap each other, which just makes things worse. There's no reason for it to be this slow, this patch changes it to loop over unaccepted keys rather than all valid hosts.
Attachment #462808 - Flags: review?(rail)
Roll-out plan:
* Sync updated scripts to production file store
* Create/verify configs files on production, staging, and mv-production
* Land puppet manifests patch

Once that's done, mv based slaves will start getting switched over to mv-production-puppet as they sync up. There will be burning on 10.5 and 10.6 build machines located in Castro because they sync up every 30 minutes, not just at boot.

The next part is kindof tricky. All Linux build machines (even the ones in MPT) will be re-creating their host keys because of the config file change. We need to clear their keys out on mpt-production-puppet, but not until after each one syncs up. It's going to be a bit of a cat and mouse game because of that: Watch the logs for Linux slaves connected, then clear their keys.

Mac build machines are note affected by this.

After that, I'll be watching for any failures and verifying that all slaves are connected to the correct master and syncing up.

Finally, once the castro machines are all done migrating, we need to remove their old hostkeys from the mpt-production-puppet.


Assuming positive reviews by the end of the week, I plan to land in a downtime on Monday morning.
Attachment #462808 - Flags: review?(rail) → review+
Attachment #462614 - Flags: review?(catlee) → review+
Comment on attachment 462758 [details] [diff] [review]
diff of production vs staging files

Looks fine.
Attachment #462758 - Flags: review?(rail) → review+
Attachment #462614 - Flags: review?(rail) → review+
Attachment #462758 - Flags: review?(catlee) → review+
Blocks: 585605
Comment on attachment 462808 [details] [diff] [review]
speed up host key accept script

Landed
Attachment #462808 - Flags: checked-in+
Comment on attachment 462614 [details] [diff] [review]
multilocation, ready for review

This landed in ac510f20b930

There was two bustage fixes a few additions/removals of slaves from site files as unreviewed follow-ups.
Attachment #462614 - Flags: checked-in+
Most slaves are up and running successfully after this landing. A few aren't and there are a few more checks to do. That's being tracked in bug 586443. This bug is FIXED, woo!
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Depends on: 593734
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: