Closed Bug 760093 Opened 12 years ago Closed 11 years ago

create a basic r5 lion builder ref machine without puppet and puppetagain configs to handle config management

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: arich)

References

()

Details

(Whiteboard: [buildduty])

Attachments

(5 files)

bhearsum isn't using these anymore and we're going to need ref machiens for the r5 lion builders (we don't currently have one), and soon the r5 lion talos machines.
coop: should we call the 10.8 machines: talos-mtnlion-r5-NNN?

I presume we're sticking with <purpose>-<os>-<platform>-NNN that we've been using for several months (bld-centos5-vmw bld-centos6-hp, bld-lion-r5), but don't know what you want as <os>, since I doubt you wnat to type out mountainlion every time.
(In reply to Amy Rich [:arich] [:arr] from comment #1)
> coop: should we call the 10.8 machines: talos-mtnlion-r5-NNN?

Yes, that works for me.
Dns, inventory/dhcp, nagios have been updated.
I've lost touch with both of these systems due to a bad image deployment.  Once I get the workflow and images fixed, I will file a bug with DCops to manually netboot them in to their correct states.
Depends on: 770741
I've changed the cltbld and root pws on bld-lion-r4-ref to match the new ones, but I'm not sure what else, if anything, should be done before an image is taken. I'm going to leave that to jake for when he returns.
Blocks: 759466
Recently imaged bld-lion-r5-* have the old cltbld password.
I:

* changed the cltbld pass
* changed the root pass
* entered the old keychain pass the first time i logged in as cltbld
* turned off the auto-login for cltbld, then turned back on (entered new pass)
* after a reboot i was able to log in as cltbld with the new pass via both screen sharing and ssh, and buildbot started + connected.  (dependent on having puppetized successfully)
Bear actually has a script that will automate the password changing for the 10.7 machines.
bld-lion-r5-ref had been imaged with a clean 10.7 image and not the actual bld-lion-r5 image since I was having problems with deploying some images.  I'm not sure if it is the images themselves or if it is an issue with deploying cross vlans.  I made another attempt at dropping the bld-lion-r5-ref image on bld-lion-r5-ref.svr.releng.scl3.m.c but it got stuck somewhere in the middle and is now unresponsive.
Depends on: 764948
Depends on: 780221
We've decided to put these into the vlan that their respective DS hosts are in to make image capture easier.

bld-lion-r5-ref.build.releng.scl3.mozilla.com
talos-mtnlion-r5-ref.test.releng.scl3.mozilla.com

dns, inventory, and nagios already updated, netops bug filed to move vlans.
Machines are now up on their appropriate vlans.
talos-mtnlion-r5-ref has been done.  Removing this as a dependency for bug 759466.

bld-lion-r5-ref will take significantly more work since it requires pulling apart the auto-puppetization stuff and rebuilding the default image without those bits.
No longer blocks: 759466
Summary: retask darwin11-signing1.srv.releng.scl3.mozilla.com and darwin11-signing2.srv.releng.scl3.mozilla.com as ref machines for the lion builders and mountain lion talos machine → disect the image for the lion builders and create a ref machine without puppet
combined with the desire to move these off of old puppet, what we should likely do is start with a fresh install and use puppetagain like we do with mountain lion.
Assignee: jwatkins → dustin
Summary: disect the image for the lion builders and create a ref machine without puppet → create a basic r5 lion builder ref machine without puppet and puppetagain configs to handle config management
So, the interim plan is actually the XOR of what this bug described: build PuppetAgain manifests that can manage an already-installed system, without figuring out the install process.  We'll then use this to manage hosts for servo (bug 861283).
bld-lion-r5-096 is mine to mess with.
I went through all of the old-puppet manifests to make sure that everything was ported to the new.  Since we hadn't done builders on OS X in puppetagain, and since we don't have any non-mock builders in puppetagain, this turned out to be a lot of stuff.

There are some open questions.  I'll need to find answers to these before finishing this up.

* buildbot master not installed as a DMG, in a different location (buildslave virtualenv), different version (0.8.4, not 0.8.5)

* Do we need to manage the contents of the OS X Dock?  The old manifests do, but that seems silly.

* ~cltbld/.ssh/config - I removed config for hgpvt.m.o, cvs.m.o, and dm-pvtbuild01 (gone now).  Is hgpvt still in use?

* is ~/.purge_builds.cfg required? The default in the script is 14 days, anyway.

* is pkg-dmg install required?  mxr suggests it's downloaded during build, and thus unnecessary.  It's also a bare executable with no installer or other history.

* is upx install required?  Looks like this was only used for partner repacks.  This is also a bare executable, although there's a SourceForge project we could use to build a DMG from source.

* Is there anything in particular in the cacert.pem that gets installed by old-puppet, or is it just making up for an ancient wget version?  We already install a newer wget for talos, without overriding cacert.pem.
(In reply to Dustin J. Mitchell [:dustin] from comment #15)
> I went through all of the old-puppet manifests to make sure that everything
> was ported to the new.  Since we hadn't done builders on OS X in
> puppetagain, and since we don't have any non-mock builders in puppetagain,
> this turned out to be a lot of stuff.
> 
> There are some open questions.  I'll need to find answers to these before
> finishing this up.
> 
> * buildbot master not installed as a DMG, in a different location
> (buildslave virtualenv), different version (0.8.4, not 0.8.5)

Huh, no idea about this. Seems like a bad thing if the old manifests are using an older version of Buildbot. (And I assume you meant "buildbot slave not installed").

> * Do we need to manage the contents of the OS X Dock?  The old manifests do,
> but that seems silly.

We do this on test machines because the dock contents impact Talos results IIRC.

> * is ~/.purge_builds.cfg required? The default in the script is 14 days,
> anyway.

I didn't even know this existed, should be fine to remove.

> * is pkg-dmg install required?  mxr suggests it's downloaded during build,
> and thus unnecessary.  It's also a bare executable with no installer or
> other history.

Not needed on the machines anymore.

> * is upx install required?  Looks like this was only used for partner
> repacks.  This is also a bare executable, although there's a SourceForge
> project we could use to build a DMG from source.

This is still needed. It's used during partner repacks (which all happen on OS X).

> * Is there anything in particular in the cacert.pem that gets installed by
> old-puppet, or is it just making up for an ancient wget version?  We already
> install a newer wget for talos, without overriding cacert.pem.

I can't find anything about why we added this in any bug....not sure. I would guess that a newer wget would be satisfactory, but I'm not terribly certain of that.
(In reply to Ben Hearsum [:bhearsum] from comment #16)
> (In reply to Dustin J. Mitchell [:dustin] from comment #15)
> > * Do we need to manage the contents of the OS X Dock?  The old manifests do,
> > but that seems silly.
> 
> We do this on test machines because the dock contents impact Talos results
> IIRC.
> 
IIRC it was to have a large enough screen resolution to test hardware acceleration.
Removing the dock gave us those extra pixels we were missing. I'm serious! :)
Oh, so the important setting is *hiding* the dock.

Interesting that we don't do that -- that I can see -- for talos in PuppetAgain.  Since these are builders, I'll omit it.

Per IRC, it looks like the buildbot-0.8.5 install isn't even used -- there's a version of Buildbot installed in the buildslave virtualenv that's earlier in the path.
Maybe with rev4 and rev5 it was not necessary as we might have had bigger screen resolutions (like on Win7 we could only have higher screen resolutions with a dongle attached).
(In reply to Ben Hearsum [:bhearsum] from comment #16)
> (In reply to Dustin J. Mitchell [:dustin] from comment #15)
> > I went through all of the old-puppet manifests to make sure that everything
> > was ported to the new.  Since we hadn't done builders on OS X in
> > puppetagain, and since we don't have any non-mock builders in puppetagain,
> > this turned out to be a lot of stuff.
> > 
> > There are some open questions.  I'll need to find answers to these before
> > finishing this up.
> > 
> > * buildbot master not installed as a DMG, in a different location
> > (buildslave virtualenv), different version (0.8.4, not 0.8.5)
> 
> Huh, no idea about this. Seems like a bad thing if the old manifests are
> using an older version of Buildbot. (And I assume you meant "buildbot slave
> not installed").
> 
> > * Do we need to manage the contents of the OS X Dock?  The old manifests do,
> > but that seems silly.
> 
> We do this on test machines because the dock contents impact Talos results
> IIRC.
> 
> > * is ~/.purge_builds.cfg required? The default in the script is 14 days,
> > anyway.
> 
> I didn't even know this existed, should be fine to remove.

It's still used by purge_builds.py:
http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/purge_builds.py#l162
(In reply to Chris AtLee [:catlee] from comment #20)
> (In reply to Ben Hearsum [:bhearsum] from comment #16)
> > (In reply to Dustin J. Mitchell [:dustin] from comment #15)
> > > I went through all of the old-puppet manifests to make sure that everything
> > > was ported to the new.  Since we hadn't done builders on OS X in
> > > puppetagain, and since we don't have any non-mock builders in puppetagain,
> > > this turned out to be a lot of stuff.
> > > 
> > > There are some open questions.  I'll need to find answers to these before
> > > finishing this up.
> > > 
> > > * buildbot master not installed as a DMG, in a different location
> > > (buildslave virtualenv), different version (0.8.4, not 0.8.5)
> > 
> > Huh, no idea about this. Seems like a bad thing if the old manifests are
> > using an older version of Buildbot. (And I assume you meant "buildbot slave
> > not installed").
> > 
> > > * Do we need to manage the contents of the OS X Dock?  The old manifests do,
> > > but that seems silly.
> > 
> > We do this on test machines because the dock contents impact Talos results
> > IIRC.
> > 
> > > * is ~/.purge_builds.cfg required? The default in the script is 14 days,
> > > anyway.
> > 
> > I didn't even know this existed, should be fine to remove.
> 
> It's still used by purge_builds.py:
> http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/
> purge_builds.py#l162

Right, but the default value is 14, which is the same value that puppet's putting here, so the config file is superfluous (and it's probably easier to change it in the tools repo than in puppet)
The question about .ssh/config is still open, but I'll assume it's harmless.

From IRC, the screen configuration isn't necessary for builds, but should still be enforced, as should VNC.

For the second-stage from-scratch deployment, I'll need to address:
* Xcode - https://groups.google.com/forum/?fromgroups=#!topic/puppet-users/Dvl4i9H1zNw
* Newer nagios with more sane install process
* Build a UPX DMG
* New DMGs for everything that's copied over
Almost ready for r? here.  Apparently the version of 'screenresolution' that works on mountain lion doesn't work on lion:

bld-lion-r5-096:~ root#  /usr/local/bin/screenresolution get
Illegal instruction: 4
Attached patch bug760093.patch — — Splinter Review
OK, this patch runs successfully on top of an already-configured Lion builder:
----
rm -rf /Library/Ruby/Site/1.8/puppet*  /Library/Ruby/Gems/1.8/gems/puppet*
install 2.7.17 DMG by hand:
  hdiutil mount puppet-2.7.17.dmg
  installer -pkg /Volumes/puppet-2.7.17/puppet-2.7.17.pkg -target / 
  hdiutil unmount /Volumes/puppet-2.7.17/ 
puppet --version
rm /etc/puppet/ssl
ln -s /var/lib/puppet/ssl /etc/puppet/ssl
curl http://hg.mozilla.org/users/dmitchell_mozilla.com/puppet320/raw-file/tip/modules/puppet/files/puppetize.sh > puppetize.sh
sh puppetize.sh
----

This doesn't:
* install Xcode, although  https://groups.google.com/forum/?fromgroups=#!topic/puppet-users/Dvl4i9H1zNw suggests it's possible
* rebuild any of the DMGs, or even describe how they're built.  I'd like to rebuild all of these and include the .sh scripts in the final analysis.

----

As for the review: this is primarily intended to get things ready to roll with the Servo buildout, and I strongly suspect this is perfectly adequate to that purpose.

However, this is written as a complete re-implementation of the lion builders, so I would like to get it tested and landed in that capacity reasonably quickly, before it bitrots.  My thinking for how we accomplish that is to land this as-is (with changes per review), then I build a DS image that will puppetize directly into PuppetAgain and fix up the DMGs as described above.  Once that's in place, we can rebuild a few hosts and run them in staging.  If that's successful, we can rebuild a bunch more - maybe just try initially?

If we *can't* commit to actually using this to build production systems in a finite timeframe, then I think we should kill it now, and instead implement a much simpler configuration for Servo builders.
Attachment #744653 - Flags: review?(bhearsum)
Attachment #744653 - Flags: feedback?(kmoir)
Comment on attachment 744653 [details] [diff] [review]
bug760093.patch

Review of attachment 744653 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Dustin J. Mitchell [:dustin] from comment #24)
> If we *can't* commit to actually using this to build production systems in a
> finite timeframe, then I think we should kill it now, and instead implement
> a much simpler configuration for Servo builders.

I don't want to go down the road of using differently configured systems for Servo. We have enough platforms to maintain already. I'll help push this along however I can.

The only thing I see that might be missing vs. the original manifests is disable-screensaver.sh or equivalent. I can't seem to find any code that disables the screensaver. r+ assuming that's been taken care of somewhere.
Attachment #744653 - Flags: review?(bhearsum) → review+
(In reply to Dustin J. Mitchell [:dustin] from comment #24)
> However, this is written as a complete re-implementation of the lion
> builders, so I would like to get it tested and landed in that capacity
> reasonably quickly, before it bitrots.  My thinking for how we accomplish
> that is to land this as-is (with changes per review), then I build a DS
> image that will puppetize directly into PuppetAgain and fix up the DMGs as
> described above.  Once that's in place, we can rebuild a few hosts and run
> them in staging.  If that's successful, we can rebuild a bunch more - maybe
> just try initially?

I'm happy to do some running in staging. If that looks good I think it's better to just try a few hosts out of the try pool, just in case something goes horribly wrong. If they run OK for a period of time (a day? a few days) then we can roll out to the rest of the pool pretty safely, I think. I guess we have to do it somewhat batched or gradually since you'll need to reimage them though, right?
The screensaver is handled in disableservices::common.

So I'll get this landed now, with the understanding that no machines are using it yet (asid from servo when you get that started), and get to work on building a DS image for it.  I'd rather wait until then for staging tests, as otherwise we'll just need to do two trips through staging.
Comment on attachment 744653 [details] [diff] [review]
bug760093.patch

It looks good to me. I was going to checkout a slave and puppetize it to verify but Ben's idea to run it through staging for a few days sounds like a better idea.
Attachment #744653 - Flags: feedback?(kmoir) → feedback+
Comment on attachment 744653 [details] [diff] [review]
bug760093.patch

OK, it's landed.  There shouldn't be any effect.
Attachment #744653 - Flags: checked-in+
Over to Amy temporarily.  If we can get a clean-10.7+puppet setup running in DeployStudio (similar to what we have for MtnLion), then I can work on the second half of this.
Assignee: dustin → arich
Where do we stand on this?  Do we have a working lion ref image based on the "fat" image but with the puppetagain ds task sequences that we can use to deploy, or is that still left to be done?
Currently, the install process looks like this:
 - install the old lion builder image that jhford built
   - it doesn't matter if old-puppet runs on it or not
 - run some commands manually to remove the old puppet and add 2.7.17
 - run puppetize.sh to connect it to puppetagain
 - run puppet agent --test to configure it

The problems with this are:
 - it depends on some pre-installed stuff on the host (at least xcode)
 - several of the DMGs and files installed by PuppetAgain are of unknown origin, and not installed in proper PuppetAgain fashion

We have come up with various proposals for the next step:

1. (short-term) Strip old-puppet out of jhford's image, capture that, and use a DS process similar to that for mtnlion to install them.  This doesn't fix the problem, but makes installs easier.

2. (correct) Fix the problems listed above, and set up a DS process that starts with a bare 10.7 install and follows a process similar to mtnlion from there.

..and the long-term plan..

3. #2 with, Shawn's help, in Casper Suite.

This project is competing for my time and space in my working directory with a few other puppetagain projects.  I think it would take about a week of focused time to do #2.
We're going with #2.  I have a loaner in bug 871627 to work on the DMG issues.  Once those are settled, I'll work on the DS part.
Attached patch bug760093-320.patch — — Splinter Review
Minor fixes to run 3.2.0 on Lion:

* check for the proper password hashes being empty
* specify correct puppet version
* use /var/lib/puppet, rather than $settings::vardir (which uses vardir on the master, not the agent).
Assignee: arich → dustin
Attachment #749286 - Flags: review?(bhearsum)
Comment on attachment 749286 [details] [diff] [review]
bug760093-320.patch

Review of attachment 749286 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Dustin J. Mitchell [:dustin] from comment #34)
> We're going with #2.  I have a loaner in bug 871627 to work on the DMG
> issues.  Once those are settled, I'll work on the DS part.

Does this mean that Xcode is going to be installed through Puppet after you're done? Or is that going to be considered part of the bare install?
Attachment #749286 - Flags: review?(bhearsum) → review+
Puppet will install it.  If that turns out to kill the puppetmasters, we'll consider ways to optimize it.
Xcode: 0; Dustin: 1

----
Info: Retrieving plugin
Info: Loading facts in /var/lib/puppet/lib/facter/num_masters.rb
Info: Caching catalog for bld-lion-r5-054.build.releng.scl3.mozilla.com
Info: Applying configuration version '54f5b584675c'
Notice: /Stage[main]/Packages::Xcode/Packages::Pkgdmg[xcode]/Package[xcode-4.1.dmg]/ensure: created
Notice: Finished catalog run in 617.45 seconds
----

And yes, it takes 10m.  It's a 3GB DMG.
For p7zip, yasm, and ccache, I found steps to reproduce the DMGs (and a spec for ccache), but I didn't alter the existing packages.  That will be less risky if done when upgrading than during the current transition.

For upx, which was not packaged, I've rebuilt it from source.  The result before and after:

----
bld-lion-r5-054:~ root# upx --version
upx 3.05
UCL data compression library 1.03
..
----

and it's statically linked, so there's not much else that could go wrong.

Remaining: nrpe (needs to be upgraded) and autoconf (which apparently needs a patch..)
Attached patch bug760093-packages.patch — — Splinter Review
* Xcode - installed using the metapackage from inside InstallXcode.app
* autoconf - DMG script, installed DMG rebuilt
* ccache - DMG script, but installed DMG *not* rebuilt; spec from bug 614848; moved to packages::mozilla
* nrpe - DMG rebuilt, upgraded with new paths and with user creation in the package, manifests adjusted to work with new paths
* p7zip - DMG script written, but installed DMG *not* rebuilt
* upx-3.05 - DMG script written, new DMG
* yasm-1.1.0 - DMG script written, but installed DMG *not* rebuilt
Attachment #751069 - Flags: review?(bhearsum)
Attachment #751069 - Flags: review?(bhearsum) → review+
Attached patch bug760093-screenres.patch — — Splinter Review
Minor tweak to avoid setting resolution on every run
Attachment #751232 - Flags: review?(bhearsum)
Attachment #751232 - Flags: review?(bhearsum) → review+
OK, June is here, so the plan is:

* reimage a few hosts and put them in staging; then
* reimage all try hosts; then
* reimage all production hosts.

Rail, IIRC you were on the hook to do the releng side of this.  I'll do the reimaging and any fixing that's needed along the way.  Can you let me know which hosts to reimage?
Assignee: dustin → rail
(In reply to Dustin J. Mitchell [:dustin] from comment #42)
> Rail, IIRC you were on the hook to do the releng side of this.  I'll do the
> reimaging and any fixing that's needed along the way.  Can you let me know
> which hosts to reimage?

Not sure that it was me. Let's ask coop.
Assignee: rail → server-ops-releng
Flags: needinfo?(coop)
No longer blocks: 861283
Assignee: server-ops-releng → arich
Looks like bld-lion-r5-003 has already been re-imaged with this process in bug 877840, so let's see how that slave pans out with jobs today.

Aki's on buildduty this week, so please coordinate with him for the rest of the batch re-imaging.
Flags: needinfo?(coop)
Whiteboard: [buildduty]
We got rolling testing this on bug 877840.  Copying some posts over here for context.

---- from coop

Something's not right. It's failing quite early in the config checks during compile:

http://dev-master01.build.scl1.mozilla.com:8044/builders/OS%20X%2010.7%20mozilla-central%20build/builds/1/steps/compile/logs/stdio

Relevant excerpt:

creating cache ./config.cache
checking host system type... x86_64-apple-darwin11.2.0
checking target system type... i386-apple-darwin11.2.0
checking build system type... x86_64-apple-darwin11.2.0
checking for mawk... no
checking for gawk... no
checking for nawk... no
checking for awk... awk
checking for python2.7... /tools/buildbot/bin/python2.7
Creating Python environment
Using real prefix '/tools/python27'
New python executable in /builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/i386/_virtualenv/bin/python2.7
Also creating executable in /builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/i386/_virtualenv/bin/python
Overwriting /builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/i386/_virtualenv/lib/python2.7/distutils/__init__.py with new content
Traceback (most recent call last):
  File "/builds/slave/m-cen-osx64-000000000000000000/build/python/virtualenv/virtualenv.py", line 2563, in <module>
    main()
  File "/builds/slave/m-cen-osx64-000000000000000000/build/python/virtualenv/virtualenv.py", line 964, in main
    never_download=options.never_download)
  File "/builds/slave/m-cen-osx64-000000000000000000/build/python/virtualenv/virtualenv.py", line 1067, in create_environment
    install_distutils(home_dir)
  File "/builds/slave/m-cen-osx64-000000000000000000/build/python/virtualenv/virtualenv.py", line 1596, in install_distutils
    writefile(os.path.join(distutils_path, '__init__.py'), DISTUTILS_INIT)
  File "/builds/slave/m-cen-osx64-000000000000000000/build/python/virtualenv/virtualenv.py", line 458, in writefile
    f = open(dest, 'wb')
IOError: [Errno 13] Permission denied: '/builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/i386/_virtualenv/lib/python2.7/distutils/__init__.py'
Traceback (most recent call last):
  File "/builds/slave/m-cen-osx64-000000000000000000/build/build/virtualenv/populate_virtualenv.py", line 384, in <module>
    manager.ensure()
  File "/builds/slave/m-cen-osx64-000000000000000000/build/build/virtualenv/populate_virtualenv.py", line 103, in ensure
    return self.build()
  File "/builds/slave/m-cen-osx64-000000000000000000/build/build/virtualenv/populate_virtualenv.py", line 315, in build
    self.create()
  File "/builds/slave/m-cen-osx64-000000000000000000/build/build/virtualenv/populate_virtualenv.py", line 122, in create
    raise Exception('Error creating virtualenv.')
Exception: Error creating virtualenv.

---- me

The problem is that the __init__.py it's trying to write is a symlink to the file under /tools.  We've seen this in a few other bugs, too - bug 805091 and bug 758694 are particularly relevant.

Basically, if you run virtualenv using /tools/buildbot/bin/python2.7 (where /tools/buildbot is a virtualenv), it fails.  If you run virtualenv using /tools/python27/bin/python2.7, which is a real Python install, all is well.

I'm working on how to fix this.

---- me

From this build:
  PATH=/tools/python/bin:/tools/buildbot/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
...
checking for python2.7... /tools/buildbot/bin/python2.7

From a build on an old-puppet-configured lion system:

 PATH=/tools/python/bin:/tools/buildbot/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
...
checking for python2.7... /tools/python/bin/python2.7

The problem being /tools/python vs. /tools/python27.

---- me

Created attachment 760593 [details] [diff] [diff] [review]
buildbot-configs.patch

So, in the interests of keeping things consistent across platforms, the options are
 1. change PATH in the configs (to include both dirs for now, with the option to remove /tools/python later); or
 2. add a symlink from /tools/python -> /tools/python27 on both OS X and Linux

The attached patch implements #1 - well, it might, that's pretty complicated.

I think I prefer #2, though - then any script that needs any old Python can use /tools/python/bin/python, and any script that needs 2.7.x can use /tools/python27/.

Thoughts?

---- callek

I would happily accept a patch for #2

I would also love if the brunt of this work was moved from the "update password bug" to a bug that can be public ;-)
Assignee: arich → dustin
Attached patch bug760093-python-symlinks.patch — — Splinter Review
This adds /tools/python and /tools/python2, which should help in the eventual migration to python-3.x.

This change will affect all Linux and OS X platforms, once they are upgraded to Puppet-3.2.0.  From what I can see, this can only do good - all of the PATHs for build processes have either /tools/python or /tools/python27 in them, and I see lots of scripts that look for /tools/python/bin/python and fall back to PATH if they don't find it.
Attachment #760953 - Flags: review?(coop)
Comment on attachment 760953 [details] [diff] [review]
bug760093-python-symlinks.patch

Review of attachment 760953 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #760953 - Flags: review?(coop) → review+
Comment on attachment 760953 [details] [diff] [review]
bug760093-python-symlinks.patch

I re-ran puppet on bld-lion-r5-003, so this build job should succeed now.  Well, it should get past that point anyway.
Attachment #760953 - Flags: checked-in+
(In reply to Dustin J. Mitchell [:dustin] from comment #48)
> Comment on attachment 760953 [details] [diff] [review]
> bug760093-python-symlinks.patch
> 
> I re-ran puppet on bld-lion-r5-003, so this build job should succeed now. 
> Well, it should get past that point anyway.

Build was successful, but failed on upload. Pretty sure that's just me neglecting to sync the ssh key's before I started, though.

Log is here:

http://dev-master01.build.scl1.mozilla.com:8044/builders/OS%20X%2010.7%20mozilla-central%20build/builds/3
Very cool!

I had expected based on some conversations with callek that hg would fail, since it's not using the "system hg" but is using the hg DMG built for mtnlion.  I don't know what "system hg" is, since OS X doesn't ship with hg, but at any rate, clarity would be good before we deploy this.
Flags: needinfo?(bugspam.Callek)
(In reply to Dustin J. Mitchell [:dustin] from comment #50)
> Very cool!
> 
> I had expected based on some conversations with callek that hg would fail,
> since it's not using the "system hg" but is using the hg DMG built for
> mtnlion.  I don't know what "system hg" is, since OS X doesn't ship with hg,
> but at any rate, clarity would be good before we deploy this.

This did surprise me, but checking:

[cltbld@bld-lion-r5-003.build.releng.scl3.mozilla.com ~]$ hg --version
Mercurial Distributed SCM (version 2.1.1)
(see http://mercurial.selenic.com for more information)

Copyright (C) 2005-2012 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[cltbld@bld-lion-r5-003.build.releng.scl3.mozilla.com ~]$ which -a hg
/usr/local/bin/hg
[cltbld@bld-lion-r5-003.build.releng.scl3.mozilla.com ~]$ ls -la /usr/local/bin/hg
lrwxr-xr-x  1 501  com.apple.local.ard_admin  23 Jun  5 08:08 /usr/local/bin/hg -> /tools/mercurial/bin/hg

Means I have no blocks on deploying this, in fact I'll probably just build the new hg package this way for 10.7 and 10.8 and let this bugs work [new puppet] be my 10.7 deploy instead ;-)
Flags: needinfo?(bugspam.Callek)
(In reply to Dustin J. Mitchell [:dustin] from comment #14)
> bld-lion-r5-096 is mine to mess with.

Still using this?
Glad I did a bugzilla comment search and didn't just blindly file a dcops reimage bug.
Also somewhat glad other buildduty isn't proactive about buildduty bugs, kind of.
Nope, I returned it 3 weeks ago - bug 871627 comment 2.
I'm reimaging the following with 3.2.0:

bld-lion-r5-022.try.releng.scl3.mozilla.com
bld-lion-r5-023.try.releng.scl3.mozilla.com
bld-lion-r5-030.try.releng.scl3.mozilla.com
bld-lion-r5-031.try.releng.scl3.mozilla.com
bld-lion-r5-034.try.releng.scl3.mozilla.com
bld-lion-r5-035.try.releng.scl3.mozilla.com
bld-lion-r5-036.try.releng.scl3.mozilla.com
bld-lion-r5-037.try.releng.scl3.mozilla.com
bld-lion-r5-039.try.releng.scl3.mozilla.com
bld-lion-r5-042.build.releng.scl3.mozilla.com
bld-lion-r5-043.build.releng.scl3.mozilla.com
bld-lion-r5-051.build.releng.scl3.mozilla.com
bld-lion-r5-052.build.releng.scl3.mozilla.com
bld-lion-r5-054.build.releng.scl3.mozilla.com
bld-lion-r5-066.build.releng.scl3.mozilla.com
bld-lion-r5-078.build.releng.scl3.mozilla.com
bld-lion-r5-085.build.releng.scl3.mozilla.com
bld-lion-r5-088.build.releng.scl3.mozilla.com
bld-lion-r5-096.try.releng.scl3.mozilla.com
Callek just noticed that wget doesn't work - it was most likely built on mountain lion.  So, that will need to be fixed.
There are NRPE problems too, probably also fixed by bug 882869
Depends on: 882869
All of the hosts in comment 54 successfully puppetized.  Well, 078's still chugging but it will get there.  And 096 doesn't exist.
OK, puppet runs to completion, and

[root@bld-lion-r5-031.try.releng.scl3.mozilla.com ~]# wget
wget: missing URL
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

I've re-imaged -031 to check, and I'll run puppet on the others.  Then we should get these into staging to see how they fare.
Puppet's run everywhere.  Coop, can you delegate this for staging?
Assignee: dustin → coop
I'm working through the list of slaves now. 

Both bld-lion-r5-022 and bld-lion-r5-023 weren't doing very well before this process[1]. They'll be poor indicators of success here, but they are back in the try pool now.

bld-lion-r5-043 has better prospects[2], and is back in the build pool. 

Working on getting the ssh keys deployed to the rest of the slaves and will get them in service shortly.

1. https://secure.pub.build.mozilla.org/buildapi/recent/bld-lion-r5-022
   https://secure.pub.build.mozilla.org/buildapi/recent/bld-lion-r5-023

2. https://secure.pub.build.mozilla.org/buildapi/recent/bld-lion-r5-043
All 22 and 23 needed was to be reimaged, so there's no reason not to expect them to be better now.
Well, they didn't even need a reimage, just a clean out of the temp dir, but that's another bug ;)
Oops, they'll need more than a re-run of puppet, since they think wget's already installed.  I'll cook up a script and fix that.  Catlee points out that this host burned a production build :( -- these really should be in staging first!
[root@bld-lion-r5-023.try.releng.scl3.mozilla.com ~]# rm -f /var/db/.puppet_pkgdmg_installed_wget-1.12-1.dmg
[root@bld-lion-r5-023.try.releng.scl3.mozilla.com ~]# puppet agent --test

Info: Retrieving plugin
Info: Loading facts in /var/lib/puppet/lib/facter/facter_dot_d.rb
Info: Loading facts in /var/lib/puppet/lib/facter/num_masters.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
Info: Loading facts in /var/lib/puppet/lib/facter/vmwaretools_version.rb
Info: Caching catalog for bld-lion-r5-023.try.releng.scl3.mozilla.com
Info: Applying configuration version 'b4c08ca56c4e'
Notice: /Stage[main]/Packages::Wget/Packages::Pkgdmg[wget]/Package[wget-1.12-1.dmg]/ensure: created
Notice: Finished catalog run in 44.22 seconds

[root@bld-lion-r5-023.try.releng.scl3.mozilla.com ~]# wget
wget: missing URL
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.


(same on all of the hosts in comment 54 that exist).  This should be good go to again.  Coop, do you want to re-enable -043?
(In reply to Dustin J. Mitchell [:dustin] from comment #65) 
> (same on all of the hosts in comment 54 that exist).  This should be good go
> to again.  Coop, do you want to re-enable -043?

Done.
(In reply to Chris Cooper [:coop] from comment #66) 
> Done.

043 is building successfully in staging. I'll enable the rest.
(In reply to Chris Cooper [:coop] from comment #67) 
> 043 is building successfully in staging. I'll enable the rest.

Which is to say, I'll try to determine which of these slaves have other issues and only put the good ones back in production, since there's a real mix of good and bad slaves here.
FYI, the hosts in this silo that are *not* on 3.2.2 are now unmanaged.  Let me know when you feel comfortable with upgrading those.
(In reply to Dustin J. Mitchell [:dustin] from comment #69)
> FYI, the hosts in this silo that are *not* on 3.2.2 are now unmanaged.  Let
> me know when you feel comfortable with upgrading those.

I'm fine with fast-tracking this. How soon do you want to do this?
I'd like to knock this out this week or early next week - basically full speed with any necessary verification, and then I'll do the deployment.
The slaves I've checked from comment #54 are building correctly with 3.2.0. We can proceed with 3.2.2 whenever you're ready.
All of the hosts in comment 54 have been upgraded to 3.2.2 except:

043 which looks like it's in the middle of a build (and will upgrade itself when it next runs puppet)
052 which is down (and will upgrade itself when it next runs puppet)
096 which doesn't exist anymore
I have been working on reimaging these in batches (coop's been copying over keys and putting them back in slavealloc).  Status tracked in the etherpad in the URL field.

Coop: it appears that some of the ones that were done in comment 54 are also disabled.  I don't know if you want to enable those as well (I just copied in them in to the "done" column when I started).
Assignee: coop → arich
Hrmm, buildbot isn't starting on these re-imaged slaves after a reboot.

[cltbld@bld-lion-r5-005.build.releng.scl3.mozilla.com ~]$ python /usr/local/bin/runslave.py --verbose
writing /builds/slave/buildbot.tac
calculated nagios hostname 'bld-lion-r5-005.build.releng.scl3'
reporting to NSCA daemon on 'bm-admin01.mozilla.org'
Error sending notice to nagios (ignored)
Traceback (most recent call last):
  File "/usr/local/bin/runslave.py", line 385, in try_send_notice
    self.do_send_notice()
  File "/usr/local/bin/runslave.py", line 365, in do_send_notice
    sk.connect((monitoring_host, self.monitoring_port))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
gaierror: [Errno 8] nodename nor servname provided, or not known
The errors are from the bogus code that's leftover from ages ago when slaves were supposed to report to nagios when they rebooted.  That code should definitely be removed (is there a bug for that already?).  I'm not sure if that's what's preventing buildbot from starting, though.
So this python:

[cltbld@bld-lion-r5-005.build.releng.scl3.mozilla.com ~]$ md5 /tools/python27/bin/python2.7
MD5 (/tools/python27/bin/python2.7) = c143495169647a2ccfcef94cd282e2c7

which was built fresh on 10.7 in bug 882869, segfaults when running twistd.

This python:

[root@bld-lion-r5-030.try.releng.scl3.mozilla.com ~]# md5 /tools/python27/bin/python2.7
MD5 (/tools/python27/bin/python2.7) = 4c1ba9ce92aaa3adbb9089a2d1caa8ad

which was built a year ago on 10.8, works fine.
I would note that we also seem to have lost python 2.7.3 somewhere in the shuffle:

https://bugzilla.mozilla.org/show_bug.cgi?id=602908#c35
We're hacking in the 10.8 python by hand.  Boo.  Fix will be in bug 882869.
The r5 builder fleet has been reimaged with this new process and is back in production.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: