Closed
Bug 760093
Opened 12 years ago
Closed 10 years ago
create a basic r5 lion builder ref machine without puppet and puppetagain configs to handle config management
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: arich, Assigned: arich)
References
()
Details
(Whiteboard: [buildduty])
Attachments
(5 files)
38.26 KB,
patch
|
bhearsum
:
review+
kmoir
:
feedback+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
4.98 KB,
patch
|
bhearsum
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
38.26 KB,
patch
|
bhearsum
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
1.29 KB,
patch
|
bhearsum
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
1.65 KB,
patch
|
coop
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
bhearsum isn't using these anymore and we're going to need ref machiens for the r5 lion builders (we don't currently have one), and soon the r5 lion talos machines.
Assignee | ||
Comment 1•12 years ago
|
||
coop: should we call the 10.8 machines: talos-mtnlion-r5-NNN? I presume we're sticking with <purpose>-<os>-<platform>-NNN that we've been using for several months (bld-centos5-vmw bld-centos6-hp, bld-lion-r5), but don't know what you want as <os>, since I doubt you wnat to type out mountainlion every time.
Comment 2•12 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #1) > coop: should we call the 10.8 machines: talos-mtnlion-r5-NNN? Yes, that works for me.
Comment 3•11 years ago
|
||
Dns, inventory/dhcp, nagios have been updated.
Comment 4•11 years ago
|
||
I've lost touch with both of these systems due to a bad image deployment. Once I get the workflow and images fixed, I will file a bug with DCops to manually netboot them in to their correct states.
Assignee | ||
Comment 5•11 years ago
|
||
I've changed the cltbld and root pws on bld-lion-r4-ref to match the new ones, but I'm not sure what else, if anything, should be done before an image is taken. I'm going to leave that to jake for when he returns.
Comment 6•11 years ago
|
||
Recently imaged bld-lion-r5-* have the old cltbld password. I: * changed the cltbld pass * changed the root pass * entered the old keychain pass the first time i logged in as cltbld * turned off the auto-login for cltbld, then turned back on (entered new pass) * after a reboot i was able to log in as cltbld with the new pass via both screen sharing and ssh, and buildbot started + connected. (dependent on having puppetized successfully)
Assignee | ||
Comment 7•11 years ago
|
||
Bear actually has a script that will automate the password changing for the 10.7 machines.
Comment 8•11 years ago
|
||
bld-lion-r5-ref had been imaged with a clean 10.7 image and not the actual bld-lion-r5 image since I was having problems with deploying some images. I'm not sure if it is the images themselves or if it is an issue with deploying cross vlans. I made another attempt at dropping the bld-lion-r5-ref image on bld-lion-r5-ref.svr.releng.scl3.m.c but it got stuck somewhere in the middle and is now unresponsive.
Assignee | ||
Comment 9•11 years ago
|
||
We've decided to put these into the vlan that their respective DS hosts are in to make image capture easier. bld-lion-r5-ref.build.releng.scl3.mozilla.com talos-mtnlion-r5-ref.test.releng.scl3.mozilla.com dns, inventory, and nagios already updated, netops bug filed to move vlans.
Assignee | ||
Comment 10•11 years ago
|
||
Machines are now up on their appropriate vlans.
Assignee | ||
Comment 11•11 years ago
|
||
talos-mtnlion-r5-ref has been done. Removing this as a dependency for bug 759466. bld-lion-r5-ref will take significantly more work since it requires pulling apart the auto-puppetization stuff and rebuilding the default image without those bits.
No longer blocks: 759466
Assignee | ||
Updated•11 years ago
|
Summary: retask darwin11-signing1.srv.releng.scl3.mozilla.com and darwin11-signing2.srv.releng.scl3.mozilla.com as ref machines for the lion builders and mountain lion talos machine → disect the image for the lion builders and create a ref machine without puppet
Assignee | ||
Comment 12•11 years ago
|
||
combined with the desire to move these off of old puppet, what we should likely do is start with a fresh install and use puppetagain like we do with mountain lion.
Assignee | ||
Updated•11 years ago
|
Assignee: jwatkins → dustin
Summary: disect the image for the lion builders and create a ref machine without puppet → create a basic r5 lion builder ref machine without puppet and puppetagain configs to handle config management
Comment 13•11 years ago
|
||
So, the interim plan is actually the XOR of what this bug described: build PuppetAgain manifests that can manage an already-installed system, without figuring out the install process. We'll then use this to manage hosts for servo (bug 861283).
Comment 14•11 years ago
|
||
bld-lion-r5-096 is mine to mess with.
Comment 15•11 years ago
|
||
I went through all of the old-puppet manifests to make sure that everything was ported to the new. Since we hadn't done builders on OS X in puppetagain, and since we don't have any non-mock builders in puppetagain, this turned out to be a lot of stuff. There are some open questions. I'll need to find answers to these before finishing this up. * buildbot master not installed as a DMG, in a different location (buildslave virtualenv), different version (0.8.4, not 0.8.5) * Do we need to manage the contents of the OS X Dock? The old manifests do, but that seems silly. * ~cltbld/.ssh/config - I removed config for hgpvt.m.o, cvs.m.o, and dm-pvtbuild01 (gone now). Is hgpvt still in use? * is ~/.purge_builds.cfg required? The default in the script is 14 days, anyway. * is pkg-dmg install required? mxr suggests it's downloaded during build, and thus unnecessary. It's also a bare executable with no installer or other history. * is upx install required? Looks like this was only used for partner repacks. This is also a bare executable, although there's a SourceForge project we could use to build a DMG from source. * Is there anything in particular in the cacert.pem that gets installed by old-puppet, or is it just making up for an ancient wget version? We already install a newer wget for talos, without overriding cacert.pem.
Comment 16•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #15) > I went through all of the old-puppet manifests to make sure that everything > was ported to the new. Since we hadn't done builders on OS X in > puppetagain, and since we don't have any non-mock builders in puppetagain, > this turned out to be a lot of stuff. > > There are some open questions. I'll need to find answers to these before > finishing this up. > > * buildbot master not installed as a DMG, in a different location > (buildslave virtualenv), different version (0.8.4, not 0.8.5) Huh, no idea about this. Seems like a bad thing if the old manifests are using an older version of Buildbot. (And I assume you meant "buildbot slave not installed"). > * Do we need to manage the contents of the OS X Dock? The old manifests do, > but that seems silly. We do this on test machines because the dock contents impact Talos results IIRC. > * is ~/.purge_builds.cfg required? The default in the script is 14 days, > anyway. I didn't even know this existed, should be fine to remove. > * is pkg-dmg install required? mxr suggests it's downloaded during build, > and thus unnecessary. It's also a bare executable with no installer or > other history. Not needed on the machines anymore. > * is upx install required? Looks like this was only used for partner > repacks. This is also a bare executable, although there's a SourceForge > project we could use to build a DMG from source. This is still needed. It's used during partner repacks (which all happen on OS X). > * Is there anything in particular in the cacert.pem that gets installed by > old-puppet, or is it just making up for an ancient wget version? We already > install a newer wget for talos, without overriding cacert.pem. I can't find anything about why we added this in any bug....not sure. I would guess that a newer wget would be satisfactory, but I'm not terribly certain of that.
Comment 17•11 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #16) > (In reply to Dustin J. Mitchell [:dustin] from comment #15) > > * Do we need to manage the contents of the OS X Dock? The old manifests do, > > but that seems silly. > > We do this on test machines because the dock contents impact Talos results > IIRC. > IIRC it was to have a large enough screen resolution to test hardware acceleration. Removing the dock gave us those extra pixels we were missing. I'm serious! :)
Comment 18•11 years ago
|
||
Oh, so the important setting is *hiding* the dock. Interesting that we don't do that -- that I can see -- for talos in PuppetAgain. Since these are builders, I'll omit it. Per IRC, it looks like the buildbot-0.8.5 install isn't even used -- there's a version of Buildbot installed in the buildslave virtualenv that's earlier in the path.
Comment 19•11 years ago
|
||
Maybe with rev4 and rev5 it was not necessary as we might have had bigger screen resolutions (like on Win7 we could only have higher screen resolutions with a dongle attached).
Comment 20•11 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #16) > (In reply to Dustin J. Mitchell [:dustin] from comment #15) > > I went through all of the old-puppet manifests to make sure that everything > > was ported to the new. Since we hadn't done builders on OS X in > > puppetagain, and since we don't have any non-mock builders in puppetagain, > > this turned out to be a lot of stuff. > > > > There are some open questions. I'll need to find answers to these before > > finishing this up. > > > > * buildbot master not installed as a DMG, in a different location > > (buildslave virtualenv), different version (0.8.4, not 0.8.5) > > Huh, no idea about this. Seems like a bad thing if the old manifests are > using an older version of Buildbot. (And I assume you meant "buildbot slave > not installed"). > > > * Do we need to manage the contents of the OS X Dock? The old manifests do, > > but that seems silly. > > We do this on test machines because the dock contents impact Talos results > IIRC. > > > * is ~/.purge_builds.cfg required? The default in the script is 14 days, > > anyway. > > I didn't even know this existed, should be fine to remove. It's still used by purge_builds.py: http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/purge_builds.py#l162
Comment 21•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #20) > (In reply to Ben Hearsum [:bhearsum] from comment #16) > > (In reply to Dustin J. Mitchell [:dustin] from comment #15) > > > I went through all of the old-puppet manifests to make sure that everything > > > was ported to the new. Since we hadn't done builders on OS X in > > > puppetagain, and since we don't have any non-mock builders in puppetagain, > > > this turned out to be a lot of stuff. > > > > > > There are some open questions. I'll need to find answers to these before > > > finishing this up. > > > > > > * buildbot master not installed as a DMG, in a different location > > > (buildslave virtualenv), different version (0.8.4, not 0.8.5) > > > > Huh, no idea about this. Seems like a bad thing if the old manifests are > > using an older version of Buildbot. (And I assume you meant "buildbot slave > > not installed"). > > > > > * Do we need to manage the contents of the OS X Dock? The old manifests do, > > > but that seems silly. > > > > We do this on test machines because the dock contents impact Talos results > > IIRC. > > > > > * is ~/.purge_builds.cfg required? The default in the script is 14 days, > > > anyway. > > > > I didn't even know this existed, should be fine to remove. > > It's still used by purge_builds.py: > http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/ > purge_builds.py#l162 Right, but the default value is 14, which is the same value that puppet's putting here, so the config file is superfluous (and it's probably easier to change it in the tools repo than in puppet)
Comment 22•11 years ago
|
||
The question about .ssh/config is still open, but I'll assume it's harmless. From IRC, the screen configuration isn't necessary for builds, but should still be enforced, as should VNC. For the second-stage from-scratch deployment, I'll need to address: * Xcode - https://groups.google.com/forum/?fromgroups=#!topic/puppet-users/Dvl4i9H1zNw * Newer nagios with more sane install process * Build a UPX DMG * New DMGs for everything that's copied over
Comment 23•11 years ago
|
||
Almost ready for r? here. Apparently the version of 'screenresolution' that works on mountain lion doesn't work on lion: bld-lion-r5-096:~ root# /usr/local/bin/screenresolution get Illegal instruction: 4
Comment 24•11 years ago
|
||
OK, this patch runs successfully on top of an already-configured Lion builder: ---- rm -rf /Library/Ruby/Site/1.8/puppet* /Library/Ruby/Gems/1.8/gems/puppet* install 2.7.17 DMG by hand: hdiutil mount puppet-2.7.17.dmg installer -pkg /Volumes/puppet-2.7.17/puppet-2.7.17.pkg -target / hdiutil unmount /Volumes/puppet-2.7.17/ puppet --version rm /etc/puppet/ssl ln -s /var/lib/puppet/ssl /etc/puppet/ssl curl http://hg.mozilla.org/users/dmitchell_mozilla.com/puppet320/raw-file/tip/modules/puppet/files/puppetize.sh > puppetize.sh sh puppetize.sh ---- This doesn't: * install Xcode, although https://groups.google.com/forum/?fromgroups=#!topic/puppet-users/Dvl4i9H1zNw suggests it's possible * rebuild any of the DMGs, or even describe how they're built. I'd like to rebuild all of these and include the .sh scripts in the final analysis. ---- As for the review: this is primarily intended to get things ready to roll with the Servo buildout, and I strongly suspect this is perfectly adequate to that purpose. However, this is written as a complete re-implementation of the lion builders, so I would like to get it tested and landed in that capacity reasonably quickly, before it bitrots. My thinking for how we accomplish that is to land this as-is (with changes per review), then I build a DS image that will puppetize directly into PuppetAgain and fix up the DMGs as described above. Once that's in place, we can rebuild a few hosts and run them in staging. If that's successful, we can rebuild a bunch more - maybe just try initially? If we *can't* commit to actually using this to build production systems in a finite timeframe, then I think we should kill it now, and instead implement a much simpler configuration for Servo builders.
Attachment #744653 -
Flags: review?(bhearsum)
Attachment #744653 -
Flags: feedback?(kmoir)
Comment 25•11 years ago
|
||
Comment on attachment 744653 [details] [diff] [review] bug760093.patch Review of attachment 744653 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Dustin J. Mitchell [:dustin] from comment #24) > If we *can't* commit to actually using this to build production systems in a > finite timeframe, then I think we should kill it now, and instead implement > a much simpler configuration for Servo builders. I don't want to go down the road of using differently configured systems for Servo. We have enough platforms to maintain already. I'll help push this along however I can. The only thing I see that might be missing vs. the original manifests is disable-screensaver.sh or equivalent. I can't seem to find any code that disables the screensaver. r+ assuming that's been taken care of somewhere.
Attachment #744653 -
Flags: review?(bhearsum) → review+
Comment 26•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #24) > However, this is written as a complete re-implementation of the lion > builders, so I would like to get it tested and landed in that capacity > reasonably quickly, before it bitrots. My thinking for how we accomplish > that is to land this as-is (with changes per review), then I build a DS > image that will puppetize directly into PuppetAgain and fix up the DMGs as > described above. Once that's in place, we can rebuild a few hosts and run > them in staging. If that's successful, we can rebuild a bunch more - maybe > just try initially? I'm happy to do some running in staging. If that looks good I think it's better to just try a few hosts out of the try pool, just in case something goes horribly wrong. If they run OK for a period of time (a day? a few days) then we can roll out to the rest of the pool pretty safely, I think. I guess we have to do it somewhat batched or gradually since you'll need to reimage them though, right?
Comment 27•11 years ago
|
||
The screensaver is handled in disableservices::common. So I'll get this landed now, with the understanding that no machines are using it yet (asid from servo when you get that started), and get to work on building a DS image for it. I'd rather wait until then for staging tests, as otherwise we'll just need to do two trips through staging.
Comment 28•11 years ago
|
||
Comment on attachment 744653 [details] [diff] [review] bug760093.patch It looks good to me. I was going to checkout a slave and puppetize it to verify but Ben's idea to run it through staging for a few days sounds like a better idea.
Attachment #744653 -
Flags: feedback?(kmoir) → feedback+
Comment 29•11 years ago
|
||
Comment on attachment 744653 [details] [diff] [review] bug760093.patch OK, it's landed. There shouldn't be any effect.
Attachment #744653 -
Flags: checked-in+
Comment 30•11 years ago
|
||
Over to Amy temporarily. If we can get a clean-10.7+puppet setup running in DeployStudio (similar to what we have for MtnLion), then I can work on the second half of this.
Assignee: dustin → arich
Comment 31•11 years ago
|
||
My notes: https://etherpad.mozilla.org/FehXQtJAgu
Assignee | ||
Comment 32•11 years ago
|
||
Where do we stand on this? Do we have a working lion ref image based on the "fat" image but with the puppetagain ds task sequences that we can use to deploy, or is that still left to be done?
Comment 33•11 years ago
|
||
Currently, the install process looks like this: - install the old lion builder image that jhford built - it doesn't matter if old-puppet runs on it or not - run some commands manually to remove the old puppet and add 2.7.17 - run puppetize.sh to connect it to puppetagain - run puppet agent --test to configure it The problems with this are: - it depends on some pre-installed stuff on the host (at least xcode) - several of the DMGs and files installed by PuppetAgain are of unknown origin, and not installed in proper PuppetAgain fashion We have come up with various proposals for the next step: 1. (short-term) Strip old-puppet out of jhford's image, capture that, and use a DS process similar to that for mtnlion to install them. This doesn't fix the problem, but makes installs easier. 2. (correct) Fix the problems listed above, and set up a DS process that starts with a bare 10.7 install and follows a process similar to mtnlion from there. ..and the long-term plan.. 3. #2 with, Shawn's help, in Casper Suite. This project is competing for my time and space in my working directory with a few other puppetagain projects. I think it would take about a week of focused time to do #2.
Comment 34•11 years ago
|
||
We're going with #2. I have a loaner in bug 871627 to work on the DMG issues. Once those are settled, I'll work on the DS part.
Comment 35•11 years ago
|
||
Minor fixes to run 3.2.0 on Lion: * check for the proper password hashes being empty * specify correct puppet version * use /var/lib/puppet, rather than $settings::vardir (which uses vardir on the master, not the agent).
Assignee: arich → dustin
Attachment #749286 -
Flags: review?(bhearsum)
Comment 36•11 years ago
|
||
Comment on attachment 749286 [details] [diff] [review] bug760093-320.patch Review of attachment 749286 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Dustin J. Mitchell [:dustin] from comment #34) > We're going with #2. I have a loaner in bug 871627 to work on the DMG > issues. Once those are settled, I'll work on the DS part. Does this mean that Xcode is going to be installed through Puppet after you're done? Or is that going to be considered part of the bare install?
Attachment #749286 -
Flags: review?(bhearsum) → review+
Comment 37•11 years ago
|
||
Puppet will install it. If that turns out to kill the puppetmasters, we'll consider ways to optimize it.
Updated•11 years ago
|
Attachment #749286 -
Flags: checked-in+
Comment 38•11 years ago
|
||
Xcode: 0; Dustin: 1 ---- Info: Retrieving plugin Info: Loading facts in /var/lib/puppet/lib/facter/num_masters.rb Info: Caching catalog for bld-lion-r5-054.build.releng.scl3.mozilla.com Info: Applying configuration version '54f5b584675c' Notice: /Stage[main]/Packages::Xcode/Packages::Pkgdmg[xcode]/Package[xcode-4.1.dmg]/ensure: created Notice: Finished catalog run in 617.45 seconds ---- And yes, it takes 10m. It's a 3GB DMG.
Comment 39•11 years ago
|
||
For p7zip, yasm, and ccache, I found steps to reproduce the DMGs (and a spec for ccache), but I didn't alter the existing packages. That will be less risky if done when upgrading than during the current transition. For upx, which was not packaged, I've rebuilt it from source. The result before and after: ---- bld-lion-r5-054:~ root# upx --version upx 3.05 UCL data compression library 1.03 .. ---- and it's statically linked, so there's not much else that could go wrong. Remaining: nrpe (needs to be upgraded) and autoconf (which apparently needs a patch..)
Comment 40•11 years ago
|
||
* Xcode - installed using the metapackage from inside InstallXcode.app * autoconf - DMG script, installed DMG rebuilt * ccache - DMG script, but installed DMG *not* rebuilt; spec from bug 614848; moved to packages::mozilla * nrpe - DMG rebuilt, upgraded with new paths and with user creation in the package, manifests adjusted to work with new paths * p7zip - DMG script written, but installed DMG *not* rebuilt * upx-3.05 - DMG script written, new DMG * yasm-1.1.0 - DMG script written, but installed DMG *not* rebuilt
Attachment #751069 -
Flags: review?(bhearsum)
Updated•11 years ago
|
Attachment #751069 -
Flags: review?(bhearsum) → review+
Updated•11 years ago
|
Attachment #751069 -
Flags: checked-in+
Comment 41•11 years ago
|
||
Minor tweak to avoid setting resolution on every run
Attachment #751232 -
Flags: review?(bhearsum)
Updated•11 years ago
|
Attachment #751232 -
Flags: review?(bhearsum) → review+
Updated•11 years ago
|
Attachment #751232 -
Flags: checked-in+
Comment 42•11 years ago
|
||
OK, June is here, so the plan is: * reimage a few hosts and put them in staging; then * reimage all try hosts; then * reimage all production hosts. Rail, IIRC you were on the hook to do the releng side of this. I'll do the reimaging and any fixing that's needed along the way. Can you let me know which hosts to reimage?
Assignee: dustin → rail
Comment 43•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #42) > Rail, IIRC you were on the hook to do the releng side of this. I'll do the > reimaging and any fixing that's needed along the way. Can you let me know > which hosts to reimage? Not sure that it was me. Let's ask coop.
Assignee: rail → server-ops-releng
Flags: needinfo?(coop)
Assignee | ||
Updated•10 years ago
|
Assignee: server-ops-releng → arich
Comment 44•10 years ago
|
||
Looks like bld-lion-r5-003 has already been re-imaged with this process in bug 877840, so let's see how that slave pans out with jobs today. Aki's on buildduty this week, so please coordinate with him for the rest of the batch re-imaging.
Flags: needinfo?(coop)
Whiteboard: [buildduty]
Comment 45•10 years ago
|
||
We got rolling testing this on bug 877840. Copying some posts over here for context. ---- from coop Something's not right. It's failing quite early in the config checks during compile: http://dev-master01.build.scl1.mozilla.com:8044/builders/OS%20X%2010.7%20mozilla-central%20build/builds/1/steps/compile/logs/stdio Relevant excerpt: creating cache ./config.cache checking host system type... x86_64-apple-darwin11.2.0 checking target system type... i386-apple-darwin11.2.0 checking build system type... x86_64-apple-darwin11.2.0 checking for mawk... no checking for gawk... no checking for nawk... no checking for awk... awk checking for python2.7... /tools/buildbot/bin/python2.7 Creating Python environment Using real prefix '/tools/python27' New python executable in /builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/i386/_virtualenv/bin/python2.7 Also creating executable in /builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/i386/_virtualenv/bin/python Overwriting /builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/i386/_virtualenv/lib/python2.7/distutils/__init__.py with new content Traceback (most recent call last): File "/builds/slave/m-cen-osx64-000000000000000000/build/python/virtualenv/virtualenv.py", line 2563, in <module> main() File "/builds/slave/m-cen-osx64-000000000000000000/build/python/virtualenv/virtualenv.py", line 964, in main never_download=options.never_download) File "/builds/slave/m-cen-osx64-000000000000000000/build/python/virtualenv/virtualenv.py", line 1067, in create_environment install_distutils(home_dir) File "/builds/slave/m-cen-osx64-000000000000000000/build/python/virtualenv/virtualenv.py", line 1596, in install_distutils writefile(os.path.join(distutils_path, '__init__.py'), DISTUTILS_INIT) File "/builds/slave/m-cen-osx64-000000000000000000/build/python/virtualenv/virtualenv.py", line 458, in writefile f = open(dest, 'wb') IOError: [Errno 13] Permission denied: '/builds/slave/m-cen-osx64-000000000000000000/build/obj-firefox/i386/_virtualenv/lib/python2.7/distutils/__init__.py' Traceback (most recent call last): File "/builds/slave/m-cen-osx64-000000000000000000/build/build/virtualenv/populate_virtualenv.py", line 384, in <module> manager.ensure() File "/builds/slave/m-cen-osx64-000000000000000000/build/build/virtualenv/populate_virtualenv.py", line 103, in ensure return self.build() File "/builds/slave/m-cen-osx64-000000000000000000/build/build/virtualenv/populate_virtualenv.py", line 315, in build self.create() File "/builds/slave/m-cen-osx64-000000000000000000/build/build/virtualenv/populate_virtualenv.py", line 122, in create raise Exception('Error creating virtualenv.') Exception: Error creating virtualenv. ---- me The problem is that the __init__.py it's trying to write is a symlink to the file under /tools. We've seen this in a few other bugs, too - bug 805091 and bug 758694 are particularly relevant. Basically, if you run virtualenv using /tools/buildbot/bin/python2.7 (where /tools/buildbot is a virtualenv), it fails. If you run virtualenv using /tools/python27/bin/python2.7, which is a real Python install, all is well. I'm working on how to fix this. ---- me From this build: PATH=/tools/python/bin:/tools/buildbot/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin ... checking for python2.7... /tools/buildbot/bin/python2.7 From a build on an old-puppet-configured lion system: PATH=/tools/python/bin:/tools/buildbot/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin ... checking for python2.7... /tools/python/bin/python2.7 The problem being /tools/python vs. /tools/python27. ---- me Created attachment 760593 [details] [diff] [diff] [review] buildbot-configs.patch So, in the interests of keeping things consistent across platforms, the options are 1. change PATH in the configs (to include both dirs for now, with the option to remove /tools/python later); or 2. add a symlink from /tools/python -> /tools/python27 on both OS X and Linux The attached patch implements #1 - well, it might, that's pretty complicated. I think I prefer #2, though - then any script that needs any old Python can use /tools/python/bin/python, and any script that needs 2.7.x can use /tools/python27/. Thoughts? ---- callek I would happily accept a patch for #2 I would also love if the brunt of this work was moved from the "update password bug" to a bug that can be public ;-)
Updated•10 years ago
|
Assignee: arich → dustin
Comment 46•10 years ago
|
||
This adds /tools/python and /tools/python2, which should help in the eventual migration to python-3.x. This change will affect all Linux and OS X platforms, once they are upgraded to Puppet-3.2.0. From what I can see, this can only do good - all of the PATHs for build processes have either /tools/python or /tools/python27 in them, and I see lots of scripts that look for /tools/python/bin/python and fall back to PATH if they don't find it.
Attachment #760953 -
Flags: review?(coop)
Comment 47•10 years ago
|
||
Comment on attachment 760953 [details] [diff] [review] bug760093-python-symlinks.patch Review of attachment 760953 [details] [diff] [review]: ----------------------------------------------------------------- lgtm
Attachment #760953 -
Flags: review?(coop) → review+
Comment 48•10 years ago
|
||
Comment on attachment 760953 [details] [diff] [review] bug760093-python-symlinks.patch I re-ran puppet on bld-lion-r5-003, so this build job should succeed now. Well, it should get past that point anyway.
Attachment #760953 -
Flags: checked-in+
Comment 49•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #48) > Comment on attachment 760953 [details] [diff] [review] > bug760093-python-symlinks.patch > > I re-ran puppet on bld-lion-r5-003, so this build job should succeed now. > Well, it should get past that point anyway. Build was successful, but failed on upload. Pretty sure that's just me neglecting to sync the ssh key's before I started, though. Log is here: http://dev-master01.build.scl1.mozilla.com:8044/builders/OS%20X%2010.7%20mozilla-central%20build/builds/3
Comment 50•10 years ago
|
||
Very cool! I had expected based on some conversations with callek that hg would fail, since it's not using the "system hg" but is using the hg DMG built for mtnlion. I don't know what "system hg" is, since OS X doesn't ship with hg, but at any rate, clarity would be good before we deploy this.
Flags: needinfo?(bugspam.Callek)
Comment 51•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #50) > Very cool! > > I had expected based on some conversations with callek that hg would fail, > since it's not using the "system hg" but is using the hg DMG built for > mtnlion. I don't know what "system hg" is, since OS X doesn't ship with hg, > but at any rate, clarity would be good before we deploy this. This did surprise me, but checking: [cltbld@bld-lion-r5-003.build.releng.scl3.mozilla.com ~]$ hg --version Mercurial Distributed SCM (version 2.1.1) (see http://mercurial.selenic.com for more information) Copyright (C) 2005-2012 Matt Mackall and others This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [cltbld@bld-lion-r5-003.build.releng.scl3.mozilla.com ~]$ which -a hg /usr/local/bin/hg [cltbld@bld-lion-r5-003.build.releng.scl3.mozilla.com ~]$ ls -la /usr/local/bin/hg lrwxr-xr-x 1 501 com.apple.local.ard_admin 23 Jun 5 08:08 /usr/local/bin/hg -> /tools/mercurial/bin/hg Means I have no blocks on deploying this, in fact I'll probably just build the new hg package this way for 10.7 and 10.8 and let this bugs work [new puppet] be my 10.7 deploy instead ;-)
Flags: needinfo?(bugspam.Callek)
Comment 52•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #14) > bld-lion-r5-096 is mine to mess with. Still using this? Glad I did a bugzilla comment search and didn't just blindly file a dcops reimage bug. Also somewhat glad other buildduty isn't proactive about buildduty bugs, kind of.
Blocks: bld-lion-r5-096
Comment 53•10 years ago
|
||
Nope, I returned it 3 weeks ago - bug 871627 comment 2.
Comment 54•10 years ago
|
||
I'm reimaging the following with 3.2.0: bld-lion-r5-022.try.releng.scl3.mozilla.com bld-lion-r5-023.try.releng.scl3.mozilla.com bld-lion-r5-030.try.releng.scl3.mozilla.com bld-lion-r5-031.try.releng.scl3.mozilla.com bld-lion-r5-034.try.releng.scl3.mozilla.com bld-lion-r5-035.try.releng.scl3.mozilla.com bld-lion-r5-036.try.releng.scl3.mozilla.com bld-lion-r5-037.try.releng.scl3.mozilla.com bld-lion-r5-039.try.releng.scl3.mozilla.com bld-lion-r5-042.build.releng.scl3.mozilla.com bld-lion-r5-043.build.releng.scl3.mozilla.com bld-lion-r5-051.build.releng.scl3.mozilla.com bld-lion-r5-052.build.releng.scl3.mozilla.com bld-lion-r5-054.build.releng.scl3.mozilla.com bld-lion-r5-066.build.releng.scl3.mozilla.com bld-lion-r5-078.build.releng.scl3.mozilla.com bld-lion-r5-085.build.releng.scl3.mozilla.com bld-lion-r5-088.build.releng.scl3.mozilla.com bld-lion-r5-096.try.releng.scl3.mozilla.com
Updated•10 years ago
|
No longer blocks: bld-lion-r5-096
Comment 55•10 years ago
|
||
Callek just noticed that wget doesn't work - it was most likely built on mountain lion. So, that will need to be fixed.
Comment 56•10 years ago
|
||
There are NRPE problems too, probably also fixed by bug 882869
Depends on: 882869
Comment 57•10 years ago
|
||
All of the hosts in comment 54 successfully puppetized. Well, 078's still chugging but it will get there. And 096 doesn't exist.
Comment 58•10 years ago
|
||
OK, puppet runs to completion, and [root@bld-lion-r5-031.try.releng.scl3.mozilla.com ~]# wget wget: missing URL Usage: wget [OPTION]... [URL]... Try `wget --help' for more options. I've re-imaged -031 to check, and I'll run puppet on the others. Then we should get these into staging to see how they fare.
Comment 59•10 years ago
|
||
Puppet's run everywhere. Coop, can you delegate this for staging?
Assignee: dustin → coop
Comment 60•10 years ago
|
||
I'm working through the list of slaves now. Both bld-lion-r5-022 and bld-lion-r5-023 weren't doing very well before this process[1]. They'll be poor indicators of success here, but they are back in the try pool now. bld-lion-r5-043 has better prospects[2], and is back in the build pool. Working on getting the ssh keys deployed to the rest of the slaves and will get them in service shortly. 1. https://secure.pub.build.mozilla.org/buildapi/recent/bld-lion-r5-022 https://secure.pub.build.mozilla.org/buildapi/recent/bld-lion-r5-023 2. https://secure.pub.build.mozilla.org/buildapi/recent/bld-lion-r5-043
Comment 61•10 years ago
|
||
All 22 and 23 needed was to be reimaged, so there's no reason not to expect them to be better now.
Comment 62•10 years ago
|
||
Well, they didn't even need a reimage, just a clean out of the temp dir, but that's another bug ;)
Comment 63•10 years ago
|
||
I've disabled bld-lion-r5-043 for this: [cltbld@bld-lion-r5-043.build.releng.scl3.mozilla.com ~]$ wget http://hg.mozilla.org/releases/mozilla-release/raw-file/FIREFOX_22_0_RELEASE/build/package/mac_osx/pkg-dmg Illegal instruction: 4 [cltbld@bld-lion-r5-043.build.releng.scl3.mozilla.com ~]$ wget Illegal instruction: 4
Comment 64•10 years ago
|
||
Oops, they'll need more than a re-run of puppet, since they think wget's already installed. I'll cook up a script and fix that. Catlee points out that this host burned a production build :( -- these really should be in staging first!
Comment 65•10 years ago
|
||
[root@bld-lion-r5-023.try.releng.scl3.mozilla.com ~]# rm -f /var/db/.puppet_pkgdmg_installed_wget-1.12-1.dmg [root@bld-lion-r5-023.try.releng.scl3.mozilla.com ~]# puppet agent --test Info: Retrieving plugin Info: Loading facts in /var/lib/puppet/lib/facter/facter_dot_d.rb Info: Loading facts in /var/lib/puppet/lib/facter/num_masters.rb Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb Info: Loading facts in /var/lib/puppet/lib/facter/vmwaretools_version.rb Info: Caching catalog for bld-lion-r5-023.try.releng.scl3.mozilla.com Info: Applying configuration version 'b4c08ca56c4e' Notice: /Stage[main]/Packages::Wget/Packages::Pkgdmg[wget]/Package[wget-1.12-1.dmg]/ensure: created Notice: Finished catalog run in 44.22 seconds [root@bld-lion-r5-023.try.releng.scl3.mozilla.com ~]# wget wget: missing URL Usage: wget [OPTION]... [URL]... Try `wget --help' for more options. (same on all of the hosts in comment 54 that exist). This should be good go to again. Coop, do you want to re-enable -043?
Comment 66•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #65) > (same on all of the hosts in comment 54 that exist). This should be good go > to again. Coop, do you want to re-enable -043? Done.
Comment 67•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #66) > Done. 043 is building successfully in staging. I'll enable the rest.
Comment 68•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #67) > 043 is building successfully in staging. I'll enable the rest. Which is to say, I'll try to determine which of these slaves have other issues and only put the good ones back in production, since there's a real mix of good and bad slaves here.
Comment 69•10 years ago
|
||
FYI, the hosts in this silo that are *not* on 3.2.2 are now unmanaged. Let me know when you feel comfortable with upgrading those.
Comment 70•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #69) > FYI, the hosts in this silo that are *not* on 3.2.2 are now unmanaged. Let > me know when you feel comfortable with upgrading those. I'm fine with fast-tracking this. How soon do you want to do this?
Comment 72•10 years ago
|
||
I'd like to knock this out this week or early next week - basically full speed with any necessary verification, and then I'll do the deployment.
Comment 73•10 years ago
|
||
The slaves I've checked from comment #54 are building correctly with 3.2.0. We can proceed with 3.2.2 whenever you're ready.
Assignee | ||
Comment 74•10 years ago
|
||
All of the hosts in comment 54 have been upgraded to 3.2.2 except: 043 which looks like it's in the middle of a build (and will upgrade itself when it next runs puppet) 052 which is down (and will upgrade itself when it next runs puppet) 096 which doesn't exist anymore
Assignee | ||
Comment 75•10 years ago
|
||
I have been working on reimaging these in batches (coop's been copying over keys and putting them back in slavealloc). Status tracked in the etherpad in the URL field. Coop: it appears that some of the ones that were done in comment 54 are also disabled. I don't know if you want to enable those as well (I just copied in them in to the "done" column when I started).
Assignee: coop → arich
Comment 76•10 years ago
|
||
Hrmm, buildbot isn't starting on these re-imaged slaves after a reboot. [cltbld@bld-lion-r5-005.build.releng.scl3.mozilla.com ~]$ python /usr/local/bin/runslave.py --verbose writing /builds/slave/buildbot.tac calculated nagios hostname 'bld-lion-r5-005.build.releng.scl3' reporting to NSCA daemon on 'bm-admin01.mozilla.org' Error sending notice to nagios (ignored) Traceback (most recent call last): File "/usr/local/bin/runslave.py", line 385, in try_send_notice self.do_send_notice() File "/usr/local/bin/runslave.py", line 365, in do_send_notice sk.connect((monitoring_host, self.monitoring_port)) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) gaierror: [Errno 8] nodename nor servname provided, or not known
Assignee | ||
Comment 77•10 years ago
|
||
The errors are from the bogus code that's leftover from ages ago when slaves were supposed to report to nagios when they rebooted. That code should definitely be removed (is there a bug for that already?). I'm not sure if that's what's preventing buildbot from starting, though.
Comment 78•10 years ago
|
||
So this python: [cltbld@bld-lion-r5-005.build.releng.scl3.mozilla.com ~]$ md5 /tools/python27/bin/python2.7 MD5 (/tools/python27/bin/python2.7) = c143495169647a2ccfcef94cd282e2c7 which was built fresh on 10.7 in bug 882869, segfaults when running twistd. This python: [root@bld-lion-r5-030.try.releng.scl3.mozilla.com ~]# md5 /tools/python27/bin/python2.7 MD5 (/tools/python27/bin/python2.7) = 4c1ba9ce92aaa3adbb9089a2d1caa8ad which was built a year ago on 10.8, works fine.
Comment 79•10 years ago
|
||
I would note that we also seem to have lost python 2.7.3 somewhere in the shuffle: https://bugzilla.mozilla.org/show_bug.cgi?id=602908#c35
Comment 80•10 years ago
|
||
We're hacking in the 10.8 python by hand. Boo. Fix will be in bug 882869.
Assignee | ||
Comment 81•10 years ago
|
||
The r5 builder fleet has been reimaged with this new process and is back in production.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•