deploy Buildbot-0.8.4-pre-moz1 to the puppet buildslaves

RESOLVED FIXED

Status

P2
normal
RESOLVED FIXED
8 years ago
5 years ago

People

(Reporter: dustin, Assigned: dustin)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [puppet])

Attachments

(2 attachments, 1 obsolete attachment)

For a number of bugs (listed as deps to this one), we'll need to get a new version of Buildbot deployed across the pool.

Right now the plan is to make this 0.8.0r1.  The code deployed now is 0.8.0 plus some backported fixes.

The alternative is to upgrade to 0.8.3, but that's a lot of changes to take. Upgrading to 0.8.3 will be easier later, when everything is run from runslave.py, since it skips 'buildbot start' (which has been renamed to 'buildslave start' in 0.8.1 and up).
Depends on: 631854
Depends on: 565397
Depends on: 626486
From discussion on bug 631854, we should upgrade the slaves to 0.8.3, or more accurately to the current upstream HEAD, which is 0.8.3 plus some patches we want.  From my probably-imperfect 'hg diff' invocations, it looks like the only local patch that must be applied is the mozilla properties command.

We can avoid drama due to 'buildbot'->'buildslave' by rolling this out at the same time as runslave.py on every machine.  runslave.py does not use the 'buildslave' command.

Once the blocker bugs are done, I'll need to stage this on each platform as a proof of concept before deploying globally.
Summary: deploy a new version of Buildbot to the buildslaves → deploy Buildbot-0.8.3+patches to the buildslaves
Priority: -- → P4
Whiteboard: [puppet]
Priority: P4 → P2
Assignee: nobody → dustin
Depends on: 635007
This will be changing the default location of the buildbot code from
 /tools/buildbot
to
 /tools/buildbot-slave
which nicely keeps new things out of the way of the old.

The existing install of /tools/buildbot is done differently on each different sort of slave, with no clear reason - in some cases, it's a hand-compiled version of Python.  In some cases, a version was hand-compiled and installed in /tools/python-X.Y.Z.  In some cases, it's the system Python.

The best I can discern from IRC is that it's possible that some of the runtime scripts are using /tools/buildbot/bin/python, and as such that had to be Python-2.6 or higher.

So, a few notes:

 1. As written, this will use the system Python (/usr/bin/python) everywhere.  This can be changed easily, but I won't change it until I see a reason to.

 2. /tools/buildbot/bin/python needs to stick around for the benefit of ateam scripts, with a new bug to explore this particular interaction (probably just blowing away the directory on staging and seeing what breaks)

 3. /tools/buildbot/bin and /tools/python/bin are in $PATH at least on try-mac-slaveNN.  They shouldn't be.  This may mess with some ateam scripts, too - in fact, maybe that's how they're using /tools/buildbot/bin/python?
Catlee rightly points out that the slaves need to have the master code installed, because they run 'buildbot sendchange'.  They also need to have buildbot in their PATH, so point 3, above, is invalid.
Depends on: 635296
My plan is to get this installed and operational on each flavor of slave, and then deploy it in such a way that 0.8.4 is installed everywhere but only active on staging slaves.  Then we can look for any trouble in staging before rolling it out everywhere.
Summary: deploy Buildbot-0.8.3+patches to the buildslaves → deploy Buildbot-0.8.4-pre-moz1 to the puppet buildslaves
Blocks: 627126
Blocks: 637349
Scripts that the slaves run depend on simplejson explicitly (and do not fall back to or from the built-in json module - bug 637508).  So that will need to go into the virtualenv as well.
I'm also seeing a lot of

rm -rf build
 in dir /home/cltbld/talos-slave/test/. (timeout 1200 secs)
 watching logfiles {}
 argv: ['rm', '-rf', 'build']
 ...
 closing stdin
 using PTY: True
process killed by signal 1
program finished with exit code -1

which is caused by usepty=1 on the slaves.  I see this failure in production, too, so I'm not going to worry about it.  Once slavealloc is live and 0.8.4 is out, we can disable usepty.
comment 6 is related to 631854, by the way.
Created attachment 515842 [details] [diff] [review]
m631851-puppet-manifests-r1.patch

I also ran into some trouble with idle slaves being marked as disconnected due to NAT timeouts (bug 637541).  Again, nothing new.

I've been running this on
 talos-r3-fed-001
 talos-r3-fed64-001
 talos-r3-snow-001
 talos-r3-leopard-002
 linux-ix-slave01
 moz2-darwin9-slave08
 moz2-darwin10-slave03

At this point, I've seen enough green runs that I'd like to deploy this new version universally to staging.  Then we can shake out any bugs before pushing to production.
Attachment #515842 - Flags: review?(bhearsum)
Comment on attachment 515842 [details] [diff] [review]
m631851-puppet-manifests-r1.patch

Correct me if I'm wrong, but this patch doesn't cause anything to be deployed, does it? I don't see buildslave::install::production being used anywhere...I'm probably missing something though.

>+    # platform_python is whatever's available on this platform. If that's not
>+    # good enough, we should start installing Pythons with Puppet.
>+    $platform_python = "/usr/bin/python"

This assumption isn't correct in all cases. We use a Python out of /tools for all linux/mac build slaves and one out of ~cltbld for fed, fed64, and snow leopard. (On snow leopard, it's just a symlink to /usr/bin/python, though). Leopard uses the default system Python. We need to continue using these.

Do you have a solution for removing old, unwanted versions of Buildbot?
Attachment #515842 - Flags: review?(bhearsum) → review-
(In reply to comment #9)
> Correct me if I'm wrong, but this patch doesn't cause anything to be deployed,
> does it? I don't see buildslave::install::production being used anywhere...I'm
> probably missing something though.

Apparently I forgot to qrefresh.  I'll do so for the next patch.

> >+    # platform_python is whatever's available on this platform. If that's not
> >+    # good enough, we should start installing Pythons with Puppet.
> >+    $platform_python = "/usr/bin/python"
> 
> This assumption isn't correct in all cases. We use a Python out of /tools for
> all linux/mac build slaves and one out of ~cltbld for fed, fed64, and snow
> leopard. (On snow leopard, it's just a symlink to /usr/bin/python, though).
> Leopard uses the default system Python. We need to continue using these.

Well, you're correct that /usr/bin/python isn't good enough in all cases, but your suggestions of which pythons to use are also incorrect :)

Based on md5's of the existing /tools/buildbot/bin/python and various other pythons (and noting that the md5's change when virtualenv copies the Python binary around on mac, probably due to a small resource fork):

 test
  linux - /usr/bin/python (2.6.2)
  linux64 - /usr/bin/python (2.6.2)
  darwin9 - /usr/bin/python (2.5.1)
  darwin10 - /usr/bin/python (2.6.1)
 build
  linux - /tools/python-2.6.5/bin/python
  darwin9 - /tools/python/bin/python
  darwin10 - /tools/python-2.6.4/bin/python

(I'm surprised by test-darwin9, honestly, but /tools/buildbot/bin/python and /usr/bin/python both give the same build info, with version 2.5.1)

This will need a new Puppet module to clean it up and do installs the same way everywhere (particularly since build-darwin9 has some mac-like /tools/python/Python.framework/Versions/2.6/bin/python thing going on), but that's not necessary at the moment.

> Do you have a solution for removing old, unwanted versions of Buildbot?

ensure => absent
Created attachment 516078 [details] [diff] [review]
m631851-puppet-manifests-r2.patch

I was wrong about the ensure => absent in the last patch, but I added it here :)
Attachment #515842 - Attachment is obsolete: true
Attachment #516078 - Flags: review?(bhearsum)
Comment on attachment 516078 [details] [diff] [review]
m631851-puppet-manifests-r2.patch

Sorry this sat for so long :(.
Attachment #516078 - Flags: review?(bhearsum) → review+
Comment on attachment 516078 [details] [diff] [review]
m631851-puppet-manifests-r2.patch

552df8913261

deployed everywhere, although it only affects staging (one hopes!)
Attachment #516078 - Flags: checked-in+
This seems to be going well so far - Aki's killing a bunch of builds on sm01 and the slaves are doing well at killing the underlying processes.
Created attachment 520308 [details] [diff] [review]
m631851-puppet-manifests-productiondeploy-r1.patch

This seems fine in staging so far - let's roll it out!
Attachment #520308 - Flags: review?(bhearsum)
Attachment #520308 - Flags: review?(bhearsum) → review+
Hm, it looks like the Python path is wrong for talos-r3-snow-*.  Sigh.
Scratch that, talos-r3-snow-002 has the wrong version of Mac OS X installed (10.2.0 instead of 10.6.0).  This is still OK to roll out, once jhford's done with the linux64 stuff.
Problems with talos-r3-fed-002, too: bug 645012
Depends on: 645012
Just need to test this on a moz2-linux64-slaveNN machine, and it will be ready to deploy.
Attachment #520308 - Flags: checked-in+
OK, this is deployed on all puppet masters now, and seems to be going smoothly - at least, I've seen a bunch of slaves come up in production with the appropriate version.  I'll keep watching puppet master logs to see if there are any machines constantly pinging (which would indicate a puppet failure).
hooray!
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Depends on: 646710
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.