Closed Bug 548146 Opened 14 years ago Closed 14 years ago

deploy maemo 5 scratchbox to linux32 build slaves

Categories

(Release Engineering :: General, defect, P2)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: jhford)

References

Details

(Whiteboard: [maemo5][missing documentation])

Attachments

(5 files, 12 obsolete files)

1.17 KB, text/plain
Details
656 bytes, text/plain
Details
3.22 KB, text/plain
Details
2.09 KB, text/plain
bhearsum
: review+
bhearsum
: checked-in+
Details
6.70 KB, patch
bhearsum
: review+
jhford
: checked-in+
Details | Diff | Splinter Review
We are going to need to put scratchbox onto the 32bit linux slaves.  I am almost certain that the new scratchbox is a self-contained package, and we can create a tarball with the needed files.  Unlike the older scratchbox, the new scratchbox doesn't have /etc/init.d/sbox.  instead, it has /<path>/scratchbox/sbin/sbox_ctl which is used to start the needed daemons.  We shouldn't use the provided installation scripts as they fetch data and contain updates.  There might be some negative interaction with the current chinook sdk and scratchbox as both of them do the same cpu emulation for arm binaries and both have a lot of similar bind mounts.  We currently have chkconf sbox on, but we are likely going to have to have both scratchbox installs disabled at boot, enabling them as part of the build process.  I am investigating whether it is possible to have both a diablo (N810) and a fremantle (N900) sdk in the same scratchbox5 install as opposed to having a seperate SB4 + Chinook (Pre-N810) and SB5+Fremantle (N900).  This has the advantage of only requiring one scratchbox service to be running and a lot less disk space usage.
Summary: deploy scratchbox to linux32 build slaves → deploy maemo 5 scratchbox to linux32 build slaves
to be clear, the current scratchbox is not capable of doing N900 (maemo5) builds.  We need to either have an updated Diablo+Fremantle scratchbox or keep the chinook and add a fremantle one.
Summary: deploy maemo 5 scratchbox to linux32 build slaves → deploy scratchbox to linux32 build slaves
Summary: deploy scratchbox to linux32 build slaves → deploy maemo 5 scratchbox to linux32 build slaves
Assignee: nobody → jhford
OS: Mac OS X → Linux
Hardware: x86 → ARM
Whiteboard: [maemo5]
Hardware: ARM → x86
takes output of two different runs of:
find / -mount -type f -exec openssl md5 '{}' \; | tee -a /file-list
find /builds -mount -type f -exec openssl md5 '{}' \; | tee -a /file-list
it looks like we are safe to deploy this as a tarball of /builds/scratchbox

python fs-compare.py before-upg-sorted after-upg-sorted
new file -  /before-sb-upgrade
updated file -  /file-list
updated file -  /home/cltbld/.bash_history
new file -  /home/cltbld/maemo-scratchbox-install_5.0.sh
new file -  /home/cltbld/maemo-sdk-install_5.0.sh
updated file -  /root/.bash_history
new file -  /root/maemo-scratchbox-install_5.0.sh
new file -  /tmp/scratchbox-core-1.0.16-i386.tar.gz
new file -  /tmp/scratchbox-devkit-apt-https-1.0.10-i386.tar.gz
new file -  /tmp/scratchbox-devkit-debian-1.0.10-i386.tar.gz
new file -  /tmp/scratchbox-devkit-doctools-1.0.13-i386.tar.gz
new file -  /tmp/scratchbox-devkit-git-1.0.1-i386.tar.gz
new file -  /tmp/scratchbox-devkit-perl-1.0.4-i386.tar.gz
new file -  /tmp/scratchbox-devkit-qemu-0.10.0-0sb10-i386.tar.gz
new file -  /tmp/scratchbox-devkit-svn-1.0-i386.tar.gz
new file -  /tmp/scratchbox-libs-1.0.16-i386.tar.gz
new file -  /tmp/scratchbox-toolchain-cs2007q3-glibc2.5-arm7-1.0.14-2-i386.tar.gz
new file -  /tmp/scratchbox-toolchain-cs2007q3-glibc2.5-i486-1.0.14-1-i386.tar.gz
new file -  /tmp/scratchbox-toolchain-host-gcc-1.0.16-i386.tar.gz
Beyond the md5 comparison, have you tested that installing through a tarball doesn't break any part of the build process? This sounds like something that should be run through staging a little bit before rolling out.
I am not currently able to get valid builds on my first machine so I can't test that it will work on another machine.
Blocks: 550945
i have a second vm that I can use to test the pave-over tarball deployment option with
currently testing a build on the deployed-to vm.  I used the following process:

=Create Tarball=
On source machine
 su -
 /scratchbox/sbin/sbox_ctl stop
 /scratchbox/sbin/sbox_umount_all
 cd /builds
 tar cvfps scratchbox-YYYY-MM-DD-HHMM.tar
 scp scratchbox-YYYY-MM-DD-HHMM.tar <wherever>

=Deploy=
On target machine
 su - 
 /scratchbox/sbin/sbox_ctl stop
 /etc/init.d/sbox stop # above should do it, but be certain
 /scratchbox/sbin/sbox_umount_all
 mkdir -p /builds/slave /scratchbox/users/cltbld/builds/slave
 mount -o bind /builds/slave /scratchbox/users/cltbld/builds/slave
 #set up bind mount in /etc/fstab, must be there every boot or things break
 cd /builds
 rm -rf scratchbox
 tar xvfps scratchbox-YYYY-MM-DD-HHMM.tar
currently testing a build on the deployed-to vm.  I used the following process:

=Create Tarball=
On source machine
 su -
 /scratchbox/sbin/sbox_ctl stop
 /scratchbox/sbin/sbox_umount_all
 umount /builds/slave
 cd /builds
 tar cvfpsW scratchbox-YYYY-MM-DD-HHMM.tar
 scp scratchbox-YYYY-MM-DD-HHMM.tar <wherever>
Status: NEW → ASSIGNED
Attached file run lots and lots of builds (obsolete) —
going to run this script once i have deployed to the two slaves that phong created for me.
testing deployment (again) with

=Deploy=
On target machine
 su - 
 /scratchbox/sbin/sbox_ctl stop
 /etc/init.d/sbox stop # above should do it, but be certain
 /scratchbox/sbin/sbox_umount_all
 cd /builds
 rm -rf scratchbox
 tar xvfps scratchbox-YYYY-MM-DD-HHMM.tar
 mkdir -p /builds/slave /scratchbox/users/cltbld/builds/slave
 mount -o bind /builds/slave /scratchbox/users/cltbld/builds/slave
 #set up bind mount in /etc/fstab, must be there every boot or things break
Attached file run lots and lots of builds (obsolete) —
forgot to actually use scratchbox.  this script is correct.

these are the commands for one loop iteration:
/scratchbox/moz_scratchbox -p -d /builds/slave sb-conf select FREMANTLE_ARMEL
/scratchbox/moz_scratchbox -p -d /builds/slave make -f client.mk build
/scratchbox/moz_scratchbox -p -d /builds/slave/obj-2-fre make package
/scratchbox/moz_scratchbox -p -d /builds/slave/obj-2-fre make package-tests
Attachment #434129 - Attachment is obsolete: true
Attachment #434263 - Attachment is patch: false
John I was thinking about the size of the tar ball you mentioned and to deploy through puppet.

Have you considered:
1) remove scratchbox
2) recreate scratchbox from scratch to the point that you wanted
rather than:
1) remove scracthbox
2) download tar ball (few gigabytes)
3) untar that tar ball (which is in the state that you wanted)

What do you think?
Do we have the steps needed to get us from scratch to the desired scratchbox desired state?
(In reply to comment #12)
> John I was thinking about the size of the tar ball you mentioned and to deploy
> through puppet.
> 
> Have you considered:
> 1) remove scratchbox
> 2) recreate scratchbox from scratch to the point that you wanted
> rather than:
> 1) remove scracthbox
> 2) download tar ball (few gigabytes)
> 3) untar that tar ball (which is in the state that you wanted)
> 
> What do you think?
> Do we have the steps needed to get us from scratch to the desired scratchbox
> desired state?

yes. I thought about doing that but did not pursue it because of three reasons. first is that the scratchbox and sdk use apt to install the most of the libraries and apps. this means that we could easily have mismatched versions of various packages. to guard against this we could recursively hash on the fs on a known good sb and compare it to code on the deployed slave but the logs to do that are fairly large and that takes a long time to do. second is that it takes a *very* long time to install scratchbox. we get a max of 20kbps to the scratchbox tarballs and would have to xfer 3gb at that speed. we could use a proxy but this is still going to be the same amount of bits going to the slave. this would lower the load on nfs though. the last concern I have is that we have to do an in-place upgrade of the scratchbox. there is a high probability that not all scratchboxes are identical. if we do a pave-over install we know that all of our scratchbox installs are identical. 

that being said, we could probably put that huge tarball on an internal http/FTP server and have one of the deployment steps be to wget it. do you think that it'd work?  we could do versioned tarballs so that we don't have issues with figuring out which package is where. 

(written on iPhone. sorry for poor auto correction)
How big is the resulting tarball?  A few gigs is pretty large to pull down and then have sitting on every slave.
(In reply to comment #14)
> How big is the resulting tarball?  A few gigs is pretty large to pull down and
> then have sitting on every slave.

i have not zipped it yet, but uncompressed it is 3.0GB.  I just realised that i never put the deleting of the tarball in the posted steps above.  The only reason i could think of to keep the tarball is if we wanted to do binary diffs for future deployments but that is probably overengineering the problem.
looking at the updated sdk notes (http://www.forum.nokia.com/info/sw.nokia.com/id/c05693a1-265c-4c7f-a389-fc227db4c465/Maemo_5_SDK.html) it looks like they are deploying sdk updates as debian packages.

This means that if we wish to keep our scratchboxes in sync with each other, we should not do any package deployment using apt/aptitude on the scratchbox installations unless we are able to do the same apt operation on all slaves + the ref image before the package gets updated on the repository and we don't add new linux32 slaves to the pool ever.

One thing we could do is use rsync to keep /scratchbox up to date on each boot from a master copy of the scratchbox but that seems like a more fragile solution.
Attached file verify-builds does mozconfigs (obsolete) —
fixes
Attachment #434263 - Attachment is obsolete: true
Attached file verify-builds (obsolete) —
sorry for the bugspam, but this is the only way i can get the script on the test slaves.  scp isn't working for me right now
Attachment #434478 - Attachment is obsolete: true
Attached file verify-builds
Attachment #434482 - Attachment is obsolete: true
(In reply to comment #13)
> (In reply to comment #12)
> > John I was thinking about the size of the tar ball you mentioned and to deploy
> > through puppet.
> > 
> > Have you considered:
> > 1) remove scratchbox
> > 2) recreate scratchbox from scratch to the point that you wanted
> > rather than:
> > 1) remove scracthbox
> > 2) download tar ball (few gigabytes)
> > 3) untar that tar ball (which is in the state that you wanted)
> > 
> > What do you think?
> > Do we have the steps needed to get us from scratch to the desired scratchbox
> > desired state?
> 
> yes. I thought about doing that but did not pursue it because of three reasons.
> first is that the scratchbox and sdk use apt to install the most of the
> libraries and apps. this means that we could easily have mismatched versions of
> various packages. to guard against this we could recursively hash on the fs on
> a known good sb and compare it to code on the deployed slave but the logs to do
> that are fairly large and that takes a long time to do. second is that it takes
> a *very* long time to install scratchbox. we get a max of 20kbps to the
> scratchbox tarballs and would have to xfer 3gb at that speed. we could use a
> proxy but this is still going to be the same amount of bits going to the slave.
> this would lower the load on nfs though. the last concern I have is that we
> have to do an in-place upgrade of the scratchbox. there is a high probability
> that not all scratchboxes are identical. if we do a pave-over install we know
> that all of our scratchbox installs are identical. 

We can pretty easily deal with the load on NFS problem by staging out the rollout to 10 slaves or so at a time. It would take longer overall, but eliminate load concerns there.

> that being said, we could probably put that huge tarball on an internal
> http/FTP server and have one of the deployment steps be to wget it. do you
> think that it'd work?  we could do versioned tarballs so that we don't have
> issues with figuring out which package is where. 

Puppet has a built in HTTP server that we should use for this. It's already got an Apache install in front of it but I don't know if caching would help though, since they're hosted on the same machine.
(In reply to comment #16)
> looking at the updated sdk notes
> (http://www.forum.nokia.com/info/sw.nokia.com/id/c05693a1-265c-4c7f-a389-fc227db4c465/Maemo_5_SDK.html)
> it looks like they are deploying sdk updates as debian packages.
> 
> This means that if we wish to keep our scratchboxes in sync with each other, we
> should not do any package deployment using apt/aptitude on the scratchbox
> installations unless we are able to do the same apt operation on all slaves +
> the ref image before the package gets updated on the repository and we don't
> add new linux32 slaves to the pool ever.
> 
> One thing we could do is use rsync to keep /scratchbox up to date on each boot
> from a master copy of the scratchbox but that seems like a more fragile
> solution.

Totally agree that keeping things in sync with apt is a bit tricky.  We _could_ store all the .deb's that get installed locally, and then use puppet to install those with dpkg.

I'm more concerned with the on-disk size of the whole environment.  Is there any way to get the size down below 3 GB?  After deleting the old scratchbox, what's the impact?
(In reply to comment #22)
> Totally agree that keeping things in sync with apt is a bit tricky.  We _could_
> store all the .deb's that get installed locally, and then use puppet to install
> those with dpkg.

The only downside to this is that we can't use Puppet's "package" types to do it, which means we've got to deal with a lot of fiddly exec {}s -- working with a single tarball would be easier from that perspective.
(In reply to comment #22)
> Totally agree that keeping things in sync with apt is a bit tricky.  We _could_
> store all the .deb's that get installed locally, and then use puppet to install
> those with dpkg.
Yes, we could do that.  Approximately the same amount of data has to be pushed to the slave though.  There is a risk (however small) that the scratchbox install scripts don't assemble a working scratchbox.  We also have to build python and openssl from source in a weird way.  I would personally feel more confident with the final install if we did it as a snapshot of a known-to-work scratchbox installation.

> I'm more concerned with the on-disk size of the whole environment.  Is there
> any way to get the size down below 3 GB?  After deleting the old scratchbox,
> what's the impact?

This is a pave-over of the old scratchbox.  

Scratchbox is basically a chroot manager with a bunch of neat tricks for tricking the programs into thinking it is running on ARM, complete with ARM binary execution.  The end result is an entire debian system for each target environment you have in your scratchbox.  The old scratchbox had 2 for arm (one that we didn't use) and one for x86 (that we didn't use).  The new one has two arm ones that we do use and no x86 ones (see https://bugzilla.mozilla.org/show_bug.cgi?id=528431#c42 for details on space usage).  I deleted the unused targets which means that this deployment will actually cut our net disk usage by 1.2GB

My main concern is that I am going to kill puppet with this deployment, not that the deployment method is flawed.  For future changes, if it is a matter of installing a couple deb files, I would be in favour of downloading the dependent debs (so long as there aren't too many) and doing the exec{} thing in Ben's comment, but for adding a new target or upgrading the SDK version, this seems like the best way to have a working scratchbox that is identical on all slaves without over-engineering the solution.

I am waiting on 3 builds to finish up, and once they do I will have a tarball that is ready to deploy (at least to staging).  Is anyone around who can help me with this?

Oh, and the bind mount that needs to be set up -- does puppet understand how to manage fstab and bind mounting?
=Deploy=
 On target machine
  su - 
  /scratchbox/sbin/sbox_ctl stop
  /etc/init.d/sbox stop # above should do it, but be certain
  /scratchbox/sbin/sbox_umount_all
  cd /builds
  rm -rf scratchbox
  tar xvfps scratchbox-YYYY-MM-DD-HHMM.tar
  rm scratchbox-YYYY-MM-DD-HHMM.tar
  mkdir -p /builds/slave /scratchbox/users/cltbld/builds/slave
  mount -o bind /builds/slave /scratchbox/users/cltbld/builds/slave
  #set up bind mount in /etc/fstab, must be there every boot or things break
well, it turns out that bzip2'ing the tarball yields a major improvement in archive size (1.2gb vs 3.0gb).

The steps should be updated with
-tar xvfps scratchbox-YYYY-MM-DD-HHMM.tar
-rm scratchbox-YYYY-MM-DD-HHMM.tar

+tar jxvfps scratchbox-YYYY-MM-DD-HHMM.tar.bz2
+rm scratchbox-YYYY-MM-DD-HHMM.tar.bz2
Attached patch taking a stab at it (obsolete) — Splinter Review
this is my guess at what needs to happen.
Attachment #434618 - Flags: feedback?(bhearsum)
Attachment #434618 - Flags: feedback?(armenzg)
Comment on attachment 434618 [details] [diff] [review]
taking a stab at it

Is there any reason you can't put all of this in a bash script and just execute that? It would make the Puppet side 
much simpler, like:
* copy tarball
* execute script
* delete tarball

...and you'd probably want to keep the symlink stuff in puppet, too.

One thing to watch out for is the multiple bind mounting of .ssh. It'd be easy to unmount in a loop in a bash script :-).

A few more comments are inline.

I don't think puppet will accept variables outside of classes; move it inside.

>+        "umount /builds/slave || /bin/true":
>+            cwd => "/",
>+            alias => "umount-builds-slave-bind",
>+            subscribe => Exec['umount-all-scratchbox-binds'];

This mount doesn't exist in our current scratchbox install, so you don't need to do this.
Attachment #434618 - Flags: feedback?(bhearsum) → feedback-
Comment on attachment 434618 [details] [diff] [review]
taking a stab at it

+1 to the script idea 

>+$sbtarball = "scratchbox-2010-03-22-1916.tar.bz2"
>+
> class scratchbox {
> 
I think it should go inside the class. I could be wrong.
    
>+            source => "${fileroot}/${sbtarball}";
source => "${fileroot}/centos/dist/${sbtarball}";
or something like that.

We are not going to use httproot but have a look at the structure in here for some ideas:
http://production-puppet.build.mozilla.org/darwin9/


For the next attachment could you paste the scracthbox.pp file as a whole rather than the diff?

We can work on the diff when you ask for review instead of feedback
Attachment #434618 - Flags: feedback?(armenzg)
(In reply to comment #28)
> (From update of attachment 434618 [details] [diff] [review])
> Is there any reason you can't put all of this in a bash script and just execute

That is what i would prefer to do.  I suggested it and was told that it would be better to do all of it in puppet.

> that? It would make the Puppet side 
> much simpler, like:
> * copy tarball
> * execute script
> * delete tarball
> 
> ...and you'd probably want to keep the symlink stuff in puppet, too.
> 
> One thing to watch out for is the multiple bind mounting of .ssh. It'd be easy
> to unmount in a loop in a bash script :-).

why don't i do something in my script that greps for the bind mount in fstab and adds only if it isn't already there?

When it comes time to test, I have absolutely zero clue how to test this.  Is anyone able to help me test this?

I am also not averse to rolling this out manually as time is of the essence in this deployment.  If we do a manual rollout I would have to take slaves down as they go idle, run the script then put it back in the pool

(In reply to comment #29)
> (From update of attachment 434618 [details] [diff] [review])
> We are not going to use httproot but have a look at the structure in here for
> some ideas:
> http://production-puppet.build.mozilla.org/darwin9/

I will look there

 
> For the next attachment could you paste the scracthbox.pp file as a whole
> rather than the diff?
> We can work on the diff when you ask for review instead of feedback

yah, will do.
it is looking like the automation work that depends on this is moving a lot quicker than anticipated.
Severity: normal → major
Priority: -- → P2
Attached file deploy-sb.sh (obsolete) —
this is the script to deploy scratchbox that would go in /N/centos5/scratchbox/
Attachment #435099 - Flags: feedback?(bhearsum)
Attachment #435099 - Flags: feedback?(armenzg)
Attached file scratchbox.pp (obsolete) —
this is the whole file as requested
Attachment #434618 - Attachment is obsolete: true
Attachment #435100 - Flags: feedback?(bhearsum)
Attachment #435100 - Flags: feedback?(armenzg)
Comment on attachment 435100 [details]
scratchbox.pp

>    ${VERSION}=2010-03-25-2138

I am not sure if i need to surround in quotes.  I do not want the quotes in the file string

>    exec {
>        "VERSION=${VERSION} /N/centos5/scratchbox/deploy-sb.sh":
>            creates => "/scratchbox/deployed-${VERSION}",
>            cwd => "/builds/",
>            alias => "install-scratchbox",
>	    subscribe => File['/builds/scratchbox-${VERSION}.tar.bz2'];
>    }

1. I don't know if i can do the env var thing.  If this is passed to an exec*()-like function instead of a system()-like one, I could use /usr/bin/env to insert the variable I want.

2. does the deploy-sb.sh need to be in the files section?  It is only being executed not installed and will in the NFS mount

>        "/builds/scratchbox-${VERSION}.tar.bz2":
>            source => "${fileroot}/centos5/scratchbox/scratchbox-${VERSION}.tar.bz2";

Not really sure if this is correct or not.
Comment on attachment 435099 [details]
deploy-sb.sh

>su - cltbld -c 'buildbot stop /builds/slave'
>./scratchbox/sbin/sbox_ctl stop || error "stopping sbox" #turn off
>./scratchbox/sbin/sbox_umount_all || error "umounting sbox" #turn extra off
....
>touch ./scratchbox/deployed-${VERSION}
>./scratchbox/sbin/sbox_ctl start
>su - cltbld -c 'buildbot start /builds/slave'

Not sure If i should be doing the buildbot slave control, but I'd err on the side of caution.  If the slave is guarunteed to not be running, we could loose the buildbot commands.
Comment on attachment 435099 [details]
deploy-sb.sh

>#!/bin/bash
># This is a script to deploy a scratchbox. At the end of deployment, this
># script will touch ${WORKDIR}/scratchbox/scratchbox-deployed and
># ${WORKDIR}/scratchbox/deployed-${VERSION}
>#
># This script must be run as root and will stop a buildbot slave at /builds/slave
># if it is there using the cltbld user account
>if [[ x"$VERSION" == "x" ]] ; then
>    VERSION=2
>fi
>WORKDIR='/builds/'
>
>error () {
>    echo "ERROR: $1" 1>&2
>    mv /builds/slave/buildbot.tac /builds/slave/buildbot.tac.off
>    cat >> /builds/slave/buildbot.tac.off <<EOF 
>##########################################################
>#                                                        #
>#                                                        #
># SLAVE IS OFF BECAUSE OF SCRATCHBOX DEPLOYMENT FAILURE  #
>#                                                        #
>#     DO NOT ENABLE SLAVE UNTIL SCRATCHBOX IS FIXED      #
>#                                                        #
>#                                                        #
>##########################################################
>EOF
>    exit 1
>}
Please don't add anything to buildbot.tac.

>mount | egrep "^/builds/slave" &> /dev/null
>if [[ $? != 0 ]] ; then
>    #We need to umount /builds/slave or die
>    umount /builds/slave || error "could not umount /builds/slave"
>fi
I am confused. We have to check for /builds/slave? I didn't know we mount that.


I like this script. I bet bhearsum will have more comments than me.


John, how long does this process take?
I am trying to picture what problems will cause having our slaves going down for deployment. Might be worth closing the trees for a couple of hours just in case.
Comment on attachment 435100 [details]
scratchbox.pp

># scratchbox.pp
># installs the scratchbox files
># In this manifest we make use of some .expect scripts to automate
># the steps necessary. Sometimes these need to be done as the cltbld
># user so we do that by calling the .expect via su
>
>class scratchbox {
>    ${VERSION}=2010-03-25-2138
>    exec {
>        "VERSION=${VERSION} /N/centos5/scratchbox/deploy-sb.sh":
Use ${fileroot} instead of /N. Use $os instead of "centos5".

>            creates => "/scratchbox/deployed-${VERSION}",
>            cwd => "/builds/",
>            alias => "install-scratchbox",
>	    subscribe => File['/builds/scratchbox-${VERSION}.tar.bz2'];
>    }
>
>    file {
>        "/scratchbox":
>            ensure => '/builds/scratchbox',
>            mode => 755,
>            owner => root,
>            group => root;
>
>        "/builds/scratchbox-${VERSION}.tar.bz2":
>            source => "${fileroot}/centos5/scratchbox/scratchbox-${VERSION}.tar.bz2";
Use $os.

>
>        "/builds/scratchbox/moz_scratchbox":
>            source => "/N/centos5/scratchbox/moz_scratchbox",
>            mode => 755;
>
>        "/scratchbox/etc/resolv.conf":
>            source => "/N/centos5/scratchbox/etc/resolv.conf",
>            require => Exec["install-scratchbox"];
>
Use ${fileroot} instead of /N. Use $os instead of "centos5".
Attachment #435100 - Flags: feedback?(armenzg) → feedback+
Comment on attachment 435099 [details]
deploy-sb.sh

You can't go stopping the buildbot slave here. We cannot predict when this is going to run, or if the buildbot slave is going to busy.

I agree with Armen about not modifying the buildbot.tac file. If we manage to screw it up on every machine that's a _ton_ of manual cleanup to do.

You need to unmount the .ssh bind mount in a loop here, I'm pretty sure.

Why do you need to test for old-scratchbox in a loop? rm -rf isn't asynchronous. If for some reason you do, set a maximum amount of time, and die after it to avoid the possibility of an infinite loop.

Rolling this out is definitely going to be tricky. Given the size and scope of this change it's best to do it in a downtime, I think. I'm also wondering if we should switch the Linux build machines to do what we do on Talos: only run Puppet once, at boot, and only launch Buildbot after it finishes. That would eliminate the possibility of a build running while we're trying to deploy.

Let's chat about this today.
Attachment #435099 - Flags: feedback?(bhearsum) → feedback-
Comment on attachment 435100 [details]
scratchbox.pp

Why not just pass $VERSION as an arg?

This seems pretty ok overall, but you need to delete the tarball afterwards. You can do this with an additional file {} check using 'ensure => absent, force => true'.

I'd like to see this in action before I review it, too.
(In reply to comment #36)
> Please don't add anything to buildbot.tac.

sure, but if this deployment fails, the slave needs to be taken out of the pool.  Is moving buildbot.tac ok?

> I am confused. We have to check for /builds/slave? I didn't know we mount that.

We don't currently, but that is something that we are going to be doing in the future.  This check is to see if /builds/slave is mounted at all, and only if it is, unmount it.
 
> John, how long does this process take?
> I am trying to picture what problems will cause having our slaves going down
> for deployment. Might be worth closing the trees for a couple of hours just in
> case.

it takes 15minutes+ per slave.  There would be issues if there is a current maemo build in progress.  There would also be issues if the slave rebooted during deployment.

(In reply to comment #37)
> Use ${fileroot} instead of /N. Use $os instead of "centos5".
> Use $os.
> Use ${fileroot} instead of /N. Use $os instead of "centos5".
Will Do.

(In reply to comment #38)
> (From update of attachment 435099 [details])
> You can't go stopping the buildbot slave here. We cannot predict when this is
> going to run, or if the buildbot slave is going to busy.

Unless we can guarantee that there is no Maemo build happening and the slave won't reboot midstream, this is not a safe deployment.

> I agree with Armen about not modifying the buildbot.tac file. If we manage to
> screw it up on every machine that's a _ton_ of manual cleanup to do.

Ok, I won't do that.

> You need to unmount the .ssh bind mount in a loop here, I'm pretty sure.

forgot that we mounted .ssh, will do this.
 
> Why do you need to test for old-scratchbox in a loop? rm -rf isn't
> asynchronous. If for some reason you do, set a maximum amount of time, and die
> after it to avoid the possibility of an infinite loop.

I am running the rm -rf in the background to speed up deployment.  This check is just to make sure that the script does not exit before the old scratchbox finishes being deleted, but I can set a maximum time.
 
> Rolling this out is definitely going to be tricky. Given the size and scope of
> this change it's best to do it in a downtime, I think. I'm also wondering if we
> should switch the Linux build machines to do what we do on Talos: only run
> Puppet once, at boot, and only launch Buildbot after it finishes. That would
> eliminate the possibility of a build running while we're trying to deploy.

In general, it seems to be that changing anything on the filesystem that is related to the actual build while builds are happening isn't a good thing.  Is there a way we can say that certain jobs are safe to deploy at any time and others are only safe before slaves start?

I would agree that this is best for a downtime.  Alternatively, if my automation patches are r+'d then I would not be opposed to taking slaves out of the pool manually and deploying this to our pm02 slaves.
 
> Let's chat about this today.

Sure
Depends on: 555267
the patches that depend on this have been r+'d.  I would like to be able to land them before they bitrot and we would also like to get this in production by the end of quarter as this is a goal.
Severity: major → critical
I am also including the libcurl3-dev package from bug 554999 in the CHINOOK-ARMEL-2007 tarball.
Blocks: 554999
Attached patch puppet patch to test (obsolete) — Splinter Review
Attachment #435100 - Attachment is obsolete: true
Attachment #435100 - Flags: feedback?(bhearsum)
patch coming soon, but i got this output on running puppetd --test --server staging-puppet.build.mozilla.org


<snip>
notice: //Node[moz2-linux-slave03.build.mozilla.org]/staging-buildslave/scratchbox/Exec[install-scratchbox]/returns: executed successfully
<snip>
notice: Finished catalog run in 1123.42 seconds
Attached patch puppet manifest (obsolete) — Splinter Review
Here is the puppet manifest for deployment.  As discussed it requires buildbot to not be running during the deployment.

I have put the required tarballs and script in the /N/puppet-files/centos5/scratchbox directory of staging-puppet which aiui means it is also in that directory for production-puppet.  I have not checked them in.  I will attach the deployment script to this bug
Attachment #435959 - Attachment is obsolete: true
Attachment #436055 - Flags: review?(bhearsum)
Attachment #436055 - Flags: checked-in?
Attached file deploy-sb.sh (obsolete) —
updated.  Using cwd => /builds/slave was not valid after changing from ${fileroot} to NFS, so I have to store the pwd before changing to /builds/slave in order to know where the tarball is located.
Attachment #435099 - Attachment is obsolete: true
Attachment #436056 - Flags: review?(bhearsum)
Attachment #436056 - Flags: checked-in?
Attachment #435099 - Flags: feedback?(armenzg)
Attachment #436056 - Attachment mime type: application/x-sh → text/plain
Comment on attachment 436055 [details] [diff] [review]
puppet manifest

+            timeout => 1*60*60,

should be 

+            timeout => 7200,
I have had to update the scratchbox due to bug 554999.  I am tarring+bzip2'ing the new scratchbox and putting it on /N/.  I am going to test that this deployment works as well.  This does mean that the $sb_version variable in the puppet manifest change we land should be 2010-03-30-1129.
Attached file deploy-sb.sh (obsolete) —
-don't do rm asynchronously
-save build dir from existing scratchbox before deployment
Attachment #436056 - Attachment is obsolete: true
Attachment #436208 - Flags: review?(bhearsum)
Attachment #436056 - Flags: review?(bhearsum)
Attachment #436056 - Flags: checked-in?
Comment on attachment 436056 [details]
deploy-sb.sh

As per our IRC discussion, please do everything here synchronously. Doing the rm in the background is error prone, and in the worst case, ties up a slave needlessly for 5 hours.

If there's a problem when removing the old scratchbox, just error out.

umount'ing like you are doesn't work for things that are mounted multiple times, like .ssh can be. It's probably ok since we're running at boot, but for correctness can you please change the umount's to unmount by mount point?

unmount doesn't return until it's finished unmount, the sleep's are unnecessary - remove them.
(In reply to comment #50)
> (From update of attachment 436056 [details])
> As per our IRC discussion, please do everything here synchronously. Doing the
> rm in the background is error prone, and in the worst case, ties up a slave
> needlessly for 5 hours.
> 
> If there's a problem when removing the old scratchbox, just error out.

new script attached

> umount'ing like you are doesn't work for things that are mounted multiple
> times, like .ssh can be. It's probably ok since we're running at boot, but for
> correctness can you please change the umount's to unmount by mount point?

The output redirection was breaking my loops, i have tested the fix below.

> unmount doesn't return until it's finished unmount, the sleep's are unnecessary
> - remove them.

Actually, in my testing that was not the case.  They returned before the umount had actually unmounted the file system.  I am happy to remove them though


[root@maemo5-test01 ~]# mount
/dev/sda1 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda2 on /var type ext3 (rw)
/dev/sdb1 on /builds type ext3 (rw,noatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
10.2.71.136:/export/buildlogs/puppet-files on /N type nfs (ro,addr=10.2.71.136)
/builds/scratchbox/users/cltbld/host_usr on /host_usr type none (rw,bind)
/builds/scratchbox/users/cltbld/host_usr on /host_usr type none (rw,bind,/scratchbox/users/cltbld/host_usr)
[root@maemo5-test01 ~]# mount -o bind /builds/slave/ lala                       [root@maemo5-test01 ~]# mount -o bind /builds/slave/ lala
[root@maemo5-test01 ~]# mount -o bind /builds/slave/ lala
[root@maemo5-test01 ~]# mount -o bind /builds/slave/ lala
[root@maemo5-test01 ~]# mount -o bind /builds/slave/ lala
[root@maemo5-test01 ~]# mount -o bind /builds/slave/ lala
[root@maemo5-test01 ~]# mount -o bind /builds/slave/ lala
[root@maemo5-test01 ~]# mount -o bind /builds/slave/ lala
[root@maemo5-test01 ~]# mount -o bind /builds/slave/ lala
[root@maemo5-test01 ~]# mount
/dev/sda1 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda2 on /var type ext3 (rw)
/dev/sdb1 on /builds type ext3 (rw,noatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
10.2.71.136:/export/buildlogs/puppet-files on /N type nfs (ro,addr=10.2.71.136)
/builds/scratchbox/users/cltbld/host_usr on /host_usr type none (rw,bind)
/builds/scratchbox/users/cltbld/host_usr on /host_usr type none (rw,bind,/scratchbox/users/cltbld/host_usr)
/builds/slave on /root/lala type none (rw,bind)
/builds/slave on /root/lala type none (rw,bind)
/builds/slave on /root/lala type none (rw,bind)
/builds/slave on /root/lala type none (rw,bind)
/builds/slave on /root/lala type none (rw,bind)
/builds/slave on /root/lala type none (rw,bind)
/builds/slave on /root/lala type none (rw,bind)
/builds/slave on /root/lala type none (rw,bind)
/builds/slave on /root/lala type none (rw,bind)
[root@maemo5-test01 ~]# while [[ `mount | grep "^/builds/slave"` ]] ; do umount /builds/slave ; done
[root@maemo5-test01 ~]# mount
/dev/sda1 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda2 on /var type ext3 (rw)
/dev/sdb1 on /builds type ext3 (rw,noatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
10.2.71.136:/export/buildlogs/puppet-files on /N type nfs (ro,addr=10.2.71.136)
/builds/scratchbox/users/cltbld/host_usr on /host_usr type none (rw,bind)
/builds/scratchbox/users/cltbld/host_usr on /host_usr type none (rw,bind,/scratchbox/users/cltbld/host_usr)
[root@maemo5-test01 ~]#
Attached file deploy-sb.sh
-dealing with files that were mounted multiple times is fixed
Attachment #436208 - Attachment is obsolete: true
Attachment #436213 - Flags: review?(bhearsum)
Attachment #436208 - Flags: review?(bhearsum)
Attachment #436213 - Attachment mime type: application/x-sh → text/plain
Attachment #436055 - Attachment is obsolete: true
Attachment #436231 - Flags: review?(bhearsum)
Attachment #436055 - Flags: review?(bhearsum)
Attachment #436055 - Flags: checked-in?
Attachment #436208 - Attachment mime type: application/x-sh → text/plain
Comment on attachment 436213 [details]
deploy-sb.sh

This seems fine.
Attachment #436213 - Flags: review?(bhearsum) → review+
Comment on attachment 436231 [details] [diff] [review]
puppet manifest - should apply cleanly

>+        install-scratchbox:
>+            command => "/N/centos5/scratchbox/deploy-sb.sh $sb_version",
>+            creates => "/builds/scratchbox/deployed-$sb_version",
>+            timeout => 1*60*60,

This is still wrong. 

>+    mount {
>+        "/builds/scratchbox/users/cltbld/builds/slave":
>+            device => "/builds/slave",
>+            fstype => "auto",
>+            options => "bind",
>+            ensure => "mounted",
>+            require => Exec['install-scratchbox'];
>     }
> }


This is fine, but you need to remove the one in centos.pp
Attachment #436231 - Flags: review?(bhearsum) → review-
(In reply to comment #55)
> (From update of attachment 436231 [details] [diff] [review])
> >+        install-scratchbox:
> >+            command => "/N/centos5/scratchbox/deploy-sb.sh $sb_version",
> >+            creates => "/builds/scratchbox/deployed-$sb_version",
> >+            timeout => 1*60*60,
> 
> This is still wrong. 
> 
> >+    mount {
> >+        "/builds/scratchbox/users/cltbld/builds/slave":
> >+            device => "/builds/slave",
> >+            fstype => "auto",
> >+            options => "bind",
> >+            ensure => "mounted",
> >+            require => Exec['install-scratchbox'];
> >     }
> > }
> 
> 
> This is fine, but you need to remove the one in centos.pp

...of course, the one in centos.pp is .ssh, not the slave dir, so nevermind this.
timeout fixed.  Set at 3 hours because of the slow link between MPT->MV.  As discussed in IRC, the /builds/slave bind mount is not done in centos.pp so it is safe to do here.
Attachment #436231 - Attachment is obsolete: true
Attachment #436300 - Flags: review?(bhearsum)
Attachment #436300 - Flags: review?(bhearsum) → review+
Attachment #436300 - Flags: checked-in?
Comment on attachment 436213 [details]
deploy-sb.sh

Checking in deploy-sb.sh;
/mofo/puppet-files/centos5/scratchbox/deploy-sb.sh,v  <--  deploy-sb.sh
initial revision: 1.1
done


Haven't been able to land the tarball yet, because CVS sucks:
cvs [commit aborted]: out of memory; can not allocate 1277911658 bytes
Attachment #436213 - Flags: checked-in+
Had to land a bustage fix because we forgot to make a case {} to stop 64-bit slaves from installing this. Landed it as http://hg.mozilla.org/build/puppet-manifests/rev/74d5fac79c34
This has landed.  I haven't seen any slaves fail out on a build due to missing scratchbox.  If there any slaves that do not have the updated scratchbox please file a new bug
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
John, I just remembered that the ref platform doc was never updated. Can you please document how you upgraded Scratchbox here: https://bugzilla.mozilla.org/show_bug.cgi?id=548146? (To be clear, we're looking for the actual upgrade instructions, not how to upgrade with the tarball.)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Yes, I was about to do that.  I am going to be putting these steps into https://wiki.mozilla.org/index.php?title=ReferencePlatforms/Linux-scratchbox as we basically treat that as a platform in a platform.  I will close this bug when I have finished
Done
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
There is still no reference to the upgraded scratchbox on the ref platform page. You pointed me to https://wiki.mozilla.org/ReferencePlatforms/Linux-scratchbox, but that either needs to be migrated to, or linked from the ref platform page -- otherwise it's going to be missing for anyone who tries to create our ref platform.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [maemo5] → [maemo5][missing documentation]
Do we have a linux32 ref machine at the moment ? Did that get the update too ?
(In reply to comment #66)
> Do we have a linux32 ref machine at the moment ? Did that get the update too ?

We've got a VM and ix ref for this platform. Not sure if either got updated.
jhford, what say you ?
I don't know what the status of the ref machine/image is.
I booted up the 32-bit Linux ref platform today and it successfully synced up the new Scratchbox. The IX ref machine synced the new scratchbox along with the slaves on April 1st.

I also copied the Scratchbox upgrade instructions to the ref platform page:
https://wiki.mozilla.org/ReferencePlatforms/Linux-CentOS-5.0#Upgrade_Scratchbox
and deleted the Linux-scratchbox page, because it is now entirely duplicating content found on the CentOS page.

I think we're all done here.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Thanks Ben, that's awesome.
Attachment #436300 - Flags: checked-in? → checked-in+
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: