slave-side slave-alloc support for linux build & test

RESOLVED FIXED

Status

RESOLVED FIXED
8 years ago
5 years ago

People

(Reporter: dustin, Assigned: dustin)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 6 obsolete attachments)

This puts slave-side support for the slave allocation system in place on linux hosts.  It has no immediate behavior change, although adjusting buildbot.tac will require adjusting files in a central location, rather than on the slaves.

1. Collect the buildbot.tac's for all linux slaves and place them in a build-net-accessible directory (fabric?!)

2. land runslave.py and (/etc/init.d/)buildbot in build/tools/buildbot-helpers/startup, and add to puppet-files.  Files will have appropriate header comments to describe their version control and deployment

3. adjust linux puppet manifests to distribute these files to linux slaves, and to delete /etc/init.d/buildbot-tac

4. remove build-tools.rpm (bug 615301)
Created attachment 494866 [details] [diff] [review]
puppet-manifests-r1.patch

Here are the corresponding changes to buildbot-files.  All of the new/changed files are in the build/tools patch I'll push momentarily.

::remove unused $platform/etc/init.d/buildbot.tac
::remove unused $platform/etc/default/buildbot

::change $platform/etc/init.d/buildbot (centos)
/N/staging/centos5-x86_64/build/etc/init.d/buildbot
/N/staging/centos5-i686/build/etc/init.d/buildbot

::change fedora's run-puppet-and-buildbot.sh
/N/staging/fedora12-x86_64/test/home/cltbld/run-puppet-and-buildbot.sh
/N/staging/fedora12-i686/test/home/cltbld/run-puppet-and-buildbot.sh

::add runslave.py
add centos5-i686/user/local/bin/runslave.py
add centos5-x86_64/user/local/bin/runslave.py
add fedora12-i686/test/user/local/bin/runslave.py
add fedora12-x86_64/test/user/local/bin/runslave.py


This has been staged successfully, and slaves start up and download the .tac from my people page.  I've even seen runslave.py successfully use the existing buildbot.tac after a failed download.
Attachment #494866 - Flags: review?(bhearsum)
Created attachment 494867 [details] [diff] [review]
tools-r1.patch

Let's keep all of these files in tools/buildbot-helpers/startup!

If they exist elsewhere in version control, I haven't found them - please point me there.
Attachment #494867 - Flags: review?(bhearsum)
Attachment #494867 - Flags: feedback?(catlee)
Created attachment 494870 [details] [diff] [review]
tools-r2.patch
Attachment #494867 - Attachment is obsolete: true
Attachment #494870 - Flags: review?
Attachment #494867 - Flags: review?(bhearsum)
Attachment #494867 - Flags: feedback?(catlee)
Comment on attachment 494870 [details] [diff] [review]
tools-r2.patch

just like -r1, but with some extra MOZILLA DEPLOYMENT NOTES comments.
Attachment #494870 - Flags: review?(bhearsum)
Attachment #494870 - Flags: review?
Attachment #494870 - Flags: feedback?(catlee)
When these are reviewed, I think it will be fine to throw them onto puppet with no further ado - most living slaves will check http://people.mozilla.com/~dmitchell/allocator/SLAVE, not find a .tac, and keep using the existing .tac -- just like they've been doing since they were set up.

We'll lose the ability to disable slaves by moving buildbot.tac to buildbot.off, as runslave.py will just create a fresh new buildbot.tac.  But we can fix this by creating DO_NOT_START in the basedir.

It might be nice to finish bug 616344 first, so we don't need to redeploy runslave.py, but honestly it's not that hard to change the URL later.
Blocks: 615301

Comment 6

8 years ago
A file called DO_NOT_START?
Comment on attachment 494870 [details] [diff] [review]
tools-r2.patch

>+# Required-Stop:     $local_fs
>+PATH=/sbin:/bin:/usr/sbin:/usr/bin
>+DESC="Buildbot"
>+
>+BUILDSLAVE_CMD=/tools/buildbot/bin/buildbot # will change to 'buildslave' in 0.8.1
>+LOCKFILE=/var/lock/subsys/buildbot
>+RUNSLAVE=/builds/runslave.py
>+PYTHON=/tools/python/bin/python
>+
>+test -x ${BUILDSLAVE_CMD} || exit 0
>+
>+. /lib/lsb/init-functions
>+
>+start_buildbot() {
>+    # note - spaces here will break things
>+    su - cltbld sh -c "${PYTHON} ${RUNSLAVE}"
>+    ret=$?
>+    if [ $ret == "0" ]; then
>+        touch ${LOCKFILE}
>+    fi
>+    return $ret
>+}

Is LOCKFILE useful to maintain?

>+    parser.add_option("-a", "--allocator-url", action="store", dest="allocator_url")
>+    parser.add_option("-c", "--buildslave-cmd", action="store", dest="buildslave_cmd")
>+    parser.add_option("-d", "--basedir", action="store", dest="basedir")
>+    parser.add_option("-n", "--slavename", action="store", dest="slavename")
>+    parser.add_option("-v", "--verbose", action="store_true", dest="verbose")
>+    parser.add_option(      "--no-start", action="store_true", dest="no_start")
>+
>+    (options, args) = parser.parse_args()
>+
>+    # apply some defaults
>+    if not options.slavename:
>+        options.slavename = socket.gethostname().split('.')[0]
>+    def slave_matches(*parts):
>+        "True if any of PARTS appears in options.slavename"
>+        for part in parts:
>+            if part in options.slavename:
>+                return True
>+        return False
>+
>+    if not options.basedir:
>+        if slave_matches('slave'):
>+            if slave_matches('linux', 'darwin'):
>+                options.basedir = '/builds/slave'
>+            elif slave_matches('win', 'w32'):
>+                options.basedir = r'e:\builds\moz2_slave'
>+        elif slave_matches('talos', '-try', 'r3'):
>+            if slave_matches('linux', 'ubuntu', 'fed'):
>+                options.basedir = '/home/cltbld/talos-slave'
>+            elif slave_matches('tiger', 'leopard', 'snow'):
>+                options.basedir = '/Users/cltbld/talos-slave'
>+            elif slave_matches('xp', 'w7'):
>+                options.basedir = r'c:\talos-slave'
>+
>+        if not options.basedir:
>+            parser.error("could not determine basedir")
>+
>+    if not os.path.exists(options.basedir):
>+        parser.error("basedir '%s' does not exist" % options.basedir)
>+
>+    # check for DO_NOT_START
>+    do_not_start = os.path.join(options.basedir, "DO_NOT_START")
>+    while os.path.exists(do_not_start):
>+        if options.verbose:
>+            print "waiting until '%s' goes away" % do_not_start
>+        time.sleep(10)
>+
>+    if not options.allocator_url:
>+        options.allocator_url = default_allocator_url

better to do parser.set_defaults(allocator_url=default_allocator_url) above?
Attachment #494870 - Flags: feedback?(catlee) → feedback+
(In reply to comment #7)
> Is LOCKFILE useful to maintain?

I don't know if it's necessary to the CentOS initscript infrastructure, so I figured I'd leave it in.  I'll take it out if you know that to be harmless.

> better to do parser.set_defaults(allocator_url=default_allocator_url) above?

Fair enough.
Created attachment 495099 [details] [diff] [review]
tools-r3.patch
Attachment #494870 - Attachment is obsolete: true
Attachment #495099 - Flags: review?(bhearsum)
Attachment #494870 - Flags: review?(bhearsum)
Comment on attachment 495099 [details] [diff] [review]
tools-r3.patch

I don't think production build machines should pull config information from people.m.c. Can we at least move this to cruncher if we don't have a permanent home yet?

Is RUNSLAVE correct? In the runslave.py header you claim that it is in /usr/local/bin/runslave.py. The latter makes more sense since /builds doesn't exist everywhere.

I think it would be better to assign all of the things from options that you want, rather than the whole dict. It makes it easier to get an overview of BuildbotTac that way.

If I'm reading this right, verbose mode will actually change the behaviour of the script if there's an error fetching or writing the tac file. If that's really what you want to do, I think it should be controlled if "--abort-on-error" or something like that. verbosity shouldn't change behaviour IMHO.

Other than the above, this seems good to me.
Attachment #495099 - Flags: review?(bhearsum) → review-
Comment on attachment 494866 [details] [diff] [review]
puppet-manifests-r1.patch

Can you put all of this in a new file/class? startup.pp maybe? buildslave/staging-buildslave is purposely sparse. Also..it's weird to have Fedora including the buildslave class....

I don't see the launchd startuptype being handled anywhere.
Attachment #494866 - Flags: review?(bhearsum) → review-
(In reply to comment #10)
> Comment on attachment 495099 [details] [diff] [review]
> tools-r3.patch
> 
> I don't think production build machines should pull config information from
> people.m.c. Can we at least move this to cruncher if we don't have a permanent
> home yet?

Well, there's nothing *on* people at the moment, so even a failure there won't affect production, but I'm happy to put it wherever you suggest.  It's easy enough to change now or later.  I don't really want to hold up landing this while we bikeshed about where to drop these files.

> Is RUNSLAVE correct? In the runslave.py header you claim that it is in
> /usr/local/bin/runslave.py. The latter makes more sense since /builds doesn't
> exist everywhere.

I'm not sure what RUNSLAVE is, but yes, it's in /usr/local/bin/runslave.py

> I think it would be better to assign all of the things from options that you
> want, rather than the whole dict. It makes it easier to get an overview of
> BuildbotTac that way.

This is probably a stylistic issue, but I don't like threading parameters through by repeating the parameter name everywhere (in the constructor invocation, constructor parameter list, and twice in the assignment in the constructor).  BuildbotTac could easily just become a function or two at this point, anyway.

> If I'm reading this right, verbose mode will actually change the behaviour of
> the script if there's an error fetching or writing the tac file. If that's
> really what you want to do, I think it should be controlled if
> "--abort-on-error" or something like that. verbosity shouldn't change behaviour
> IMHO.

Oops, that 'raise' shouldn't be in there.
(In reply to comment #11)
> Can you put all of this in a new file/class? startup.pp maybe?
> buildslave/staging-buildslave is purposely sparse. Also..it's weird to have
> Fedora including the buildslave class....

I can put it in a new class.  It is weird to have fedora including the buildslave class, but there's a lot of weird stuff going on in the puppet manifests, and this change seems in line with that.  The puppet manifests need an overhaul to separate concerns a bit better, but this is not the time or the bug for it..

> I don't see the launchd startuptype being handled anywhere.

Well, right - check the summary of the bug :)
Created attachment 495157 [details] [diff] [review]
m616003-puppet-manifests-r2.patch

moved to a new class as requested
Attachment #494866 - Attachment is obsolete: true
Attachment #495157 - Flags: review?(bhearsum)
Created attachment 495158 [details] [diff] [review]
m616003-tools-r4.patch

I moved all of the logic into the class and renamed it RunSlave, as well as fixing the error handling.

This still points to people, since I don't know where else to put it.  Open to suggestions there, but I think that can safely be changed without additional review.

This has been re-staged, with the new manifests, successfully.
Attachment #495099 - Attachment is obsolete: true
Attachment #495158 - Flags: review?(bhearsum)
(In reply to comment #12)
> (In reply to comment #10)
> > Comment on attachment 495099 [details] [diff] [review] [details]
> > tools-r3.patch
> > 
> > I don't think production build machines should pull config information from
> > people.m.c. Can we at least move this to cruncher if we don't have a permanent
> > home yet?
> 
> Well, there's nothing *on* people at the moment, so even a failure there won't
> affect production, but I'm happy to put it wherever you suggest.  It's easy
> enough to change now or later.  I don't really want to hold up landing this
> while we bikeshed about where to drop these files.

OK, please do change it. I'm just seeking to avoid having a landed, in production copy that's pulling from there.

> > Is RUNSLAVE correct? In the runslave.py header you claim that it is in
> > /usr/local/bin/runslave.py. The latter makes more sense since /builds doesn't
> > exist everywhere.
> 
> I'm not sure what RUNSLAVE is, but yes, it's in /usr/local/bin/runslave.py

I'm talking about the RUNSLAVE definition from linux-initd-buildbot.sh -- it says "RUNSLAVE=/builds/runslave.py"


(In reply to comment #13)
> (In reply to comment #11)
> > Can you put all of this in a new file/class? startup.pp maybe?
> > buildslave/staging-buildslave is purposely sparse. Also..it's weird to have
> > Fedora including the buildslave class....
> 
> I can put it in a new class.  It is weird to have fedora including the
> buildslave class, but there's a lot of weird stuff going on in the puppet
> manifests, and this change seems in line with that.The puppet manifests need
> an overhaul to separate concerns a bit better, but this is not the time or the
> bug for it..

I don't agree that fedora including "buildslave" is in line with existing weirdness. There's a very clear separation of build and test slave stuff in the manifests, and this breaks that.

> > I don't see the launchd startuptype being handled anywhere.
> 
> Well, right - check the summary of the bug :)

Duh, thanks for the reminder.
Comment on attachment 495157 [details] [diff] [review]
m616003-puppet-manifests-r2.patch

Per my other comment, Fedora should include buildbot-startup directly, not through buildslave.
Can you delete /etc/init.d/buildbot-tac in addition to disabling the inetd service?
Attachment #495157 - Flags: review?(bhearsum) → review-
Comment on attachment 495158 [details] [diff] [review]
m616003-tools-r4.patch

Sorry, but I like this even less than the previous one....done like this, it's not possible to re-use the RunSlave class in any non-command-line context.

If we can't agree on the explicit arg passing, can you use the previous version w/ the verbose mode bugfix?
Attachment #495158 - Flags: review?(bhearsum) → review-
Depends on: 617396
Created attachment 495911 [details] [diff] [review]
m616003-puppet-manifests-r3.patch

-r2, plus include buildslave-startup from the centos and fedora classes instead.  And delete /etc/init.d/buildbot-tac
Attachment #495157 - Attachment is obsolete: true
Attachment #495911 - Flags: review?(bhearsum)
Created attachment 495912 [details] [diff] [review]
m616003-tools-r5.patch

linux-initd-buildbot: fix RUNPYTHON value

runpython.py: revert to -r3 form, with 'raise' removed

Staged on darwin, fedora, and centos.  No change on darwin, and fedora and centos both did the right thing.
Attachment #495158 - Attachment is obsolete: true
Attachment #495912 - Flags: review?(bhearsum)
Attachment #495912 - Flags: review?(bhearsum) → review+
Comment on attachment 495911 [details] [diff] [review]
m616003-puppet-manifests-r3.patch

I'm still seeing fedora including "buildslave".
Weren't you going to remove the build-tools RPM, too?
> Weren't you going to remove the build-tools RPM, too?

That's a separate bug - bug 615301.  

As for the include in talosslave.pp, just an oversight.  I'll take the r+ to mean it's OK with that removed.
Comment on attachment 495911 [details] [diff] [review]
m616003-puppet-manifests-r3.patch

0395b9a64270 - review via IRC
Attachment #495911 - Flags: review?(bhearsum)
Attachment #495911 - Flags: review+
Attachment #495911 - Flags: checked-in+
Comment on attachment 495912 [details] [diff] [review]
m616003-tools-r5.patch

pushed as a1d4cb6cfafd
Attachment #495912 - Flags: checked-in+
Created attachment 496198 [details] [diff] [review]
m616003fix-r1.patch

I somehow managed to re-introduce a typo when reverting to the earlier runslave.py.  With this fix, staging systems seem to be starting without any trouble.
Attachment #496198 - Flags: review?(bhearsum)
Attachment #496198 - Flags: review?(bhearsum) → review+
Comment on attachment 496198 [details] [diff] [review]
m616003fix-r1.patch

59640dd90609
Attachment #496198 - Flags: checked-in+
All of the staging slaves are humming along perfectly, but we're backing this out to re-land it tomorrow.

puppet-manifests: 8a62a0769b1c
tools: 6dc81ad838ae
puppet-files: not in version control; rolled back by hand
landed again, and into production.

puppet-manifests: 5092d483dec1
tools: 21c07e261e2a
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Might have a regression from this - nagios is reporting some linux-ix-slaveN with 2 buildbot processes running. eg linux-ix-slave19 has an (abridged) 'ps xf' of
  PID TTY      STAT   TIME COMMAND
 3170 ?        Ss     0:00 /tools/python/bin/python /usr/local/bin/runslave.py
 3198 ?        S      0:00  \_ /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave
 3199 ?        Z      0:00      \_ [buildbot] <defunct>
 3202 ?        Sl     0:00 /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave
I saw that on Darwin in testing bug 616350, but assumed it had to do with un-deactivated plist files.

I'll look into this ASAP in bug 619082
No longer depends on: 617396
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.