Closed Bug 712205 Opened 9 years ago Closed 5 years ago

slaves should always have tools checked out and up to date

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Unassigned)

References

Details

(Whiteboard: [unittest][talos][automation])

Attachments

(1 file, 1 obsolete file)

most of our jobs require hg.m.o/build/tools in some way.

we should stop clobbering this and checking it out for every build / test run and instead make sure it's checked out and up to date as part of machine startup.
Duplicate of this bug: 590969
Duplicate of this bug: 629361
Duplicate of this bug: 678182
Duplicate of this bug: 600967
Whiteboard: [unittest][talos]
Priority: -- → P4
Whiteboard: [unittest][talos] → [unittest][talos][automation]
Duplicate of this bug: 716888
Man, I could really use this for bug 660480.

A quick survey of our talos platforms indicates that some platforms have a longstanding tools checkout while some do not:

fed & fed64
/tools/build-tools

leopard
/tools/build-tools (NOT UP-TO-DATE)

xp
C:\tools

r4-lion, r4-snow, & w7
none

Based on the above, I would vote to standardize on /tools/build-tools and C:\tools\build-tools as locations for our permanent checkouts when we fix this.
If we had this (on a tangent), we could also move runslave.py unto tools and always have an up-to-date version of it. This would be beneficial for Windows since it would be located on a non-admin directory, hence, not requiring an UAC prompt.
Duplicate of this bug: 733658
Component: Release Engineering → Release Engineering: Automation
QA Contact: release → catlee
Duplicate of this bug: 659344
Component: Release Engineering: Automation → Release Engineering: Machine Management
QA Contact: catlee → armenzg
taking per joduinn
Assignee: nobody → bugspam.Callek
Priority: P4 → P1
If I will be expected to test this myself, I'll need access to a puppet master, one of each type of system, and opsi. As well as a crash-course in how to stage stuff with puppet, and how to actually *use* opsi, rather than just write manifests for it.

Though this should be simple enough in terms of puppet/opsi-manifest that I can probably do the work with minimal testing by someone else. (If I need to test myself will get bugs on file for the access, otherwise will just chug away at this anyway)
To be honest, I don't why we want to use puppet/opsi for this.
I would prefer not to clobber tools, to share it amongst builders and keep it up to date.
If we can make the step smart enough to retry on HG being down and then retry the job after few attempts that would *my* ideal.

I saw this because we would eventually move to mozharness and to newer slaves and I don't think it would be good to carry over puppet and opsi dependencies.

On another note, we don't keep a copy of mozilla-central checked out with puppet/opsi and I don't know why the tools repo should be different.

What do you guys think?
I think tools is fundamentally different that mozilla-central. We rely on tools for all sorts of things, it's critical to the 'setup' of many types of builds.

We shouldn't be clobbering/checking out tools for every type of job we run. It should be always available up-to-date on the system.

I don't think that puppet/opsi necessarily have to do this. It probably belongs more as a 'pre-flight' type of task (see bug 712206).
Attached patch [puppet] WIP 0 (obsolete) — Splinter Review
*Very* rudimentary start to this.

The idea is that we'll have a preflight initd script which is itself responsible for launching buildbot if preflight tasks succeed.

catlee suggested run-parts, which this patch is beginning to prepare for. I'll just have to figure out if its possible to have run-parts exit with error if one of its sub-scripts exits with an error. If not figure out a different way to do this.

None of this is tested, but I wanted to dump it public before I head out for the day.

Since we can't rely on anything in tools/ during the clone/update of itself, I'm writing a basic retry logic and clone/checkout in this script.

There will need to be changes to make buildbot recognize and use the tools dir(s) across slave types of course.
Comment on attachment 608457 [details] [diff] [review]
[puppet] WIP 0

Looks like a good start. Will need to pull/update the directory if it exists.

It really really sucks that we have to replicate hgtool in bash here...I wonder if it's worthwhile packaging hgtool and its dependencies as a standalone package that puppet can deploy and use.
(In reply to Chris AtLee [:catlee] from comment #15) 
> It really really sucks that we have to replicate hgtool in bash here...I
> wonder if it's worthwhile packaging hgtool and its dependencies as a
> standalone package that puppet can deploy and use.

Could we deploy a bundle with puppet and then just update it on every reboot?
(In reply to Chris AtLee [:catlee] from comment #15)
> Comment on attachment 608457 [details] [diff] [review]
> [puppet] WIP 0
> 
> Looks like a good start. Will need to pull/update the directory if it exists.
> 
> It really really sucks that we have to replicate hgtool in bash here...I
> wonder if it's worthwhile packaging hgtool and its dependencies as a
> standalone package that puppet can deploy and use.

I would love to, but part of our problems with deploying hgtool seperately, is that its currently not really splittable from our tools repo magic (right now) lines 11-13 of http://mxr.mozilla.org/build/source/tools/buildfarm/utils/hgtool.py is why

(In reply to Chris Cooper [:coop] from comment #16)
> (In reply to Chris AtLee [:catlee] from comment #15) 
> > It really really sucks that we have to replicate hgtool in bash here...I
> > wonder if it's worthwhile packaging hgtool and its dependencies as a
> > standalone package that puppet can deploy and use.
> 
> Could we deploy a bundle with puppet and then just update it on every reboot?

I know we can probably deploy an initial hg bundle of tools, and likely update from that. But keeping the bundle itself up to date is going to be a pain with our current setups. I think I'll investigate for sure that updating tools from a tools repo works, and if so attack that solution.
So far untested, but early-feedback welcome
Attachment #608457 - Attachment is obsolete: true
Attachment #611075 - Flags: feedback?(coop)
Attachment #611075 - Flags: feedback?(catlee)
Comment on attachment 611075 [details] [diff] [review]
[puppet-manifests] WIP 1

Review of attachment 611075 [details] [diff] [review]:
-----------------------------------------------------------------

Some nits, but looking good for Linux so far.

::: modules/buildslave/files/preflights/10-toolsdir
@@ +21,5 @@
> +    echo "[paths]" > ${TOOLS_DIR}/.hg/hgrc
> +    echo "default = ${REPO_URL}" >> ${TOOLS_DIR}/.hg/hgrc
> +fi
> +
> +set +e # Turn off errors so we continue if hgtool fails

What's the non-fatal failure mode that this is meant to avoid?

::: modules/buildslave/manifests/preflights.pp
@@ +5,5 @@
> +#  include buildslave::preflights
> +#
> +# This class takes care of installing runslave.py and any other prerequisites,
> +# and hooking into the startup process as appropriate to the system.  The
> +# interface between this class and the puppet startup is somewhat ill-defined,

"ill-defined" how? Are we talking about when in the sequence this module gets run?

::: modules/buildslave/manifests/startup.pp
@@ +42,5 @@
>                      group  => "root",
> +                    mode   => 755,
> +                    notify => Exec["reset-preflight-service"],
> +                    require => File["/etc/preflight/preflight.d"];
> +                #"/etc/init.d/buildbot":

I would just remove this section rather than comment it out.
Attachment #611075 - Flags: feedback?(coop) → feedback+
Comment on attachment 611075 [details] [diff] [review]
[puppet-manifests] WIP 1

Review of attachment 611075 [details] [diff] [review]:
-----------------------------------------------------------------

I like the overall direction here. Some small things to clean up, but nothing major.

::: modules/buildslave/files/linux-initd-buildbot.sh
@@ +1,2 @@
>  #! /bin/bash
> +#  initscript for preflight checks

I can't tell from the diff if you've renamed this file or modified the original...hopefully you're creating a new one!

@@ +31,5 @@
> +start_preflights() {
> +    # Explicitly allow us to continue when something fails
> +    set +e
> +
> +    echo > $LOGFILE # Empty logfile on boot to keep it from growing too much

we may want to use some log rotation here instead. wiping it on boot could make future debugging hard.

@@ +35,5 @@
> +    echo > $LOGFILE # Empty logfile on boot to keep it from growing too much
> +
> +    # Unable to use run-parts because it does not exit on error
> +    # Ignore *~ and *, scripts
> +    for i in ${PREFLIGHT_DIR}/*[^~,] ; do

will this sort properly? will 2-foo be before or after 11-bar?

@@ +36,5 @@
> +
> +    # Unable to use run-parts because it does not exit on error
> +    # Ignore *~ and *, scripts
> +    for i in ${PREFLIGHT_DIR}/*[^~,] ; do
> +        [ -d $i ] && continue

what does this do? skip directories? comments FTW!

@@ +39,5 @@
> +    for i in ${PREFLIGHT_DIR}/*[^~,] ; do
> +        [ -d $i ] && continue
> +    
> +        if [ -x $i ]; then
> +            print "====== Starting Preflight: '$i' ======" \

itym echo?

::: modules/buildslave/files/preflights/10-toolsdir
@@ +8,5 @@
> +PYTHON=python
> +
> +if [ ! -d ${TOOLS_DIR}/.hg ]
> +then
> +    # Something is strange, clobber it

some logging here would be good!

@@ +21,5 @@
> +    echo "[paths]" > ${TOOLS_DIR}/.hg/hgrc
> +    echo "default = ${REPO_URL}" >> ${TOOLS_DIR}/.hg/hgrc
> +fi
> +
> +set +e # Turn off errors so we continue if hgtool fails

you can also wrap this in

if ( ! ${PYTHON} ... ); then
# something went wrong...
fi

commands that exit with non-zero inside () don't cause the script to exit when -e is set

::: modules/buildslave/manifests/startup.pp
@@ +60,1 @@
>                      command => "/sbin/chkconfig --del buildbot && /sbin/chkconfig --add buildbot",

should this be resetting the preflight service instead?
Attachment #611075 - Flags: feedback?(catlee) → feedback+
(In reply to Chris Cooper [:coop] from comment #19)
> ::: modules/buildslave/files/preflights/10-toolsdir
> @@ +21,5 @@
> > +    echo "[paths]" > ${TOOLS_DIR}/.hg/hgrc
> > +    echo "default = ${REPO_URL}" >> ${TOOLS_DIR}/.hg/hgrc
> > +fi
> > +
> > +set +e # Turn off errors so we continue if hgtool fails
> 
> What's the non-fatal failure mode that this is meant to avoid?

Not so much non-fatal (since we'll still treat it fatal) but we want to properly cleanup after the failed attempt here. (the rm following that) so that if automation then reboots (kittenherder?) it can try again.

(In reply to Chris AtLee [:catlee] from comment #20)
> ::: modules/buildslave/files/linux-initd-buildbot.sh
> @@ +1,2 @@
> >  #! /bin/bash
> > +#  initscript for preflight checks
> 
> I can't tell from the diff if you've renamed this file or modified the
> original...hopefully you're creating a new one!

mentioned in IRC, but this is because of hg's and bmo's failure to communicate, I did an hg rename/copy for some of this stuff. BMO only shows old filename

> @@ +35,5 @@
> > +    echo > $LOGFILE # Empty logfile on boot to keep it from growing too much
> > +
> > +    # Unable to use run-parts because it does not exit on error
> > +    # Ignore *~ and *, scripts
> > +    for i in ${PREFLIGHT_DIR}/*[^~,] ; do
> 
> will this sort properly? will 2-foo be before or after 11-bar?

It does sort, but by the same method ls * will sort, which means 2-foo will be after 11-bar but before 2-zoo and 3-foo.
Depends on: 743316
I filed bug 764102 to remove the dependency to the tools repo for *test* slaves.
Component: Release Engineering: Machine Management → Release Engineering: Platform Support
QA Contact: armenzg → coop
Callek: if you don't have time to work on this, please consider offloading it to Pete or Simone.

Fixing this bug will remove most of the need to fix bug 692715. Many of the failures in bug 692715 seem to occur while cleaning up tools prior to re-cloning.
Simone, Pete, is this a bug one of you would want to drive to completion?
Flags: needinfo?(sbruno)
Flags: needinfo?(pmoore)
(In reply to Justin Wood (:Callek) from comment #24)
> Simone, Pete, is this a bug one of you would want to drive to completion?

In case it's not clear from the comments, this represents a simplification of our current process, saves us a non-zero amount of time per build/test, and will likely fix or neuter several other bugs in the process.
Sure
Flags: needinfo?(pmoore)
Product: mozilla.org → Release Engineering
Hi, I've got a note for bld-centos6-hp-002 which says that sbruno is using such host.
Is this still valid? Fine if so.
Hi Armen, yes I am using that host for some puppet tests.
Flags: needinfo?(sbruno)
Had a brief chat with Simone and Pete this morning about how this could work. Capturing that conversation here for posterity.

Using puppet was brought up as a potential solution. The drawback here is that we would still need an alternate solution for Windows. Maybe this is an option if the follow-on work from bug 794612 goes well.

I think the best place to integrate this change is in the pre-flight code that launches runslave.py. There are different scripts per platform here:

https://hg.mozilla.org/build/puppet/file/618c178fd73e/modules/buildslave/files

Cloning/updating tools would be another pre-flight check: no tools checkout, no buildbot startup.

I had initially thought about adding the check directly to runslave.py, but I don't think that makes sense now. There are bound to be platform-specific foibles that are best contained in the existing platform-specific scripts (IMO).
Also, I should mention that the hard part here will be ripping out the existing tools checkout code from buildbotcustom and then verifying everything still works.

Might be worthwhile maintaining the current workflow in buildbotcustom for release branches until we're sure this is working for nightly CI.
Assignee: bugspam.Callek → sbruno
Planning to work on this next week (16-20 Sept)
No longer blocks: 712206
Depends on: 712206
Duplicate of this bug: 764102
Assignee: sbruno → nobody
Duplicate of this bug: 646580
Duplicate of this bug: 695220
I think this is fixed by runner?
Flags: needinfo?(catlee)
Yup. We need that deployed everywhere though.
Flags: needinfo?(catlee)
Runner does this now!
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.