slave pre-flight tasks (runner)

RESOLVED FIXED

Status

Release Engineering
Platform Support
P3
normal
RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: catlee, Assigned: mrrrgn)

Tracking

(Blocks: 2 bugs)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2714] [unittest][talos][automation][q4goal])

Attachments

(2 attachments, 5 obsolete attachments)

(Reporter)

Description

6 years ago
a lot of slave maintenance is done at the beginning of build and test jobs that could be done instead before even starting buildbot.

things like doing clobbers, purging old directories, making sure the tools checkout is up-to-date.

in the case where we have spare capacity in the pool this can be a slight win for end-to-end time since builds don't have to waste time cleaning up. in the case where we're maxed out, it's no worse than the current situation.

pre-flight checks could be added here as well (can I compile something? is my display correct?), and refuse to start buildbot if the checks fail.
Whiteboard: [unittest][talos]
(Reporter)

Updated

6 years ago
Depends on: 596748

Updated

5 years ago
Priority: -- → P4
Whiteboard: [unittest][talos] → [unittest][talos][automation]
Duplicate of this bug: 733400
(Reporter)

Updated

5 years ago
Component: Release Engineering → Release Engineering: Machine Management
QA Contact: release → armenzg
Among the things this should do is to verify and, if necessary, correct the screen resolution on all affected platforms.

I can see two ways to implement this:
 1. add functionality to runslave.py to run a directory full of scripts
 2. insert a step *before* runslave.py on all platforms that runs a directory full of scripts
(Reporter)

Updated

5 years ago
Blocks: 801607
(Reporter)

Comment 3

5 years ago
I think we need this soon
Priority: P4 → P2
I can help with the Windows side as I have increased my expertise by setting the Windows 8, 7 and 64-bit several times in the last two quarters.
I'm thinking that we should add bad slaves in the "see also" field and keep track in which weird ways it starts burning jobs even if hardware diagnostics did not find any problem.
See Also: → bug 797245
I would also add the ability to report issues into a dashboard.
I would like to avoid having to go and check each single slave when it fails to start.

Updated

4 years ago
Component: Release Engineering: Machine Management → Release Engineering: Platform Support
QA Contact: armenzg → coop
Product: mozilla.org → Release Engineering
Blocks: 906706
We need to clean temporary files on bld-lion machines like on bug 906706.
Blocks: 880003
It would be great if we had a script that would help us determine if the right graphic card is installed on the win7 machines: see bug https://bugzilla.mozilla.org/show_bug.cgi?id=873566#c24
An approach to slave pre-flight tasks for puppet:
https://github.com/catlee/puppet/compare/master...startup
Very cool!  I can't comment on the "compare", so a few notes here:

* /etc/preflight.d should be purged, so that puppet can remove no-longer-desired scripts

* the reboot should be delayed or, given the presence of slave-health/kittenherder, maybe just hang forever on the assumption it will be rebooted externally?  At any rate, trying to fix something that's rebooting every 30s is no fun at all.

* As for starting Buildbot, it might be easiest to just invoke this script from the various shell scripts used for puppet::atboot, since they already have a mechanism to delay buildbot startup.

* It'd be nicer to have preflight::step take a 'contents' attribute.  Then the in-module classes like preflight::steps::tools can use contents => template("${module_name}/tools.erb"), and preflight::step can also be used outside of the preflight module, with arbitrary scripts.

* Really minor, but grammatically I'd prefer preflight::step::tools (singular), just for the symmetry with preflight::step.

* It'd be great to have additional logging from these scripts, both to syslog and, potentially, to nagios via NSCA.

* Given the necessity of running these scripts on all platforms, would it be better to write them in Python or Ruby?  That would also allow libraries for platform-independent logging, etc.

I'm sad that we've been sitting on this since May :(

Updated

4 years ago
Assignee: nobody → armenzg
Whiteboard: [unittest][talos][automation] → [unittest][talos][automation][q4goal]
I would prefer writing this in python to make it cross-platform compatible.
It is also easier to have people contribute to it.

I assume there will be some Puppet and GPO to deploy some part of it.

I will do an analysis to see where to hook this up for each OS.
I can't look at this bug until past the work week.
I'm afraid this will be a missed goal.
Priority: P2 → P3
(Reporter)

Updated

4 years ago
Depends on: 930831
Simone: how do you feel about owning the whole thing, and not just the tools repo part?
Flags: needinfo?(sbruno)
Coop: it surely makes sense since I started working on Bug 712205. I'm taking this!
Assignee: armenzg → sbruno
Flags: needinfo?(sbruno)
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #10)
> * As for starting Buildbot, it might be easiest to just invoke this script
> from the various shell scripts used for puppet::atboot, since they already
> have a mechanism to delay buildbot startup.

+1

> * It'd be great to have additional logging from these scripts, both to
> syslog and, potentially, to nagios via NSCA.

I think we handle this. Windows will still suck.
 
> * Given the necessity of running these scripts on all platforms, would it be
> better to write them in Python or Ruby?  That would also allow libraries for
> platform-independent logging, etc.

Yes, I think we want python scripts/libs here.
Some other notes that have come out of conversations with releng this week:

* measuring improvement: we track current setup and teardown times per OS in brasstacks (http://brasstacks.mozilla.com/gofaster/#/ but the charts seem to be busted right now,  I'll investigate). We should gather baseline stats now, so we can assess whether (and how far, hopefully) we've moved the needle at the end

* incremental gains: once we have our baselines, we can and should roll this out per OS as each step is ready, rather than waiting for a giant improvement across the board at the end. Making things a little bit better every week, even on one platform at a time, is progress.

* specific tasks:

** clobberer should run at start-up, although we will still need to check clobberer as part of the build, but only for the tree in question. Depending on how long a slave sits idle before its first job, there's a chance that a clobber request will come in between buildbot startup and job start.

** hg-shared: we should iterate through all the trees in the hg-shared dir and make sure they're updated to tip before starting buildbot. This will give jobs the minimum possible code delta to pull during the actual build process.

** moving post-flight tasks to pre-flight: obviously this makes sense too, but may require some special logic. Some of the post-flight tasks we do are specific to the current build, e.g. removing build and/or objdirs. Moving those tasks to pre-flight means you wouldn't have a current build as reference. Maybe post-flight tasks get merged in a later round of improvements here. I'm hoping better build system dependencies make this a non-issue in the near future.
** one pony, with rainbow hair

Let's get the basic framework in place before we start hammering every nail!

Updated

4 years ago
Depends on: 932877
(In reply to Chris Cooper [:coop] from comment #16)
> * measuring improvement: we track current setup and teardown times per OS in
> brasstacks (http://brasstacks.mozilla.com/gofaster/#/ but the charts seem to
> be busted right now,  I'll investigate). 

Filed bug 932877.

Updated

4 years ago
Duplicate of this bug: 783274

Updated

4 years ago
Depends on: 769384

Updated

4 years ago
Blocks: 827367
I am going to write a preflight_runner.py script, which will be invoked just before running runslave.py in the following files, which reside in http://hg.mozilla.org/build/puppet/file/default/modules/buildslave/files:

darwin-run-buildslave.sh
run-puppet-and-buildbot.sh
start-buildbot-win64.bat
startTalos-w7.bat
startTalos-w764.bat
startTalos-w864.bat
startTalos-xp.bat

Using these files as hook points for preflight calls will make it possible to roll out preflight scripts per OS, as proposed by coop.

preflight_runner.py will run a directory containing scripts, determining the order of execution based on some simple naming conventions.

I will start implementing support for python scripts, but it should not be difficult to extend to direct support of native shell scripts (.sh scripts for unix, .bat/.cmd scripts for Windows).

Armen, can you please confirm that this plan is compatible with our last conversation?
Flags: needinfo?(armenzg)
First of all, could you please go to each Windows host and grab a recent version of start*.bat?
What is committed puppet could be easily be out-of-date since we have refactor the Windows infrastructure with GPO and have changed a bunch of assumptions.

This matches what we spoke.
I have no problem making the script more OS-centric, however, I would *suggest* not mixing them in the same directory but split by OS (e.g. win_preflight_tasks/, unix_preflight_tasks/ and osx_preflight_tasks/).
Flags: needinfo?(armenzg)
Hi Armen,

Thanks for your quick feedback.

I will have a look to the actual .bat files on windows hosts - even if there are differences, I believe the principle of using these files as hook points before runslave.py is called remains valid.

Regarding the preflight tasks, I completely agree with your recommendation; they will necessarily be written and stored separately for each OS, at least for two reasons:

1) because we may want to add different behavior per OS
2) because we may want to support native shell scripts
The scripts will, presumably, be installed by puppet, so I would recommend having the per-OS (or per-silo, really) logic handled in puppet, not by the runner script.
(Reporter)

Comment 24

3 years ago
I'd like to shift focus slightly for this. Instead of running pre-flight tasks on boot, we should run them before running buildbot. A minor distinction right now, but I'd like to get to the point where we're running some kind of loop like:

while True:
- run pre-flight tasks
- run buildbot for one job
- run post-flight tasks

then we can stop rebooting machines after every job.
Depends on: 1005936
(Reporter)

Comment 25

3 years ago
We're going to do this with runner:
https://github.com/catlee/runner
Assignee: sbruno → catlee
Duplicate of this bug: 1002398
(Reporter)

Updated

3 years ago
Blocks: 1019013
Created attachment 8436062 [details] [diff] [review]
runner.diff

Puppet module for installing and configuring Runner. Runner itself is already packaged as an sdist and uploaded. Unsure if :catlee wants a review on runner itself or not.
Attachment #8436062 - Flags: review?(dustin)
Attachment #8436062 - Flags: feedback?(catlee)
Comment on attachment 8436062 [details] [diff] [review]
runner.diff

Review of attachment 8436062 [details] [diff] [review]:
-----------------------------------------------------------------

It looks like this has a bunch of other stuff mixed in.  Can you re-post a diff containing only your changes?
Attachment #8436062 - Flags: review?(dustin)
Created attachment 8437030 [details] [diff] [review]
Puppet module for Catlee's Runner

A diff not created on an brain-addled Friday evening...
Attachment #8436062 - Attachment is obsolete: true
Attachment #8436062 - Flags: feedback?(catlee)
Attachment #8437030 - Flags: review?(dustin)
Attachment #8437030 - Flags: feedback?(catlee)
Comment on attachment 8437030 [details] [diff] [review]
Puppet module for Catlee's Runner

Review of attachment 8437030 [details] [diff] [review]:
-----------------------------------------------------------------

I think we talked a bit in irc about this, but just to follow up, it seems like dependencies are going to be the gotcha here, especially if this is going to be useful for other organizations (e.g., QA probably has some pre-flight tasks, but they're probably not very similar to releng's).

Using sorted, numbered scripts works OK if different numbers have meaning (e.g., "all repositories are up to date by level 10"), but they don't represent dependencies very well.  For example, if I rename 15-frobnicate.sh to 17-frobnicate.sh, how do I know that 16-barterize.sh didn't depend on it?

Since the runner itself is Python, it'd actually be pretty easy to specify dependencies explicitly, either in the scripts themselves (perhaps in comments) or in some nearby file (e.g., barterize.deps), and then flatten that dependency tree into a fixed order in the Python script.  That would also leave the door open to parallelizing the preflight tasks at a later date.  And if those dependencies are reflected in the Puppet configuration, then Puppet can ensure that every task's dependency tasks are also installed.

Systemd, launchd, and Gentoo all solve this a little bit differently, and they all have complicated syntax to say 'starts before this service *or* that service' or to refer to virtual services or whatever.  I don't see runner needing such levels of complexity.

This, combined with the config handling I suggest below, would mean that toplevel::slave::build (or toplevel::slave::qa::tps_ci) only needs to have

  include runner::task::update_shared_repos
  include runner::task::clean _tempfiles

and all of the reqiured configuration, dependent tasks, and runner itself would be installed automatically.

::: modules/runner/files/checkout_tools
@@ +6,5 @@
> +        exit $?
> +fi
> +HGTOOL=/usr/local/bin/hgtool.py
> +
> +TOOLS_REPO=$($RUNNER_CONFIG_CMD -g hg.tools_repo)

I don't see $RUNNER_CONFIG_CMD defined anywhere..

::: modules/runner/manifests/service.pp
@@ +6,5 @@
> +    include runner::settings
> +    case $::operatingsystem {
> +        'CentOS': {
> +            # Which service this relies on
> +            $initd_required_start = 'network'

What's this for?  Would this change?

::: modules/toplevel/manifests/slave/build.pp
@@ -33,5 @@
>      ccache::ccache_dir {
> -        "/builds/ccache":
> -            maxsize => "10G",
> -            owner => $users::builder::username,
> -            group => $users::builder::group;

Looks like you have some kind of automatic puppetlint thing going on?  It's OK for files you've modified, but not every file..

::: modules/toplevel/manifests/slave/build/mock.pp
@@ -17,5 @@
>      # good way to communicate the need to that class.
>      exec {
>          'add-builder-to-mock_mozilla':
> -            command => "/usr/bin/gpasswd -a $users::builder::username mock_mozilla",
> -            unless => "/usr/bin/groups $users::builder::username | grep '\\<mock_mozilla\\>'",

The addition of {..} here is legit, but we don't follow the style of lining up the => in a single column.

@@ +32,5 @@
> +            data        => template('toplevel/slave/build/hg.cfg.erb');
> +        'buildbot.cfg':
> +            sectionname => 'buildbot',
> +            data        => template('toplevel/slave/build/buildbot.cfg.erb');
> +    }

I think there are bits of the config stuff missing in this patch, so maybe I'm missing something, but -- why have all of these config files?  And it looks like each just has a single section?  You could use the 'concat' puppet module to concatenate those into a single file and make things easier for users to read in place.

From what I can tell, the config's are orthogonal to the tasks - -meaning that a task may use several config's, and several tasks may use a config.  What if each task included the configs it needed?

So runner::tasks::checkout_tools would have
  include runner::config::hg
and runner::tasks::update_shared_repos would have
  include runner::config::hg
  include runner::config::env
Attachment #8437030 - Flags: review?(dustin) → feedback+
(In reply to Dustin J. Mitchell [:dustin] from comment #30)
> Comment on attachment 8437030 [details] [diff] [review]
> Puppet module for Catlee's Runner
> 
> Review of attachment 8437030 [details] [diff] [review]:
> -----------------------------------------------------------------

Thanks a bunch for taking the time to review this so thoroughly, Dustin!

> I think we talked a bit in irc about this, but just to follow up, it seems
> like dependencies are going to be the gotcha here, especially if this is
> going to be useful for other organizations (e.g., QA probably has some
> pre-flight tasks, but they're probably not very similar to releng's).
> 
> Using sorted, numbered scripts works OK if different numbers have meaning
> (e.g., "all repositories are up to date by level 10"), but they don't
> represent dependencies very well.  For example, if I rename 15-frobnicate.sh
> to 17-frobnicate.sh, how do I know that 16-barterize.sh didn't depend on it?
> 
> Since the runner itself is Python, it'd actually be pretty easy to specify
> dependencies explicitly, either in the scripts themselves (perhaps in
> comments) or in some nearby file (e.g., barterize.deps), and then flatten
> that dependency tree into a fixed order in the Python script.  That would
> also leave the door open to parallelizing the preflight tasks at a later
> date.  And if those dependencies are reflected in the Puppet configuration,
> then Puppet can ensure that every task's dependency tasks are also installed.
> 
> Systemd, launchd, and Gentoo all solve this a little bit differently, and
> they all have complicated syntax to say 'starts before this service *or*
> that service' or to refer to virtual services or whatever.  I don't see
> runner needing such levels of complexity.
> 
> This, combined with the config handling I suggest below, would mean that
> toplevel::slave::build (or toplevel::slave::qa::tps_ci) only needs to have
> 
>   include runner::task::update_shared_repos
>   include runner::task::clean _tempfiles
> 
> and all of the reqiured configuration, dependent tasks, and runner itself
> would be installed automatically.

I need to grab :catlee to discuss this today. Will get back to you with my thoughts after that. 


> 
> ::: modules/runner/files/checkout_tools
> @@ +6,5 @@
> > +        exit $?
> > +fi
> > +HGTOOL=/usr/local/bin/hgtool.py
> > +
> > +TOOLS_REPO=$($RUNNER_CONFIG_CMD -g hg.tools_repo)
> 
> I don't see $RUNNER_CONFIG_CMD defined anywhere..

Runner sets this itself (and guarantees it being set). 

Definitely worthy of a comment though. Will fix.


> 
> ::: modules/runner/manifests/service.pp
> @@ +6,5 @@
> > +    include runner::settings
> > +    case $::operatingsystem {
> > +        'CentOS': {
> > +            # Which service this relies on
> > +            $initd_required_start = 'network'
> 
> What's this for?  Would this change?

Good question. Will find out if still needed

> ::: modules/toplevel/manifests/slave/build.pp
> @@ -33,5 @@
> >      ccache::ccache_dir {
> > -        "/builds/ccache":
> > -            maxsize => "10G",
> > -            owner => $users::builder::username,
> > -            group => $users::builder::group;
> 
> Looks like you have some kind of automatic puppetlint thing going on?  It's
> OK for files you've modified, but not every file..

Oh, damn. Apologies. 
> 
> ::: modules/toplevel/manifests/slave/build/mock.pp
> @@ -17,5 @@
> >      # good way to communicate the need to that class.
> >      exec {
> >          'add-builder-to-mock_mozilla':
> > -            command => "/usr/bin/gpasswd -a $users::builder::username mock_mozilla",
> > -            unless => "/usr/bin/groups $users::builder::username | grep '\\<mock_mozilla\\>'",
> 
> The addition of {..} here is legit, but we don't follow the style of lining
> up the => in a single column.

Will know for the future - thanks!


> 
> @@ +32,5 @@
> > +            data        => template('toplevel/slave/build/hg.cfg.erb');
> > +        'buildbot.cfg':
> > +            sectionname => 'buildbot',
> > +            data        => template('toplevel/slave/build/buildbot.cfg.erb');
> > +    }
> 
> I think there are bits of the config stuff missing in this patch, so maybe
> I'm missing something, but -- why have all of these config files?  And it
> looks like each just has a single section?  You could use the 'concat'
> puppet module to concatenate those into a single file and make things easier
> for users to read in place.
> 
> From what I can tell, the config's are orthogonal to the tasks - -meaning
> that a task may use several config's, and several tasks may use a config. 
> What if each task included the configs it needed?
> 
> So runner::tasks::checkout_tools would have
>   include runner::config::hg
> and runner::tasks::update_shared_repos would have
>   include runner::config::hg
>   include runner::config::env

There doesn't seem to be any config missing. Is there something in particularly that made you think that that I may have forgotten?

And that's a good point on being able to `include` the configs. I'll look into that today.

Thanks again.
RUNNER_CONFIG_CMD made me think something was missing, but I see the error of my ways now.
(Reporter)

Updated

3 years ago
Assignee: catlee → iconnolly
(Reporter)

Comment 33

3 years ago
Comment on attachment 8437030 [details] [diff] [review]
Puppet module for Catlee's Runner

Review of attachment 8437030 [details] [diff] [review]:
-----------------------------------------------------------------

We also spoke on IRC and vidyo about this.

I like the idea of moving the configs into their own modules, related to which tasks are defined.

Let's leave inter-task dependencies for the next phase.
Attachment #8437030 - Flags: feedback?(catlee) → feedback+
Created attachment 8439356 [details] [diff] [review]
post-feedback diff

Updated runner module post-feedback with Dustin.

Just need a quick look at the config changes to make sure what I envisaged was aligned with what :dustin was thinking. 

Left inter-task dependencies to iteration n+1.
Attachment #8437030 - Attachment is obsolete: true
Attachment #8439356 - Flags: review?(dustin)

Updated

3 years ago

Updated

3 years ago
Duplicate of this bug: 764102

Updated

3 years ago
Blocks: 699195
Comment on attachment 8439356 [details] [diff] [review]
post-feedback diff

Review of attachment 8439356 [details] [diff] [review]:
-----------------------------------------------------------------

All pretty minor comments here -- nice work!

::: modules/runner/files/checkout_tools
@@ +1,1 @@
> +#!/bin/bash

Minor, but these files should have license headers on them

@@ +7,5 @@
> +fi
> +HGTOOL=/usr/local/bin/hgtool.py
> +
> +# $RUNNER_CONFIG_CMD is guaranteed to be set by Runner itself. See Runner
> +# documentation for more information.

Thanks :)

::: modules/runner/manifests/config.pp
@@ +1,5 @@
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +# config file for runner
> +define runner::config($sectionname, $data) {

This appears unused now - remove?

::: modules/runner/manifests/service.pp
@@ +6,5 @@
> +    include runner::settings
> +    case $::operatingsystem {
> +        'CentOS': {
> +            # Which service this relies on
> +            $initd_required_start = 'network'

This is still hanging around..

::: modules/runner/manifests/tasks/checkout_tools.pp
@@ +3,5 @@
> +    file {
> +        '/tools/checkouts':
> +             ensure => directory,
> +             owner  => cltbld,
> +             group  => cltbld;

replace cltbld with $::config::builder_username

@@ +4,5 @@
> +        '/tools/checkouts':
> +             ensure => directory,
> +             owner  => cltbld,
> +             group  => cltbld;
> +     }

one space too many

::: modules/toplevel/manifests/slave/build/mock.pp
@@ +22,4 @@
>              require => [Class['packages::mozilla::mock_mozilla'], Class['users::builder']];
>      }
> +
> +    include runner

If you make each of the runner::tasks::* include runner, then you don't even need this here.  That might be a bit easier.
Attachment #8439356 - Flags: review?(dustin) → review+
(In reply to Dustin J. Mitchell [:dustin] from comment #36)
> Comment on attachment 8439356 [details] [diff] [review]
> post-feedback diff
> 
> Review of attachment 8439356 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> All pretty minor comments here -- nice work!

Thanks!

> 
> ::: modules/runner/files/checkout_tools
> @@ +1,1 @@
> > +#!/bin/bash
> 
> Minor, but these files should have license headers on them
> 

Fixed!


> 
> ::: modules/runner/manifests/config.pp
> @@ +1,5 @@
> > +# This Source Code Form is subject to the terms of the Mozilla Public
> > +# License, v. 2.0. If a copy of the MPL was not distributed with this
> > +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> > +# config file for runner
> > +define runner::config($sectionname, $data) {
> 
> This appears unused now - remove?

Done.

> 
> ::: modules/runner/manifests/service.pp
> @@ +6,5 @@
> > +    include runner::settings
> > +    case $::operatingsystem {
> > +        'CentOS': {
> > +            # Which service this relies on
> > +            $initd_required_start = 'network'
> 
> This is still hanging around..

Oops! My bad, I had already removed it but I must have generated the diff beforehand. Fixed.

> 
> ::: modules/runner/manifests/tasks/checkout_tools.pp
> @@ +3,5 @@
> > +    file {
> > +        '/tools/checkouts':
> > +             ensure => directory,
> > +             owner  => cltbld,
> > +             group  => cltbld;
> 
> replace cltbld with $::config::builder_username

Fixed.

> 
> @@ +4,5 @@
> > +        '/tools/checkouts':
> > +             ensure => directory,
> > +             owner  => cltbld,
> > +             group  => cltbld;
> > +     }
> 
> one space too many

Annoyed the linter didn't catch that. Thanks.

> 
> ::: modules/toplevel/manifests/slave/build/mock.pp
> @@ +22,4 @@
> >              require => [Class['packages::mozilla::mock_mozilla'], Class['users::builder']];
> >      }
> > +
> > +    include runner
> 
> If you make each of the runner::tasks::* include runner, then you don't even
> need this here.  That might be a bit easier.

Good idea. Done.
Created attachment 8441600 [details] [diff] [review]
runner.diff -- Changes made as per Dustin's 2nd review

Could I get a check-in? :)

Carrying the r+ from :dustin over from Attachment 8439356 [details] [diff].
Attachment #8439356 - Attachment is obsolete: true
Attachment #8441600 - Flags: review+
Attachment #8441600 - Flags: checked-in?
Comment on attachment 8441600 [details] [diff] [review]
runner.diff -- Changes made as per Dustin's 2nd review

remote:   https://hg.mozilla.org/build/puppet/rev/58a3c0e9ae64
remote:   https://hg.mozilla.org/build/puppet/rev/6fa2dc80d46a
Attachment #8441600 - Flags: checked-in? → checked-in+
(Reporter)

Comment 40

3 years ago
Comment on attachment 8441600 [details] [diff] [review]
runner.diff -- Changes made as per Dustin's 2nd review

we didn't end up with the proper boot order for some reason
Attachment #8441600 - Flags: checked-in+ → checked-in-
Duplicate of this bug: 930831
Duplicate of this bug: 596748
Duplicate of this bug: 699195
Duplicate of this bug: 783253

Updated

3 years ago
Blocks: 785410
(Reporter)

Updated

3 years ago
Blocks: 1028191
Created attachment 8443701 [details] [diff] [review]
Runner - with boot order fixes

This should fix the boot order problems we were having previously. 

The problems were:

a) We had rc3.d/ symlinks for puppet and buildbot already on ours existing machines. We were changing their boot priority in their init.d/ scripts, but the symlinks weren't being updated to reflect that. Puppet doesn't do this as part of its Service type unfortunately, but I've run an external command to reset these symlinks. (This was all working fine on *new* machines because the symlinks had the correct priority value if they were newly created)

b) Another issue was that we were moving new services to after puppet, that weren't currently in the boot order for after puppet (ie. we were changing puppet's boot priority while it was running from its initscript). This has been fixed by getting puppet for force a reboot whenever its initscript is changed. 

These were the known issues that arose from the last deployment. Upon a reboot of a server configured with this patch we correctly see puppet -> runner -> buildbot so hopefully that means we're good.
Attachment #8441600 - Attachment is obsolete: true
Attachment #8443701 - Flags: review?(dustin)
Attachment #8443701 - Flags: checked-in?
Comment on attachment 8443701 [details] [diff] [review]
Runner - with boot order fixes

Review of attachment 8443701 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM. Just a couple of trailing space nazi comments below. :)

I can fix them when I land the patch.

::: manifests/moco-config.pp
@@ +225,4 @@
>      $selfserve_agent_carrot_exchange = "buildapi.control"
>      $selfserve_agent_carrot_queue = "buildapi-agent-rabbit2"
>  
> +

Can you remove the added empty line above?

::: modules/puppet/manifests/atboot.pp
@@ +27,5 @@
>      exec {
>          # ask the puppet startup script to reboot
>          "reboot-after-puppet":
> +            command => "touch /REBOOT_AFTER_PUPPET",
> +            path => ['/bin/', '/usr/bin/'],

From IRC, "touch" lives in /bin on CentOS. Good catch!

@@ +61,5 @@
> +                    # rc3.d symlink and we want to reboot because we've changed
> +                    # what we want to come after puppet in the boot order
> +                    notify => [ Exec['initd-refresh'], Exec['reboot-after-puppet'] ];
> +            }
> +            

.. and here

@@ +65,5 @@
> +            
> +            exec {
> +                'initd-refresh':
> +                    # resetpriorities tells chkconfig to update the 
> +                    # symlinks in rcX.d with the values from the service's 

and on these 2 lines

@@ +68,5 @@
> +                    # resetpriorities tells chkconfig to update the 
> +                    # symlinks in rcX.d with the values from the service's 
> +                    # init.d script
> +                    command => '/sbin/chkconfig puppet resetpriorities',
> +                    refreshonly => true;    

And here
(trailing spaces nazi mode off)

::: modules/puppet/templates/puppet-centos-initd.erb
@@ +8,5 @@
> +### BEGIN INIT INFO
> +# Provides:          puppet
> +# Required-Start:    $local_fs $network
> +# Required-Start:    $local_fs $network
> +# Should-Start:      $remote_fs 

Ooop, and here
Attachment #8443701 - Flags: review?(dustin) → review+
Created attachment 8445296 [details] [diff] [review]
cleaned runner.diff

Whitespace fixes from Rail's review.

Carrying over the r+.
Attachment #8443701 - Attachment is obsolete: true
Attachment #8443701 - Flags: checked-in?
Attachment #8445296 - Flags: review+
Attachment #8445296 - Flags: checked-in?
(Reporter)

Updated

3 years ago
Attachment #8445296 - Flags: checked-in? → checked-in+
Created attachment 8445328 [details] [diff] [review]
fix for subset of machines that got the backed out patch
Attachment #8445296 - Attachment is obsolete: true
We're sticking with this - will pull fix out into new bug.
(Reporter)

Updated

3 years ago
Attachment #8445296 - Attachment is obsolete: false
Depends on: 1029704
Depends on: 1029777
(Reporter)

Updated

3 years ago
Depends on: 1029903
(Reporter)

Updated

3 years ago
Blocks: 851123

Updated

3 years ago
Blocks: 1036468
Summary: slave pre-flight tasks → slave pre-flight tasks (runner)
Depends on: 1055806
(Reporter)

Updated

3 years ago
Depends on: 1069529
(Reporter)

Updated

3 years ago
Depends on: 1069530

Updated

3 years ago
Whiteboard: [unittest][talos][automation][q4goal] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2706] [unittest][talos][automation][q4goal]

Updated

3 years ago
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2706] [unittest][talos][automation][q4goal] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2714] [unittest][talos][automation][q4goal]
(Assignee)

Updated

3 years ago
Assignee: ian → winter2718
(Assignee)

Comment 50

3 years ago
It seems like this is essentially finished. I'm starting on getting rid of post-build reboots; will open bugs for any additional runner tasks that need creating catch-as-catch-can.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.