Closed Bug 1393524 Opened 7 years ago Closed 7 years ago

Change mac-mini slaves to not always run puppet at boot.

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dhouse, Assigned: dhouse)

References

Details

Attachments

(5 files, 9 obsolete files)

2.58 KB, text/plain
Details
6.63 KB, patch
dividehex
: review+
dhouse
: checked-in+
Details | Diff | Splinter Review
648 bytes, patch
dividehex
: review+
dhouse
: checked-in+
Details | Diff | Splinter Review
772 bytes, patch
dividehex
: review+
dhouse
: checked-in+
Details | Diff | Splinter Review
3.57 KB, patch
dividehex
: review+
dragrom
: review+
dhouse
: checked-in+
Details | Diff | Splinter Review
Re: "firefox-ci > rebooting OS X 10.10 testers between jobs" (https://groups.google.com/a/mozilla.com/forum/?utm_medium=email&utm_source=footer#!topic/firefox-ci/mMEt8YmdsuA) On mac test/try slaves, we will stop running puppet at every boot to reduce the time spent for each reboot to get a slave back to running the next job. details: Puppet runs at boot. Mac-mini slaves are rebooting after each test. Puppet can take a few minutes to run. The reboot+puppetrun time slows down our testing. Because there may be residue left after try/tests and it would require unknown work to clean that up. Joel suggested "Personally I think rebooting every few jobs would be reasonable (maybe every 5th?)" Dustin suggested "The run-puppet script could add a timestamp on success and skip any runs within N seconds of that timestamp (say, 30 minutes)." Amy raised the concern that OSX screen resolution is set, or fixed, by puppet. So if we do not run puppet at every boot, there are some settings (only expecting screen resolution for now) that need to be set despite not running puppet (to put put into the at-boot script). https://hg.mozilla.org/build/puppet/file/tip/modules/screenresolution/manifests/init.pp
Through the gui module, https://hg.mozilla.org/build/puppet/file/tip/modules/gui/manifests/init.pp#l23 The screenresolution module is applied to test slave in: https://hg.mozilla.org/build/puppet/file/tip/modules/toplevel/manifests/slave/releng/test/headless.pp#l7 https://hg.mozilla.org/build/puppet/file/tip/modules/toplevel/manifests/slave/releng/test/gpu.pp#l7 # Noting:https://hg.mozilla.org/build/puppet/file/tip/modules/gui/manifests/init.pp#l24 # $on_gpu is irrelevant on Darwin - everything's onscreen, and gui also modifies appearance: https://hg.mozilla.org/build/puppet/file/tip/modules/gui/manifests/appearance.pp#l9 AppleShowScrollBars set-background-image restart-Dock So I'll want to test if these also need set (do I dare to search for other settings).
Blocks: 1386625
Puppet is set up to run at boot for macs in: https://hg.mozilla.org/build/puppet/file/tip/modules/puppet/manifests/atboot.pp#l100 Which places a script built from: https://hg.mozilla.org/build/puppet/file/tip/modules/puppet/templates/puppet-darwin-run-puppet.sh.erb which includes the generic puppet vars and run functions from: https://hg.mozilla.org/build/puppet/file/tip/modules/puppet/templates/puppet-atboot-common.erb So we can put the logic and always-settings into the darwin-only template shell script to only affect darwin boots. To restrict the logic to test slaves, the atboot.pp can include a stanza for tester class that tells the script what to do (flag, or other template or settings). For the screenresolution setting, to start with, we can clip from the module and assume the binary is installed: modules/screenresolution/manifests/init.pp 13 10.7: {¬ 14 $resolution = "${width}x${height}x${depth}"¬ 15 }¬ 16 10.8,10.9: {¬ 17 $resolution = "${width}x${height}x${depth}@${refresh}"¬ 18 }¬ 26 command => "/usr/local/bin/screenresolution set ${resolution}",¬ 27 unless => "/usr/local/bin/screenresolution get 2>&1 | /usr/bin/grep 'Display 0: ${resolution}'",
I haven't tested this yet. THis is a first pass to apply a time check and screenresolution for puppet:atboot on mac test slaves. When I have time to figure out the puppet details, the resolution seting would be better done by using the gui+screenresolution modules to create the resolution command instead of the copy-paste into atboot. It may be better to use a boot count, but the elapsed time seems good for puppet changes to be applied at next boot greater than 30min, and to have some consistency in the time added by puppet runs (only add the puppet run time every 30 minutes). Using a boot count could be better if we are accumulating changes at boot that puppet would have fixed, but I don't think we have anything like that.
Attachment #8900883 - Flags: review?(jwatkins)
I need to do something more to get puppet to re-run the puppet::atboot module, or I'm not matching the node definition: Testing on t-yosemite-r7-471.test.releng.mdc1.mozilla.com (:arr is on 472, :dragos is on 394): ``` [root@t-yosemite-r7-471.test.releng.mdc1.mozilla.com ~]# puppet agent --test --noop --environment=dhouse --server=releng-puppet2.srv.releng.scl3.mozilla.com [...] [root@t-yosemite-r7-471.test.releng.mdc1.mozilla.com ~]# puppet agent --test --environment=dhouse --server=releng-puppet2.srv.releng.scl3.mozilla.com [...] [root@t-yosemite-r7-471.test.releng.mdc1.mozilla.com ~]# ls -la /usr/local/bin/run-puppet.sh -rwxr-xr-x 1 root wheel 4778 May 17 12:44 /usr/local/bin/run-puppet.sh ```
My node regex was still from default, with only scl3 for t-.test -node /t.*-\d+\.test\.releng\.scl3\.mozilla\.com/ { +node /t.*-\d+\.test\.releng\.(scl3|mdc1)\.mozilla\.com/ {
Still needs some cleanup, but it works (tested t-yosemite-r7-471.mdc1) so I'd like to apply it Friday morning if it looks good enough for now.
Attachment #8900883 - Attachment is obsolete: true
Attachment #8900883 - Flags: review?(jwatkins)
Attachment #8900939 - Flags: review?(jwatkins)
I accidentally left a date format arg in the stat command in the script template.
Attachment #8900939 - Attachment is obsolete: true
Attachment #8900939 - Flags: review?(jwatkins)
Attachment #8900948 - Flags: review?(jwatkins)
Comment on attachment 8900948 [details] [diff] [review] tested config for 30min before puppet on reboot for macs We're looking to base the decision on whether or not puppet run on the number of taskcluster jobs that have been executed (number of time the machine has rebooted), not absolute wall clock time.
Attachment #8900948 - Flags: review?(jwatkins) → review-
Actually, thinking about this some more, we should probably check for either condition. We want to look for 5 reboots OR time > 1h (or 30m, whatever) to determine whether we should run puppet. We want to make sure that we aren't going too many jobs between puppet runs (something might be busted), but we also want to make sure that a machine doesn't sit idle for days at a time and not run puppet on the next reboot. My other concern is... what if we push out a busted puppet patch? This is going to increase the time between pushing a fix and having a machine pick it up. Can we build in a failsafe that will tell it to run puppet on the next reboot?
Attachment #8900948 - Attachment is obsolete: true
Attachment #8901249 - Flags: review?(jwatkins)
Attachment #8901249 - Flags: review?(arich)
I cleaned up the script. after review, I'll plan to apply it to a subset of mdc1 testers early next week.
Attachment #8901385 - Flags: review?(jwatkins)
Attachment #8901249 - Attachment is obsolete: true
Attachment #8901249 - Flags: review?(jwatkins)
Attachment #8901249 - Flags: review?(arich)
Comment on attachment 8901385 [details] [diff] [review] macs run_puppet if N runs skipped, if 30min since last run, or if hg hash changed. cleaned for testing next week Review of attachment 8901385 [details] [diff] [review]: ----------------------------------------------------------------- I really like the three-pronged approach here especially comparing the current repo hash. This is definitely a good comprehensive solution to this problem and should work well. Although do test it, I tend to miss things when it comes to shell scripts. ;-) The one BIG issue I see is this breaks the reboot semaphore function. If a reboot flag is set anytime during a puppet run, we want to make sure the host is rebooted and puppet is run again immediately at boot. This should cycle as many times as needed until puppet completes successfully and no reboot semaphore exists after. I think you can simply insert a check for the semaphore to ensure the limiting logic is overridden and puppet is run anyway. Also, let's move the 2 node scope vars to moco-config. Set them undef in config::base and set them to real values in moco-config (using a $::fqdn selector if needed) ::: manifests/moco-nodes.pp @@ +17,5 @@ > $slave_trustlevel = 'try' > + > + # Easy faster Darwin boot for bug 1393524 > + $puppet_run_if_more_than_seconds = 7200 > + $puppet_run_if_more_than_n_reboots = 5 These node scope variables should definitely be moved to moco-config. Helps keep node defs clean and put configurations in there proper place. @@ +67,5 @@ > $slave_trustlevel = 'try' > + > + # Easy faster boot for bug 1393524 > + $puppet_run_if_more_than_seconds = 7200 > + $puppet_run_if_more_than_n_reboots = 5 Same as above. Move vars to moco-config. Don't forget to add defaults to config::base @@ +1102,5 @@ > + include toplevel::worker::releng::generic_worker::test::gpu > +} > + > +# Dsiabled Linux and OS X taskcluster workers in mdc1 > +#node /t.*-\d+\.test\.releng\.mdc1\.mozilla\.com/ { Did you also comment out the rest of the node def here? This also might be bit rot since :arr enabled the osx tc workers in mdc1 last week
Attachment #8901385 - Flags: review?(jwatkins) → review-
Attached file osx_reboots.log
From test runs on t-yosemite-r7-471, this saves about a minute by skipping the puppet run and the wait for dns to resolve. A few times from reboot requested to next task started (from attached log clips): 0:2:37 with puppet 0:2:33 with puppet 0:1:56 0:3:52 with puppet 0:1:37 0:2:38 with puppet 0:1:43 0:1:33
This moves 2 node scope vars to moco-config and adds them as explicitly undefined in config::base. Does this look right? This adds a check for the boot flag file and removed the deletion of it from the puppet-run function (when the reboot is triggered, it was deleting the flag file. it also deletes it before puppet is run). This also adds a condition to the start of the checks to, by default, bypass performing any checks. That seems better than expecting the tests to fail because the limits are not defined.
Attachment #8901385 - Attachment is obsolete: true
Attachment #8903582 - Flags: review?(jwatkins)
Comment on attachment 8903582 [details] [diff] [review] include boot flag file check in osx run-puppet atboot conditions Review of attachment 8903582 [details] [diff] [review]: ----------------------------------------------------------------- I like it! Just fix the conditional stuff noted inline and r+. I ::: manifests/moco-config.pp @@ +135,5 @@ > + # first test: limit to 393-399 in mdc1 > + /t-yosemite-r7-3\d+\.test\.releng\.mdc1\.mozilla\.com/: { > + $puppet_run_atboot_if_more_than_n_reboots = 5 > + $puppet_run_atboot_if_more_than_seconds = 3600 > + } You might need an empty default case here since the future lint config might complain about it ::: modules/config/manifests/base.pp @@ +147,4 @@ > # a flag that controls which nodes should install and run MIG Agent > $enable_mig_agent = false > > + # Conditional puppet run atboot for macs Perfect! ::: modules/puppet/manifests/atboot.pp @@ +13,5 @@ > $puppet_servers = $::puppet::settings::puppet_servers > > case ($::operatingsystem) { > + Darwin: { > + if ( $::config::puppet_run_atboot_if_more_than_n_reboots != undef or First, I would add a comment about this conditional stating the reasoning. You should also be explicit in setting the true or false with an if/else since $puppet_run_conditions_set is a "true" or "false" string that is being inserted into a shell script. If it is UNDEF, the erb generator will set it as an empty string in the template. May work but that is not what you want in a shell script. :-) From my understanding, we want to check that BOTH variables ARE set and if either are undef, then the first conditional in the shell script will be true therefore short circuit the evaluation and puppet will always run. I don't think we would have a situation where only one var would be set. If this is an incorrect assumption, then let's talk. I would suggest removing the negatives. Negatives tend to muddy up logic and make it hard to interpret. eg. if $puppet_run_atboot_if_more_than_n_reboots and $puppet_run_atboot_if_more_than_seconds { $conditional_missing = false # both conditionals are set, let's continue evaluating } else { $conditional_missing = true # oh noes, conditional missing, always run puppet! } ::: modules/puppet/templates/puppet-atboot-common.erb @@ +21,4 @@ > # when the puppet exit status is incorrect. > tmp=`mktemp /tmp/puppet-outputXXXXXX` > . /usr/local/bin/proxy_reset_env.sh > + /usr/bin/puppet agent --detailed-exitcodes $PUPPET_OPTIONS --server ${puppet_server} > $tmp 2>&1 Good catch! ::: modules/puppet/templates/puppet-darwin-run-puppet.sh.erb @@ +94,5 @@ > +# boot semaphore is still set > +# every N reboots > +# or if N seconds elapsed since last puppet run > +# or if the configuration(repo hash) has changed > +if [ ! <%= @puppet_run_conditions_set %> ] || [ -f $REBOOT_FLAG_FILE ] || puppet_counter_exceeded || puppet_elapsed_time_exceeded || ! puppet_config_version_is_latest; then If you explicitly set the first var either true or false in the origin logic, you can remove the confusing ! negative
Attachment #8903582 - Flags: review?(jwatkins) → review+
Attached patch osx run-puppet atboot conditions (obsolete) — Splinter Review
Updated patch based on review comments: removed negatives in condition statements. always run unless both conditions are set (added comment explaining). clarified condition check by naming the variable for "always run"
Attachment #8903582 - Attachment is obsolete: true
Attachment #8904853 - Flags: review?(jwatkins)
Comment on attachment 8904853 [details] [diff] [review] osx run-puppet atboot conditions Review of attachment 8904853 [details] [diff] [review]: ----------------------------------------------------------------- r+ with changes mentioned. You're on the right track! Much cleaner with all the inverse logic. Ping me with you want a quick r? on the changes. Let's get this landed ::: modules/puppet/manifests/atboot.pp @@ +17,5 @@ > + # Always run puppet at boot unless maximums are set for both > + # n_reboots and seconds since last puppet run. > + if ( $::config::puppet_run_atboot_if_more_than_n_reboots and > + $::config::puppet_run_atboot_if_more_than_seconds ) { > + $puppet_run_atboot_always = false treat $puppet_run_atboot_always as a string and single quote it for both "true" and "false" ::: modules/puppet/templates/puppet-darwin-run-puppet.sh.erb @@ +90,5 @@ > + fi > +} > + > +puppet_run_atboot_always=<%= @puppet_run_atboot_always %> > +if [[ $puppet_run_atboot_always == true ]] \ No need to set this as a variable and no need to evaluate it will == Just treat is like a sting being inject with erb. Should look like this: if [[ $puppet_run_atboot_always ]] \ The result will look like: if [[ true ]] \ or if [[ false ]] \ @@ +94,5 @@ > +if [[ $puppet_run_atboot_always == true ]] \ > + || [ -f $REBOOT_FLAG_FILE ] \ > + || puppet_counter_exceeded \ > + || puppet_elapsed_time_exceeded \ > + || puppet_config_version_is_not_latest; then Nice clean up! I love easy to read.
Attachment #8904853 - Flags: review?(jwatkins) → review+
That should read, "Much cleaner withOUT all the inverse logic."
Attached patch osx run-puppet atboot conditions (obsolete) — Splinter Review
Updated based on comments
Attachment #8904853 - Attachment is obsolete: true
Attachment #8905147 - Flags: review?(jwatkins)
changed to single quotes for the always-run flag
Attachment #8905147 - Attachment is obsolete: true
Attachment #8905147 - Flags: review?(jwatkins)
Attachment #8905149 - Flags: review?(jwatkins)
Comment on attachment 8905149 [details] [diff] [review] osx run-puppet atboot conditions Review of attachment 8905149 [details] [diff] [review]: ----------------------------------------------------------------- LGTM. Ship IT!
Attachment #8905149 - Flags: review?(jwatkins) → review+
Attachment #8905149 - Flags: checked-in+
See http://logs.glob.uno/?c=releng#c312233 - this might have possibly caused a linux talos puppet issue. aobreja|buildduty is currently testing.
See Also: → 1397674
Attached patch osx run-puppet atboot conditions (obsolete) — Splinter Review
Expand mac puppet atboot conditional runs to all "t-yosemite-r7" instances.
Attachment #8908174 - Flags: review?(jwatkins)
Attachment #8908174 - Flags: review?(dcrisan)
Attachment #8908174 - Flags: feedback?
Attachment #8908174 - Flags: review?(dcrisan) → review+
We need to not check for the counter file. It will never exist unless we remove this check.
Attachment #8908252 - Flags: review?(jwatkins)
Attachment #8908252 - Flags: review?(jwatkins) → review+
Comment on attachment 8908252 [details] [diff] [review] was checking for run count file too soon Applied and pushed to default: remote: https://hg.mozilla.org/build/puppet/rev/f7cfa8646cbde8421177b268f579ce6b80ecae50 Travis passed. pushed to production: remote: https://hg.mozilla.org/build/puppet/rev/5ecfc7bf1c93e9906f9693a9bb663a2df66346d4
Attachment #8908252 - Flags: checked-in+
Attachment #8908174 - Flags: feedback? → checked-in+
often network/dns is not ready and so we cannot get the repo tip hash. This prevents us from skipping the puppet run as we're depending on getting the hash
Attachment #8908429 - Flags: review?(jwatkins)
Attachment #8908429 - Flags: review?(dcrisan)
Attachment #8908429 - Attachment description: stop checking remote repo version. network/dns is often not ready (wait_for_dns ...) → wait_for_dns always. network/dns is often not ready for the hg tip hash check
Attachment #8908429 - Flags: review?(jwatkins) → review+
Comment on attachment 8908429 [details] [diff] [review] wait_for_dns always. network/dns is often not ready for the hg tip hash check Applied and pushed to default: remote: https://hg.mozilla.org/build/puppet/rev/bd4e4860d911867598a0f72cc8558753c9dccbfa Travis passed. Still didn't fix the hg response being empty *need to determine cause since it doesn't look like dns. Do not compare the hash if empty (bustage): diff --git a/modules/puppet/templates/puppet-darwin-run-puppet.sh.erb b/modules/puppet/templates/puppet-darwin-run-puppet.sh.erb --- a/modules/puppet/templates/puppet-darwin-run-puppet.sh.erb +++ b/modules/puppet/templates/puppet-darwin-run-puppet.sh.erb @@ -57,6 +57,7 @@ function puppet_config_version_is_not_la # if hg fails or is not yet installed, this will leave a non-matching empty string local repo_version=$(hg identify https://hg.mozilla.org/build/puppet) echo "Repository tip hash: ${repo_version}" + [[ -z "${repo_version}" ]] && return 1 [[ "${last_run}" != "${repo_version}" ]] } remote: https://hg.mozilla.org/build/puppet/rev/b90518da429a5101c84565a4cf2e7e2eba73fd3d Travis passed. Pushed the production: remote: https://hg.mozilla.org/build/puppet/rev/d9b4a49865013673f1d5a92fe4cc731f8c377c44 I've collected reboot times (and whether they ran puppet) since the push and it is working correctly from what I see so far (and continuing on to run the generic_worker to take jobs) after reboots.
Attachment #8908429 - Flags: review?(dcrisan) → checked-in+
The t-yosemite-r7 machines are skipping puppet runs atboot based on the conditions and that seems to cut the down-time for reboots down by 50-60 seconds. t-yosemite-r7-0379 Sep 15 07:32:38, Sep 15 07:34:12, 94s, Creating file /etc/generic-worker.config..., t-yosemite-r7-0214 Sep 15 07:32:38, Sep 15 07:34:13, 95s, Creating file /etc/generic-worker.config..., t-yosemite-r7-0044 Sep 15 07:32:45, Sep 15 07:34:19, 94s, Creating file /etc/generic-worker.config..., t-yosemite-r7-0059 Sep 15 07:32:01, Sep 15 07:34:26, 145s, Creating file /etc/generic-worker.config..., Sep 15 07:34:25 t-yosemite-r7-0112 Sep 15 07:32:04, Sep 15 07:34:33, 149s, Creating file /etc/generic-worker.config..., Sep 15 07:34:32 t-yosemite-r7-0106 Sep 15 07:32:04, Sep 15 07:34:33, 149s, Creating file /etc/generic-worker.config..., Sep 15 07:34:33 t-yosemite-r7-0060 Sep 15 07:33:07, Sep 15 07:34:42, 95s, Creating file /etc/generic-worker.config..., t-yosemite-r7-0153 Sep 15 07:33:10, Sep 15 07:34:44, 94s, Creating file /etc/generic-worker.config..., t-yosemite-r7-0333 Sep 15 07:32:25, Sep 15 07:34:48, 143s, Creating file /etc/generic-worker.config..., Sep 15 07:34:47 t-yosemite-r7-0346 Sep 15 07:33:16, Sep 15 07:34:51, 95s, Creating file /etc/generic-worker.config..., t-yosemite-r7-0319 Sep 15 07:32:38, Sep 15 07:35:04, 146s, Creating file /etc/generic-worker.config..., Sep 15 07:35:03 t-yosemite-r7-0079 Sep 15 07:32:50, Sep 15 07:35:15, 145s, Creating file /etc/generic-worker.config..., Sep 15 07:35:13 t-yosemite-r7-0239 Sep 15 07:32:18, Sep 15 07:35:16, 178s, Creating file /etc/generic-worker.config..., Sep 15 07:35:15 So we'll see today if these minutes saved help the group of active servers to keep up with the queue.
Comment on attachment 8908174 [details] [diff] [review] osx run-puppet atboot conditions Review of attachment 8908174 [details] [diff] [review]: ----------------------------------------------------------------- I'd prefer to stick to being explicit as much as possible in regex. The .* should be (scl3|mdc1|usw2|use1).
Attachment #8908174 - Flags: review?(jwatkins) → review-
This: 1. Makes the fqdn matching specific to (scl3|mdc1|usw2|use1). 2. Fixes a counter bug: reset the boot counter every time we run puppet (not only when triggered by counter). 3. Adds logging for why puppet was run when a condition triggers it.
Attachment #8908174 - Attachment is obsolete: true
Attachment #8909439 - Flags: review?(jwatkins)
Attachment #8909439 - Flags: review?(dcrisan)
Comment on attachment 8909439 [details] [diff] [review] osx atboot conditions. logging, reset counter, specific fqdn lgtm
Attachment #8909439 - Flags: review?(dcrisan) → review+
Attachment #8909439 - Flags: review?(jwatkins) → review+
Attachment #8909439 - Flags: checked-in+
Some totals from the reboot timing collected since the 20th: ``` day hosts reboots puppet-runs puppet-skips hours-saved potential-taskhours percent-saved 21 457 17769 7018 10751 179 10968 1.63 22 378 12904 5556 7348 122 9072 1.34 23 378 7617 4083 3534 58 9072 .63 24 378 8573 4370 4203 70 9072 .77 25 378 16041 6388 9653 160 9072 1.76 26 379 16711 6290 10421 173 9096 1.90 27 382 18107 6609 11498 191 9168 2.08 ``` So on working week days, it looks like we are saving 1 to 2 percent of our mac processing time for tasks. I want to also sum up the time for reboots versus the time for puppet runs.
Did we reach to a conclusion on this bug?is there something else that need to be configured? We have bug 1386625 blocked by this one.
Flags: needinfo?(dhouse)
We reached a stable state (and confirmed it gave use a few percentage of total host time). It is something we can revisit when/if we need to review host/ops time on the workers.
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(dhouse)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: