Closed Bug 1159111 Opened 9 years ago Closed 9 years ago

needs reboot semaphore not being cleared

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: dividehex)

Details

Attachments

(1 obsolete file)

So, in doing some maintenance tonight I noticed a foopy needed a reboot (via MOTD) however looks like there are more foopies and some other hosts that need rebooting. Likely TCW fodder. https://foreman.pub.build.mozilla.org/fact_values?utf8=%E2%9C%93&search=name+%3D+needs_reboot
Fwiw I just rebooted foopy67 (since it had no pandas) twice, and it still says that in MOTD. I also ran puppet again and it hasn't changed MOTD... reboot and then puppet again... I'm turning this into an active foopy now, so it won't be able to be blindly restarted as of 4/28 (PT) [root@foopy67.p5.releng.scl3.mozilla.com ~]# puppet agent --test --noop --environment production Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for foopy67.p5.releng.scl3.mozilla.com Warning: The package type's allow_virtual parameter will be changing its default value from false to true in a future release. If you do not want to allow virtual packages, please explicitly set allow_virtual to false. (at /usr/lib/ruby/site_ruby/1.8/puppet/type/package.rb:430:in `default') Info: Applying configuration version 'd798f627e8e8' Notice: Finished catalog run in 26.31 seconds [root@foopy67.p5.releng.scl3.mozilla.com ~]# sudo reboot Broadcast message from jwood@foopy67.p5.releng.scl3.mozilla.com (/dev/pts/0) at 21:34 ... The system is going down for reboot NOW! [root@foopy67.p5.releng.scl3.mozilla.com ~]# Connection to foopy67.build.mozilla.org closed by remote host. Connection to foopy67.build.mozilla.org closed. jwood@PERSEUS /c/Sources/svn/sysadmins/modules/nagios $ ssh jwood@foopy67.build.mozilla.org Last login: Mon Apr 27 21:29:54 2015 from 10-22-248-6.vpn.scl3.mozilla.com This host is set to follow security level "medium" Unauthorized access prohibited REBOOT REQUIRED: reboot_after_puppet [jwood@foopy67.p5.releng.scl3.mozilla.com ~]$ sudo su - [root@foopy67.p5.releng.scl3.mozilla.com ~]# puppet agent --test --noop Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for foopy67.p5.releng.scl3.mozilla.com Warning: The package type's allow_virtual parameter will be changing its default value from false to true in a future release. If you do not want to allow virtual packages, please explicitly set allow_virtual to false. (at /usr/lib/ruby/site_ruby/1.8/puppet/type/package.rb:430:in `default') Info: Applying configuration version 'd798f627e8e8' Notice: Augeas[resolvconf](provider=augeas): --- /etc/resolv.conf 2015-04-27 21:35:39.092620154 -0700 +++ /etc/resolv.conf.augnew 2015-04-27 21:37:25.496304154 -0700 @@ -1,4 +1,4 @@ ; generated by /sbin/dhclient-script -search p5.releng.scl3.mozilla.com. nameserver 10.26.75.40 nameserver 10.26.75.41 +domain p5.releng.scl3.mozilla.com Notice: /Stage[main]/Network::Resolv/Augeas[resolvconf]/returns: current_value need_to_run, should be 0 (noop) Notice: Class[Network::Resolv]: Would have triggered 'refresh' from 1 events Notice: /Stage[main]/Tweaks::I82574l_aspm/Exec[setpci-aspm-off]/returns: current_value notrun, should be 0 (noop) Notice: Class[Tweaks::I82574l_aspm]: Would have triggered 'refresh' from 1 events Notice: Stage[main]: Would have triggered 'refresh' from 2 events Notice: Finished catalog run in 32.13 seconds [root@foopy67.p5.releng.scl3.mozilla.com ~]# cat /etc/issue CentOS release 6.5 (Final) Kernel \r on an \m Kickstart Date: Thu Apr 16 12:29:21 PDT 2015 Kickstart OS: CentOS 6.5 System Installed: Thu Apr 16 12:40:43 PDT 2015 [root@foopy67.p5.releng.scl3.mozilla.com ~]# uptime 21:39:06 up 3 min, 1 user, load average: 0.12, 0.12, 0.04
Assignee: relops → jwatkins
Summary: Hosts in need of reboot → needs reboot semaphore not being cleared
A few observations here: First, even though puppetize.sh is forcing a reboot, it should still include a rm to remove the reboot flag if it had been set during the puppet run. Second, the semaphore is being enforced if the $needs_reboot fact is true but it should be enforcing $needs_reboot_for_reboot_after_puppet since it specific to that semaphore file. $needs_reboot is a more general fact that gets set when any reboot reason is true. I'm not even sure we need to be enforcing this with a file resource. It is a bit redundant. And third, the atboot puppet logic handles clearing the flag before automatically rebooting when the flag is present but this doesn't happen on hosts that run puppet periodically. So we either need to "rm -rf /REBOOT_AFTER_PUPPET; reboot" manually on those systems or add "rm -rf /REBOOT_AFTER_PUPPET" somewhere upon shutdown (or startup before puppet runs)
Yeah, this logic is so circular it's hard to get right. Let's see: Within a puppet run: - $needs_reboot_for_reboot_after_puppet is set is /REBOOT_AFTER_PUPPET exists - $needs_reboot_for_reboot_after_kernel_upgrade is set if the kernel version is different from $(</.kernel_release) - $needs_reboot is set if either of the above are set - /REBOOT_AFTER_PUPPET is touched if some puppet change has notify => Exec['reboot_semaphore'] - /REBOOT_AFTER_PUPPET is touched if $needs_reboot is set Part of the point is to maintain the indication that a reboot is needed even if puppet is run multiple times (so, for example, if you run puppet and it upgrades the kernel, then on the next run it won't update the kernel -- we don't want it to remove the indication that a reboot is required at that point). So following a kernel upgrade process for a periodic host: * kernel upgrade change lands * puppet runs * no needs_reboot* facts are set * upgrades kernel, writes /.kernel_release * notifies Exec['reboot_semaphore'] * creates /REBOOT_AFTER_PUPPET * 30m passes * puppet runs * needs_reboot_for_reboot_after_puppet is set * needs_reboot_for_reboot_after_kernel_upgrade is set * needs_reboot is set * no kernel upgrade occurs * File["/REBOOT_AFTER_PUPPET"] included (with no change since the file exists) Similarly for a pupppet::atboot host: * kernel upgrade change lands * host reboots * puppet runs * no needs_reboot* facts are set * upgrades kernel, writes /.kernel_release * notifies Exec['reboot_semaphore'] * creates /REBOOT_AFTER_PUPPET * puppet initscript reboots * puppet initscript removes /REBOOT_AFTER_PUPPET * puppet runs * no needs_reboot* facts are set * no kernel upgrade occurs So I think your first and third points are right: * any time we reboot automatically, removing /REBOOT_AFTER_PUPPET is appropriate * puppet::periodic hosts need a way to delete /REBOOT_AFTER_PUPPET on startup And your second point can be strengthened: there's no reason to use File["/REBOOT_AFTER_PUPPET"] at all, as long as we can guarantee that Exec['reboot_semaphore'] is notified for anything that would require a reboot. Once /REBOOT_AFTER_PUPPET is touched, it won't be deleted until a reboot. The only place I see this notify missing is in packages::kernel, where it should be set when `/.kernel_release` is modified. Also, that file resource should have ensure => file, otherwise it won't be updated if it already exists.
Attached patch bug1159111-1.patch (obsolete) — Splinter Review
* puts semaphore removal back in puppetize.sh * removes redundant file resource for /REBOOT_AFTER_PUPPET * adds notify and ensure => file for .kernel_release * adds semaphore removal startup/shutdown methods for all OS cases Tested on ubuntu 14.04, centos 6.5, win2008 and osx 10.6
Attachment #8653645 - Flags: review?(dustin)
Comment on attachment 8653645 [details] [diff] [review] bug1159111-1.patch Also added a $runas param for creating tasks. Defaults to SYSTEM
Attachment #8653645 - Attachment is obsolete: true
Attachment #8653645 - Flags: review?(dustin)
Attachment #8653645 - Attachment is obsolete: false
Attachment #8653645 - Attachment is obsolete: true
Attachment #8653645 - Flags: review+
This is list of hosts that are reporting the need of a reboot. They will need to be manually cycled to clear the semaphore file assuming they actually need a reboot. { "releng-puppet1.srv.releng.usw2.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "signingworker-1.srv.releng.use1.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "buildbot-master67.bb.releng.use1.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "proxxy1.srv.releng.usw2.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "tst-linux32-ec2-shu.test.releng.use1.mozilla.com": { "needs_reboot": "reboot_after_kernel_upgrade" }, "foopy103.p10.releng.scl3.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "signingworker-3.srv.releng.use1.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "funsize-dev1.srv.releng.usw2.mozilla.com": { "needs_reboot": "reboot_after_kernel_upgrade, reboot_after_puppet" }, "signingworker-2.srv.releng.usw2.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "signingworker-4.srv.releng.usw2.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "foopy65.p4.releng.scl3.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "releng-puppet1.srv.releng.use1.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "buildbot-master01.bb.releng.use1.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "foopy69.p5.releng.scl3.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "foopy56.p3.releng.scl3.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "foopy67.p5.releng.scl3.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "proxxy1.srv.releng.use1.mozilla.com": { "needs_reboot": "reboot_after_puppet" }, "signingworker-dev2.srv.releng.usw2.mozilla.com": { "needs_reboot": "reboot_after_puppet" } }
Is there any indication of why the reboot is needed? I'm wondering how many of these are false positives. In any case, I think this bug is done?
I don't think there is anyway to indicate if these are false positives or not. I only assumed we should be better safe than sorry and give them a reboot just to be sure. But if your ok with just calling it done and leaving them be, I will go along.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: