Closed Bug 1330695 Opened 7 years ago Closed 5 years ago

Upgrade linux kernel on centos6.5

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

Attachments

(4 files)

The linux kernel packages need to be upgraded to the latest 2.6.32 version
Current version is pinned at 2.6.32-504.3.3.  There is a kernel module already in place but it will need to be modified so we can isolate the upgrade to 'toplevel::server' only
See bug 1319455
Depends on: 1339219
This will install the latest (2.6.32-642.13.1.el6) linux kernel package on all centos 6.5 hosts.  A subsequent reboot will be required to actually run the new kernel.  Conveniently, once all puppet clients have had a chance to run, we can query foreman for a list of hosts that have the 'needs_reboot' fact.  With that we can schedule downtimes servers to do rolling reboots.  The buildbot masters will also need to have a handful tested before waiting for the weekend reboots.

The aws centos instances generated by the golden ami will also pickup this change.
Attachment #8837746 - Flags: review?(catlee)
Attachment #8837746 - Flags: review?(catlee) → review?(nthomas)
Attachment #8837746 - Flags: review?(nthomas) → review+
I've disabled buildbot-master01 and buildbot-master51 in slavealloc and initiated clean shutdowns for both.  Waiting on jobs finish in the meantime.
buildbot-master51.bb.releng.use1 has been rebooted and new kernel is in place
[jwatkins@buildbot-master51.bb.releng.use1.mozilla.com ~]$ uname -r
2.6.32-642.13.1.el6.x86_64

Still waiting on buildbot-master01.
Still waiting on buildbot-master01 has been rebooted.  Both masters are running the new kernel.
I've attached a list of hosts that have taking the kernel upgrade and need to be rebooted
The list of hosts essential breaks down to:

aws-managers
balrogworker
beetmoverworker
buildbot-masters
cruncher-aws
dev-master2
log-aggregators
pushapkworker
puppetmasters
signing-linux
signing
signingworker
slaveapi(prod/dev)

plus various one-off user test instances
Puppetmasters have all been rebooted and sport the new kernel!
I rebooted pushapkworker (when updating jdk17):
pushapkworker-1.srv.releng.use1.mozilla.com

and I rebooted two (one hardware in scl3 and one in aws) log-aggregators:
log-aggregator2.srv.releng.scl3.mozilla.com
log-aggregator1.srv.releng.usw2.mozilla.com

I'll wait and reboot the other log-aggregators tomorrow if there are no problems.
Blocks: 1340311
No longer blocks: 1340311
Depends on: 1340311
Depends on: 1340324
I rebooted slaveapi-dev1 and the kernel update applied successfully. SlaveAPI prod update planned for tomorrow night 7pm pacific (email notice sent to sheriffs and release).
I will reboot balrogworker1 this afternoon after it completes the daily run/work. Planning on a reboot for balrogworker2 next week.
balrogworker1 had no active jobs and the queue had been empty:
tail -f /builds/scriptworker/logs/worker.log showed 200's

Rebooted at 20:32 UTC (not found in nagios)
$ uname -r
2.6.32-642.13.1.el6.x86_64
[dhouse@balrogworker-1.srv.releng.use1.mozilla.com ~]$ uptime
 20:33:22 up 0 min,  1 user,  load average: 0.00, 0.00, 0.00
rebooted:
log-aggregator2.srv.releng.usw2.mozilla.com
log-aggregator1.srv.releng.scl3.mozilla.com
rebooted:
aws-manager1.srv.releng.scl3.mozilla.com
buildduty-tools.srv.releng.usw2.mozilla.com

sent a notice 10 minutes in advance to #buildduty, and confirmed no users or activity (for two+ days for both)
rebooted slaveapi1 as scheduled. (downtimed, confirmed no users recently)

there was a .pid file left on reboot and so the puppet slaveapi module thought that slaveapi was running. moved the pid file to /tmp and ran puppet and it started up correctly.
Depends on: 1340759
:dhosue, nice job on the reboots!
Here is the latest list of hosts needing a reboot (pulled via foreman):

[u'aws-manager2.srv.releng.scl3.mozilla.com',
 u'balrogworker-2.srv.releng.usw2.mozilla.com',
 u'beetmoverworker-1.srv.releng.use1.mozilla.com',
 u'beetmoverworker-2.srv.releng.usw2.mozilla.com',
 u'buildbot-master69.bb.releng.use1.mozilla.com',
 u'cruncher-aws.srv.releng.usw2.mozilla.com',
 u'dev-master2.bb.releng.use1.mozilla.com',
 u'log-aggregator1.srv.releng.use1.mozilla.com',
 u'log-aggregator2.srv.releng.use1.mozilla.com',
 u'log-aggregator3.srv.releng.scl3.mozilla.com',
 u'signing-linux-1.srv.releng.use1.mozilla.com',
 u'signing-linux-2.srv.releng.usw2.mozilla.com',
 u'signing-linux-3.srv.releng.use1.mozilla.com',
 u'signing-linux-4.srv.releng.usw2.mozilla.com',
 u'signing5.srv.releng.scl3.mozilla.com',
 u'signing6.srv.releng.scl3.mozilla.com',
 u'signingscriptworker-1.srv.releng.use1.mozilla.com',
 u'signingworker-1.srv.releng.use1.mozilla.com',
 u'signingworker-2.srv.releng.usw2.mozilla.com',
 u'signingworker-3.srv.releng.use1.mozilla.com',
 u'signingworker-4.srv.releng.usw2.mozilla.com']
I rebooted buildbot-master69 too.
Depends on: 1342096
(In reply to Jake Watkins [:dividehex] from comment #1)
>
> This will install the latest (2.6.32-642.13.1.el6) linux kernel package 

No rest for the wicked, 2.6.32-642.13.2.el6 came out yesterday to address a potential local privilege escalation.
rebooted:
aws-manager2
log-aggregator1.srv.releng.use1.mozilla.com
log-aggregator2.srv.releng.use1.mozilla.com
log-aggregator3.srv.releng.scl3.mozilla.com
dev-master2.bb.releng.use1.mozilla.com

:aobreja will reboot cruncher tomorrow, Feb. 24

I'm planning to make a bug to request that the signing servers be rebooted in the first week of March (to catch the newer kernel and avoid conflicting with releases next week).
cc'ing :aobreja re: cruncher reboot tomorrow
Attachment #8840548 - Flags: review?(jwatkins)
I tested this on a vm, dhouse-1330169.srv.releng.scl3.mozilla.com, and puppet correctly updated the kernel (I rebooted manually to confirm).

I added rpms to the kernel custom repo (followed by createrepo and fixperms):
dhouse@releng-puppet2:/data/repos/yum/custom/kernel$ tree -I "repodata" -P "kernel*" .
.
├── i386
│   ├── kernel-2.6.32-504.3.3.el6.i686.rpm
│   ├── kernel-2.6.32-642.13.1.el6.i686.rpm
│   ├── kernel-2.6.32-642.13.2.el6.i686.rpm
│   ├── kernel-devel-2.6.32-642.13.1.el6.i686.rpm
│   ├── kernel-devel-2.6.32-642.13.2.el6.i686.rpm
│   ├── kernel-firmware-2.6.32-504.3.3.el6.noarch.rpm
│   ├── kernel-firmware-2.6.32-642.13.1.el6.noarch.rpm
│   ├── kernel-firmware-2.6.32-642.13.2.el6.noarch.rpm
│   ├── kernel-headers-2.6.32-504.3.3.el6.i686.rpm
│   ├── kernel-headers-2.6.32-642.13.1.el6.i686.rpm
│   └── kernel-headers-2.6.32-642.13.2.el6.i686.rpm
└── x86_64
    ├── kernel-2.6.32-504.3.3.el6.x86_64.rpm
    ├── kernel-2.6.32-642.13.1.el6.x86_64.rpm
    ├── kernel-2.6.32-642.13.2.el6.x86_64.rpm
    ├── kernel-devel-2.6.32-642.13.1.el6.x86_64.rpm
    ├── kernel-devel-2.6.32-642.13.2.el6.x86_64.rpm
    ├── kernel-firmware-2.6.32-504.3.3.el6.noarch.rpm
    ├── kernel-firmware-2.6.32-642.13.1.el6.noarch.rpm
    ├── kernel-firmware-2.6.32-642.13.2.el6.noarch.rpm
    ├── kernel-headers-2.6.32-504.3.3.el6.x86_64.rpm
    ├── kernel-headers-2.6.32-642.13.1.el6.x86_64.rpm
    └── kernel-headers-2.6.32-642.13.2.el6.x86_64.rpm
Depends on: 1342203
Comment on attachment 8840548 [details] [diff] [review]
bug1330695_update_kernel_dot2.patch

Review of attachment 8840548 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm.  At some point, maybe once all hosts have been rebooted and are running the same kernel, will should add the old kernels to the $obsolete_kernels lists.  This will ensure they get uninstalled and will help keep /boot from filling up.
Attachment #8840548 - Flags: review?(jwatkins) → review+
Attachment #8840548 - Flags: checked-in+
added older kernel to obsolete list to free some space on /boot for clients

tested and worked against my vm
pushed to default and production:
remote:   https://hg.mozilla.org/build/puppet/rev/609cb121d53937a154d0e697690eaeff2f8ea532



--- a/manifests/moco-config.pp
+++ b/manifests/moco-config.pp
@@ -346,17 +346,18 @@ class config inherits config::base {
         default => undef,
     }
 
     # Specifying Ubuntu obsolete kernels is a different format than current kernel above
     # The format is aa.bb.xx-yy
     $obsolete_kernels = $operatingsystem ? {
         'CentOS' => $operatingsystemrelease ? {
             '6.2'   => [ '2.6.32-431.el6', '2.6.32-431.11.2.el6', '2.6.32-431.5.1.el6' ],
-            '6.5'   => [ '2.6.32-431.el6', '2.6.32-431.11.2.el6', '2.6.32-431.5.1.el6' ],
+            '6.5'   => [ '2.6.32-431.el6', '2.6.32-431.11.2.el6', '2.6.32-431.5.1.el6',
+                         '2.6.32-504.3.3.el6' ],
             default => [],
The removal of obsolete_kernels may not take place before installing a new one (I am seeing failures still for disk space on /boot). That makes sense if one were to try to make the current kernel obsolete before installing a new one.

I'll remove the new kernel so that the obsolete_kernels addition can apply (removing -504.3.3).
diff --git a/manifests/moco-config.pp b/manifests/moco-config.pp
--- a/manifests/moco-config.pp
pushed to default and tested against my vm again (downgrade however, but checking syntax/etc)
success and pushed to production:
remote:   https://hg.mozilla.org/build/puppet/rev/855647646c4de718aa6cbd071368292ea15cf900

+++ b/manifests/moco-config.pp
@@ -335,7 +335,7 @@ class config inherits config::base {
     $current_kernel = $operatingsystem ? {
         'CentOS' => $operatingsystemrelease ? {
             '6.2'   => '2.6.32-504.3.3.el6',
-            '6.5'   => '2.6.32-642.13.2.el6',
+            '6.5'   => '2.6.32-642.13.1.el6',
             default => undef,
         },
         'Ubuntu' => $operatingsystemrelease ? {
Confirmed the obsolete_kernel(504) is removed.
And there is space in the standard 100mb boot for 642.13.1 and 642.13.2:

dhouse@dhouse-1330169:~$ rpm -q kernel
kernel-2.6.32-642.13.1.el6.x86_64
kernel-2.6.32-642.13.2.el6.x86_64
dhouse@dhouse-1330169:~$ df -k
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sda3       36923072 2306692  32734116   7% /
tmpfs            1962288       0   1962288   0% /dev/shm
/dev/sda1          95054   63833     26101  71% /boot

So I will restore the original change to 642.13.2 after about an hour (giving time for all the servers to run puppet to remove the obsolete_kernel).
Restoring original kernel update now that the old one will have been removed. Over 90minutes have passed since the obsolete_kernel was set in production to clear space on the client /boot partitions.
Reverted ... some clients have 60mb /boot and the kernel install reports it needs 2mb more.
Attachment #8840994 - Flags: checked-in-
See Also: → 1342518
See Also: → 1342521

buildbot related systems are gone. scriptworkers are currently still on centos 6, but are moving to docker images this quarter.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: