Closed
Bug 1330695
Opened 7 years ago
Closed 5 years ago
Upgrade linux kernel on centos6.5
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Infrastructure & Operations
RelOps: Puppet
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: dividehex, Assigned: dividehex)
References
Details
Attachments
(4 files)
906 bytes,
text/plain
|
nthomas
:
review+
dividehex
:
checked-in+
|
Details |
4.87 KB,
text/plain
|
Details | |
995 bytes,
patch
|
dividehex
:
review+
dhouse
:
checked-in+
|
Details | Diff | Splinter Review |
1007 bytes,
patch
|
dhouse
:
checked-in-
|
Details | Diff | Splinter Review |
The linux kernel packages need to be upgraded to the latest 2.6.32 version Current version is pinned at 2.6.32-504.3.3. There is a kernel module already in place but it will need to be modified so we can isolate the upgrade to 'toplevel::server' only See bug 1319455
Assignee | ||
Comment 1•7 years ago
|
||
This will install the latest (2.6.32-642.13.1.el6) linux kernel package on all centos 6.5 hosts. A subsequent reboot will be required to actually run the new kernel. Conveniently, once all puppet clients have had a chance to run, we can query foreman for a list of hosts that have the 'needs_reboot' fact. With that we can schedule downtimes servers to do rolling reboots. The buildbot masters will also need to have a handful tested before waiting for the weekend reboots. The aws centos instances generated by the golden ami will also pickup this change.
Attachment #8837746 -
Flags: review?(catlee)
Assignee | ||
Updated•7 years ago
|
Attachment #8837746 -
Flags: review?(catlee) → review?(nthomas)
Updated•7 years ago
|
Attachment #8837746 -
Flags: review?(nthomas) → review+
Assignee | ||
Comment 2•7 years ago
|
||
Comment on attachment 8837746 [details] Upgrade centos linux kernel https://hg.mozilla.org/build/puppet/rev/c8c1db899e6b8bea037184907dbe3cd298397e9b https://hg.mozilla.org/build/puppet/rev/ae00088db15d6829209254ab61b8ac2f89881a6b
Attachment #8837746 -
Flags: checked-in+
Assignee | ||
Comment 3•7 years ago
|
||
I've disabled buildbot-master01 and buildbot-master51 in slavealloc and initiated clean shutdowns for both. Waiting on jobs finish in the meantime.
Assignee | ||
Comment 4•7 years ago
|
||
buildbot-master51.bb.releng.use1 has been rebooted and new kernel is in place [jwatkins@buildbot-master51.bb.releng.use1.mozilla.com ~]$ uname -r 2.6.32-642.13.1.el6.x86_64 Still waiting on buildbot-master01.
Assignee | ||
Comment 5•7 years ago
|
||
Still waiting on buildbot-master01 has been rebooted. Both masters are running the new kernel.
Assignee | ||
Comment 6•7 years ago
|
||
I've attached a list of hosts that have taking the kernel upgrade and need to be rebooted
Assignee | ||
Comment 7•7 years ago
|
||
The list of hosts essential breaks down to: aws-managers balrogworker beetmoverworker buildbot-masters cruncher-aws dev-master2 log-aggregators pushapkworker puppetmasters signing-linux signing signingworker slaveapi(prod/dev) plus various one-off user test instances
Assignee | ||
Comment 8•7 years ago
|
||
Puppetmasters have all been rebooted and sport the new kernel!
I rebooted pushapkworker (when updating jdk17): pushapkworker-1.srv.releng.use1.mozilla.com and I rebooted two (one hardware in scl3 and one in aws) log-aggregators: log-aggregator2.srv.releng.scl3.mozilla.com log-aggregator1.srv.releng.usw2.mozilla.com I'll wait and reboot the other log-aggregators tomorrow if there are no problems.
Comment 10•7 years ago
|
||
I rebooted slaveapi-dev1 and the kernel update applied successfully. SlaveAPI prod update planned for tomorrow night 7pm pacific (email notice sent to sheriffs and release).
Comment 11•7 years ago
|
||
I will reboot balrogworker1 this afternoon after it completes the daily run/work. Planning on a reboot for balrogworker2 next week.
Comment 12•7 years ago
|
||
balrogworker1 had no active jobs and the queue had been empty: tail -f /builds/scriptworker/logs/worker.log showed 200's Rebooted at 20:32 UTC (not found in nagios) $ uname -r 2.6.32-642.13.1.el6.x86_64 [dhouse@balrogworker-1.srv.releng.use1.mozilla.com ~]$ uptime 20:33:22 up 0 min, 1 user, load average: 0.00, 0.00, 0.00
Comment 13•7 years ago
|
||
rebooted: log-aggregator2.srv.releng.usw2.mozilla.com log-aggregator1.srv.releng.scl3.mozilla.com
Comment 14•7 years ago
|
||
rebooted: aws-manager1.srv.releng.scl3.mozilla.com buildduty-tools.srv.releng.usw2.mozilla.com sent a notice 10 minutes in advance to #buildduty, and confirmed no users or activity (for two+ days for both)
Comment 15•7 years ago
|
||
rebooted slaveapi1 as scheduled. (downtimed, confirmed no users recently) there was a .pid file left on reboot and so the puppet slaveapi module thought that slaveapi was running. moved the pid file to /tmp and ran puppet and it started up correctly.
Assignee | ||
Comment 16•7 years ago
|
||
:dhosue, nice job on the reboots! Here is the latest list of hosts needing a reboot (pulled via foreman): [u'aws-manager2.srv.releng.scl3.mozilla.com', u'balrogworker-2.srv.releng.usw2.mozilla.com', u'beetmoverworker-1.srv.releng.use1.mozilla.com', u'beetmoverworker-2.srv.releng.usw2.mozilla.com', u'buildbot-master69.bb.releng.use1.mozilla.com', u'cruncher-aws.srv.releng.usw2.mozilla.com', u'dev-master2.bb.releng.use1.mozilla.com', u'log-aggregator1.srv.releng.use1.mozilla.com', u'log-aggregator2.srv.releng.use1.mozilla.com', u'log-aggregator3.srv.releng.scl3.mozilla.com', u'signing-linux-1.srv.releng.use1.mozilla.com', u'signing-linux-2.srv.releng.usw2.mozilla.com', u'signing-linux-3.srv.releng.use1.mozilla.com', u'signing-linux-4.srv.releng.usw2.mozilla.com', u'signing5.srv.releng.scl3.mozilla.com', u'signing6.srv.releng.scl3.mozilla.com', u'signingscriptworker-1.srv.releng.use1.mozilla.com', u'signingworker-1.srv.releng.use1.mozilla.com', u'signingworker-2.srv.releng.usw2.mozilla.com', u'signingworker-3.srv.releng.use1.mozilla.com', u'signingworker-4.srv.releng.usw2.mozilla.com']
Comment 17•7 years ago
|
||
I rebooted buildbot-master69 too.
Comment 18•7 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #1) > > This will install the latest (2.6.32-642.13.1.el6) linux kernel package No rest for the wicked, 2.6.32-642.13.2.el6 came out yesterday to address a potential local privilege escalation.
Comment 19•7 years ago
|
||
rebooted: aws-manager2 log-aggregator1.srv.releng.use1.mozilla.com log-aggregator2.srv.releng.use1.mozilla.com log-aggregator3.srv.releng.scl3.mozilla.com dev-master2.bb.releng.use1.mozilla.com :aobreja will reboot cruncher tomorrow, Feb. 24 I'm planning to make a bug to request that the signing servers be rebooted in the first week of March (to catch the newer kernel and avoid conflicting with releases next week).
Comment 20•7 years ago
|
||
cc'ing :aobreja re: cruncher reboot tomorrow
Comment 21•7 years ago
|
||
Attachment #8840548 -
Flags: review?(jwatkins)
Comment 22•7 years ago
|
||
I tested this on a vm, dhouse-1330169.srv.releng.scl3.mozilla.com, and puppet correctly updated the kernel (I rebooted manually to confirm). I added rpms to the kernel custom repo (followed by createrepo and fixperms): dhouse@releng-puppet2:/data/repos/yum/custom/kernel$ tree -I "repodata" -P "kernel*" . . ├── i386 │ ├── kernel-2.6.32-504.3.3.el6.i686.rpm │ ├── kernel-2.6.32-642.13.1.el6.i686.rpm │ ├── kernel-2.6.32-642.13.2.el6.i686.rpm │ ├── kernel-devel-2.6.32-642.13.1.el6.i686.rpm │ ├── kernel-devel-2.6.32-642.13.2.el6.i686.rpm │ ├── kernel-firmware-2.6.32-504.3.3.el6.noarch.rpm │ ├── kernel-firmware-2.6.32-642.13.1.el6.noarch.rpm │ ├── kernel-firmware-2.6.32-642.13.2.el6.noarch.rpm │ ├── kernel-headers-2.6.32-504.3.3.el6.i686.rpm │ ├── kernel-headers-2.6.32-642.13.1.el6.i686.rpm │ └── kernel-headers-2.6.32-642.13.2.el6.i686.rpm └── x86_64 ├── kernel-2.6.32-504.3.3.el6.x86_64.rpm ├── kernel-2.6.32-642.13.1.el6.x86_64.rpm ├── kernel-2.6.32-642.13.2.el6.x86_64.rpm ├── kernel-devel-2.6.32-642.13.1.el6.x86_64.rpm ├── kernel-devel-2.6.32-642.13.2.el6.x86_64.rpm ├── kernel-firmware-2.6.32-504.3.3.el6.noarch.rpm ├── kernel-firmware-2.6.32-642.13.1.el6.noarch.rpm ├── kernel-firmware-2.6.32-642.13.2.el6.noarch.rpm ├── kernel-headers-2.6.32-504.3.3.el6.x86_64.rpm ├── kernel-headers-2.6.32-642.13.1.el6.x86_64.rpm └── kernel-headers-2.6.32-642.13.2.el6.x86_64.rpm
Assignee | ||
Comment 23•7 years ago
|
||
Comment on attachment 8840548 [details] [diff] [review] bug1330695_update_kernel_dot2.patch Review of attachment 8840548 [details] [diff] [review]: ----------------------------------------------------------------- lgtm. At some point, maybe once all hosts have been rebooted and are running the same kernel, will should add the old kernels to the $obsolete_kernels lists. This will ensure they get uninstalled and will help keep /boot from filling up.
Attachment #8840548 -
Flags: review?(jwatkins) → review+
Attachment #8840548 -
Flags: checked-in+
Comment 24•7 years ago
|
||
pushed to default: https://hg.mozilla.org/build/puppet/rev/87ff0db21186c77ae462679c6052c2be26eb45b0
Comment 25•7 years ago
|
||
travis passed pushed to production: remote: https://hg.mozilla.org/build/puppet/rev/094352f1521b60292b314893b27e5435c299a995
Comment 26•7 years ago
|
||
added older kernel to obsolete list to free some space on /boot for clients tested and worked against my vm pushed to default and production: remote: https://hg.mozilla.org/build/puppet/rev/609cb121d53937a154d0e697690eaeff2f8ea532 --- a/manifests/moco-config.pp +++ b/manifests/moco-config.pp @@ -346,17 +346,18 @@ class config inherits config::base { default => undef, } # Specifying Ubuntu obsolete kernels is a different format than current kernel above # The format is aa.bb.xx-yy $obsolete_kernels = $operatingsystem ? { 'CentOS' => $operatingsystemrelease ? { '6.2' => [ '2.6.32-431.el6', '2.6.32-431.11.2.el6', '2.6.32-431.5.1.el6' ], - '6.5' => [ '2.6.32-431.el6', '2.6.32-431.11.2.el6', '2.6.32-431.5.1.el6' ], + '6.5' => [ '2.6.32-431.el6', '2.6.32-431.11.2.el6', '2.6.32-431.5.1.el6', + '2.6.32-504.3.3.el6' ], default => [],
Comment 27•7 years ago
|
||
The removal of obsolete_kernels may not take place before installing a new one (I am seeing failures still for disk space on /boot). That makes sense if one were to try to make the current kernel obsolete before installing a new one. I'll remove the new kernel so that the obsolete_kernels addition can apply (removing -504.3.3).
Comment 28•7 years ago
|
||
diff --git a/manifests/moco-config.pp b/manifests/moco-config.pp --- a/manifests/moco-config.pp pushed to default and tested against my vm again (downgrade however, but checking syntax/etc) success and pushed to production: remote: https://hg.mozilla.org/build/puppet/rev/855647646c4de718aa6cbd071368292ea15cf900 +++ b/manifests/moco-config.pp @@ -335,7 +335,7 @@ class config inherits config::base { $current_kernel = $operatingsystem ? { 'CentOS' => $operatingsystemrelease ? { '6.2' => '2.6.32-504.3.3.el6', - '6.5' => '2.6.32-642.13.2.el6', + '6.5' => '2.6.32-642.13.1.el6', default => undef, }, 'Ubuntu' => $operatingsystemrelease ? {
Comment 29•7 years ago
|
||
Confirmed the obsolete_kernel(504) is removed. And there is space in the standard 100mb boot for 642.13.1 and 642.13.2: dhouse@dhouse-1330169:~$ rpm -q kernel kernel-2.6.32-642.13.1.el6.x86_64 kernel-2.6.32-642.13.2.el6.x86_64 dhouse@dhouse-1330169:~$ df -k Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda3 36923072 2306692 32734116 7% / tmpfs 1962288 0 1962288 0% /dev/shm /dev/sda1 95054 63833 26101 71% /boot So I will restore the original change to 642.13.2 after about an hour (giving time for all the servers to run puppet to remove the obsolete_kernel).
Comment 30•7 years ago
|
||
Restoring original kernel update now that the old one will have been removed. Over 90minutes have passed since the obsolete_kernel was set in production to clear space on the client /boot partitions.
Comment 31•7 years ago
|
||
pushed to default: remote: https://hg.mozilla.org/build/puppet/rev/aa0ed04a70bf1d184fdf3c42c92015809ee52910 travis passed pushed to production: remote: https://hg.mozilla.org/build/puppet/rev/de8627bb3f87112f154ea686149edf55b3ab804e
Comment 32•7 years ago
|
||
Reverted ... some clients have 60mb /boot and the kernel install reports it needs 2mb more.
Attachment #8840994 -
Flags: checked-in-
Comment 33•5 years ago
|
||
buildbot related systems are gone. scriptworkers are currently still on centos 6, but are moving to docker images this quarter.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•