Closed
Bug 1141339
Opened 9 years ago
Closed 9 years ago
Update kernels on Linux 32bit EC2 slaves: signal handling bug in linux-image-3.2.0-23-generic-pae
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: shu, Assigned: dividehex)
References
Details
Attachments
(1 file, 1 obsolete file)
2.13 KB,
patch
|
rail
:
review+
dividehex
:
checked-in+
|
Details | Diff | Splinter Review |
Linux 32bit EC2 instances currently have linux-image-3.2.0-23-generic-pae, which seems to have a bug in signal handling and causes very weird crashes per bug 1139386. glandium helped confirm that upgrading to linux-image-3.2.0-76-generic-pae fixes the crashes. The kernel is already installed per in the image, but not booted by grub.
Reporter | ||
Comment 1•9 years ago
|
||
For the curious, here's the kernel bug: http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6 It's consistent with my debugging.
Comment 2•9 years ago
|
||
If upgrading the kernel doesn't break any tests, it's going to be easy just like bumping the version in http://hg.mozilla.org/build/puppet/file/06467c0bceb9/manifests/moco-config.pp#l266 (with packages synced).
Comment 3•9 years ago
|
||
CC Jake who added support for that in bug 1113328
Comment 4•9 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #2) > If upgrading the kernel doesn't break any tests, it's going to be easy just > like bumping the version in > http://hg.mozilla.org/build/puppet/file/06467c0bceb9/manifests/moco-config. > pp#l266 (with packages synced). You'd think that, but it's not, because that's actually the version we should be using and aren't. Because grub uses something else.
Comment 5•9 years ago
|
||
Also note that bug 1113328 supposedly upgraded from -38 to -75, which was later upgraded to -76, and none of -38 and -75 are installed. BUT the kernel actually being used is -23.
Comment 6•9 years ago
|
||
Which suggests none of the previous kernel upgrades did anything in practice.
Comment 7•9 years ago
|
||
And it all seems to be caused by the contents of /boot/grub/menu.lst. If I remove the file and let update-grub generate one, I get the expected content with all the installed kernels and the latest one being the default, while with the current menu.lst content, it's never updated when update-grub runs.
Comment 8•9 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #7) > And it all seems to be caused by the contents of /boot/grub/menu.lst. If I > remove the file and let update-grub generate one, I get the expected content > with all the installed kernels and the latest one being the default Except that doesn't seem to boot... the kernel command line differences, notably for the root fs, must be responsible for that.
Assignee | ||
Comment 9•9 years ago
|
||
We have a pretty big package difference between test slaves in scl3 and ec2 slaves. In this case, it looks like the ec2 ami uses the legacy version of grub. /boot/grub/menu.lst is depreciated for /boot/grub/grub.cfg, which is what we assume and manage in puppet. -ii grub 0.97-29ubuntu66 GRand Unified Bootloader (Legacy version) +ii grub-gfxpayload-lists 0.6 GRUB gfxpayload blacklist +ii grub-pc 1.99-21ubuntu3 GRand Unified Bootloader, version 2 (PC/BIOS version) +ii grub-pc-bin 1.99-21ubuntu3 GRand Unified Bootloader, version 2 (PC/BIOS binaries) +ii grub2-common 1.99-21ubuntu3 GRand Unified Bootloader (common files for version 2)
Reporter | ||
Updated•9 years ago
|
Summary: Update kernels on Linux 32bit EC2 slaves: possible signal handling bug in linux-image-3.2.0-23-generic-pae → Update kernels on Linux 32bit EC2 slaves: signal handling bug in linux-image-3.2.0-23-generic-pae
Reporter | ||
Updated•9 years ago
|
Severity: normal → critical
Reporter | ||
Comment 10•9 years ago
|
||
What are the next steps here? Could we start testing if -76 breaks any tests?
Assignee | ||
Comment 12•9 years ago
|
||
I wasn't able to get update-grub to properly update menu.lst either so I defaulted to a file resource for it instead. This will get applied during the golden ami generation process and should ensure the ec2 spot instances come up with the proper kernel as defined in puppet.
Assignee: nobody → jwatkins
Attachment #8575749 -
Flags: review?(rail)
Comment 13•9 years ago
|
||
Comment on attachment 8575749 [details] [diff] [review] bug1141339-1.patch Review of attachment 8575749 [details] [diff] [review]: ----------------------------------------------------------------- I think, if you install grub-legacy-ec2 package you wouldn't even need to add these (except the package definition). Can you try this option? Otherwise this should work, with some comments below. ::: modules/grub/manifests/defaults.pp @@ +24,5 @@ > + # as generated by update-grub above > + file { > + '/boot/grub/menu.lst': > + ensure => present, > + content => template("grub/menu.lst.erb"), Can you rename the file to something like Ubuntu-menu.lst.erb, because it's Ubuntu specific? ::: modules/grub/templates/menu.lst.erb @@ +3,5 @@ > +hiddenmenu > +title default kernel > +root (hd0) > +kernel /boot/vmlinuz-<%= @kern %> ro root=LABEL=cloudimg-rootfs > +initrd /boot/initrd.img-<%= @kern %> Please remove all comments below. They are useless because we don't use update-grub.
Attachment #8575749 -
Flags: review?(rail) → review+
Assignee | ||
Comment 14•9 years ago
|
||
So I did attempt to utilize grub-legacy-ec2 but ran into more complications. grub-legacy-ec2 removes update-grub and installs its own under a different name (update-grub-legacy-ec2) so the existing exec wouldn't handle that. Also, ignores the installed kernels unless the conf is set not to detect zen kernels. And when it does generate kernel menu it sets the rootfs to the existing UUID of that instance. Which should be ok but it really should stay pointed at the generic LABEL of cloudimg-rootfs. The worst part about using update-grub (on legacy) to generate menu.lst is it reads var from menu.lst. Total chicken/egg crap. Let's just play it safe and have puppet set the menu.lst on ec2 instances only.
Attachment #8575749 -
Attachment is obsolete: true
Attachment #8576136 -
Flags: review?(rail)
Comment 15•9 years ago
|
||
Comment on attachment 8576136 [details] [diff] [review] bug1141339-2.patch Review of attachment 8576136 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/grub/manifests/defaults.pp @@ +7,4 @@ > > case $operatingsystem { > 'Ubuntu': { > + if $ec2_ami_id { Can you use the same approach used in http://hg.mozilla.org/build/puppet/file/29334cc3129e/modules/toplevel/manifests/slave/releng/test.pp#l27 ? I remember that we had some issues with those variables... Otherwise LGTM
Attachment #8576136 -
Flags: review?(rail) → review+
Assignee | ||
Comment 16•9 years ago
|
||
Comment on attachment 8576136 [details] [diff] [review] bug1141339-2.patch remote: https://hg.mozilla.org/build/puppet/rev/276eb3d7a57e remote: https://hg.mozilla.org/build/puppet/rev/01e7390b2c56
Attachment #8576136 -
Flags: checked-in+
Assignee | ||
Comment 17•9 years ago
|
||
This is looking good. I see ~450 spot instances with the kernel.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 18•9 years ago
|
||
I'm still seeing -23 on spot instances in the gum tree. Is this expected? See http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/gum-linux-debug/1426187997/gum_ubuntu32_vm-debug_test-mochitest-devtools-chrome-2-bm04-tests1-linux32-build5.txt.gz Excerpt: 14:25:11 INFO - Operating system: Linux 14:25:11 INFO - 0.0.0 Linux 3.2.0-23-generic-pae #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012 i686
Assignee | ||
Comment 19•9 years ago
|
||
(In reply to Shu-yu Guo [:shu] from comment #18) > I'm still seeing -23 on spot instances in the gum tree. Is this expected? Something is definitely wrong here. There are way too tst-linux32 spot instances running old AMIs.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 20•9 years ago
|
||
We are not entirely sure why but it looks like the golden image which was created the day before the grub patch landed didn't include the check_ami task. This meant that the instances weren't terminating themselves. :mrrrgn invoked check_ami via ansible and it looks like the outdated instances all successfully terminated. We shouldn't see anymore tests over the old kernel.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•