Update kernels on Linux 32bit EC2 slaves: signal handling bug in linux-image-3.2.0-23-generic-pae

RESOLVED FIXED

Status

--
critical
RESOLVED FIXED
4 years ago
3 months ago

People

(Reporter: shu, Assigned: dividehex)

Tracking

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

4 years ago
Linux 32bit EC2 instances currently have linux-image-3.2.0-23-generic-pae, which seems to have a bug in signal handling and causes very weird crashes per bug 1139386.

glandium helped confirm that upgrading to linux-image-3.2.0-76-generic-pae fixes the crashes. The kernel is already installed per in the image, but not booted by grub.
(Reporter)

Updated

4 years ago
Blocks: 1139386
(Reporter)

Comment 1

4 years ago
For the curious, here's the kernel bug: http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=a349e23d1cf746f8bdc603dcc61fae9ee4a695f6

It's consistent with my debugging.
If upgrading the kernel doesn't break any tests, it's going to be easy just like bumping the version in http://hg.mozilla.org/build/puppet/file/06467c0bceb9/manifests/moco-config.pp#l266 (with packages synced).
CC Jake who added support for that in bug 1113328
(In reply to Rail Aliiev [:rail] from comment #2)
> If upgrading the kernel doesn't break any tests, it's going to be easy just
> like bumping the version in
> http://hg.mozilla.org/build/puppet/file/06467c0bceb9/manifests/moco-config.
> pp#l266 (with packages synced).

You'd think that, but it's not, because that's actually the version we should be using and aren't. Because grub uses something else.
Also note that bug 1113328 supposedly upgraded from -38 to -75, which was later upgraded to -76, and none of -38 and -75 are installed. BUT the kernel actually being used is -23.
Which suggests none of the previous kernel upgrades did anything in practice.
And it all seems to be caused by the contents of /boot/grub/menu.lst. If I remove the file and let update-grub generate one, I get the expected content with all the installed kernels and the latest one being the default, while with the current menu.lst content, it's never updated when update-grub runs.
(In reply to Mike Hommey [:glandium] from comment #7)
> And it all seems to be caused by the contents of /boot/grub/menu.lst. If I
> remove the file and let update-grub generate one, I get the expected content
> with all the installed kernels and the latest one being the default

Except that doesn't seem to boot... the kernel command line differences, notably for the root fs, must be responsible for that.
(Assignee)

Comment 9

4 years ago
We have a pretty big package difference between test slaves in scl3 and ec2 slaves.  In this case, it looks like the ec2 ami uses the legacy version of grub.  /boot/grub/menu.lst is depreciated for /boot/grub/grub.cfg, which is what we assume and manage in puppet.

-ii  grub                     0.97-29ubuntu66          GRand Unified Bootloader (Legacy version)

+ii  grub-gfxpayload-lists    0.6                      GRUB gfxpayload blacklist
+ii  grub-pc                  1.99-21ubuntu3           GRand Unified Bootloader, version 2 (PC/BIOS version)
+ii  grub-pc-bin              1.99-21ubuntu3           GRand Unified Bootloader, version 2 (PC/BIOS binaries)
+ii  grub2-common             1.99-21ubuntu3           GRand Unified Bootloader (common files for version 2)
(Reporter)

Updated

4 years ago
Summary: Update kernels on Linux 32bit EC2 slaves: possible signal handling bug in linux-image-3.2.0-23-generic-pae → Update kernels on Linux 32bit EC2 slaves: signal handling bug in linux-image-3.2.0-23-generic-pae
(Reporter)

Updated

4 years ago
Severity: normal → critical
(Reporter)

Comment 10

4 years ago
What are the next steps here? Could we start testing if -76 breaks any tests?
Duplicate of this bug: 1140487
(Assignee)

Comment 12

4 years ago
Created attachment 8575749 [details] [diff] [review]
bug1141339-1.patch

I wasn't able to get update-grub to properly update menu.lst either so I defaulted to a file resource for it instead.  This will get applied during the golden ami generation process and should ensure the ec2 spot instances come up with the proper kernel as defined in puppet.
Assignee: nobody → jwatkins
Attachment #8575749 - Flags: review?(rail)
Comment on attachment 8575749 [details] [diff] [review]
bug1141339-1.patch

Review of attachment 8575749 [details] [diff] [review]:
-----------------------------------------------------------------

I think, if you install grub-legacy-ec2 package you wouldn't even need to add these (except the package definition). Can you try this option?

Otherwise this should work, with some comments below.

::: modules/grub/manifests/defaults.pp
@@ +24,5 @@
> +            # as generated by update-grub above
> +            file {
> +                '/boot/grub/menu.lst':
> +                    ensure => present,
> +                    content => template("grub/menu.lst.erb"),

Can you rename the file to something like Ubuntu-menu.lst.erb, because it's Ubuntu specific?

::: modules/grub/templates/menu.lst.erb
@@ +3,5 @@
> +hiddenmenu
> +title default kernel
> +root (hd0)
> +kernel /boot/vmlinuz-<%= @kern %> ro root=LABEL=cloudimg-rootfs
> +initrd /boot/initrd.img-<%= @kern %>

Please remove all comments below. They are useless because we don't use update-grub.
Attachment #8575749 - Flags: review?(rail) → review+
(Assignee)

Comment 14

4 years ago
Created attachment 8576136 [details] [diff] [review]
bug1141339-2.patch

So I did attempt to utilize grub-legacy-ec2 but ran into more complications.  grub-legacy-ec2 removes update-grub and installs its own under a different name (update-grub-legacy-ec2) so the existing exec wouldn't handle that.  Also, ignores the installed kernels unless the conf is set not to detect zen kernels.  And when it does generate kernel menu it sets the rootfs to the existing UUID of that instance.  Which should be ok but it really should stay pointed at the generic LABEL of cloudimg-rootfs.

The worst part about using update-grub (on legacy) to generate menu.lst is it reads var from menu.lst.  Total chicken/egg crap.

Let's just play it safe and have puppet set the menu.lst on ec2 instances only.
Attachment #8575749 - Attachment is obsolete: true
Attachment #8576136 - Flags: review?(rail)
Comment on attachment 8576136 [details] [diff] [review]
bug1141339-2.patch

Review of attachment 8576136 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/grub/manifests/defaults.pp
@@ +7,4 @@
>  
>      case $operatingsystem {
>          'Ubuntu': {
> +            if $ec2_ami_id {

Can you use the same approach used in http://hg.mozilla.org/build/puppet/file/29334cc3129e/modules/toplevel/manifests/slave/releng/test.pp#l27 ?
I remember that we had some issues with those variables...

Otherwise LGTM
Attachment #8576136 - Flags: review?(rail) → review+
(Assignee)

Comment 17

4 years ago
This is looking good.  I see ~450 spot instances with the kernel.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
(Reporter)

Comment 18

4 years ago
I'm still seeing -23 on spot instances in the gum tree. Is this expected?

See http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/gum-linux-debug/1426187997/gum_ubuntu32_vm-debug_test-mochitest-devtools-chrome-2-bm04-tests1-linux32-build5.txt.gz

Excerpt:

14:25:11     INFO -  Operating system: Linux
14:25:11     INFO -                    0.0.0 Linux 3.2.0-23-generic-pae #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012 i686
(Assignee)

Comment 19

4 years ago
(In reply to Shu-yu Guo [:shu] from comment #18)
> I'm still seeing -23 on spot instances in the gum tree. Is this expected?

Something is definitely wrong here.  There are way too tst-linux32 spot instances running old AMIs.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 20

4 years ago
We are not entirely sure why but it looks like the golden image which was created the day before the grub patch landed didn't include the check_ami task.  This meant that the instances weren't terminating themselves.  :mrrrgn invoked check_ami via ansible and it looks like the outdated instances all successfully terminated.  We shouldn't see anymore tests over the old kernel.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.