Closed Bug 808397 Opened 12 years ago Closed 11 years ago

"eth0: reset adaptor" on ix multinode systems

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: dustin)

References

Details

Attachments

(2 files, 2 obsolete files)

Over the weekend, foopy64 lost network connectivity.  On the console (which was still responsive and had a loging screen) was the following message:

e1000e 0000:03:00.0: eth0: Reset adapter

I suspect that's not a good sign.

https://bugzilla.redhat.com/show_bug.cgi?id=625776 indicates that this should be fixed with a kernel upgrade. Opening this bug to track this issue and see what we're running.
That suggests upgrading to 3.3.4, which is a long jump, and not available for CentOS 6 from what I can tell.  There is 2.6.32-279.11.1.el6.centos.plus available (in centosplus, which we'd have to mirror).

I have installed that version manually on foopy42 and foopy65.  Let's see what happens there?

For the record, it's foopy65 that failed, not foopy64.

> [root@foopy65 ~]# yum shell
> Loaded plugins: security
> Setting up Yum Shell
> > install http://linux.mirrors.es.net/centos/6/centosplus/x86_64/Packages/kernel-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm
> Setting up Install Process
> kernel-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm                                                                                                                                                                      |  26 MB     00:00
> Examining /var/tmp/yum-root-OPtoS8/kernel-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm: kernel-2.6.32-279.11.1.el6.centos.plus.x86_64
> Marking /var/tmp/yum-root-OPtoS8/kernel-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm as an update to kernel-2.6.32-220.el6.x86_64
> Marking /var/tmp/yum-root-OPtoS8/kernel-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm as an update to kernel-2.6.32-220.7.1.el6.x86_64
> > install http://linux.mirrors.es.net/centos/6/centosplus/x86_64/Packages/kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch.rpm
> kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch.rpm                                                                                                                                                             | 8.7 MB     00:00
> Examining /var/tmp/yum-root-OPtoS8/kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch.rpm: kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch
> Marking /var/tmp/yum-root-OPtoS8/kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch.rpm as an update to kernel-firmware-2.6.32-220.7.1.el6.noarch
> > install http://linux.mirrors.es.net/centos/6/centosplus/x86_64/Packages/kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm
> kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm                                                                                                                                                              | 1.9 MB     00:00
> Examining /var/tmp/yum-root-OPtoS8/kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm: kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64
> Marking /var/tmp/yum-root-OPtoS8/kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm to be installed
> > run 
> --> Running transaction check
> ---> Package kernel.x86_64 0:2.6.32-279.11.1.el6.centos.plus will be installed
> ---> Package kernel-firmware.noarch 0:2.6.32-220.7.1.el6 will be updated
> ---> Package kernel-firmware.noarch 0:2.6.32-279.11.1.el6.centos.plus will be an update
> ---> Package kernel-headers.x86_64 0:2.6.32-279.11.1.el6.centos.plus will be installed
> --> Finished Dependency Resolution
> 
> ==============================================================================================================================================================================================================================================
>  Package                                       Arch                                 Version                                                       Repository                                                                             Size
> ==============================================================================================================================================================================================================================================
> Installing:
>  kernel                                        x86_64                               2.6.32-279.11.1.el6.centos.plus                               /kernel-2.6.32-279.11.1.el6.centos.plus.x86_64                                        117 M
>  kernel-headers                                x86_64                               2.6.32-279.11.1.el6.centos.plus                               /kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64                                2.4 M
> Updating:
>  kernel-firmware                               noarch                               2.6.32-279.11.1.el6.centos.plus                               /kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch                                16 M
> 
> Transaction Summary
> ==============================================================================================================================================================================================================================================
> Install       2 Package(s)
> Upgrade       1 Package(s)
> 
> Total size: 135 M
> Is this ok [y/N]: y
> Downloading Packages:
> Running rpm_check_debug
> Running Transaction Test
> Transaction Test Succeeded
> Running Transaction
>   Updating   : kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch                                                                                                                                                                     1/4 
>   Installing : kernel-2.6.32-279.11.1.el6.centos.plus.x86_64                                                                                                                                                                              2/4 
>   Installing : kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64                                                                                                                                                                      3/4 
>   Cleanup    : kernel-firmware-2.6.32-220.7.1.el6.noarch                                                                                                                                                                                  4/4 
> 
> Installed:
>   kernel.x86_64 0:2.6.32-279.11.1.el6.centos.plus                                                                   kernel-headers.x86_64 0:2.6.32-279.11.1.el6.centos.plus                                                                      
> 
> Updated:
>   kernel-firmware.noarch 0:2.6.32-279.11.1.el6.centos.plus                                                                                                                                                                                        
> 
> Finished Transaction
Well, with no further failures on either upgraded, or non-upgrade machines, there's nothing to suggest this upgrade was worth it -- or that it wasn't.  So we wait.
I'm calling this WFM.  We've had worse problems with the machines staying on, in other bugs.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
From mobile-imaging-009:

Duplicate check
=====


Common information
=====
package
-----
kernel

architecture
-----
x86_64

kernel
-----
2.6.32-220.7.1.el6.x86_64



Additional information
=====
kernel_tainted_long
-----
Taint on warning.

kernel_tainted
-----
512

backtrace
-----
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: iX21X4-STIBTRF
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1
Call Trace:
<IRQ>  [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50
[<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280
[<ffffffff8108b3fd>] ? insert_work+0x6d/0xb0
[<ffffffff8107bbe5>] ? internal_add_timer+0xb5/0x110
[<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280
[<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340
[<ffffffff810a0b20>] ? tick_sched_timer+0x0/0xc0
[<ffffffff8102af2d>] ? lapic_next_event+0x1d/0x30
[<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0
[<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250
[<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
[<ffffffff8100de85>] ? do_softirq+0x65/0xa0
[<ffffffff81071de5>] ? irq_exit+0x85/0x90
[<ffffffff814f4eb0>] ? smp_apic_timer_interrupt+0x70/0x9b
[<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
<EOI>  [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170
[<ffffffff812c4af1>] ? intel_idle+0xc1/0x170
[<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140
[<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
[<ffffffff814e5ffc>] ? start_secondary+0x202/0x245


hostname
-----
mobile-imaging-009.p9.releng.scl1.mozilla.com

component
-----
kernel

time
-----
1365335034

reason
-----
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)

analyzer
-----
Kerneloops

kernel_tainted_short
-----
---------W

cmdline
-----
ro root=UUID=e89fe5eb-b1e9-47e6-8346-4b51dda35a8b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM

os_release
-----
CentOS release 6.2 (Final)
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: foopy65 lost network conenctivity → "eth0: reset adaptor" on ix multinode systems
foopy121 back on April 5th (bug 858565).

Duplicate check
=====


Common information
=====
package
-----
kernel

architecture
-----
x86_64

kernel
-----
2.6.32-220.7.1.el6.x86_64



Additional information
=====
kernel_tainted_long
-----
Taint on warning.

kernel_tainted
-----
512

backtrace
-----
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: X8SIL
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 microcode i2c_i801 i2c_core serio_raw sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1
Call Trace:
<IRQ>  [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50
[<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280
[<ffffffff8108b3fd>] ? insert_work+0x6d/0xb0
[<ffffffff8109b743>] ? ktime_get+0x63/0xe0
[<ffffffff8107bbe5>] ? internal_add_timer+0xb5/0x110
[<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280
[<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340
[<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250
[<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0
[<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170
[<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
[<ffffffff8100de85>] ? do_softirq+0x65/0xa0
[<ffffffff81071de5>] ? irq_exit+0x85/0x90
[<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0
[<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11
<EOI>  [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170
[<ffffffff812c4af1>] ? intel_idle+0xc1/0x170
[<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140
[<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
[<ffffffff814e5ffc>] ? start_secondary+0x202/0x245


hostname
-----
foopy121.build.mtv1.mozilla.com

component
-----
kernel

cmdline
-----
ro root=UUID=80d9716b-c3c1-40b8-b7e3-7fe1113c9f9b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM

reason
-----
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)

kernel_tainted_short
-----
---------W

analyzer
-----
Kerneloops

time
-----
1365171322

os_release
-----
CentOS release 6.2 (Final)
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → shyam
Server Ops folks:

I need some help figuring out how to track this error down.  It's happening on new supermicro hardware (IX X8SIL), with CentOS 6.2 x86_64.  It doesn't seem to occur with other OS's on the same hardware (we're running Ubuntu on 200 nodes and Win8 on 100).
foopy121's second failure was the same:

WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: X8SIL
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: ipv6 microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1
Call Trace:
<IRQ>  [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50
[<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280
[<ffffffff8109b743>] ? ktime_get+0x63/0xe0
[<ffffffff8109f0e4>] ? clockevents_program_event+0x54/0xa0
[<ffffffff810a0635>] ? tick_dev_program_event+0x65/0xc0
[<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280
[<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340
[<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250
[<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0
[<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170
[<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
[<ffffffff8100de85>] ? do_softirq+0x65/0xa0
[<ffffffff81071de5>] ? irq_exit+0x85/0x90
[<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0
[<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11
<EOI>  [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170
[<ffffffff812c4af1>] ? intel_idle+0xc1/0x170
[<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140
[<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
[<ffffffff814e5ffc>] ? start_secondary+0x202/0x245
Correction: this is happening on new and old hardware.  That suggests it's a bug in CentOS.

In the LKML discussion (http://thread.gmane.org/gmane.linux.kernel/1233566), it's eventually decided that the culprit is ASPM.  The device has known failures with ASPM L0s, and the drivers attempt to disable that level.  On our systems, they're successful in doing so - L0s appears in the capabilities but not in the LnkCtl.  However, L1 is still enabled, and later in that thread, Chris Boot determines that both levels are problematic.

Nix was having problems with NICs never coming up at all.  We don't seem to have that problem, so we can probably just use setpci to disable ASPM L1 after boot.
This totally reminds me of peek and poke in my Applesoft Basic days, but this appears to disable ASPM on all 82574L's on the system:

 setpci -d 8086:10d3 CAP_EXP+10.b=40

That's easy to automate with puppet, again counting on the fact that these hosts don't fall over immediately on boot.
Attached patch bug808397.patch (obsolete) — Splinter Review
Attachment #736943 - Flags: review?(bugspam.Callek)
Comment on attachment 736943 [details] [diff] [review]
bug808397.patch

So as far as risk here:
 * this doesn't cause any interruption to the behavior of the system
 * this occurs at runtime, and doesn't persist over boots
 * I ran this on the three types of affected hardware, with no ill effects

The biggest risk I see is that hard-coding the PCIE register like that is probably not ideal.  Given that this is exactly one model of hardware that we've been careful to buy exactly one lot of, I think we're OK.  At any rate, I can't find any better ways to do this anyway.
Attached patch bug808397-r2.patch (obsolete) — Splinter Review
Greg mentioned in IRC that I should probably be more careful about the pattern used in grep.  They're hex bytes, so I expect two characters, but who knows what kind of error message or other stuff could appear in the output.
Attachment #736943 - Attachment is obsolete: true
Attachment #736943 - Flags: review?(bugspam.Callek)
Attachment #736969 - Flags: review?(bugspam.Callek)
Attachment #736969 - Flags: feedback?(gcox)
Comment on attachment 736969 [details] [diff] [review]
bug808397-r2.patch

I share :dustin's concern that he's hard-coded a peek/poke value in there, but it's in such an isolated branch (which you get to by, matching hard-coded model names, hi there irony!) that I think he's taken a good swipe at it, and the foundation is there for re-solving it if it comes back in a later hw release.
Attachment #736969 - Flags: feedback?(gcox) → feedback+
Comment on attachment 736969 [details] [diff] [review]
bug808397-r2.patch

Review of attachment 736969 [details] [diff] [review]:
-----------------------------------------------------------------

besides comments below, I'd prefer if we landed only when you + at least 1 DCOps person is expected to be around and able to dive in (incase anything needs hands on as a weird fallout from this) -- so likely monday morning instead of late friday

::: modules/hardware/manifests/init.pp
@@ +32,5 @@
>      # Nodes running IPMI-compliant hardware should install OpenIPMI
>      if (($::manufacturer == "HP" and $::productname =~ /ProLiant/) or
>          ($::manufacturer == "iXsystems" and $::productname == "iX700-C") or
> +        # iX700-C's can show up as X8SIL, too
> +        ($::manufacturer == "iXsystems" and $::productname == "X8SIL") or

this is a surprising ride-along. Not opposed to taking it, just making sure its not part of another bug that snuck in here

@@ +46,5 @@
> +        # particular devices, that register is at the offset below.  This is
> +        # unlikely to work on other hardware (and likely to make things go boom)!
> +        # see bug 808397
> +        exec {
> +            'setpci-aspm-off':

as mentioned in IRC my r+ is on puppet syntax and not on the command or risk-assessment of change itself.

@@ +50,5 @@
> +            'setpci-aspm-off':
> +                command => '/sbin/setpci -d 8086:10d3 CAP_EXP+10.b=40',
> +                unless => '/usr/bin/test -z "`/sbin/setpci -d 8086:10d3 CAP_EXP+10.b | grep -v ^40$`"';
> +        }
> +    }

I would much prefer this exec to be thrown into a tweaks:: module and included from here similar to how we include ipmi above. Rather than making the tweak directly in this module.

It is just my own idealism and is a minor nit, so if you disagree even slightly go ahead and land-as-is.
Attachment #736969 - Flags: review?(bugspam.Callek) → review+
How's this?

The ride-along is intentional.  It looks like some models of the old ix hardware identify as X8SIL, while others identify as iX700-C.
Attachment #736969 - Attachment is obsolete: true
Attachment #737466 - Flags: review?(bugspam.Callek)
Comment on attachment 737466 [details] [diff] [review]
bug808397-r3.patch

Other than a pretty cryptic tweaks name r+ (I have no better color for the shed in mind though, so go ahead and paint it unless a prettier color exists in your mind)
Attachment #737466 - Flags: review?(bugspam.Callek) → review+
landed - thanks!
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
It seems iX systems may now have manufacturer 'Supermicro', rather than 'ixSystems'.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Although I can't replicate 'ixSystems' there.  I wonder if we have slightly different HW revs across the fleet :(
Attached patch bug808397.patchSplinter Review
Attachment #757389 - Flags: review?(bugspam.Callek)
Comment on attachment 757389 [details] [diff] [review]
bug808397.patch

Review of attachment 757389 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Dustin J. Mitchell [:dustin] from comment #20)
> It seems iX systems may now have manufacturer 'Supermicro', rather than
> 'ixSystems'.

I'm going to r+ this based on this, but I admit when on foopy125 running |facter -p| I still don't see a manufacturer fact to see this, so would love to know how you're finding it.
Attachment #757389 - Flags: review?(bugspam.Callek) → review+
Facter has some incorrect/missing results when run as non-root, so run it as root.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Component: Server Operations → Server Operations: RelEng
QA Contact: shyam → arich
Resolution: --- → FIXED
Attachment #737466 - Flags: checked-in+
Attachment #757389 - Flags: checked-in+
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: