Closed
Bug 808397
Opened 12 years ago
Closed 11 years ago
"eth0: reset adaptor" on ix multinode systems
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: arich, Assigned: dustin)
References
Details
Attachments
(2 files, 2 obsolete files)
2.49 KB,
patch
|
Callek
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
1.15 KB,
patch
|
Callek
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
Over the weekend, foopy64 lost network connectivity. On the console (which was still responsive and had a loging screen) was the following message: e1000e 0000:03:00.0: eth0: Reset adapter I suspect that's not a good sign. https://bugzilla.redhat.com/show_bug.cgi?id=625776 indicates that this should be fixed with a kernel upgrade. Opening this bug to track this issue and see what we're running.
Assignee | ||
Comment 1•12 years ago
|
||
That suggests upgrading to 3.3.4, which is a long jump, and not available for CentOS 6 from what I can tell. There is 2.6.32-279.11.1.el6.centos.plus available (in centosplus, which we'd have to mirror).
I have installed that version manually on foopy42 and foopy65. Let's see what happens there?
For the record, it's foopy65 that failed, not foopy64.
> [root@foopy65 ~]# yum shell
> Loaded plugins: security
> Setting up Yum Shell
> > install http://linux.mirrors.es.net/centos/6/centosplus/x86_64/Packages/kernel-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm
> Setting up Install Process
> kernel-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm | 26 MB 00:00
> Examining /var/tmp/yum-root-OPtoS8/kernel-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm: kernel-2.6.32-279.11.1.el6.centos.plus.x86_64
> Marking /var/tmp/yum-root-OPtoS8/kernel-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm as an update to kernel-2.6.32-220.el6.x86_64
> Marking /var/tmp/yum-root-OPtoS8/kernel-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm as an update to kernel-2.6.32-220.7.1.el6.x86_64
> > install http://linux.mirrors.es.net/centos/6/centosplus/x86_64/Packages/kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch.rpm
> kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch.rpm | 8.7 MB 00:00
> Examining /var/tmp/yum-root-OPtoS8/kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch.rpm: kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch
> Marking /var/tmp/yum-root-OPtoS8/kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch.rpm as an update to kernel-firmware-2.6.32-220.7.1.el6.noarch
> > install http://linux.mirrors.es.net/centos/6/centosplus/x86_64/Packages/kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm
> kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm | 1.9 MB 00:00
> Examining /var/tmp/yum-root-OPtoS8/kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm: kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64
> Marking /var/tmp/yum-root-OPtoS8/kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64.rpm to be installed
> > run
> --> Running transaction check
> ---> Package kernel.x86_64 0:2.6.32-279.11.1.el6.centos.plus will be installed
> ---> Package kernel-firmware.noarch 0:2.6.32-220.7.1.el6 will be updated
> ---> Package kernel-firmware.noarch 0:2.6.32-279.11.1.el6.centos.plus will be an update
> ---> Package kernel-headers.x86_64 0:2.6.32-279.11.1.el6.centos.plus will be installed
> --> Finished Dependency Resolution
>
> ==============================================================================================================================================================================================================================================
> Package Arch Version Repository Size
> ==============================================================================================================================================================================================================================================
> Installing:
> kernel x86_64 2.6.32-279.11.1.el6.centos.plus /kernel-2.6.32-279.11.1.el6.centos.plus.x86_64 117 M
> kernel-headers x86_64 2.6.32-279.11.1.el6.centos.plus /kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64 2.4 M
> Updating:
> kernel-firmware noarch 2.6.32-279.11.1.el6.centos.plus /kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch 16 M
>
> Transaction Summary
> ==============================================================================================================================================================================================================================================
> Install 2 Package(s)
> Upgrade 1 Package(s)
>
> Total size: 135 M
> Is this ok [y/N]: y
> Downloading Packages:
> Running rpm_check_debug
> Running Transaction Test
> Transaction Test Succeeded
> Running Transaction
> Updating : kernel-firmware-2.6.32-279.11.1.el6.centos.plus.noarch 1/4
> Installing : kernel-2.6.32-279.11.1.el6.centos.plus.x86_64 2/4
> Installing : kernel-headers-2.6.32-279.11.1.el6.centos.plus.x86_64 3/4
> Cleanup : kernel-firmware-2.6.32-220.7.1.el6.noarch 4/4
>
> Installed:
> kernel.x86_64 0:2.6.32-279.11.1.el6.centos.plus kernel-headers.x86_64 0:2.6.32-279.11.1.el6.centos.plus
>
> Updated:
> kernel-firmware.noarch 0:2.6.32-279.11.1.el6.centos.plus
>
> Finished Transaction
Assignee | ||
Comment 2•12 years ago
|
||
Well, with no further failures on either upgraded, or non-upgrade machines, there's nothing to suggest this upgrade was worth it -- or that it wasn't. So we wait.
Assignee | ||
Comment 3•12 years ago
|
||
I'm calling this WFM. We've had worse problems with the machines staying on, in other bugs.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Assignee | ||
Comment 4•11 years ago
|
||
From mobile-imaging-009: Duplicate check ===== Common information ===== package ----- kernel architecture ----- x86_64 kernel ----- 2.6.32-220.7.1.el6.x86_64 Additional information ===== kernel_tainted_long ----- Taint on warning. kernel_tainted ----- 512 backtrace ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: iX21X4-STIBTRF NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280 [<ffffffff8108b3fd>] ? insert_work+0x6d/0xb0 [<ffffffff8107bbe5>] ? internal_add_timer+0xb5/0x110 [<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340 [<ffffffff810a0b20>] ? tick_sched_timer+0x0/0xc0 [<ffffffff8102af2d>] ? lapic_next_event+0x1d/0x30 [<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0 [<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81071de5>] ? irq_exit+0x85/0x90 [<ffffffff814f4eb0>] ? smp_apic_timer_interrupt+0x70/0x9b [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170 [<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e5ffc>] ? start_secondary+0x202/0x245 hostname ----- mobile-imaging-009.p9.releng.scl1.mozilla.com component ----- kernel time ----- 1365335034 reason ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) analyzer ----- Kerneloops kernel_tainted_short ----- ---------W cmdline ----- ro root=UUID=e89fe5eb-b1e9-47e6-8346-4b51dda35a8b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM os_release ----- CentOS release 6.2 (Final)
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: foopy65 lost network conenctivity → "eth0: reset adaptor" on ix multinode systems
Assignee | ||
Comment 6•11 years ago
|
||
foopy121 back on April 5th (bug 858565). Duplicate check ===== Common information ===== package ----- kernel architecture ----- x86_64 kernel ----- 2.6.32-220.7.1.el6.x86_64 Additional information ===== kernel_tainted_long ----- Taint on warning. kernel_tainted ----- 512 backtrace ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: X8SIL NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 microcode i2c_i801 i2c_core serio_raw sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280 [<ffffffff8108b3fd>] ? insert_work+0x6d/0xb0 [<ffffffff8109b743>] ? ktime_get+0x63/0xe0 [<ffffffff8107bbe5>] ? internal_add_timer+0xb5/0x110 [<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340 [<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81071de5>] ? irq_exit+0x85/0x90 [<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170 [<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e5ffc>] ? start_secondary+0x202/0x245 hostname ----- foopy121.build.mtv1.mozilla.com component ----- kernel cmdline ----- ro root=UUID=80d9716b-c3c1-40b8-b7e3-7fe1113c9f9b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM reason ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) kernel_tainted_short ----- ---------W analyzer ----- Kerneloops time ----- 1365171322 os_release ----- CentOS release 6.2 (Final)
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → shyam
Assignee | ||
Comment 7•11 years ago
|
||
Server Ops folks: I need some help figuring out how to track this error down. It's happening on new supermicro hardware (IX X8SIL), with CentOS 6.2 x86_64. It doesn't seem to occur with other OS's on the same hardware (we're running Ubuntu on 200 nodes and Win8 on 100).
Assignee | ||
Comment 9•11 years ago
|
||
foopy121's second failure was the same: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: X8SIL NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: ipv6 microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280 [<ffffffff8109b743>] ? ktime_get+0x63/0xe0 [<ffffffff8109f0e4>] ? clockevents_program_event+0x54/0xa0 [<ffffffff810a0635>] ? tick_dev_program_event+0x65/0xc0 [<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340 [<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81071de5>] ? irq_exit+0x85/0x90 [<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170 [<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e5ffc>] ? start_secondary+0x202/0x245
Assignee | ||
Comment 10•11 years ago
|
||
Correction: this is happening on new and old hardware. That suggests it's a bug in CentOS. In the LKML discussion (http://thread.gmane.org/gmane.linux.kernel/1233566), it's eventually decided that the culprit is ASPM. The device has known failures with ASPM L0s, and the drivers attempt to disable that level. On our systems, they're successful in doing so - L0s appears in the capabilities but not in the LnkCtl. However, L1 is still enabled, and later in that thread, Chris Boot determines that both levels are problematic. Nix was having problems with NICs never coming up at all. We don't seem to have that problem, so we can probably just use setpci to disable ASPM L1 after boot.
Assignee | ||
Comment 11•11 years ago
|
||
This totally reminds me of peek and poke in my Applesoft Basic days, but this appears to disable ASPM on all 82574L's on the system: setpci -d 8086:10d3 CAP_EXP+10.b=40 That's easy to automate with puppet, again counting on the fact that these hosts don't fall over immediately on boot.
Assignee | ||
Comment 12•11 years ago
|
||
Attachment #736943 -
Flags: review?(bugspam.Callek)
Assignee | ||
Comment 13•11 years ago
|
||
Comment on attachment 736943 [details] [diff] [review] bug808397.patch So as far as risk here: * this doesn't cause any interruption to the behavior of the system * this occurs at runtime, and doesn't persist over boots * I ran this on the three types of affected hardware, with no ill effects The biggest risk I see is that hard-coding the PCIE register like that is probably not ideal. Given that this is exactly one model of hardware that we've been careful to buy exactly one lot of, I think we're OK. At any rate, I can't find any better ways to do this anyway.
Assignee | ||
Comment 14•11 years ago
|
||
Greg mentioned in IRC that I should probably be more careful about the pattern used in grep. They're hex bytes, so I expect two characters, but who knows what kind of error message or other stuff could appear in the output.
Attachment #736943 -
Attachment is obsolete: true
Attachment #736943 -
Flags: review?(bugspam.Callek)
Attachment #736969 -
Flags: review?(bugspam.Callek)
Attachment #736969 -
Flags: feedback?(gcox)
Comment 15•11 years ago
|
||
Comment on attachment 736969 [details] [diff] [review] bug808397-r2.patch I share :dustin's concern that he's hard-coded a peek/poke value in there, but it's in such an isolated branch (which you get to by, matching hard-coded model names, hi there irony!) that I think he's taken a good swipe at it, and the foundation is there for re-solving it if it comes back in a later hw release.
Attachment #736969 -
Flags: feedback?(gcox) → feedback+
Comment 16•11 years ago
|
||
Comment on attachment 736969 [details] [diff] [review] bug808397-r2.patch Review of attachment 736969 [details] [diff] [review]: ----------------------------------------------------------------- besides comments below, I'd prefer if we landed only when you + at least 1 DCOps person is expected to be around and able to dive in (incase anything needs hands on as a weird fallout from this) -- so likely monday morning instead of late friday ::: modules/hardware/manifests/init.pp @@ +32,5 @@ > # Nodes running IPMI-compliant hardware should install OpenIPMI > if (($::manufacturer == "HP" and $::productname =~ /ProLiant/) or > ($::manufacturer == "iXsystems" and $::productname == "iX700-C") or > + # iX700-C's can show up as X8SIL, too > + ($::manufacturer == "iXsystems" and $::productname == "X8SIL") or this is a surprising ride-along. Not opposed to taking it, just making sure its not part of another bug that snuck in here @@ +46,5 @@ > + # particular devices, that register is at the offset below. This is > + # unlikely to work on other hardware (and likely to make things go boom)! > + # see bug 808397 > + exec { > + 'setpci-aspm-off': as mentioned in IRC my r+ is on puppet syntax and not on the command or risk-assessment of change itself. @@ +50,5 @@ > + 'setpci-aspm-off': > + command => '/sbin/setpci -d 8086:10d3 CAP_EXP+10.b=40', > + unless => '/usr/bin/test -z "`/sbin/setpci -d 8086:10d3 CAP_EXP+10.b | grep -v ^40$`"'; > + } > + } I would much prefer this exec to be thrown into a tweaks:: module and included from here similar to how we include ipmi above. Rather than making the tweak directly in this module. It is just my own idealism and is a minor nit, so if you disagree even slightly go ahead and land-as-is.
Attachment #736969 -
Flags: review?(bugspam.Callek) → review+
Assignee | ||
Comment 17•11 years ago
|
||
How's this? The ride-along is intentional. It looks like some models of the old ix hardware identify as X8SIL, while others identify as iX700-C.
Attachment #736969 -
Attachment is obsolete: true
Attachment #737466 -
Flags: review?(bugspam.Callek)
Comment 18•11 years ago
|
||
Comment on attachment 737466 [details] [diff] [review] bug808397-r3.patch Other than a pretty cryptic tweaks name r+ (I have no better color for the shed in mind though, so go ahead and paint it unless a prettier color exists in your mind)
Attachment #737466 -
Flags: review?(bugspam.Callek) → review+
Assignee | ||
Comment 19•11 years ago
|
||
landed - thanks!
Status: REOPENED → RESOLVED
Closed: 12 years ago → 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 20•11 years ago
|
||
It seems iX systems may now have manufacturer 'Supermicro', rather than 'ixSystems'.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 21•11 years ago
|
||
Although I can't replicate 'ixSystems' there. I wonder if we have slightly different HW revs across the fleet :(
Assignee | ||
Comment 22•11 years ago
|
||
Attachment #757389 -
Flags: review?(bugspam.Callek)
Comment 23•11 years ago
|
||
Comment on attachment 757389 [details] [diff] [review] bug808397.patch Review of attachment 757389 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Dustin J. Mitchell [:dustin] from comment #20) > It seems iX systems may now have manufacturer 'Supermicro', rather than > 'ixSystems'. I'm going to r+ this based on this, but I admit when on foopy125 running |facter -p| I still don't see a manufacturer fact to see this, so would love to know how you're finding it.
Attachment #757389 -
Flags: review?(bugspam.Callek) → review+
Assignee | ||
Comment 24•11 years ago
|
||
Facter has some incorrect/missing results when run as non-root, so run it as root.
Assignee | ||
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Component: Server Operations → Server Operations: RelEng
QA Contact: shyam → arich
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Attachment #737466 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Attachment #757389 -
Flags: checked-in+
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•