Closed
Bug 853835
Opened 12 years ago
Closed 12 years ago
replace foopy127 with another ix machine
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: arich, Assigned: arich)
References
Details
(Whiteboard: [buildduty])
foopy127 started alerting for ping at Fri 02:12:08 PDT. When I checked on it this morning, I noticed that ifconfig eth0 showed a massive number of errors and dropped packets. Can someone check on the cable and the switch port, please?
RX packets: 74189222 errors:2293512535530 dropped:302252089255 overruns:0 frame:1529008357020
Comment 1•12 years ago
|
||
Foopy127 should have steady connectivity now. Had to perform a reboot on host.
ping 10.250.49.171
PING 10.250.49.171 (10.250.49.171): 56 data bytes
64 bytes from 10.250.49.171: icmp_seq=0 ttl=63 time=2.058 ms
64 bytes from 10.250.49.171: icmp_seq=1 ttl=63 time=2.320 ms
64 bytes from 10.250.49.171: icmp_seq=2 ttl=63 time=2.246 ms
64 bytes from 10.250.49.171: icmp_seq=3 ttl=63 time=1.783 ms
64 bytes from 10.250.49.171: icmp_seq=4 ttl=63 time=47.888 ms
64 bytes from 10.250.49.171: icmp_seq=5 ttl=63 time=32.533 ms
64 bytes from 10.250.49.171: icmp_seq=6 ttl=63 time=3.627 ms
Assignee: server-ops-dcops → vhua
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 2•12 years ago
|
||
Did you find a root cause for the errors in the first place? I didn't just want to do a reboot because I wanted to leave things as is for you guys to troubleshoot. If nothing else changed, my concern is that we're just going to run into the same issue again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•12 years ago
|
colo-trip: --- → mtv1
Comment 3•12 years ago
|
||
I swapped out the ethernet cable and tried a different port on the switch but issue was not resolved. Ended up patching the cable to it's original port and performed a host reboot. Eth0 connectivity came back online afterwards.
Status: REOPENED → RESOLVED
colo-trip: mtv1 → ---
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 4•12 years ago
|
||
Sooo, I'd rather not sweep this under the rug based on what I now see in e-mail,
Could use some understanding of this before we move on for a production device:
----- Original Message -----
From: user@localhost.build.mtv1.mozilla.com
To: root@localhost.build.mtv1.mozilla.com
Sent: Friday, March 22, 2013 5:05:06 AM
Subject: [abrt] full crash report
Duplicate check
=====
Common information
=====
package
-----
kernel
architecture
-----
x86_64
kernel
-----
2.6.32-220.7.1.el6.x86_64
Additional information
=====
kernel_tainted_long
-----
Taint on warning.
time
-----
1363943103
backtrace
-----
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: X8SIL
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1
Call Trace:
<IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50
[<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280
[<ffffffff8108b3fd>] ? insert_work+0x6d/0xb0
[<ffffffff8109b743>] ? ktime_get+0x63/0xe0
[<ffffffff8107bbe5>] ? internal_add_timer+0xb5/0x110
[<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280
[<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340
[<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250
[<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0
[<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170
[<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
[<ffffffff8100de85>] ? do_softirq+0x65/0xa0
[<ffffffff81071de5>] ? irq_exit+0x85/0x90
[<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0
[<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11
<EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170
[<ffffffff812c4af1>] ? intel_idle+0xc1/0x170
[<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140
[<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
[<ffffffff814e5ffc>] ? start_secondary+0x202/0x245
hostname
-----
foopy127.build.mtv1.mozilla.com
component
-----
kernel
reason
-----
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
kernel_tainted
-----
512
kernel_tainted_short
-----
---------W
analyzer
-----
Kerneloops
cmdline
-----
ro root=UUID=dbaf19d1-f9bf-4c96-b900-1b5b56fd96cc rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM
os_release
-----
CentOS release 6.2 (Final)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 5•12 years ago
|
||
The important line from this crash dump is:
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
That's usually indicative of a network driver or hardware failure. The E1000 drivers are pretty stable, but there's no surefire way to reproduce or confirm (especially since the issue cleared up after a power cycle).
If it happens again, our only recourse is to replace the motherboard.
Assignee | ||
Comment 6•12 years ago
|
||
If it happened again, we'd just pitch the box since it's out of warranty. We'd have to take one of the four back from the bld-linux64-ix ones I did.
Assignee | ||
Comment 7•12 years ago
|
||
There've been no further errors. callek: do you want to decommission this machine and use a different one, or do you want to resolve this ticket and call it good?
Flags: needinfo?(bugspam.Callek)
Updated•12 years ago
|
colo-trip: --- → mtv1
Comment 8•12 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #7)
> There've been no further errors. callek: do you want to decommission this
> machine and use a different one, or do you want to resolve this ticket and
> call it good?
lets call it good for now, based on prior convo here, if it resurfaces with a similar issue we'll drop it onto the floor, thwap it with a bat, and play with some thermite. And figure out next steps at that point.
Status: REOPENED → RESOLVED
colo-trip: mtv1 → ---
Closed: 12 years ago → 12 years ago
Flags: needinfo?(bugspam.Callek)
Resolution: --- → FIXED
Comment 9•12 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #6)
> If it happened again, we'd just pitch the box since it's out of warranty.
> We'd have to take one of the four back from the bld-linux64-ix ones I did.
On second thought, this is bad again.
[02:51:55] nagios-releng Tue 23:51:54 PDT [454] foopy127.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
Assuming the issue now is the same as what this bug describes...
Lets find another ix host to replace it with....
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: tons of errors and dropped packets on foopy127 → replace foopy127 with another ix machine
Whiteboard: [buildduty]
Assignee | ||
Comment 10•12 years ago
|
||
callek: please tell us which ix machine in mtv1 you would like retasked to be a foopy.
Assignee: vhua → arich
Component: Server Operations: DCOps → Server Operations: RelEng
Flags: needinfo?(bugspam.Callek)
QA Contact: dmoore → arich
Comment 11•12 years ago
|
||
bld-linux64-ix-054 is ready and waiting
Assignee | ||
Comment 12•12 years ago
|
||
bld-linux64-ix-054 has been turned into foopy128
I'll open a separate bug to decomm foopy127.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Flags: needinfo?(bugspam.Callek)
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•