Closed Bug 853835 Opened 12 years ago Closed 12 years ago

replace foopy127 with another ix machine

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: arich)

References

Details

(Whiteboard: [buildduty])

foopy127 started alerting for ping at Fri 02:12:08 PDT. When I checked on it this morning, I noticed that ifconfig eth0 showed a massive number of errors and dropped packets. Can someone check on the cable and the switch port, please? RX packets: 74189222 errors:2293512535530 dropped:302252089255 overruns:0 frame:1529008357020
Foopy127 should have steady connectivity now. Had to perform a reboot on host. ping 10.250.49.171 PING 10.250.49.171 (10.250.49.171): 56 data bytes 64 bytes from 10.250.49.171: icmp_seq=0 ttl=63 time=2.058 ms 64 bytes from 10.250.49.171: icmp_seq=1 ttl=63 time=2.320 ms 64 bytes from 10.250.49.171: icmp_seq=2 ttl=63 time=2.246 ms 64 bytes from 10.250.49.171: icmp_seq=3 ttl=63 time=1.783 ms 64 bytes from 10.250.49.171: icmp_seq=4 ttl=63 time=47.888 ms 64 bytes from 10.250.49.171: icmp_seq=5 ttl=63 time=32.533 ms 64 bytes from 10.250.49.171: icmp_seq=6 ttl=63 time=3.627 ms
Assignee: server-ops-dcops → vhua
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Did you find a root cause for the errors in the first place? I didn't just want to do a reboot because I wanted to leave things as is for you guys to troubleshoot. If nothing else changed, my concern is that we're just going to run into the same issue again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
colo-trip: --- → mtv1
I swapped out the ethernet cable and tried a different port on the switch but issue was not resolved. Ended up patching the cable to it's original port and performed a host reboot. Eth0 connectivity came back online afterwards.
Status: REOPENED → RESOLVED
colo-trip: mtv1 → ---
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Sooo, I'd rather not sweep this under the rug based on what I now see in e-mail, Could use some understanding of this before we move on for a production device: ----- Original Message ----- From: user@localhost.build.mtv1.mozilla.com To: root@localhost.build.mtv1.mozilla.com Sent: Friday, March 22, 2013 5:05:06 AM Subject: [abrt] full crash report Duplicate check ===== Common information ===== package ----- kernel architecture ----- x86_64 kernel ----- 2.6.32-220.7.1.el6.x86_64 Additional information ===== kernel_tainted_long ----- Taint on warning. time ----- 1363943103 backtrace ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: X8SIL NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280 [<ffffffff8108b3fd>] ? insert_work+0x6d/0xb0 [<ffffffff8109b743>] ? ktime_get+0x63/0xe0 [<ffffffff8107bbe5>] ? internal_add_timer+0xb5/0x110 [<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340 [<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81071de5>] ? irq_exit+0x85/0x90 [<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170 [<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e5ffc>] ? start_secondary+0x202/0x245 hostname ----- foopy127.build.mtv1.mozilla.com component ----- kernel reason ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) kernel_tainted ----- 512 kernel_tainted_short ----- ---------W analyzer ----- Kerneloops cmdline ----- ro root=UUID=dbaf19d1-f9bf-4c96-b900-1b5b56fd96cc rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM os_release ----- CentOS release 6.2 (Final)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The important line from this crash dump is: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out That's usually indicative of a network driver or hardware failure. The E1000 drivers are pretty stable, but there's no surefire way to reproduce or confirm (especially since the issue cleared up after a power cycle). If it happens again, our only recourse is to replace the motherboard.
If it happened again, we'd just pitch the box since it's out of warranty. We'd have to take one of the four back from the bld-linux64-ix ones I did.
There've been no further errors. callek: do you want to decommission this machine and use a different one, or do you want to resolve this ticket and call it good?
Flags: needinfo?(bugspam.Callek)
colo-trip: --- → mtv1
(In reply to Amy Rich [:arich] [:arr] from comment #7) > There've been no further errors. callek: do you want to decommission this > machine and use a different one, or do you want to resolve this ticket and > call it good? lets call it good for now, based on prior convo here, if it resurfaces with a similar issue we'll drop it onto the floor, thwap it with a bat, and play with some thermite. And figure out next steps at that point.
Status: REOPENED → RESOLVED
colo-trip: mtv1 → ---
Closed: 12 years ago12 years ago
Flags: needinfo?(bugspam.Callek)
Resolution: --- → FIXED
(In reply to Amy Rich [:arich] [:arr] from comment #6) > If it happened again, we'd just pitch the box since it's out of warranty. > We'd have to take one of the four back from the bld-linux64-ix ones I did. On second thought, this is bad again. [02:51:55] nagios-releng Tue 23:51:54 PDT [454] foopy127.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100% Assuming the issue now is the same as what this bug describes... Lets find another ix host to replace it with....
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: tons of errors and dropped packets on foopy127 → replace foopy127 with another ix machine
Whiteboard: [buildduty]
callek: please tell us which ix machine in mtv1 you would like retasked to be a foopy.
Assignee: vhua → arich
Component: Server Operations: DCOps → Server Operations: RelEng
Flags: needinfo?(bugspam.Callek)
QA Contact: dmoore → arich
bld-linux64-ix-054 is ready and waiting
bld-linux64-ix-054 has been turned into foopy128 I'll open a separate bug to decomm foopy127.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Flags: needinfo?(bugspam.Callek)
Blocks: tegra-282
Blocks: tegra-276
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.