853835 - replace foopy127 with another ix machine

Assignee

Description

•

12 years ago

foopy127 started alerting for ping at Fri 02:12:08 PDT. When I checked on it this morning, I noticed that ifconfig eth0 showed a massive number of errors and dropped packets. Can someone check on the cable and the switch port, please? RX packets: 74189222 errors:2293512535530 dropped:302252089255 overruns:0 frame:1529008357020

Vinh Hua [:vinh]

Comment 1

•

12 years ago

Foopy127 should have steady connectivity now. Had to perform a reboot on host. ping 10.250.49.171 PING 10.250.49.171 (10.250.49.171): 56 data bytes 64 bytes from 10.250.49.171: icmp_seq=0 ttl=63 time=2.058 ms 64 bytes from 10.250.49.171: icmp_seq=1 ttl=63 time=2.320 ms 64 bytes from 10.250.49.171: icmp_seq=2 ttl=63 time=2.246 ms 64 bytes from 10.250.49.171: icmp_seq=3 ttl=63 time=1.783 ms 64 bytes from 10.250.49.171: icmp_seq=4 ttl=63 time=47.888 ms 64 bytes from 10.250.49.171: icmp_seq=5 ttl=63 time=32.533 ms 64 bytes from 10.250.49.171: icmp_seq=6 ttl=63 time=3.627 ms

Assignee: server-ops-dcops → vhua

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Amy Rich [:arr] [:arich]

Assignee

Comment 2

•

12 years ago

Did you find a root cause for the errors in the first place? I didn't just want to do a reboot because I wanted to leave things as is for you guys to troubleshoot. If nothing else changed, my concern is that we're just going to run into the same issue again.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Van Le [:van]

Updated

•

12 years ago

colo-trip: --- → mtv1

Vinh Hua [:vinh]

Comment 3

•

12 years ago

I swapped out the ethernet cable and tried a different port on the switch but issue was not resolved. Ended up patching the cable to it's original port and performed a host reboot. Eth0 connectivity came back online afterwards.

Status: REOPENED → RESOLVED

colo-trip: mtv1 → ---

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Comment 4

•

12 years ago

Sooo, I'd rather not sweep this under the rug based on what I now see in e-mail, Could use some understanding of this before we move on for a production device: ----- Original Message ----- From: user@localhost.build.mtv1.mozilla.com To: root@localhost.build.mtv1.mozilla.com Sent: Friday, March 22, 2013 5:05:06 AM Subject: [abrt] full crash report Duplicate check ===== Common information ===== package ----- kernel architecture ----- x86_64 kernel ----- 2.6.32-220.7.1.el6.x86_64 Additional information ===== kernel_tainted_long ----- Taint on warning. time ----- 1363943103 backtrace ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: X8SIL NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280 [<ffffffff8108b3fd>] ? insert_work+0x6d/0xb0 [<ffffffff8109b743>] ? ktime_get+0x63/0xe0 [<ffffffff8107bbe5>] ? internal_add_timer+0xb5/0x110 [<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340 [<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81071de5>] ? irq_exit+0x85/0x90 [<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170 [<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e5ffc>] ? start_secondary+0x202/0x245 hostname ----- foopy127.build.mtv1.mozilla.com component ----- kernel reason ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) kernel_tainted ----- 512 kernel_tainted_short ----- ---------W analyzer ----- Kerneloops cmdline ----- ro root=UUID=dbaf19d1-f9bf-4c96-b900-1b5b56fd96cc rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM os_release ----- CentOS release 6.2 (Final)

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Derek Moore [:dmoore]

Comment 5

•

12 years ago

The important line from this crash dump is: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out That's usually indicative of a network driver or hardware failure. The E1000 drivers are pretty stable, but there's no surefire way to reproduce or confirm (especially since the issue cleared up after a power cycle). If it happens again, our only recourse is to replace the motherboard.

Amy Rich [:arr] [:arich]

Assignee

Comment 6

•

12 years ago

If it happened again, we'd just pitch the box since it's out of warranty. We'd have to take one of the four back from the bld-linux64-ix ones I did.

Amy Rich [:arr] [:arich]

Assignee

Comment 7

•

12 years ago

There've been no further errors. callek: do you want to decommission this machine and use a different one, or do you want to resolve this ticket and call it good?

Flags: needinfo?(bugspam.Callek)

Van Le [:van]

Updated

•

12 years ago

colo-trip: --- → mtv1

Justin Wood (:Callek)

Comment 8

•

12 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #7) > There've been no further errors. callek: do you want to decommission this > machine and use a different one, or do you want to resolve this ticket and > call it good? lets call it good for now, based on prior convo here, if it resurfaces with a similar issue we'll drop it onto the floor, thwap it with a bat, and play with some thermite. And figure out next steps at that point.

Status: REOPENED → RESOLVED

colo-trip: mtv1 → ---

Closed: 12 years ago → 12 years ago

Flags: needinfo?(bugspam.Callek)

Resolution: --- → FIXED

Justin Wood (:Callek)

Comment 9

•

12 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #6) > If it happened again, we'd just pitch the box since it's out of warranty. > We'd have to take one of the four back from the bld-linux64-ix ones I did. On second thought, this is bad again. [02:51:55] nagios-releng Tue 23:51:54 PDT [454] foopy127.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100% Assuming the issue now is the same as what this bug describes... Lets find another ix host to replace it with....

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Summary: tons of errors and dropped packets on foopy127 → replace foopy127 with another ix machine

Whiteboard: [buildduty]

Amy Rich [:arr] [:arich]

Assignee

Comment 10

•

12 years ago

callek: please tell us which ix machine in mtv1 you would like retasked to be a foopy.

Assignee: vhua → arich

Component: Server Operations: DCOps → Server Operations: RelEng

Flags: needinfo?(bugspam.Callek)

QA Contact: dmoore → arich

Chris AtLee [:catlee]

Comment 11

•

12 years ago

bld-linux64-ix-054 is ready and waiting

Amy Rich [:arr] [:arich]

Assignee

Comment 12

•

12 years ago

bld-linux64-ix-054 has been turned into foopy128 I'll open a separate bug to decomm foopy127.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Updated

•

12 years ago

Flags: needinfo?(bugspam.Callek)

Justin Wood (:Callek)

Updated

•

12 years ago

Blocks: tegra-282

Justin Wood (:Callek)

Updated

•

12 years ago

Blocks: tegra-276

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Bugzilla

replace foopy127 with another ix machine

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: arich, Assigned: arich)

References

Details

(Whiteboard: [buildduty])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Updated

Updated

Updated