858565 - foopy121 is down

Down again now... wondering if it has a loose cable or something: [20:12:54] nagios-releng Fri 17:12:37 PDT [491] foopy121.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Justin Wood (:Callek)

Comment 4

•

12 years ago

Tossing over the wall to DCOps, I know we can PDU cycle this, but since it happened again I want to first have DCOps check the wire connections incase they are seated loose. Failing that I'm guessing either of Bad Power Supply on the system or kernel faulting for some other reason. Please lob back over to releng after you check cables and repower.

Assignee: nobody → server-ops-dcops

Component: Release Engineering: Machine Management → Server Operations: DCOps

QA Contact: armenzg → dmoore

Justin Wood (:Callek)

Comment 5

•

12 years ago

Dropping to normal as its not weekend work (sorry for pre-emptive page)

Severity: critical → normal

colo-trip: --- → mtv1

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

12 years ago

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → DUPLICATE

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

12 years ago

This is duped on the basis of an abrtd email from 7:15 pacific on April 5. That was the first crash (that Ben saw). From what's onscreen, I think the second failure was the same. This failure seems to hit IX systems running CentOS 6.2 randomly. I don't think it's worth reading too much into the back-to-back nature of these failures. I restarted the host via iLO, so we should get another email. Let's give it a few days to see if it fails again, and if not, restart its devices.

Assignee: server-ops-dcops → dustin

Component: Server Operations: DCOps → Server Operations: RelEng

QA Contact: dmoore → arich

Justin Wood (:Callek)

Comment 8

•

12 years ago

This happened again for foopy121, :rbryce brought it back up for us. :dustin do I remember correctly that the puppet change in 808397 should have solved this for us? LOG: ========= Duplicate check ===== Common information ===== package ----- kernel architecture ----- x86_64 kernel ----- 2.6.32-220.7.1.el6.x86_64 Additional information ===== kernel_tainted_long ----- Taint on warning. kernel_tainted ----- 512 backtrace ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: X8SIL NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: ipv6 microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280 [<ffffffff8109b743>] ? ktime_get+0x63/0xe0 [<ffffffff8109f0e4>] ? clockevents_program_event+0x54/0xa0 [<ffffffff810a0635>] ? tick_dev_program_event+0x65/0xc0 [<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340 [<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81071de5>] ? irq_exit+0x85/0x90 [<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170 [<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e5ffc>] ? start_secondary+0x202/0x245 hostname ----- foopy121.build.mtv1.mozilla.com component ----- kernel cmdline ----- ro root=UUID=80d9716b-c3c1-40b8-b7e3-7fe1113c9f9b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM reason ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) kernel_tainted_short ----- ---------W analyzer ----- Kerneloops time ----- 1370134859 os_release ----- CentOS release 6.2 (Final) ===============

Justin Wood (:Callek)

Comment 9

•

12 years ago

And of note, when rbryce brought it up, the hardware clock was 8 *hours* behind, ntpd corrected the skew and then :rbryce fixed the hwclock

Justin Wood (:Callek)

Comment 10

•

12 years ago

Got around today, and saw another instance in e-mail. Able to log in manually and see: [cltbld@foopy121 ~]$ uptime 13:34:24 up 3:36, 1 user, load average: 0.02, 0.14, 0.12 hardware clock is also accurate at the moment... Log of issue follows ================= Duplicate check ===== Common information ===== package ----- kernel architecture ----- x86_64 kernel ----- 2.6.32-220.7.1.el6.x86_64 Additional information ===== kernel_tainted_long ----- Taint on warning. kernel_tainted ----- 512 backtrace ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: X8SIL NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: ipv6 microcode sg serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280 [<ffffffff8109b743>] ? ktime_get+0x63/0xe0 [<ffffffff8107bbe5>] ? internal_add_timer+0xb5/0x110 [<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340 [<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81071de5>] ? irq_exit+0x85/0x90 [<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170 [<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e5ffc>] ? start_secondary+0x202/0x245 hostname ----- foopy121.build.mtv1.mozilla.com component ----- kernel cmdline ----- ro root=UUID=80d9716b-c3c1-40b8-b7e3-7fe1113c9f9b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM reason ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) kernel_tainted_short ----- ---------W analyzer ----- Kerneloops time ----- 1370190113 os_release ----- CentOS release 6.2 (Final)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 11

•

12 years ago

Yes, but on that host, facter gives: manufacturer => Supermicro I wonder if this is due to the facter upgrade.

Justin Wood (:Callek)

Comment 12

•

12 years ago

[13:39:54] nagios-releng Sat 10:39:53 PDT [410] foopy121.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100% Powercycling via OOB didn;t work, but I was still able to access its console that way, logged in via remote console as root, and rebooted

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Bugzilla

foopy121 is down

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Assigned: dustin)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated