Closed Bug 858565 Opened 12 years ago Closed 12 years ago

foopy121 is down

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 808397

People

(Reporter: bhearsum, Assigned: dustin)

Details

10:22 < nagios-releng> Fri 07:22:27 PDT [446] foopy121.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
trying a pdu powercycle
pdu reboot worked. I restarted watch_devices.sh in screen and hopefully its tegras will return now
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Down again now... wondering if it has a loose cable or something: [20:12:54] nagios-releng Fri 17:12:37 PDT [491] foopy121.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Tossing over the wall to DCOps, I know we can PDU cycle this, but since it happened again I want to first have DCOps check the wire connections incase they are seated loose. Failing that I'm guessing either of Bad Power Supply on the system or kernel faulting for some other reason. Please lob back over to releng after you check cables and repower.
Assignee: nobody → server-ops-dcops
Component: Release Engineering: Machine Management → Server Operations: DCOps
QA Contact: armenzg → dmoore
Dropping to normal as its not weekend work (sorry for pre-emptive page)
Severity: critical → normal
colo-trip: --- → mtv1
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → DUPLICATE
This is duped on the basis of an abrtd email from 7:15 pacific on April 5. That was the first crash (that Ben saw). From what's onscreen, I think the second failure was the same. This failure seems to hit IX systems running CentOS 6.2 randomly. I don't think it's worth reading too much into the back-to-back nature of these failures. I restarted the host via iLO, so we should get another email. Let's give it a few days to see if it fails again, and if not, restart its devices.
Assignee: server-ops-dcops → dustin
Component: Server Operations: DCOps → Server Operations: RelEng
QA Contact: dmoore → arich
This happened again for foopy121, :rbryce brought it back up for us. :dustin do I remember correctly that the puppet change in 808397 should have solved this for us? LOG: ========= Duplicate check ===== Common information ===== package ----- kernel architecture ----- x86_64 kernel ----- 2.6.32-220.7.1.el6.x86_64 Additional information ===== kernel_tainted_long ----- Taint on warning. kernel_tainted ----- 512 backtrace ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: X8SIL NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: ipv6 microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280 [<ffffffff8109b743>] ? ktime_get+0x63/0xe0 [<ffffffff8109f0e4>] ? clockevents_program_event+0x54/0xa0 [<ffffffff810a0635>] ? tick_dev_program_event+0x65/0xc0 [<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340 [<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81071de5>] ? irq_exit+0x85/0x90 [<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170 [<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e5ffc>] ? start_secondary+0x202/0x245 hostname ----- foopy121.build.mtv1.mozilla.com component ----- kernel cmdline ----- ro root=UUID=80d9716b-c3c1-40b8-b7e3-7fe1113c9f9b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM reason ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) kernel_tainted_short ----- ---------W analyzer ----- Kerneloops time ----- 1370134859 os_release ----- CentOS release 6.2 (Final) ===============
And of note, when rbryce brought it up, the hardware clock was 8 *hours* behind, ntpd corrected the skew and then :rbryce fixed the hwclock
Got around today, and saw another instance in e-mail. Able to log in manually and see: [cltbld@foopy121 ~]$ uptime 13:34:24 up 3:36, 1 user, load average: 0.02, 0.14, 0.12 hardware clock is also accurate at the moment... Log of issue follows ================= Duplicate check ===== Common information ===== package ----- kernel architecture ----- x86_64 kernel ----- 2.6.32-220.7.1.el6.x86_64 Additional information ===== kernel_tainted_long ----- Taint on warning. kernel_tainted ----- 512 backtrace ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: X8SIL NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: ipv6 microcode sg serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280 [<ffffffff8109b743>] ? ktime_get+0x63/0xe0 [<ffffffff8107bbe5>] ? internal_add_timer+0xb5/0x110 [<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340 [<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81071de5>] ? irq_exit+0x85/0x90 [<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170 [<ffffffff812c4af1>] ? intel_idle+0xc1/0x170 [<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e5ffc>] ? start_secondary+0x202/0x245 hostname ----- foopy121.build.mtv1.mozilla.com component ----- kernel cmdline ----- ro root=UUID=80d9716b-c3c1-40b8-b7e3-7fe1113c9f9b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM reason ----- WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) kernel_tainted_short ----- ---------W analyzer ----- Kerneloops time ----- 1370190113 os_release ----- CentOS release 6.2 (Final)
Yes, but on that host, facter gives: manufacturer => Supermicro I wonder if this is due to the facter upgrade.
[13:39:54] nagios-releng Sat 10:39:53 PDT [410] foopy121.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100% Powercycling via OOB didn;t work, but I was still able to access its console that way, logged in via remote console as root, and rebooted
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.