Closed
Bug 858565
Opened 12 years ago
Closed 12 years ago
foopy121 is down
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 808397
People
(Reporter: bhearsum, Assigned: dustin)
Details
10:22 < nagios-releng> Fri 07:22:27 PDT [446] foopy121.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
Reporter | ||
Comment 1•12 years ago
|
||
trying a pdu powercycle
Reporter | ||
Comment 2•12 years ago
|
||
pdu reboot worked. I restarted watch_devices.sh in screen and hopefully its tegras will return now
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 3•12 years ago
|
||
Down again now... wondering if it has a loose cable or something:
[20:12:54] nagios-releng Fri 17:12:37 PDT [491] foopy121.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 4•12 years ago
|
||
Tossing over the wall to DCOps, I know we can PDU cycle this, but since it happened again I want to first have DCOps check the wire connections incase they are seated loose.
Failing that I'm guessing either of Bad Power Supply on the system or kernel faulting for some other reason.
Please lob back over to releng after you check cables and repower.
Assignee: nobody → server-ops-dcops
Component: Release Engineering: Machine Management → Server Operations: DCOps
QA Contact: armenzg → dmoore
Comment 5•12 years ago
|
||
Dropping to normal as its not weekend work (sorry for pre-emptive page)
Severity: critical → normal
colo-trip: --- → mtv1
Assignee | ||
Updated•12 years ago
|
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → DUPLICATE
Assignee | ||
Comment 7•12 years ago
|
||
This is duped on the basis of an abrtd email from 7:15 pacific on April 5. That was the first crash (that Ben saw).
From what's onscreen, I think the second failure was the same. This failure seems to hit IX systems running CentOS 6.2 randomly. I don't think it's worth reading too much into the back-to-back nature of these failures.
I restarted the host via iLO, so we should get another email. Let's give it a few days to see if it fails again, and if not, restart its devices.
Assignee: server-ops-dcops → dustin
Component: Server Operations: DCOps → Server Operations: RelEng
QA Contact: dmoore → arich
Comment 8•12 years ago
|
||
This happened again for foopy121, :rbryce brought it back up for us.
:dustin do I remember correctly that the puppet change in 808397 should have solved this for us?
LOG:
=========
Duplicate check
=====
Common information
=====
package
-----
kernel
architecture
-----
x86_64
kernel
-----
2.6.32-220.7.1.el6.x86_64
Additional information
=====
kernel_tainted_long
-----
Taint on warning.
kernel_tainted
-----
512
backtrace
-----
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: X8SIL
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: ipv6 microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1
Call Trace:
<IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50
[<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280
[<ffffffff8109b743>] ? ktime_get+0x63/0xe0
[<ffffffff8109f0e4>] ? clockevents_program_event+0x54/0xa0
[<ffffffff810a0635>] ? tick_dev_program_event+0x65/0xc0
[<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280
[<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340
[<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250
[<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0
[<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170
[<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
[<ffffffff8100de85>] ? do_softirq+0x65/0xa0
[<ffffffff81071de5>] ? irq_exit+0x85/0x90
[<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0
[<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11
<EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170
[<ffffffff812c4af1>] ? intel_idle+0xc1/0x170
[<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140
[<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
[<ffffffff814e5ffc>] ? start_secondary+0x202/0x245
hostname
-----
foopy121.build.mtv1.mozilla.com
component
-----
kernel
cmdline
-----
ro root=UUID=80d9716b-c3c1-40b8-b7e3-7fe1113c9f9b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM
reason
-----
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
kernel_tainted_short
-----
---------W
analyzer
-----
Kerneloops
time
-----
1370134859
os_release
-----
CentOS release 6.2 (Final)
===============
Comment 9•12 years ago
|
||
And of note, when rbryce brought it up, the hardware clock was 8 *hours* behind, ntpd corrected the skew and then :rbryce fixed the hwclock
Comment 10•12 years ago
|
||
Got around today, and saw another instance in e-mail. Able to log in manually and see:
[cltbld@foopy121 ~]$ uptime
13:34:24 up 3:36, 1 user, load average: 0.02, 0.14, 0.12
hardware clock is also accurate at the moment...
Log of issue follows
=================
Duplicate check
=====
Common information
=====
package
-----
kernel
architecture
-----
x86_64
kernel
-----
2.6.32-220.7.1.el6.x86_64
Additional information
=====
kernel_tainted_long
-----
Taint on warning.
kernel_tainted
-----
512
backtrace
-----
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: X8SIL
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: ipv6 microcode sg serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-220.7.1.el6.x86_64 #1
Call Trace:
<IRQ> [<ffffffff81069a17>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff81069b06>] ? warn_slowpath_fmt+0x46/0x50
[<ffffffff8144a60d>] ? dev_watchdog+0x26d/0x280
[<ffffffff8109b743>] ? ktime_get+0x63/0xe0
[<ffffffff8107bbe5>] ? internal_add_timer+0xb5/0x110
[<ffffffff8144a3a0>] ? dev_watchdog+0x0/0x280
[<ffffffff8107c7f7>] ? run_timer_softirq+0x197/0x340
[<ffffffff81095610>] ? hrtimer_interrupt+0x140/0x250
[<ffffffff81072001>] ? __do_softirq+0xc1/0x1d0
[<ffffffff810d94a0>] ? handle_IRQ_event+0x60/0x170
[<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
[<ffffffff8100de85>] ? do_softirq+0x65/0xa0
[<ffffffff81071de5>] ? irq_exit+0x85/0x90
[<ffffffff814f4dc5>] ? do_IRQ+0x75/0xf0
[<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11
<EOI> [<ffffffff812c4b0e>] ? intel_idle+0xde/0x170
[<ffffffff812c4af1>] ? intel_idle+0xc1/0x170
[<ffffffff813fa027>] ? cpuidle_idle_call+0xa7/0x140
[<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
[<ffffffff814e5ffc>] ? start_secondary+0x202/0x245
hostname
-----
foopy121.build.mtv1.mozilla.com
component
-----
kernel
cmdline
-----
ro root=UUID=80d9716b-c3c1-40b8-b7e3-7fe1113c9f9b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM
reason
-----
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
kernel_tainted_short
-----
---------W
analyzer
-----
Kerneloops
time
-----
1370190113
os_release
-----
CentOS release 6.2 (Final)
Assignee | ||
Comment 11•12 years ago
|
||
Yes, but on that host, facter gives:
manufacturer => Supermicro
I wonder if this is due to the facter upgrade.
Comment 12•12 years ago
|
||
[13:39:54] nagios-releng Sat 10:39:53 PDT [410] foopy121.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
Powercycling via OOB didn;t work, but I was still able to access its console that way, logged in via remote console as root, and rebooted
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•