Closed Bug 781643 Opened 12 years ago Closed 12 years ago

paas-dea1.webapp.scl3 has died several times

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mburns, Assigned: dumitru)

References

Details

(Whiteboard: SeaMicro C-2391 case)

Attachments

(1 file)

[13:38:52] <nagios-scl3> Thu 13:38:51 PDT [548] paas-dea1.webapp.scl3.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%

This box has previously paged as DOWN earlier this week. On boot, the box dumps to single user mode and forces a manual fsck.
Last crash yesterday.  I downtimed the host in nagios until we figure out what is going on.
happened again Saturday night, 19:48:48 PDT [540]
Aug 18 20:06:23 paas-dea1 kernel: ------------[ cut here ]------------
Aug 18 20:06:23 paas-dea1 kernel: WARNING: at fs/buffer.c:677 __set_page_dirty+0xcb/0xf0() (Tainted: G    B      ---------------   )
Aug 18 20:06:23 paas-dea1 kernel: Hardware name: SM10000-XE
Aug 18 20:06:23 paas-dea1 kernel: Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Aug 18 20:06:23 paas-dea1 kernel: Pid: 24897, comm: puppet Tainted: G    B      ---------------    2.6.32-279.2.1.el6.x86_64 #1
Aug 18 20:06:23 paas-dea1 kernel: Call Trace:
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff8106b79a>] ? warn_slowpath_null+0x1a/0x20
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff811adbeb>] ? __set_page_dirty+0xcb/0xf0
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff811ade48>] ? __set_page_dirty_buffers+0x88/0xc0
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff81128419>] ? set_page_dirty+0x39/0x60
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff8113d09f>] ? unmap_vmas+0x9df/0xc30
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff81142d27>] ? exit_mmap+0x87/0x170
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff8106897c>] ? mmput+0x6c/0x120
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff811814d4>] ? flush_old_exec+0x484/0x690
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff811d1b6d>] ? load_elf_binary+0x3ad/0x1b10
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff8113adff>] ? follow_page+0x31f/0x470
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff81140110>] ? __get_user_pages+0x110/0x430
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff811ceadc>] ? load_misc_binary+0xac/0x3e0
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff811404c9>] ? get_user_pages+0x49/0x50
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff81182abb>] ? search_binary_handler+0x11b/0x360
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff81183c49>] ? do_execve+0x239/0x340
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff810d53ae>] ? __audit_getname+0xbe/0xd0
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff810095ea>] ? sys_execve+0x4a/0x80
Aug 18 20:06:23 paas-dea1 kernel: [<ffffffff8100b54a>] ? stub_execve+0x6a/0xc0
Aug 18 20:06:23 paas-dea1 kernel: ---[ end trace f2bd86a0c32c30c2 ]---
Assignee: server-ops → dgherman
Whiteboard: SeaMicro C-2391 case
A SeaMicro technician logged into the chassis and had a look. There's nothing to prove that the issue is related to SeaMicro.
We did see, however, that the system clock was not PDT, but it was UTC. The chassis's time is also in PDT.
Per his suggestion, modified the server's timezone (wondering how it got UTC in the first place), and we'll see if it goes down again.
Seeing something else now, but maybe that was the initial cause:

Message from syslogd@paas-dea1 at Aug 27 10:38:11 ...
 kernel:BUG: soft lockup - CPU#6 stuck for 67s! [flush-8:0:483]

Message from syslogd@paas-dea1 at Aug 27 10:38:34 ...
 kernel:BUG: soft lockup - CPU#1 stuck for 67s! [rhsmcertd-worke:3212]

Message from syslogd@paas-dea1 at Aug 27 10:39:10 ...
 kernel:BUG: soft lockup - CPU#3 stuck for 67s! [sosreport:3171]

Message from syslogd@paas-dea1 at Aug 27 10:39:35 ...
 kernel:BUG: soft lockup - CPU#6 stuck for 67s! [flush-8:0:483]


seamicro-a# server console connect 56
Using local telnet client for loopback connection to server: 56.
Standard telnet commands apply.

Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Connecting to server 56... Success!
000000036381370
FS:  00007f685882b700(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6852045000 CR3: 00000004335e0000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sosreport (pid: 3171, threadinfo ffff8804326ea000, task ffff880432559500)
Stack:
 ffff8804326ebbc8 ffffffff81113ffe ffff88043319ef00 ffff8804326ebc78
<d> ffff8804326ebc38 ffffffff8111550b ffff880432559500 ffff88042df88de0
<d> 0000000000000002 0000002881127061 ffff8804326ebc18 ffff88043319ef70
Call Trace:
 [<ffffffff81113ffe>] ? find_get_page+0x1e/0xa0
 [<ffffffff8111550b>] ? filemap_fault+0x8b/0x500
 [<ffffffff8113ed44>] ? __do_fault+0x54/0x510
 [<ffffffff8113f2f7>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff8115c30a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff81048ac7>] ? pte_alloc_one+0x37/0x50
 [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff810d358d>] ? audit_filter_rules+0x2d/0xdd0
 [<ffffffff81145fca>] ? do_mmap_pgoff+0x33a/0x380
 [<ffffffff8150339e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500755>] ? page_fault+0x25/0x30
Code: 00 8b 45 e8 48 83 c4 10 5b 41 5c c9 c3 90 90 90 55 48 8d 47 08 48 8b 7f 08 48 89 e5 48 85 ff 74 54 40 f6 c7 01 74 49 48 83 e7 fe <8b> 17 89 d0 48 3b 34 c5 a0 57 c0 81 77 3c 8d 0c 52 8d 4c 09 fa
Call Trace:
 [<ffffffff81113ffe>] ? find_get_page+0x1e/0xa0
 [<ffffffff8111550b>] ? filemap_fault+0x8b/0x500
 [<ffffffff8113ed44>] ? __do_fault+0x54/0x510
 [<ffffffff8113f2f7>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff8115c30a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff81048ac7>] ? pte_alloc_one+0x37/0x50
 [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff810d358d>] ? audit_filter_rules+0x2d/0xdd0
 [<ffffffff81145fca>] ? do_mmap_pgoff+0x33a/0x380
 [<ffffffff8150339e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500755>] ? page_fault+0x25/0x30
BUG: soft lockup - CPU#6 stuck for 67s! [flush-8:0:483]
Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 6
Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 483, comm: flush-8:0 Tainted: G    B      ---------------    2.6.32-279.2.1.el6.x86_64 #1 SeaMicro SM10000-XE/Type2 - Board Product Name1
RIP: 0010:[<ffffffff81113d1b>]  [<ffffffff81113d1b>] find_get_pages_tag+0x5b/0x120
RSP: 0018:ffff880430735940  EFLAGS: 00000246
RAX: ffff880435fa6708 RBX: ffff880430735990 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffea000e990ac8
RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000002
R10: 000000000000000e R11: ffff880435fa68f8 R12: ffffffff81faff00
R13: 0000000000000000 R14: ffffffff81faff00 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880028380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000010bf0f0 CR3: 0000000001a85000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process flush-8:0 (pid: 483, threadinfo ffff880430734000, task ffff880430a16ae0)
Stack:
 ffffea000ea53130 ffffea000000000d ffff880430735970 ffff88042df88de8
<d> 0000000000000000 ffff880430735a00 0000000000000000 ffff88042df88de0
<d> 0000000000008000 ffff880430735a00 ffff8804307359b0 ffffffff8112a965
Call Trace:
 [<ffffffff8112a965>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffffa0083f9a>] ? ext4_num_dirty_pages+0xda/0x260 [ext4]
 [<ffffffff811b34a0>] ? blkdev_get_block+0x0/0x70
 [<ffffffff811af820>] ? block_write_full_page_endio+0xe0/0x120
 [<ffffffff8112b7e6>] ? __pagevec_release+0x26/0x40
 [<ffffffffa0088906>] ? ext4_da_writepages+0x416/0x620 [ext4]
 [<ffffffff81271a29>] ? cpumask_next_and+0x29/0x50
 [<ffffffff81056a64>] ? find_busiest_group+0x244/0x9f0
 [<ffffffff81129b11>] ? do_writepages+0x21/0x40
 [<ffffffff811a513d>] ? writeback_single_inode+0xdd/0x2c0
 [<ffffffff811a557e>] ? writeback_sb_inodes+0xce/0x180
 [<ffffffff811a56db>] ? writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811a5a7b>] ? wb_writeback+0x29b/0x3f0
 [<ffffffff814fd960>] ? thread_return+0x4e/0x76e
 [<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
 [<ffffffff811a5d69>] ? wb_do_writeback+0x199/0x240
 [<ffffffff811a5e73>] ? bdi_writeback_task+0x63/0x1b0
 [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff81138770>] ? bdi_start_fn+0x0/0x100
 [<ffffffff811387f6>] ? bdi_start_fn+0x86/0x100
 [<ffffffff81138770>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 7d c8 48 89 de 45 89 e8 44 89 f1 e8 10 41 16 00 85 c0 89 c6 0f 84 b0 00 00 00 49 89 df 31 d2 31 c9 0f 1f 80 00 00 00 00 49 8b 07 <48> 8b 38 40 f6 c7 01 75 c6 48 85 ff 74 3c 48 83 ff ff 74 bb 44
Call Trace:
 [<ffffffff8112a965>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffffa0083f9a>] ? ext4_num_dirty_pages+0xda/0x260 [ext4]
 [<ffffffff811b34a0>] ? blkdev_get_block+0x0/0x70
 [<ffffffff811af820>] ? block_write_full_page_endio+0xe0/0x120
 [<ffffffff8112b7e6>] ? __pagevec_release+0x26/0x40
 [<ffffffffa0088906>] ? ext4_da_writepages+0x416/0x620 [ext4]
 [<ffffffff81271a29>] ? cpumask_next_and+0x29/0x50
 [<ffffffff81056a64>] ? find_busiest_group+0x244/0x9f0
 [<ffffffff81129b11>] ? do_writepages+0x21/0x40
 [<ffffffff811a513d>] ? writeback_single_inode+0xdd/0x2c0
 [<ffffffff811a557e>] ? writeback_sb_inodes+0xce/0x180
 [<ffffffff811a56db>] ? writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811a5a7b>] ? wb_writeback+0x29b/0x3f0
 [<ffffffff814fd960>] ? thread_return+0x4e/0x76e
 [<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
 [<ffffffff811a5d69>] ? wb_do_writeback+0x199/0x240
 [<ffffffff811a5e73>] ? bdi_writeback_task+0x63/0x1b0
 [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff81138770>] ? bdi_start_fn+0x0/0x100
 [<ffffffff811387f6>] ? bdi_start_fn+0x86/0x100
 [<ffffffff81138770>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#1 stuck for 67s! [rhsmcertd-worke:3212]
Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 1
Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 3212, comm: rhsmcertd-worke Tainted: G    B      ---------------    2.6.32-279.2.1.el6.x86_64 #1 SeaMicro SM10000-XE/Type2 - Board Product Name1
RIP: 0010:[<ffffffff81277170>]  [<ffffffff81277170>] radix_tree_lookup_slot+0x0/0x70
RSP: 0000:ffff880432e03bb0  EFLAGS: 00000246
RAX: ffffea000e990ac7 RBX: ffff880432e03bc8 RCX: ffff880435fa6708
RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88042df88de8
RBP: ffffffff8100bc0e R08: 0000000000000002 R09: 0000000000000028
R10: ffff88042df88de0 R11: 0000000000000002 R12: ffff880432e03ba8
R13: ffffffff8100bc0e R14: ffff880000038b08 R15: 0000000000000000
FS:  00007f81d7667700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81d082e000 CR3: 000000042f54a000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rhsmcertd-worke (pid: 3212, threadinfo ffff880432e02000, task ffff8804307d1540)
Stack:
 ffffffff81113ffe ffff88042f580300 ffff880432e03c78 ffff880432e03c38
<d> ffffffff8111550b ffff8804307d1540 ffff88042df88de0 0000000000000002
<d> 0000002881127061 ffff88042fe49380 ffff88042f580370 ffffea000e976c40
Call Trace:
 [<ffffffff81113ffe>] ? find_get_page+0x1e/0xa0
 [<ffffffff8111550b>] ? filemap_fault+0x8b/0x500
 [<ffffffff8113ed44>] ? __do_fault+0x54/0x510
 [<ffffffff8113f2f7>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff8115c30a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff81048ac7>] ? pte_alloc_one+0x37/0x50
 [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff810d358d>] ? audit_filter_rules+0x2d/0xdd0
 [<ffffffff81145fca>] ? do_mmap_pgoff+0x33a/0x380
 [<ffffffff8150339e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500755>] ? page_fault+0x25/0x30
Code: f6 48 c7 c2 20 e5 fc 81 e8 5e f5 00 00 85 c0 74 dc 4c 89 e7 89 45 e8 e8 ff f4 00 00 8b 45 e8 48 83 c4 10 5b 41 5c c9 c3 90 90 90 <55> 48 8d 47 08 48 8b 7f 08 48 89 e5 48 85 ff 74 54 40 f6 c7 01
Call Trace:
 [<ffffffff81113ffe>] ? find_get_page+0x1e/0xa0
 [<ffffffff8111550b>] ? filemap_fault+0x8b/0x500
 [<ffffffff8113ed44>] ? __do_fault+0x54/0x510
 [<ffffffff8113f2f7>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff8115c30a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff81048ac7>] ? pte_alloc_one+0x37/0x50
 [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff810d358d>] ? audit_filter_rules+0x2d/0xdd0
 [<ffffffff81145fca>] ? do_mmap_pgoff+0x33a/0x380
 [<ffffffff8150339e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500755>] ? page_fault+0x25/0x30
BUG: soft lockup - CPU#3 stuck for 67s! [sosreport:3171]
Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 3
Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 3171, comm: sosreport Tainted: G    B      ---------------    2.6.32-279.2.1.el6.x86_64 #1 SeaMicro SM10000-XE/Type2 - Board Product Name1
RIP: 0010:[<ffffffff812771c1>]  [<ffffffff812771c1>] radix_tree_lookup_slot+0x51/0x70
RSP: 0000:ffff8804326ebba8  EFLAGS: 00000282
RAX: ffff880435fa6708 RBX: ffff8804326ebba8 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffffea000e990ac8
RBP: ffffffff8100bc0e R08: 0000000000000002 R09: 0000000000000028
R10: ffff88042df88de0 R11: 0000000000000002 R12: ffff880000038b08
R13: 0000000000000000 R14: 00000040ffffffff R15: 0000000036381370
FS:  00007f685882b700(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6852045000 CR3: 00000004335e0000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sosreport (pid: 3171, threadinfo ffff8804326ea000, task ffff880432559500)
Stack:
 ffff8804326ebbc8 ffffffff81113ffe ffff88043319ef00 ffff8804326ebc78
<d> ffff8804326ebc38 ffffffff8111550b ffff880432559500 ffff88042df88de0
<d> 0000000000000002 0000002881127061 ffff8804326ebc18 ffff88043319ef70
Call Trace:
 [<ffffffff81113ffe>] ? find_get_page+0x1e/0xa0
 [<ffffffff8111550b>] ? filemap_fault+0x8b/0x500
 [<ffffffff8113ed44>] ? __do_fault+0x54/0x510
 [<ffffffff8113f2f7>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff8115c30a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff81048ac7>] ? pte_alloc_one+0x37/0x50
 [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff810d358d>] ? audit_filter_rules+0x2d/0xdd0
 [<ffffffff81145fca>] ? do_mmap_pgoff+0x33a/0x380
 [<ffffffff8150339e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500755>] ? page_fault+0x25/0x30
Code: 81 77 3c 8d 0c 52 8d 4c 09 fa eb 09 66 0f 1f 44 00 00 83 e9 06 48 89 f0 48 d3 e8 83 e0 3f 48 8d 44 c7 18 48 8b 38 48 85 ff 74 14 <83> ea 01 75 e2 c9 c3 0f 1f 84 00 00 00 00 00 48 85 f6 74 f1 31
Call Trace:
 [<ffffffff81113ffe>] ? find_get_page+0x1e/0xa0
 [<ffffffff8111550b>] ? filemap_fault+0x8b/0x500
 [<ffffffff8113ed44>] ? __do_fault+0x54/0x510
 [<ffffffff8113f2f7>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff8115c30a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff81048ac7>] ? pte_alloc_one+0x37/0x50
 [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff810d358d>] ? audit_filter_rules+0x2d/0xdd0
 [<ffffffff81145fca>] ? do_mmap_pgoff+0x33a/0x380
 [<ffffffff8150339e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500755>] ? page_fault+0x25/0x30
BUG: soft lockup - CPU#6 stuck for 67s! [flush-8:0:483]
Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 6
Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 483, comm: flush-8:0 Tainted: G    B      ---------------    2.6.32-279.2.1.el6.x86_64 #1 SeaMicro SM10000-XE/Type2 - Board Product Name1
RIP: 0010:[<ffffffff81113d24>]  [<ffffffff81113d24>] find_get_pages_tag+0x64/0x120
RSP: 0018:ffff880430735940  EFLAGS: 00000246
RAX: ffff880435fa6708 RBX: ffff880430735990 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffea000e990ac8
RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000002
R10: 000000000000000e R11: ffff880435fa68f8 R12: ffffffff81faff00
R13: 0000000000000000 R14: ffffffff81faff00 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880028380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000010bf0f0 CR3: 0000000001a85000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process flush-8:0 (pid: 483, threadinfo ffff880430734000, task ffff880430a16ae0)
Stack:
 ffffea000ea53130 ffffea000000000d ffff880430735970 ffff88042df88de8
<d> 0000000000000000 ffff880430735a00 0000000000000000 ffff88042df88de0
<d> 0000000000008000 ffff880430735a00 ffff8804307359b0 ffffffff8112a965
Call Trace:
 [<ffffffff8112a965>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffffa0083f9a>] ? ext4_num_dirty_pages+0xda/0x260 [ext4]
 [<ffffffff811b34a0>] ? blkdev_get_block+0x0/0x70
 [<ffffffff811af820>] ? block_write_full_page_endio+0xe0/0x120
 [<ffffffff8112b7e6>] ? __pagevec_release+0x26/0x40
 [<ffffffffa0088906>] ? ext4_da_writepages+0x416/0x620 [ext4]
 [<ffffffff81271a29>] ? cpumask_next_and+0x29/0x50
 [<ffffffff81056a64>] ? find_busiest_group+0x244/0x9f0
 [<ffffffff81129b11>] ? do_writepages+0x21/0x40
 [<ffffffff811a513d>] ? writeback_single_inode+0xdd/0x2c0
 [<ffffffff811a557e>] ? writeback_sb_inodes+0xce/0x180
 [<ffffffff811a56db>] ? writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811a5a7b>] ? wb_writeback+0x29b/0x3f0
 [<ffffffff814fd960>] ? thread_return+0x4e/0x76e
 [<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
 [<ffffffff811a5d69>] ? wb_do_writeback+0x199/0x240
 [<ffffffff811a5e73>] ? bdi_writeback_task+0x63/0x1b0
 [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff81138770>] ? bdi_start_fn+0x0/0x100
 [<ffffffff811387f6>] ? bdi_start_fn+0x86/0x100
 [<ffffffff81138770>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 89 f1 e8 10 41 16 00 85 c0 89 c6 0f 84 b0 00 00 00 49 89 df 31 d2 31 c9 0f 1f 80 00 00 00 00 49 8b 07 48 8b 38 40 f6 c7 01 75 c6 <48> 85 ff 74 3c 48 83 ff ff 74 bb 44 8b 47 08 45 85 c0 74 e3 45
Call Trace:
 [<ffffffff8112a965>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffffa0083f9a>] ? ext4_num_dirty_pages+0xda/0x260 [ext4]
 [<ffffffff811b34a0>] ? blkdev_get_block+0x0/0x70
 [<ffffffff811af820>] ? block_write_full_page_endio+0xe0/0x120
 [<ffffffff8112b7e6>] ? __pagevec_release+0x26/0x40
 [<ffffffffa0088906>] ? ext4_da_writepages+0x416/0x620 [ext4]
 [<ffffffff81271a29>] ? cpumask_next_and+0x29/0x50
 [<ffffffff81056a64>] ? find_busiest_group+0x244/0x9f0
 [<ffffffff81129b11>] ? do_writepages+0x21/0x40
 [<ffffffff811a513d>] ? writeback_single_inode+0xdd/0x2c0
 [<ffffffff811a557e>] ? writeback_sb_inodes+0xce/0x180
 [<ffffffff811a56db>] ? writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811a5a7b>] ? wb_writeback+0x29b/0x3f0
 [<ffffffff814fd960>] ? thread_return+0x4e/0x76e
 [<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
 [<ffffffff811a5d69>] ? wb_do_writeback+0x199/0x240
 [<ffffffff811a5e73>] ? bdi_writeback_task+0x63/0x1b0
 [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff81138770>] ? bdi_start_fn+0x0/0x100
 [<ffffffff811387f6>] ? bdi_start_fn+0x86/0x100
 [<ffffffff81138770>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81091d66>] ? kthread+0x96/0xa0
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#1 stuck for 66s! [rhsmcertd-worke:3212]
Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 1
Modules linked in: bridge bonding 8021q garp stp llc ipv6 microcode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp e1000 sg ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 3212, comm: rhsmcertd-worke Tainted: G    B      ---------------    2.6.32-279.2.1.el6.x86_64 #1 SeaMicro SM10000-XE/Type2 - Board Product Name1
RIP: 0010:[<ffffffff812771ae>]  [<ffffffff812771ae>] radix_tree_lookup_slot+0x3e/0x70
RSP: 0000:ffff880432e03ba8  EFLAGS: 00000297
RAX: 0000000000000002 RBX: ffff880432e03ba8 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff880435fa66e0
RBP: ffffffff8100bc0e R08: 0000000000000002 R09: 0000000000000028
R10: ffff88042df88de0 R11: 0000000000000002 R12: ffff880000038b08
R13: 0000000000000000 R14: 00000040ffffffff R15: 00000000361cb570
FS:  00007f81d7667700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81d082e000 CR3: 000000042f54a000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rhsmcertd-worke (pid: 3212, threadinfo ffff880432e02000, task ffff8804307d1540)
Stack:
 ffff880432e03bc8 ffffffff81113ffe ffff88042f580300 ffff880432e03c78
<d> ffff880432e03c38 ffffffff8111550b ffff8804307d1540 ffff88042df88de0
<d> 0000000000000002 0000002881127061 ffff88042fe49380 ffff88042f580370
Call Trace:
 [<ffffffff81113ffe>] ? find_get_page+0x1e/0xa0
 [<ffffffff8111550b>] ? filemap_fault+0x8b/0x500
 [<ffffffff8113ed44>] ? __do_fault+0x54/0x510
 [<ffffffff8113f2f7>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff8115c30a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff81048ac7>] ? pte_alloc_one+0x37/0x50
 [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff810d358d>] ? audit_filter_rules+0x2d/0xdd0
 [<ffffffff81145fca>] ? do_mmap_pgoff+0x33a/0x380
 [<ffffffff8150339e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500755>] ? page_fault+0x25/0x30
Code: c7 01 74 49 48 83 e7 fe 8b 17 89 d0 48 3b 34 c5 a0 57 c0 81 77 3c 8d 0c 52 8d 4c 09 fa eb 09 66 0f 1f 44 00 00 83 e9 06 48 89 f0 <48> d3 e8 83 e0 3f 48 8d 44 c7 18 48 8b 38 48 85 ff 74 14 83 ea
Call Trace:
 [<ffffffff81113ffe>] ? find_get_page+0x1e/0xa0
 [<ffffffff8111550b>] ? filemap_fault+0x8b/0x500
 [<ffffffff8113ed44>] ? __do_fault+0x54/0x510
 [<ffffffff8113f2f7>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff8115c30a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff81048ac7>] ? pte_alloc_one+0x37/0x50
 [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff810d358d>] ? audit_filter_rules+0x2d/0xdd0
 [<ffffffff81145fca>] ? do_mmap_pgoff+0x33a/0x380
 [<ffffffff8150339e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500755>] ? page_fault+0x25/0x30




After talking to the Seamicro tech, we'll re-image the node and see if the problem persists.
Node re-imaged. Let's see now.
Status: NEW → ASSIGNED
Same issues.
Still discussing with SeaMicro to see what to do next.

They said that "We have not tested RHEL6.3 in our lab and qualified it.", but I replied now telling them that it's ridiculous to assume it's a RHEL issue when we have over 100 nodes running 6.3 with no problems.

I'll try assigning a different disk to this node, to isolate the issue.
Server crashed again with the new disk.
Sending this info to SeaMicro, but looks like the c-card is busted.
We swapped the c-card.
Keeping an eye on it now.
QA Contact: jdow → shyam
Closing out.
Server decommissioned per bug 805945.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: