We saw a couple of hosts go offline at the same time : 21:50:39 <@nagios-phx>  engagement1.db.phx1 is DOWN: PING CRITICAL - Packet loss = 100% 21:59:08 <@nagios-phx>  engagement2.db.phx1 is DOWN: PING CRITICAL - Packet loss = 100% 22:07:21 <@nagios-phx>  support5.webapp.phx1 is DOWN: PING CRITICAL - Packet loss = 100% And on further investigation : 22:22:53 < cshields> :( 22:23:09 <@fox2mike> I rest my case 22:23:13 < dmoore> cshields: meaning it's running DHCP and it didn't get a lease? 22:24:19 < cshields> No DHCPOFFERS received 22:24:35 < cshields> sounds like leases expired at the same time for these 22:25:09 < dmoore> 8( 22:25:15 < dmoore> SAD. FACE. So netops is poking.
This appears to be a performance problem with the host/VM: INFO: task dhcpd:15480 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. dhcpd D ffff81000c093828 0 15480 1 21416 8591 (NOTLB) ffff810035b95d88 0000000000000082 ffff810035b95d98 ffffffff80062ff8 0000000000001000 0000000000000009 ffff81003f126820 ffff81003fdf0080 000a008fcf53872e 00000000000004d2 ffff81003f126a08 0000000000004be5 Call Trace: [<ffffffff80062ff8>] thread_return+0x62/0xfe [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff [<ffffffff8002fbdf>] __writeback_single_inode+0x1e9/0x328 [<ffffffff800f333b>] sync_inode+0x24/0x33 [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc [<ffffffff80050127>] do_fsync+0x52/0xa4 [<ffffffff800e0f84>] __do_fsync+0x23/0x36 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 VMCIUtil: Updating context id from 0xffffffff to 0x5adb40a1 on event 0. VMCIUtil: Updating context id from 0x5adb40a1 to 0x5adb40a1 on event 0.
We rebooted both ip-ns01 and ns02 and still no go : [root@support5 ~]# /etc/init.d/network restart Shutting down interface bond0: [ OK ] Shutting down loopback interface: [ OK ] Bringing up loopback interface: [ OK ] Bringing up interface bond0: Determining IP information for bond0... failed. [FAILED] Seems like the dhcp responses aren't making it back to the machines.
We lost a bunch of hadoop machines in phx1 because of this issue, cc'ing socorro folks. Sorry :( 23:14:50 <@nagios-phx>  hp-node62.phx1 is DOWN: PING CRITICAL - Packet loss = 100% 23:15:08 <@nagios-phx>  hp-node65.phx1 is DOWN: PING CRITICAL - Packet loss = 100% 23:15:08 <@nagios-phx>  hp-node67.phx1 is DOWN: PING CRITICAL - Packet loss = 100% 23:15:09 <@nagios-phx>  hp-node61.phx1 is DOWN: PING CRITICAL - Packet loss = 100% 23:15:16 <@nagios-phx>  hp-node69.phx1 is DOWN: PING CRITICAL - Packet loss = 100% Lost pulse too : 23:19:56 <@nagios-phx>  dp-pulse01.phx is DOWN: PING CRITICAL - Packet loss = 100%
Lost sumo as well.
Things are coming back up now, we have a temp DHCP server up in phx1
(In reply to comment #4) > Lost sumo as well. sumo outage was a total of 8 minutes, 11:32-11:40
We've confirmed the firewall was a component of this failure. The DHCP relay feature appears to be non-functional after the upgrade which took place during this evening's maintenance window. It is likely that DHCP was broken beginning around 18:30 PST. At this time, ip-ns01 and ip-ns02 are only be able to issue DHCP leases for their local network (10.8.75.0/24). I have migrated our DHCP configuration to natasha, which is directly connected to all networks in phx1. We are tracking this issue with Juniper, and I hope to have further updates in the morning.
JTAC case 2011-0308-0058
We have confirmed systemic problems with the firewall on this new code version (10.4R2). We are investigating an appropriate version for rollback with the assistance of JTAC.
We have two outstanding issues with the firewall: 1) DHCP relay is not functional. This is a regression in the JUNOS code. 2) The 5 service processing cards in the SRX are individually crashing and rebooting, which interrupts any traffic being handled by that particular card. This means 20% of our sessions are being interrupted every few minutes.
We are rolling back to JUNOS 10.3R2 now.
10.3R2 fixed issue #1, above, but still manifested #2 under load. We have rolled back yet again, to 10.2S7. This code appears to be solid, and we consider the firewall to have stabilized at this time.
Revision 10.2S7 continues to be stable. We have rebuilt the redundant cluster on this version.
Tomorrow morning we will need vlan6 dhcp fixed on natasha, tacking this on to the bug. there will be much rejoicing (yay)
vlan6 is now configured and checked into svn. We can test and tune it as needed.
(In reply to comment #16) > vlan6 is now configured and checked into svn. We can test and tune it as > needed. VERIFIED FIXED