Last Comment Bug 639745 - Possible DHCP issues in phx1
: Possible DHCP issues in phx1
Status: RESOLVED FIXED
:
Product: Infrastructure & Operations
Classification: Other
Component: NetOps (show other bugs)
: other
: All Other
: -- blocker (vote)
: ---
Assigned To: Derek Moore [:dmoore]
: matthew zeier [:mrz]
:
Mentors:
: 639812 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-03-07 22:27 PST by Shyam Mani [:fox2mike]
Modified: 2013-08-08 05:50 PDT (History)
14 users (show)
See Also:
Due Date:
QA Whiteboard:
Iteration: ---
Points: ---
Cab Review: ServiceNow Change Request (use flag)


Attachments

Description Shyam Mani [:fox2mike] 2011-03-07 22:27:57 PST
We saw a couple of hosts go offline at the same time :

21:50:39 <@nagios-phx> [148] engagement1.db.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
21:59:08 <@nagios-phx> [149] engagement2.db.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
22:07:21 <@nagios-phx> [151] support5.webapp.phx1 is DOWN: PING CRITICAL - Packet loss = 100%

And on further investigation :

22:22:53 < cshields> :(
22:23:09 <@fox2mike> I rest my case
22:23:13 < dmoore> cshields: meaning it's running DHCP and it didn't get a lease?
22:24:19 < cshields> No DHCPOFFERS received
22:24:35 < cshields> sounds like leases expired at the same time for these
22:25:09 < dmoore> 8(
22:25:15 < dmoore> SAD. FACE.

So netops is poking.
Comment 1 Derek Moore [:dmoore] 2011-03-07 22:39:25 PST
This appears to be a performance problem with the host/VM:

INFO: task dhcpd:15480 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dhcpd         D ffff81000c093828     0 15480      1         21416  8591 (NOTLB)
 ffff810035b95d88 0000000000000082 ffff810035b95d98 ffffffff80062ff8
 0000000000001000 0000000000000009 ffff81003f126820 ffff81003fdf0080
 000a008fcf53872e 00000000000004d2 ffff81003f126a08 0000000000004be5
Call Trace:
 [<ffffffff80062ff8>] thread_return+0x62/0xfe
 [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
 [<ffffffff8002fbdf>] __writeback_single_inode+0x1e9/0x328
 [<ffffffff800f333b>] sync_inode+0x24/0x33
 [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
 [<ffffffff80050127>] do_fsync+0x52/0xa4
 [<ffffffff800e0f84>] __do_fsync+0x23/0x36
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

VMCIUtil: Updating context id from 0xffffffff to 0x5adb40a1 on event 0.
VMCIUtil: Updating context id from 0x5adb40a1 to 0x5adb40a1 on event 0.
Comment 2 Shyam Mani [:fox2mike] 2011-03-07 23:01:06 PST
We rebooted both ip-ns01 and ns02 and still no go :

[root@support5 ~]# /etc/init.d/network restart
Shutting down interface bond0:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Bringing up loopback interface:  [  OK  ]
Bringing up interface bond0:  
Determining IP information for bond0... failed.
[FAILED]

Seems like the dhcp responses aren't making it back to the machines.
Comment 3 Shyam Mani [:fox2mike] 2011-03-07 23:22:17 PST
We lost a bunch of hadoop machines in phx1 because of this issue, cc'ing socorro folks. Sorry :(

23:14:50 <@nagios-phx> [157] hp-node62.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
23:15:08 <@nagios-phx> [158] hp-node65.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
23:15:08 <@nagios-phx> [159] hp-node67.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
23:15:09 <@nagios-phx> [160] hp-node61.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
23:15:16 <@nagios-phx> [161] hp-node69.phx1 is DOWN: PING CRITICAL - Packet loss = 100%

Lost pulse too :

23:19:56 <@nagios-phx> [165] dp-pulse01.phx is DOWN: PING CRITICAL - Packet loss = 100%
Comment 4 Shyam Mani [:fox2mike] 2011-03-07 23:42:15 PST
Lost sumo as well.
Comment 5 Shyam Mani [:fox2mike] 2011-03-07 23:45:34 PST
Things are coming back up now, we have a temp DHCP server up in phx1
Comment 6 Corey Shields [:cshields] 2011-03-08 00:04:47 PST
(In reply to comment #4)
> Lost sumo as well.

sumo outage was a total of 8 minutes, 11:32-11:40
Comment 7 Derek Moore [:dmoore] 2011-03-08 02:20:31 PST
We've confirmed the firewall was a component of this failure. The DHCP relay feature appears to be non-functional after the upgrade which took place during this evening's maintenance window. It is likely that DHCP was broken beginning around 18:30 PST.

At this time, ip-ns01 and ip-ns02 are only be able to issue DHCP leases for their local network (10.8.75.0/24). I have migrated our DHCP configuration to natasha, which is directly connected to all networks in phx1.

We are tracking this issue with Juniper, and I hope to have further updates in the morning.
Comment 8 Derek Moore [:dmoore] 2011-03-08 02:43:46 PST
JTAC case 2011-0308-0058
Comment 9 Derek Moore [:dmoore] 2011-03-08 09:17:21 PST
*** Bug 639812 has been marked as a duplicate of this bug. ***
Comment 10 Derek Moore [:dmoore] 2011-03-08 09:20:27 PST
We have confirmed systemic problems with the firewall on this new code version (10.4R2).

We are investigating an appropriate version for rollback with the assistance of JTAC.
Comment 11 Derek Moore [:dmoore] 2011-03-08 09:52:34 PST
We have two outstanding issues with the firewall:

1) DHCP relay is not functional. This is a regression in the JUNOS code.

2) The 5 service processing cards in the SRX are individually crashing and rebooting, which interrupts any traffic being handled by that particular card. This means 20% of our sessions are being interrupted every few minutes.
Comment 12 Derek Moore [:dmoore] 2011-03-08 09:52:48 PST
We are rolling back to JUNOS 10.3R2 now.
Comment 13 Derek Moore [:dmoore] 2011-03-08 12:08:54 PST
10.3R2 fixed issue #1, above, but still manifested #2 under load. We have rolled back yet again, to 10.2S7.

This code appears to be solid, and we consider the firewall to have stabilized at this time.
Comment 14 Derek Moore [:dmoore] 2011-03-08 22:57:12 PST
Revision 10.2S7 continues to be stable. We have rebuilt the redundant cluster on this version.
Comment 15 Corey Shields [:cshields] 2011-03-08 23:01:04 PST
Tomorrow morning we will need vlan6 dhcp fixed on natasha, tacking this on to the bug.  there will be much rejoicing

(yay)
Comment 16 Derek Moore [:dmoore] 2011-03-08 23:14:18 PST
vlan6 is now configured and checked into svn. We can test and tune it as needed.
Comment 17 Corey Shields [:cshields] 2011-03-08 23:37:36 PST
(In reply to comment #16)
> vlan6 is now configured and checked into svn. We can test and tune it as
> needed.

VERIFIED FIXED

Note You need to log in before you can comment on or make changes to this bug.