Possible DHCP issues in phx1

RESOLVED FIXED

Status

Infrastructure & Operations
NetOps
--
blocker
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: fox2mike, Assigned: dmoore)

Tracking

Details

(Reporter)

Description

7 years ago
We saw a couple of hosts go offline at the same time :

21:50:39 <@nagios-phx> [148] engagement1.db.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
21:59:08 <@nagios-phx> [149] engagement2.db.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
22:07:21 <@nagios-phx> [151] support5.webapp.phx1 is DOWN: PING CRITICAL - Packet loss = 100%

And on further investigation :

22:22:53 < cshields> :(
22:23:09 <@fox2mike> I rest my case
22:23:13 < dmoore> cshields: meaning it's running DHCP and it didn't get a lease?
22:24:19 < cshields> No DHCPOFFERS received
22:24:35 < cshields> sounds like leases expired at the same time for these
22:25:09 < dmoore> 8(
22:25:15 < dmoore> SAD. FACE.

So netops is poking.
(Reporter)

Updated

7 years ago
Assignee: server-ops → dmoore
Component: Server Operations → Server Operations: Netops
(Assignee)

Comment 1

7 years ago
This appears to be a performance problem with the host/VM:

INFO: task dhcpd:15480 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dhcpd         D ffff81000c093828     0 15480      1         21416  8591 (NOTLB)
 ffff810035b95d88 0000000000000082 ffff810035b95d98 ffffffff80062ff8
 0000000000001000 0000000000000009 ffff81003f126820 ffff81003fdf0080
 000a008fcf53872e 00000000000004d2 ffff81003f126a08 0000000000004be5
Call Trace:
 [<ffffffff80062ff8>] thread_return+0x62/0xfe
 [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a09d8>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
 [<ffffffff8002fbdf>] __writeback_single_inode+0x1e9/0x328
 [<ffffffff800f333b>] sync_inode+0x24/0x33
 [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
 [<ffffffff80050127>] do_fsync+0x52/0xa4
 [<ffffffff800e0f84>] __do_fsync+0x23/0x36
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

VMCIUtil: Updating context id from 0xffffffff to 0x5adb40a1 on event 0.
VMCIUtil: Updating context id from 0x5adb40a1 to 0x5adb40a1 on event 0.
(Reporter)

Comment 2

7 years ago
We rebooted both ip-ns01 and ns02 and still no go :

[root@support5 ~]# /etc/init.d/network restart
Shutting down interface bond0:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Bringing up loopback interface:  [  OK  ]
Bringing up interface bond0:  
Determining IP information for bond0... failed.
[FAILED]

Seems like the dhcp responses aren't making it back to the machines.
(Reporter)

Comment 3

7 years ago
We lost a bunch of hadoop machines in phx1 because of this issue, cc'ing socorro folks. Sorry :(

23:14:50 <@nagios-phx> [157] hp-node62.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
23:15:08 <@nagios-phx> [158] hp-node65.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
23:15:08 <@nagios-phx> [159] hp-node67.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
23:15:09 <@nagios-phx> [160] hp-node61.phx1 is DOWN: PING CRITICAL - Packet loss = 100%
23:15:16 <@nagios-phx> [161] hp-node69.phx1 is DOWN: PING CRITICAL - Packet loss = 100%

Lost pulse too :

23:19:56 <@nagios-phx> [165] dp-pulse01.phx is DOWN: PING CRITICAL - Packet loss = 100%
(Reporter)

Comment 4

7 years ago
Lost sumo as well.
(Reporter)

Comment 5

7 years ago
Things are coming back up now, we have a temp DHCP server up in phx1
(In reply to comment #4)
> Lost sumo as well.

sumo outage was a total of 8 minutes, 11:32-11:40
(Assignee)

Comment 7

7 years ago
We've confirmed the firewall was a component of this failure. The DHCP relay feature appears to be non-functional after the upgrade which took place during this evening's maintenance window. It is likely that DHCP was broken beginning around 18:30 PST.

At this time, ip-ns01 and ip-ns02 are only be able to issue DHCP leases for their local network (10.8.75.0/24). I have migrated our DHCP configuration to natasha, which is directly connected to all networks in phx1.

We are tracking this issue with Juniper, and I hope to have further updates in the morning.
(Assignee)

Comment 8

7 years ago
JTAC case 2011-0308-0058
(Assignee)

Updated

7 years ago
Duplicate of this bug: 639812
(Assignee)

Comment 10

7 years ago
We have confirmed systemic problems with the firewall on this new code version (10.4R2).

We are investigating an appropriate version for rollback with the assistance of JTAC.
(Assignee)

Comment 11

7 years ago
We have two outstanding issues with the firewall:

1) DHCP relay is not functional. This is a regression in the JUNOS code.

2) The 5 service processing cards in the SRX are individually crashing and rebooting, which interrupts any traffic being handled by that particular card. This means 20% of our sessions are being interrupted every few minutes.
(Assignee)

Comment 12

7 years ago
We are rolling back to JUNOS 10.3R2 now.

Updated

7 years ago
Group: infra
(Assignee)

Comment 13

7 years ago
10.3R2 fixed issue #1, above, but still manifested #2 under load. We have rolled back yet again, to 10.2S7.

This code appears to be solid, and we consider the firewall to have stabilized at this time.
(Assignee)

Comment 14

7 years ago
Revision 10.2S7 continues to be stable. We have rebuilt the redundant cluster on this version.
Tomorrow morning we will need vlan6 dhcp fixed on natasha, tacking this on to the bug.  there will be much rejoicing

(yay)
(Assignee)

Comment 16

7 years ago
vlan6 is now configured and checked into svn. We can test and tune it as needed.
(In reply to comment #16)
> vlan6 is now configured and checked into svn. We can test and tune it as
> needed.

VERIFIED FIXED

Updated

6 years ago
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.