Closed Bug 979421 Opened 11 years ago Closed 10 years ago

Force boxes to arp after core network problems

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: ericz, Unassigned)

Details

Attachments

(1 file, 1 obsolete file)

As per bug 977818, some systems drop off the network after core network issues. In this situation, a network outage in bug 977775 caused one or more web*.bugs.scl3.mozilla.com boxes to drop off the network. Pinging the default router from the offline box caused it to start working again, likely because the ping forced it into the switch's tables. If we could get systems to arp more frequently, it might fix this issue. Briefly looking at this it seems there is a related sysctl arp_notify which defaults to doing nothing but can be set to generate gratuitous arp requests when the device is brought up or hardware address changes. I'm not sure that's enough to help in this situation but it's the best I've found so far.
Eric, did the kernel logs for the server indicate that the Ethernet link went down during the outage?
I didn't explicitly check at the time, but I don't see anything in dmesg or messages and somewhat doubt it.
If there wasn't a link down/up event, then arp_notify would be of no use, since neither the IP address nor the link state changed during the event.
I'm not sure what there is to do here for Netops. as soon as a frame goes through the switch from a server, the switch will learns the associated mac. If the switch doesn't know on which port a MAC (server) is, he will send the frame on all the ports, and the server will reply (like what's happening when pinging the box).
Flags: needinfo?(eziegenhorn)
I guess the question is if we can improve the situation so some boxes don't drop off the network after core network failures. The boxes in question were not pingable remotely, so if the switch was flooding traffic to all ports, something wasn't working right about it. To get these boxes back online, from the OOB console of the box itself I had to ping the default router and then everything worked normally. That's obviously fairly labor intensive.
Flags: needinfo?(eziegenhorn)
:XioNoX doesn't think there is anything Netops can do about this, and thought it could be a bug on the server side. :jabba et al, do you have any input on something that could be done host-side? If not, we'll just close this out.
Assignee: network-operations → infra
Component: NetOps: Other → Infrastructure: Other
QA Contact: adam → jdow
I think the key here is pinging the impacted systems from a host in the same subnet. It will send a arp who-has and l2 switches should pick it up. Alternatively we could cook up a script to ping the gateway sending gratuitous announcements when/if it fails.
Attached patch arping.patch (obsolete) — Splinter Review
Wanted to revisit this bug - I think a simple solution might be to cron arping to send some arp packets every few minutes. See attached proof of concept patch.
Sounds like a great idea to me. I believe that the "deadline" argument should be substituted with a time in seconds according to the man page, as is this relies on only sending a single packet with "-c 1" for this command to exit. Also s/conetnts/contents/.
I think the keyword 'deadline' has special meaning to -w. I tested below and when I used '-w deadline' it stopped after sending 1 announcement. The intention behind this is to go easy on the l2 switch gear. I tested this on web1.bugs: [root@web1.bugs.phx1 ~]# arping -w deadline -c 1 10.8.82.1 ARPING 10.8.82.1 from 10.8.82.11 eth0 Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 2.282ms Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.249ms Sent 1 probes (1 broadcast(s)) Received 2 response(s) [root@web1.bugs.phx1 ~]# arping -w 5 -c 1 10.8.82.1 ARPING 10.8.82.1 from 10.8.82.11 eth0 Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 2.553ms Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 2.660ms Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.315ms Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.954ms Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.797ms Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 2.461ms Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 2.256ms Sent 6 probes (1 broadcast(s)) Received 7 response(s) [root@web1.bugs.phx1 ~]# arping -w 1 -c 1 10.8.82.1 ARPING 10.8.82.1 from 10.8.82.11 eth0 Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.330ms Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.531ms Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 4.053ms Sent 2 probes (1 broadcast(s)) Received 3 response(s) [root@web1.bugs.phx1 ~]#
Attached patch arping.patchSplinter Review
As discussed on IRC I dropped `-w deadline`, left -c 1, and added a flock. I think arping will re-try sending arp packets until it receives a response. If a network device was down for an extended period of time we might end up with tons of arping packets doing nefarious things to the network. After this passes your review I'm going to run it by netops and submit to the CAB for deployment.
Attachment #8415347 - Attachment is obsolete: true
Attachment #8415548 - Flags: review?(eziegenhorn)
Comment on attachment 8415548 [details] [diff] [review] arping.patch Review of attachment 8415548 [details] [diff] [review]: ----------------------------------------------------------------- Looks good to me.
Attachment #8415548 - Flags: review?(eziegenhorn) → review+
Comment on attachment 8415548 [details] [diff] [review] arping.patch Adam, I would like your input here - I intend to present this to the CAB tomorrow and I would like your(netops) input before I move forward.
Attachment #8415548 - Flags: review?(adam)
Per conversation with :digi, we'll be looking into the problem further. We're interested to find why the MAC address going away causes a problem since that is intended behavior on the networking equipment.
Attachment #8415548 - Flags: review?(adam)
Spoke with :adam - I think the tl;dr this behavior is a feature, not a bug. Please re-open if you feel otherwise.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INVALID
Just an update - spoke to ericz about this, we'll leave this as R/I for now. If the issue comes up again during a future network event we'd like to revisit this and get a plan in place to force arp announcements faster or delay the purging of cached entries.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: