Closed
Bug 979421
Opened 11 years ago
Closed 10 years ago
Force boxes to arp after core network problems
Categories
(Infrastructure & Operations :: Infrastructure: Other, task)
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: ericz, Unassigned)
Details
Attachments
(1 file, 1 obsolete file)
850 bytes,
patch
|
ericz
:
review+
|
Details | Diff | Splinter Review |
As per bug 977818, some systems drop off the network after core network issues. In this situation, a network outage in bug 977775 caused one or more web*.bugs.scl3.mozilla.com boxes to drop off the network. Pinging the default router from the offline box caused it to start working again, likely because the ping forced it into the switch's tables. If we could get systems to arp more frequently, it might fix this issue.
Briefly looking at this it seems there is a related sysctl arp_notify which defaults to doing nothing but can be set to generate gratuitous arp requests when the device is brought up or hardware address changes. I'm not sure that's enough to help in this situation but it's the best I've found so far.
Eric, did the kernel logs for the server indicate that the Ethernet link went down during the outage?
Reporter | ||
Comment 2•11 years ago
|
||
I didn't explicitly check at the time, but I don't see anything in dmesg or messages and somewhat doubt it.
If there wasn't a link down/up event, then arp_notify would be of no use, since neither the IP address nor the link state changed during the event.
Comment 4•11 years ago
|
||
I'm not sure what there is to do here for Netops. as soon as a frame goes through the switch from a server, the switch will learns the associated mac. If the switch doesn't know on which port a MAC (server) is, he will send the frame on all the ports, and the server will reply (like what's happening when pinging the box).
Flags: needinfo?(eziegenhorn)
Reporter | ||
Comment 5•11 years ago
|
||
I guess the question is if we can improve the situation so some boxes don't drop off the network after core network failures. The boxes in question were not pingable remotely, so if the switch was flooding traffic to all ports, something wasn't working right about it. To get these boxes back online, from the OOB console of the box itself I had to ping the default router and then everything worked normally. That's obviously fairly labor intensive.
Flags: needinfo?(eziegenhorn)
Reporter | ||
Comment 6•11 years ago
|
||
:XioNoX doesn't think there is anything Netops can do about this, and thought it could be a bug on the server side. :jabba et al, do you have any input on something that could be done host-side? If not, we'll just close this out.
Assignee: network-operations → infra
Component: NetOps: Other → Infrastructure: Other
QA Contact: adam → jdow
Comment 7•11 years ago
|
||
I think the key here is pinging the impacted systems from a host in the same subnet. It will send a arp who-has and l2 switches should pick it up. Alternatively we could cook up a script to ping the gateway sending gratuitous announcements when/if it fails.
Comment 8•11 years ago
|
||
Wanted to revisit this bug - I think a simple solution might be to cron arping to send some arp packets every few minutes. See attached proof of concept patch.
Reporter | ||
Comment 9•11 years ago
|
||
Sounds like a great idea to me. I believe that the "deadline" argument should be substituted with a time in seconds according to the man page, as is this relies on only sending a single packet with "-c 1" for this command to exit. Also s/conetnts/contents/.
Comment 10•11 years ago
|
||
I think the keyword 'deadline' has special meaning to -w. I tested below and when I used '-w deadline' it stopped after sending 1 announcement. The intention behind this is to go easy on the l2 switch gear.
I tested this on web1.bugs:
[root@web1.bugs.phx1 ~]# arping -w deadline -c 1 10.8.82.1
ARPING 10.8.82.1 from 10.8.82.11 eth0
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 2.282ms
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.249ms
Sent 1 probes (1 broadcast(s))
Received 2 response(s)
[root@web1.bugs.phx1 ~]# arping -w 5 -c 1 10.8.82.1
ARPING 10.8.82.1 from 10.8.82.11 eth0
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 2.553ms
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 2.660ms
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.315ms
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.954ms
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.797ms
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 2.461ms
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 2.256ms
Sent 6 probes (1 broadcast(s))
Received 7 response(s)
[root@web1.bugs.phx1 ~]# arping -w 1 -c 1 10.8.82.1
ARPING 10.8.82.1 from 10.8.82.11 eth0
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.330ms
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 3.531ms
Unicast reply from 10.8.82.1 [00:10:DB:FF:10:00] 4.053ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root@web1.bugs.phx1 ~]#
Comment 11•11 years ago
|
||
As discussed on IRC I dropped `-w deadline`, left -c 1, and added a flock. I think arping will re-try sending arp packets until it receives a response. If a network device was down for an extended period of time we might end up with tons of arping packets doing nefarious things to the network.
After this passes your review I'm going to run it by netops and submit to the CAB for deployment.
Attachment #8415347 -
Attachment is obsolete: true
Attachment #8415548 -
Flags: review?(eziegenhorn)
Reporter | ||
Comment 12•11 years ago
|
||
Comment on attachment 8415548 [details] [diff] [review]
arping.patch
Review of attachment 8415548 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good to me.
Attachment #8415548 -
Flags: review?(eziegenhorn) → review+
Comment 13•11 years ago
|
||
Comment on attachment 8415548 [details] [diff] [review]
arping.patch
Adam,
I would like your input here - I intend to present this to the CAB tomorrow and I would like your(netops) input before I move forward.
Attachment #8415548 -
Flags: review?(adam)
Comment 14•11 years ago
|
||
Per conversation with :digi, we'll be looking into the problem further. We're interested to find why the MAC address going away causes a problem since that is intended behavior on the networking equipment.
Updated•10 years ago
|
Attachment #8415548 -
Flags: review?(adam)
Comment 15•10 years ago
|
||
Spoke with :adam - I think the tl;dr this behavior is a feature, not a bug. Please re-open if you feel otherwise.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INVALID
Comment 16•10 years ago
|
||
Just an update - spoke to ericz about this, we'll leave this as R/I for now. If the issue comes up again during a future network event we'd like to revisit this and get a plan in place to force arp announcements faster or delay the purging of cached entries.
You need to log in
before you can comment on or make changes to this bug.
Description
•