【Problem Description】
Customer have two DC: “home DC” – DC SPB and “guest DC” – MSK DC.
Each DC have CSSof two CE12808 in core.
Between DCs – backbone network, and configured DCI by using EVN in All-activeGateway mode.
Between two DCsconfigured VRRP MAC filtering.
Gateway forinternal subnets - VRRP Virtual-IP at CE12808.
At DC SPB(home) devices announcing to backbone subnet 10.245.8.128/25
At DC MSK (guest)devices announcing to backbone only host-routes /32 by using ARP-entries (arp direct-route enable detect virtual-ip). The whole /25 subnet this DCdon’t announce to the backbone.
Customer move fewVM from DC SPB (home) to DC MSK (guest) – all routeswas automatically switches, no problem happened.
But after some timeof working customer found the problem: ARP-table at devices in DC MSK (guest) incorrectlyupdating. This lead to unavailability of servers. Details described below.
The main questionis – how to protect the network from such behavior andallow devices to automatically restore access to migrated VMs without longservice-interruption.
【Problem Analysis】
Server1migrate to Server2, and it works fine.
Reset ARP atdevice CE6800-2, then users cannot telnet Server2, because there’s no ARP at CE6800-2which is needed for host-route creating at NE.
So afterping operation to trigger ARP learning at device CE6800-2, it works fine.
Themain question is – how toprotect the network from such behavior and allow devices to automaticallyrestore access to migrated VMs without long service-interruption.
In normal state, if there is notraffic between CE6800-2 and Server2 for a long time, ARP has an ARP agingtime(20min), switch will trigger ARP aging detect before aging time if no traffic,so ARP will be kept.
If reset ARPat device CE6800-2, in normal state, Server2 will not know ARP delete atswitch, but Server2 has an ARP aging time, it will trigger ARP aging detectbefore aging time, and if there is some traffic (for example ping or somethingtrigger traffic), if will also trigger ARP learning. But if there is no trafficduring this time in both server or switch side, it may take a long time(agingtime last) for server to trigger ARP learning.
For V2R3version, we have a command “arp smart-discover enable”, configured invlanif view, it is used for VM or Server detect ARP actively, when there is noARP. But it isn’t recommended that use in normal state if network is very huge,since ARP packets occupy CPU performance to handle.
We think VMor Server should send the packets in normal state, for example GARP.
If we wantto reduce long service-interruption in this situation, ping is also a good wayto reduce the service- interruption time during the migrationtime. But after it, it should be step (a).
【Root Cause】
reset ARP at device CE6800-2, in normal state,Server2 will not know ARP delete at switch, but Server2 has an ARP aging time,it will trigger ARP aging detect before aging time, and if there is sometraffic (for example ping or something trigger traffic), if will also triggerARP learning. But if there is no traffic during this time in both server orswitch side, it may take a long time(aging time last) for server to trigger ARPlearning.
【Solution Description】
VM should send packets(e.g. GARP) to trigger ARP learning, but currently server trigger ARP learing which is according to ARP aging detecting, so it will take a long time to trigger ARP learning at server
so if server reduce the GARP timer which will help the ARP learning when migrating VM
ping is also a good way to reduce the service-interruption time