Hello everyone,
I'd like to share with you a case about packet loss during BGP route switching.
Problem

As shown in the preceding networking diagram, the three devices are connected through Layer 3 Eth-Trunk interfaces. S12700_1 is using a NEP board. The OSPF process advertises the addresses of all interfaces (including loopback interfaces).
IBGP peer relationships are established between S12700_2 and S12700_1, and eBGP peer relationships are established between S12700_2 and the tester Test2/2. BGP fast refresh function is enabled by default.
Use Test2/2 to send traffic to S12700_2, and the IBGP routes are sent from S12700_2 to S12700_1. Use Test2/1 to send Layer 3 traffic to Test2/2.
The traffic on the active path is sent to S12700_2 through Eth-Trunk11 of S12700_1, and the traffic on the standby path is sent to S6720EI through Eth-Trunk22 and then forwarded to S12700_2.
Simulate a situation in which the primary path goes Down (by shutting down Eth-Trunk 11 of S12700_2). Use the tester to send traffic. The traffic is switched to the secondary path.

At this time, open eth-trunk11 of S12700_2, switch traffic to the primary path.

Packet loss lasts for about 100 ms.

In addition, it is found that packet loss also occurs when the BGP fast refresh function is disabled. However, the number of lost packets is less than that when BGP fast refresh is enabled.
After the fault is rectified, the Eth-Trunk interface goes Up and immediately sends gratuitous ARP packets. However, in consideration of security and attack prevention, the main interface does not learn gratuitous ARP packets. As a result, the local end fails to learn the ARP entry of the peer end.
However, because the OSPF network type of the interfaces is P2, all packets are sent in multicast mode. Therefore, the OSPF neighbor relationship can be established without learning the ARP entry of the peer end, and the route from the peer end to OSPF can be learned. So the BGP route is immediately iterated to the new outbound interface. However, the ARP entry of the outbound interface is not learned in time. As a result, packet loss occurs. The traffic recovers only after the ARP entry is learned. And after BGP fast refresh is enabled, more packets are discarded because route switching is faster.
Solution :
Solution 1: Change the Layer 3 main interface on the live network to a VLANIF interface. Because the VLANIF interface supports gratuitous ARP learning, the VLANIF interface can immediately learn the gratuitous ARP from the peer end after the interface goes Up. In this way, packet loss does not occur.
Solution 2: Change the OSPF network type to broadcast. After the change, DD packets are unicast during the OSPF process, which triggers ARP-miss and helps learn ARP entries in advance. In this way, no packet is lost during the switchback.
I hope it is of help to you.


