How to handle the bond negotiation failure in the LACP mode? Highlighted

194 0 4 1

This post focuses on how to handle the bond negotiation failure in the LACP mode. Please read below for more information.


Symptom


The site uses the fs6.3.0 version and the upper layer is configured with the bond in the LACP mode. However, the LACP negotiation on the switch is abnormal, thus an alarm is generated.


152238d6c1f1kd133k3787.png?图片.png


Cause analysis


The possible causes are as follows:


l   LACP aggregation is not correctly configured on the switch.


l   Packet loss occurs on the physical network adapter.


l  The NIC driver reports an error, which may be caused by hardware faulty or firmware mismatch.

Impact and risks


The LAPC negotiation is abnormal, affecting the external communication of the VM.


【Solution】


Step 1. Log in to the host as the root user.


Step 2. Execute the following command to check the bond configuration, focusing on whether the bond mode is LACP and the network adapter composition. The following figure shows that the slave network adapters are eth2 and eth3.


cat /proc/net/bonding/trunkX


152221y9pvc25vvcvtv96z.png?图片.png


Step 3. Contact physical switch engineers to confirm that LACP aggregation is configured on the switch; that is to say the Eth-Trunk and LACP are configured.

 

In LACP mode, the switch (switch) must support IEEE 802.3ad Dynamic link aggregation. How to configure LACP aggregation for switches, please see details in the usage guide of the corresponding switch.


Identification Method: Capture packets on the network adapter, only the host sends the lacp negotiation packet whose source is the MAC address of the physical NIC, however, no reply is received, which can be seen in the following figure:


152208rkyd7yknwgqdwiyq.png?图片.png


Step 4. Check whether the switch network is correct. Execute the following command to check the link status of the bond slave network adapter, in normal cases, the value of Link detected is yes, which shows the network adapter is up and the switch connection is normal. Execute the eth-trunk down command on the server. Then execute the following command on the host side again as well as check whether Link detected of the two network adapters is changed to no. If all the values are changed to no, the switch network meets the expectation. Otherwise, the configuration on the switch is incorrect and other port is used, which should be rectified.


ethtool eth2

ethtool eth3


152151yaxjd6jxjj9v665e.png?图片.png


Step 5. Execute the following command to check the NIC driver and firmware. If the driver name is be2net, then the driver and firmware versions must match. The specific mapping version must be provided by the hardware vendor.


ethtool -i eth2


152140lh424ipsukuosuki.png?图片.png


Execute the following command to check whether the NIC driver reports an error, the common one is that UE errors by the be2net driver prints. If there is a driver error, contact the hardware vendor to check whether the hardware is faulty or the driver version does not match the firmware version.


dmesg | grep  “Network adapter driver name displayed in the previous command, For example:ixgbe/be2net/bnx2x/i40e/mlx4_en”


152127k88t8yxsjjaftfbs.png?图片.png


Step 6. Continuously execute the following command to check whether the number of received and sent packets of adapter command increases and the increase of abnormal statistical values.


ifconfig eth2


152110lhuj4sgzgg0jedpl.png?图片.png


If the value of RX packets does not increase, it indicates that the switch does not send negotiation packets, and it should be checked

 

If the value of TX packets does not increase, it indicates that the lacp bond does not send negotiation packets, and the bond configuration should be checked.

 

If the number of abnormal statistics increases, as shown in the preceding figure, the value of RX dropped increases, execute the following command to check which packet is lost. In this example, it is because of the addition of rx_fifo_errors.

 

ethtool -S eth2 | grep -e drop -e error -e discard


152056nbd2300oq2eid0or.png?图片.png


Confirm that rx_fifo_errors addition is owing to the full of network adapter queue, it is suggested to check the host load. Execute the top command, and you can find that the overall load of the server is low, execute the sar -n DEV 1 command to check the flow of each network adapter, and you can find that the packet receiving rate of eth2 and eth3 is about 300wpps. It is suspected that a loop occurs on the external switch.


152040wrueemhhoudml2og.png?图片.png


Check the configuration of the physical switch, it is found that the flow on multiple ports is heavy. Remove and insert some network cables according to the suggestions of switches to solve the loop problem. The packet receiving rate of the physical NIC on the host is restored to normal therefore no packet loss occurs on the physical network adapter and the lacp negotiation is normal.


152021cwnemckmbuccicaa.png?图片.png


Summary and suggestions


For xen, the bond is created by the nc component of the FusionComputer.


In the kvm server virtualization scenario, the FusionComputer delivers the ifcfg configuration and then the network service creates a bond.


In the kvm's private cloud scenario, the bond was created by neutron components.


For a bond in the LACP mode, check based on the preceding steps.


  • x
  • convention:

Login and enjoy all the member benefits

Login and enjoy all the member benefits

Login