Hi Community !
It is a pleasure to share with you my experience on tackling an issue related to a software upgrade operation leading to a physical port changing status from up to down between a Huawei NE40E router and a Cisco switch. Enjoy the read below in order to get more insights about the subject.
ISSUE DESCRIPTION
The Network Operations Center (NOC) reported an incident involving many datacenter services that went down. On the customer side, they were unable to access some applications, short codes, could not make voice calls and this greatly impacted the experience leading to loss of revenue. This happened during an upgrade operation.
HANDLING PROCESS
Considering this was very critical, the following steps were taken :
STEP 1 : The NOC immediately opened an incident ticket with severity tagged as critical and assigned it to the Back office datacom team for investigation and resolution.
STEP 2 : The Backoffice team confirmed there was an upgrade operation and quickly informed the team of Engineers that were handling the operation for fast investigation and resolution.
STEP 3 : The team of Engineers incharge of the operation and they immediately started looking at logs on the NE40E using the "display logbuffer" command. It was noticed the interface Gx/y/z connecting the cisco switch was down. It had changed state after the operation.

The down reason was indicated to be PCS_unLock, AutoNegotiation_Fail indicating issue is with peer device.
Checking the interface with the "display interface" command showed the interface went down

STEP 4 : Considering the down reason was related to the peer device, the next step was to log in to the peer device and check the logs. The peer device here was a cisco device. The logs specific to this event had the output below.

A loopback error was detected on the cisco switch and put the interface on the error disable state. This made the interface to be move from up to down status.
STEP 5 : After checking the logs from the Huawei NE40E and the Cisco switch, our next move was to look at the configurations of the Huawei router to detect what was causing the loop.

STEP 6 : Considering it was a loop that made many other services indicated to be down, we removed the VSI configuration on the interface using the "undo l2 binding vsi" command
This brought back the interface up but just the service linked to the vsi was down.
STEP 7 : At this point where the level of criticity was low, we now had to build the same environment on a test bed to investigate why the issue occured. Same results were obtained with V600R009C20SPC600.
ROOT CAUSE
The mac-forwarding table of the Network Processor for the NE40E was checked and was incorrect.
When the Cisco switch sends a keep-alive message (used to detect loop) to the NE40E, the NE40E sends the keep-alive message back to the Cisco switch, so the switch will detect a loop in the network and changes its interface to error disable state (down state)

Packet capture on the Cisco switch shows source and destination mac are the same (loop).
This is a VRP issue with the V6R9 version where the NE40E receives packets with the same source and destination MAC in the Ethernet header making the MAC forwarding table incorrect on the Network Processor chip which triggers packet forwarding back to the Cisco switch causing the loop.
TEMPORAL SOLUTION
Move the VSI configuration from the main interface to a sub-interface on the NE40E
Stop the sending of keep-alive message on the Cisco switch by using the "no keepalive" command
SOLUTION
The bug was corrected in the patch release : V600R009SPH018




