Handling Process
1. Traffic capture in the sw2 device with port mirroring, from at 00:55 hrs.
Perform tracert and display to route from sw2 device, with source interface next to RNC, and the same work from RNC. This work is going to perform before and after of fail Perform tracert and display to route from PE of fixed network from RNC. This work is going to perform before and after of fail Shutdown, tests, and undo shutdown again to each vlanif , until find the vlan with mistake.
2. For this tests the customer replaced the NPGEP2 (RNC 2) for your laptop: GE3/0/18 -> UCHL2_NPGEP3_IFGE1 ( vlan 520 and 891).. the IP that use is 10.178.199.46/30, ( to moment of the problem continued the ping to GGSN.....ok)
After test and analysis, the root cause was confirmed: while device configured dot1q termination plus load-balance, V6R7 version software sometimes make mistake while update FIB table, this can affect data forwarding
Root Cause
When the device try to switch from the “single nexthop mapping table” to “load-balancing multi-nexthop mapping table”, it mis-released the index of the “single nexthop mapping table”.
If other routes apply for the index resource inside the device and get this wrong index, the forwarding service would have forwarding problem.
Scenarios that trigger the problem:
a. When route A switch to “load-balancing multi-nexthop mapping table”, device mis-released the index of the “single nexthop mapping table” (A doesn’t need the index in the data forwarding plane, but A mis-release the index meanwhile remember it as its own index.)
b. A new route B try to establish and it applies for an index resource inside the device, then it got the old index which belong to A.
c. When Routing-table refreshing, both A and B tried to refresh the route item, and different order causes different problem:
· If route A refresh the route item then B refresh it, route A will lead to a very short problem in the forwarding table and the service will be interrupted at recovered at once.
· If route B refresh the route item then A refresh it, the normal information inside route B will be recovered by route A, and service will be interrupted and never recovered by itself
Solution
【Resolution Summary】
1.1 Temporary Solution
1. Disable the FIB regularly-refresh, it is identified that has no impact to the live network service.
2. Because routes flapping will also trigger this problem, disable the FIB regularly-refresh cannot 100% avoid the problem reproduction.
3. The trigger of the FIB mistake is dot1q-termination plus sub-interface load-balancing, and now in the live network the dot1q termination sub-interface only terminated one vlan. So customer can change the sub-interface mode from dot1q-termination to vlan-type dot1q to avoid the problem. After change the interface mode, it’s necessary to reset the board to validate the configuration.
【Resolution Details】
Develop a new patch to solve it.
The planning and solution estimating is under discussing internally.