Good day!
This post is about the issue of the ISIS Protocol flapping on IGW causing Internet disturbance. Please have a look below for more details.
ISSUE DESCRIPTION
The Internet Gateway Router (IGW) in the network setup connects to the following equipment to provide internet service to the end-customer:
Provider Edge routers, which act as the Service Gateway through mostly an iBGP connection;
the upstream provider through an eBGP connection.
We noticed some fluctuation in the graph of internet connectivity through our Network Operations Center, while we also received some complaints on poor internet experience coming from our customer during the impact period.
HANDLING PROCESS
Step 1. The Network Operations Center noticed some fluctuation on the internet graph and immediately created an incident ticket, which was assigned to the back office team for investigation and resolution.
Step 2. The Back Office team logs on to the Internet Gateway Router considering it is the convergence point for internet connectivity and issues the display logbuffer command to check the recent logs on the equipment. The log can be seen below showing the different times of occurence.

It is noticed that BFD is flapping on the interface linking the Internet Gateway Router and the Provider Edge router. This as a consequence causes the ISIS protocol configured between the 2 router interfaces to flap.
Step 3. A deeper check is conducted to understand why the BFD session was flapping by using the diagnose logs. It is noticed that each time this BFD flapping occurs, there is an LPU TM Chip soft reset. When this TM chip resets, packet forwarding is affected. This is shown in the screenshot below:

Step 4. After noticing this, the next step was to check the patch release note for the patch version (V800R011SPH032) or a higher version and verify the conditions that caused the reset of the TM Chip. It is clearly stated in the higher version - V800R011SPH036 - that the conditions that cause this reset are related to:
the type of board running on the node;
the presence of blackhole routes in the configuration of the device.
An extract of the document is shown in the screenshot below:

Step 5. Check the patch version running on the IGW:

The above shows we are running a version lower than one indicated in the patch release, concluding that this patch release is affected as well by the conditions that cause the reset of the TM Chip.
Step 6. Next, we checked the board type on the device where this flapping occured and it was noticed we had slot1_CR57LPUF50C, slot2_CR57LPUF50C and slot3_CR57LPUF120A, which matches the case occuring condition. This is done using the display device slot_ID command:

Step 7. We checked condition 2, which is the presence of blackhole routes. Checking the configuration using the display current-configuration command, it was noticed there werre some blackhole routes:

ROOT CAUSE
The root causes of the issue are:
the patch version running on the device;
the boards used on the node;
the presence of the blackhole route configuration on the device.
SOLUTION
Upgrade the patch on the node from V800R011SPH032 to V800R011SPH058, which solves this issue. Considering the issue already occured, the LPU boards affected also need to be reset.


