Got it

ECC storm troubleshooting and preventive measures Highlighted

Latest reply: Oct 17, 2021 05:43:35 3348 30 12 0 0

Hello, everyone. 

This post will analyze the ECC storm phenomenon, its causes and its solutions. Then, based on the characteristics of the ECC protocol, it will provide reasonable suggestions from the perspective of network planning in order to reduce the probability of ECC storm problems and improve the efficiency of solving such problems.

Problem Description

Symptom 1: A large number of NEs go offline.

After an ECC storm occurs, a large number of NEs go offline. In most cases, all NEs except the gateway NE cannot be logged in. After querying the routing table, it was found that some random routes existed and the distance of these routes was large.

ECC storm

Symptom 2: Network flapping.

The ECC network flaps and a large number of ECC route update packets are continuously transmitted on the network. As a result, the ECC routes of some NEs change continuously and the ECC communication is intermittent and unstable.

ECC storms occur on both SDH and WDM networks. This is because they all use the HWECC protocol to transmit network management information.

Cause analysis

ECC storms are caused by the following reasons:

1. The ECC network scale is too large

Like IP RIP, HWECC is a distance-vector protocol and cannot prevent route loops. Bulk routing information must be broadcast after a route change. Therefore, HWECC is designed as an IGP routing protocol for a small network. In the case of RIP, an ECC spans up to 16 hops. In the case of HWECC, however, an ECC travels up to 64 hops by default. In the case of a route change, the cyclic invalid data information must run through 64 hops before being discarded. For a large network, this is no doubt a disastrous load to the network bandwidth. Few bandwidth resources are available for transferring ordinary routing data. In addition, the other NEs must use some bandwidth resources to set up new MAC connections and routes. Thus, a vicious circle forms. For a long time, the routes cannot be converged. An ECC storm finally occurs.  This frequently happens when a fiber cut occurs, or when the main control board or line board is being replaced.

2. In the complex or ring network, ECC storms may occur when fibers are severely degraded

When the bit errors of the optical line are large or the packet loss is severe, the ECC protocol generates some disordered or outdated routes when broadcasting packets. If the network is a ring network, these routes are cyclically sent to the other direction, which is difficult to disappear within a period of time. Therefore, we see many routes with very large distances. The more bit errors, the more disordered routes are generated. The ring network is the amplifier of these disordered routes. As a result, routes of NEs are frequently switched. The priority of the ECC route task is higher than that of the login task. When routes on an NE are frequently switched, the CPU usage of the routing task is high, causing the login task with a low priority not to run for a long time. As a result, the NE is unreachable to the NMS. 

3. NE ID conflict

ECC communication between NEs is implemented through ID address identification. Therefore, each NE must have an independent identifier ID. If the IDs are duplicated, ECC route calculation fails. The NMS cannot find the destination when delivering connection requests. As a result, a large number of connection request packets are generated on the network. Then an ECC storm occurs.

4. Communication with a large amount of data

When the configuration data of multiple NEs is uploaded or the version of multiple NEs is upgraded at the same time, a large amount of data is transmitted on the network, causing network congestion. In this case, ECC storms may occur.

In the preceding four cases, there is a high probability that ECC storms are caused in the first and second cases. The probability of ECC storms caused by the third and fourth cases is relatively low.


Solution

The solution consists of two steps:

Step 1: Restore the NMS to manage NEs.

Method 1: Set the maximum distance of ECC.

The maximum distance of ECC is 64 by default. In the actual network, such a large distance is not required. In addition, this maximum distance affects the search scope of ECC route.

Setting the maximum distance of ECC can narrow the range of refreshing the ECC route on the network, thus reducing the probability of an ECC storm. When an ECC storm occurs, set the maximum distance of ECC to 5. Then, after the network becomes stable, increase the maximum distance to stabilize the network gradually.

Run the commands to set the route distance:cm-set-maxdist 

For example:

:cm-set-maxdist:5      # The maximum route distance cannot be set to 0.


Method 2: Disable the ECC links around the backbone node.

When disabling the ECC links, you must be familiar with the fiber connections on the network. First, disable the loops at the access layer and then isolate certain devices from the current ECC network. After the ECC is stable, gradually release the devices.

When disabling the remote optical ports, do not disable the route to the NMS. Make sure that the disabled ECC can be accessed on the NMS.

Step 2: Eliminate the root cause of the fault.

1. Troubleshooting bit errors on the line.

Locate the optical port with a large number of bit errors, disable the DCC channel, check the fiber, or replace the optical board.

2. Check for NE ID conflicts. 

Change a duplicate NE ID to ensure that the IDs of all NEs managed by the same NMS are unique. If the NMS manages NEs of the Transmission domain, access domain, and IP domain at the same time, the NE IDs of these NEs also must be unique.

3. Suspend the operations that require large amounts of data to be transferred across the DCC network.

Preventive measures

In actual application scenarios, there is no way to prevent bit errors from occurring on fibers or ring networks. However, the following precautions can be taken to reduce the probability of ECC storms and improve the efficiency of handling ECC storms:

1. Proper subnetting

Properly control the ECC network scale. It is recommended that each subnet contain no more than 64 NEs.

2. Do not use automatic extended ECC.

When three or more NEs are connected at a site, automatic ECC extension is not used. If devices are connected only through network cables, manual ECC extension is recommended. This is because the automatically extended ECC forms a very complex ring network.

3. Properly arrange the gateway location.

When an ECC storm occurs, only the gateway can be logged in. Therefore, the location of the gateway on the network is very important for fault prevention and recovery. When an ECC storm occurs, you can set the DCC on the gateway or the maximum route distance to cut the loop. This way, disordered routes caused by bit errors will not flap on the loop. Therefore, on a ring network, ensure that each ring has a GNE. For example, in the following network, NE (9-2335) is the most suitable gateway while NE (9-1213) is the least suitable gateway.

ECC storm

4. Properly plan the network to avoid duplicate NE IDs.


Thanks!

  • x
  • convention:

very good and valuable
View more
  • x
  • convention:

liqiang185
liqiang185 Created Jul 30, 2021 01:08:16 (0) (0)
Thank you for your review. Let's us share more useful knowledge about transmission networks  
Very Good information
View more
  • x
  • convention:

Good solution
View more
  • x
  • convention:

Sokrin
Sokrin Created Aug 15, 2021 08:46:30 (0) (0)
 
I really appreciate that, thanks a bunch
View more
  • x
  • convention:

lucian2003
lucian2003 Created Aug 10, 2021 01:28:11 (0) (0)
 
  • x
  • convention:

Saqib123
Saqib123 Created Jul 30, 2021 17:06:56 (0) (0)
 
BAZ
BAZ Created Jul 30, 2021 17:29:33 (0) (0)
Gateway planning is important  
liqiang185
liqiang185 Reply BAZ  Created Aug 1, 2021 01:50:49 (0) (0)
Yes  
VinceD
VinceD Created Aug 2, 2021 16:45:17 (0) (0)
nice  
csk99
csk99 Created Aug 3, 2021 07:48:59 (0) (0)
 
Thanks for sharing
View more
  • x
  • convention:

andersoncf1
MVE Author Created Jul 30, 2021 17:13:23

Good solution. Thanks for sharing
View more
  • x
  • convention:

Sokrin
Sokrin Created Aug 15, 2021 08:46:22 (0) (0)
 
Vlada85
MVE Author Created Jul 30, 2021 17:55:46

Good article!
View more
  • x
  • convention:

123
Back to list

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.