[troubleshooting] an unexpected network failure occured after s5720 been replaced Highlighted

Latest reply: Dec 27, 2018 07:44:33 868 5 5 1

fault description:

  • Client reported that couldn't connect to branch office from headquarters suddenly.After confirmed with client,we learned that none configuration had been changed before the network problem happens.It seems problem caused by network equipment itself. From the client, we learned that the only problem is employees in branch office can't access headquarters. The rest access are normal.


mainly configuration:

  1. branch office connect to headquarters with two links, first link's bandwidth is 1Gb/s, second link's bandwidth is 100Mb/s. Two links work in active/standby mode.

  2. OSPF is implemented between branch office and headquarters, so that employees in branch office can connect to servers in headquarters.

  3. In headquarters, intranet access and internet access are independent, which means one PC can only access one of the intranet or the internet at the same time.

  4. In headquarters, all the gateway IP addresses are implemented on the core switch which shows as LSW5 in the topology.

topology



Troubleshooting Procedure:

  

step 1:

OSPF is implemented to allow employees in branch office access headquarters, we check the OSPF neighbor state first.

ospf_peer

from the outputing, we see none OSPF neighbors exist on LSW5,


step 2:

we suspect the remote switch in branch office failure causing this problem, contact with the technical staff in the branch office, he replied switch works fine, employees in the branch office can access internet, and he mentioned that there is alarms about OSPF neighbor turn to down state on the switch.

alarm


step 3:

Is it caused by the link failure or switchover failure between headquarters and branch office? Ping branch office on LSW5, the result is fine, and the ISP also replies that two links between headquarters and branch office are fine.

ping_result 


Back to the alarm on the switch, it shows ospf neighbor down caused by hello packet not seen, the link is fine and none configurations are changed.

 

step 4:

capture packets on LSW1 and LSW2,


LSW1  G0/0/1

11

           G0/0/2

12

LSW2   G0/0/1

21

           G0/0/2

22

from the capture result, we see something interesting:

  1. on the link between LSW1 and LSW5, there are only hello packets from LSW7 to LSW5;

  2. on the link between LSW2 and LSW5, as opposed to point 1, only hello packet from LSW5 to LSW7;

  3. As in case 1, only hello packets from LSW7 to LSW5 exist on the link between LSW1 and LSW7;

  4. like in case 2, only hello packets from LSW5 to LSW7 on the link between LSW2 and LSW7;

The four situations are shown in the figure below.

 hello packet

 

it seems that G0/0/1 on LSW5 has been blocked unexpectly.

 

step 5:

check G0/0/1 on LSW5,

The port physical state is normal and no traffic policy or acl configured on this interface.

phy state

    interface_configuration

check spanning-tree

    stp port state

Eth-trunk 1, which G0/0/1 belongs to, is under discarding state.Apparently, this is abnormal, LSW5 should be the root switch for the MSTP instance 0.

 

step 6:

check stp state on the LSW5

root

from the result, we learn that the CIST root bridge MAC addresse is 4c1f-cc44-5063, the CIST root bridge priority is 0.

 

step 7:

since Eth-trunk 2 is root port on LSW5, check LSW2, G0/0/5 is the root port on LSW2.

sw2root


step 8:

check LSW4

 sw4root

no root port

 

checking stp state on LSW4

 sw4 stp

from the output, we notice that LSW4 is the CIST root bridge.


step 9:

checking the configuration of the LSW4

 sw4cu

from the result, we find LSW4 has been configured with root primary, which causing LSW4 to become the CIST root for MSTP instance 0, then traffic path from LSW7 to LSW5 is adjusted as below.

 trafficflow

 

But if only the traffic path was adjusted, it wouldn't cause OSPF neighbor turn to down state. There must be more things on the new path.

 

 step 10:

check configuration of G0/0/4 on LSW1 and G0/0/5 on LSW2,

120319i0yzudmp3chuc3zg.png

120320df1fxlatbfs99kwt.png

from the output, we learn that the traffic policy was configured on the interface, configurmed with client, we konw that this traffic policy was configured to allow intranet area PCs accessing only servers and core switch, this traffic policy blocks other traffic, cause the OSPF neighbor turn to down.

 

Client had told that none configuration has been implemented, what is the reason caused the LSW4 to became the root switch.


root cause: 

  • we are told that the switch was replaced because the former LSW4 switch broken, the new switch had not been setted to factory setting before configuring new settings, this caused the LSW4 to become the root switch, then topology changed, OSPF traffic blocked by the traffic policy, finally the OSPF neighbor down.

 

Advise:

  1. before replace a older network equipment, we should set the new one to factory setting before making new configurations. 
  2. we'd better configure root protect, bpdu protect and loop protect to prevent the unexpected switch access causing the topology changes.The difference between these attributes can refer to https://forum.huawei.com/enterprise/en/thread-475377.html


This article contains more resources

You need to log in to download or view. No account?Register

x
  • x
  • convention:

Created Nov 14, 2018 09:06:13 Helpful(0) Helpful(0)

A very detailed case, the troubleshooting is very clear.good example ,LSW4 has been configured with root primary, which causing LSW4 to become the CIST root for MSTP instance 0 . mostly issue root cause seems not according the issue phenomenon , so we need check all configuraiton carefully . thanks for your sharing .
  • x
  • convention:

Created Nov 20, 2018 12:51:00 Helpful(0) Helpful(0)

niu x
  • x
  • convention:

Created Nov 23, 2018 03:07:03 Helpful(0) Helpful(0)

thanks
Client reported that couldn't connect to branch office from headquarters suddenly.After confirmed with client,we learned that none configuration had been changed before the network problem happens.It seems problem caused by network equipment itself.

Network topology overview as below,and here we list some mainly configuration.

branch office connect to headquarters with two links, first link's bandwidth is 1Gb/s, second link's bandwidth is 100Mb/s. Two links work in active/standby mode.

OSPF is implemented between branch office and headquarters, so that employees in branch office can connect to servers in headquarters.
  • x
  • convention:

Created Dec 24, 2018 01:44:02 Helpful(0) Helpful(0)

This post was last edited by xiaomumu at 2018-12-27 02:49. I hope this kind of failure can be avoided in the future
  • x
  • convention:

Created Dec 27, 2018 07:44:33 Helpful(0) Helpful(0)

Client had told that none configuration has been implemented, what is the reason caused the LSW4 to became the root switch.How do you understand this sentence?
  • x
  • convention:

Reply

Reply
You need to log in to reply to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " Privacy."
If the attachment button is not available, update the Adobe Flash Player to the latest version!

Login and enjoy all the member benefits

Login
Fast reply Scroll to top