fault description:
Client reported that couldn't connect to branch office from headquarters suddenly.After confirmed with client,we learned that none configuration had been changed before the network problem happens.It seems problem caused by network equipment itself. From the client, we learned that the only problem is employees in branch office can't access headquarters. The rest access are normal.
mainly configuration:
branch office connect to headquarters with two links, first link's bandwidth is 1Gb/s, second link's bandwidth is 100Mb/s. Two links work in active/standby mode.
OSPF is implemented between branch office and headquarters, so that employees in branch office can connect to servers in headquarters.
In headquarters, intranet access and internet access are independent, which means one PC can only access one of the intranet or the internet at the same time.
In headquarters, all the gateway IP addresses are implemented on the core switch which shows as LSW5 in the topology.
Troubleshooting Procedure:
step 1:
OSPF is implemented to allow employees in branch office access headquarters, we check the OSPF neighbor state first.
from the outputing, we see none OSPF neighbors exist on LSW5,
step 2:
we suspect the remote switch in branch office failure causing this problem, contact with the technical staff in the branch office, he replied switch works fine, employees in the branch office can access internet, and he mentioned that there is alarms about OSPF neighbor turn to down state on the switch.
step 3:
Is it caused by the link failure or switchover failure between headquarters and branch office? Ping branch office on LSW5, the result is fine, and the ISP also replies that two links between headquarters and branch office are fine.
Back to the alarm on the switch, it shows ospf neighbor down caused by hello packet not seen, the link is fine and none configurations are changed.
step 4:
capture packets on LSW1 and LSW2,
LSW1 G0/0/1
G0/0/2
LSW2 G0/0/1
G0/0/2
from the capture result, we see something interesting:
on the link between LSW1 and LSW5, there are only hello packets from LSW7 to LSW5;
on the link between LSW2 and LSW5, as opposed to point 1, only hello packet from LSW5 to LSW7;
As in case 1, only hello packets from LSW7 to LSW5 exist on the link between LSW1 and LSW7;
like in case 2, only hello packets from LSW5 to LSW7 on the link between LSW2 and LSW7;
The four situations are shown in the figure below.
it seems that G0/0/1 on LSW5 has been blocked unexpectly.
step 5:
check G0/0/1 on LSW5,
The port physical state is normal and no traffic policy or acl configured on this interface.
check spanning-tree
Eth-trunk 1, which G0/0/1 belongs to, is under discarding state.Apparently, this is abnormal, LSW5 should be the root switch for the MSTP instance 0.
step 6:
check stp state on the LSW5
from the result, we learn that the CIST root bridge MAC addresse is 4c1f-cc44-5063, the CIST root bridge priority is 0.
step 7:
since Eth-trunk 2 is root port on LSW5, check LSW2, G0/0/5 is the root port on LSW2.
step 8:
check LSW4
no root port
checking stp state on LSW4
from the output, we notice that LSW4 is the CIST root bridge.
step 9:
checking the configuration of the LSW4
from the result, we find LSW4 has been configured with root primary, which causing LSW4 to become the CIST root for MSTP instance 0, then traffic path from LSW7 to LSW5 is adjusted as below.
But if only the traffic path was adjusted, it wouldn't cause OSPF neighbor turn to down state. There must be more things on the new path.
step 10:
check configuration of G0/0/4 on LSW1 and G0/0/5 on LSW2,
from the output, we learn that the traffic policy was configured on the interface, configurmed with client, we konw that this traffic policy was configured to allow intranet area PCs accessing only servers and core switch, this traffic policy blocks other traffic, cause the OSPF neighbor turn to down.
Client had told that none configuration has been implemented, what is the reason caused the LSW4 to became the root switch.
root cause:
we are told that the switch was replaced because the former LSW4 switch broken, the new switch had not been setted to factory setting before configuring new settings, this caused the LSW4 to become the root switch, then topology changed, OSPF traffic blocked by the traffic policy, finally the OSPF neighbor down.
Advise:
before replace a older network equipment, we should set the new one to factory setting before making new configurations.
we'd better configure root protect, bpdu protect and loop protect to prevent the unexpected switch access causing the topology changes.The difference between these attributes can refer to https://forum.huawei.com/enterprise/en/thread-475377.html
If you have any problems, please post them in our Community. We are happy to solve them for you!