Hello,
This is a case of OSPF protocol flapping caused by the QoS configuration problem.
Problem Description
OSPF peer flapping after migrate service to NE40E.
<****NE40E****>display ospf peer last-nbr-down
OSPF Process 501 with Router ID ***.***.92.31
Last Down OSPF Peer
Neighbor Ip Address : ***.***.4.152
Neighbor Area Id : 0.0.0.0
Neighbor Router Id : ***.***.1.158
Interface :Eth-Trunk8 (89)
Immediate Reason : Neighbor Down Due to Kill Neighbor
Primary Reason : BFD Session Down
Down Time :2021-02-14 23:18-06:00
Neighbor Ip Address : ***.***.4.152
Neighbor Area Id : 0.0.0.0
Neighbor Router Id : ***.***.1.158
Interface :Eth-Trunk8 (89)
Immediate Reason : Neighbor Down Due to Kill Neighbor
Primary Reason : BFD Session Down
Down Time :2021-02-14 23:18-06:00
Neighbor Ip Address : ***.***.4.152
Neighbor Area Id : 0.0.0.0
Neighbor Router Id : ***.***.1.158
Interface :Eth-Trunk8 (89)
Immediate Reason : Neighbor Down Due to Kill Neighbor
Primary Reason : BFD Session Down
Down Time :2021-02-14 23:18-06:00
Handling Process
(1) Verify the ping result, it is confirmedonly ATN ping NE40E have packet drop, but NE40E ping ATN was normal.
ATN | NE40E |
2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1595 ttl=255 time=1 ms 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1596 ttl=255 time=1 ms 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1597 ttl=255 time=1 ms 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1598 ttl=255 time=1 ms 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1599 ttl=255 time=1 ms 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1600 ttl=255 time=1 ms 2021-02-14 23:18+00:00 Request time out 2021-02-14 23:18+00:00 Request time out 2021-02-14 23:18+00:00 Request time out 2021-02-14 23:18+00:00 Request time out 2021-02-14 23:18+00:00 Request time out 2021-02-14 23:18+00:00 Request time out 2021-02-14 23:18+00:00 Request time out 2021-02-14 23:18+00:00 Request time out 2021-02-14 23:18+00:00 Request time out 2021-02-14 23:18+00:00 Request time out 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1611 ttl=255 time=772 ms 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1612 ttl=255 time=98 ms 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1613 ttl=255 time=1 ms 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1614 ttl=255 time=1 ms 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1615 ttl=255 time=1 ms 2021-02-14 23:18+00:00 Reply from ***.***.4.153: bytes=56 Sequence=1616 ttl=255 time=1 ms | 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=451 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=452 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=453 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=454 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=455 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=456 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=457 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=458 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=459 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=460 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=461 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=462 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=463 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=464 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=465 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=466 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=467 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=468 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=469 ttl=255 time=1 ms 2021-02-14 23:18-06:00 Reply from ***.***.4.152: bytes=56 Sequence=470 ttl=255 time=1 ms |
(2) By display the log of the device to collect info, we found every display log operation will also trigger the same BFDdown issue at the same time.
So TAC confirms the issue should be QoS configuration limits the outbound stream speed.
Feb 15 2021 21:13:20 ****ATN**** D/4/STACHG_TODWN(l)[20568383]:BFD session changed to Down. (SlotNumber=0, Discriminator=1298, Diagnostic=NeighborDown, Applications=OSPF, ProcessPST=False, BindInterfaceName=Eth-Trunk1, InterfacePhysicalState=Up, InterfaceProtocolState=Up) Feb 15 2021 21:13:20 ****ATN**** %OSPF/3/NBR_CHG_DOWN(l)[20568384]:Neighbor event:neighbor state changed to Down. (ProcessId=501, NeighborAddress=***.***.4.153, NeighborEvent=KillNbr, NeighborPreviousState=Full, NeighborCurrentState=Down) Feb 15 2021 21:13:20 ****ATN**** %OSPF/3/NBR_DOWN_REASON(l)[20568385]:Neighbor state leaves full or changed to Down. (ProcessId=501, NeighborRouterId=***.***.92.31, NeighborAreaId=0, NeighborInterface=Eth-Trunk1,NeighborDownImmediate reason=Neighbor Down Due to Kill Neighbor, NeighborDownPrimeReason=BFD Session Down, NeighborChangeTime=2021-02-14 23:18+00:00) Feb 15 2021 21:13:22 ****ATN**** %L2V/5/VPLSVC_DWN_ME(l)[20568386]:The status of the VPLS VC turned DOWN. (VsiName=Tele_Digital, RemoteIp=177.241.247.21, PwId=2993, Reason=Tunnel was Down, SysUpTime=1022585146)
Feb 15 2021 21:17:07 ****ATN**** %SNMP/4/SNMP_FAIL(s)[20568554]:Failed to login through SNMP. (Ip=167.71.186.157, Times=3, Reason=the community was incorrect, VPN=internet_mca) Feb 15 2021 21:17:08 ****ATN**** D/4/STACHG_TODWN(l)[20568556]:BFD session changed to Down. (SlotNumber=0, Discriminator=1299, Diagnostic=NeighborDown, Applications=OSPF, ProcessPST=False, BindInterfaceName=Eth-Trunk1, InterfacePhysicalState=Up, InterfaceProtocolState=Up) Feb 15 2021 21:17:08 ****ATN**** %OSPF/3/NBR_CHG_DOWN(l)[20568557]:Neighbor event:neighbor state changed to Down. (ProcessId=501, NeighborAddress=***.***.4.153, NeighborEvent=KillNbr, NeighborPreviousState=Full, NeighborCurrentState=Down) Feb 15 2021 21:17:08 ****ATN**** %OSPF/3/NBR_DOWN_REASON(l)[20568558]:Neighbor state leaves full or changed to Down. (ProcessId=501, NeighborRouterId=***.***.92.31, NeighborAreaId=0, NeighborInterface=Eth-Trunk1,NeighborDownImmediate reason=Neighbor Down Due to Kill Neighbor, NeighborDownPrimeReason=BFD Session Down, NeighborChangeTime=2021-02-14 23:18+00:00) |
(3) By checking the configuration of NE40E and ATN, it is confirmed the QOS bandwidth setting on ATN has an issue, it only permits 70kbps of cs6 and cs7, and it was configured as WFQ(Which same asaf1~af4), not PQ.
The interface speed is already 70Mbps.
(4) After removing port-que cs6 and cs7 on the ATN device, the issue solved.
User : metrocarrier, VT1, 192.168.3.1 Time : 2021-02-14 23:18+00:00 Command: undo port-queue cs6 outbound
User : metrocarrier, VT1, 192.168.3.1 Time : 2021-02-14 23:18+00:00 Command: undo port-queue cs7 outbound
User : metrocarrier, VT1, 192.168.3.1 Time : 2021-02-14 23:18+00:00 Command: dis th
User : metrocarrier, VT1, 192.168.3.1 Time : 2021-02-14 23:18+00:00 Command: undo port-queue cs6 wfq outbound
User : metrocarrier, VT1, 192.168.3.1 Time : 2021-02-14 23:18+00:00 Command: undo port-queue cs7 wfq outbound
User : metrocarrier, VT1, 192.168.3.1 Time : 2021-02-14 23:18+00:00 Command: dis th
User : metrocarrier, VT1, 192.168.3.1 Time : 2021-02-14 23:18+00:00 Command: interface Eth-Trunk 1
Area 0.0.0.0 interface 10.3.7.126(Eth-Trunk8)'s neighbors Router ID: ***.***.92.63 Address: 10.3.7.127 State: Full Mode:Nbr is Master Priority: 1 DR: None BDR: None MTU: 0 Dead timer due in 39 sec Retrans timer interval: 0 Neighbor is up for 06:22:53 Authentication Sequence: [ 22936] |
Root Cause
The QoS configuration on ATN is not suitable for current service speed, it causes the time delay increase, which affect BFDtimeout, then affect ospf flapping.
It only specifies 70kbpsfor CS6 and CS7 priority. And it configured the QoS attribute as WFQ, it will cause CS6 and cs7 need competes for the bandwidth with AF1~AF4, so the time delay will be increased while the service stream increase.
interface Eth-Trunk1 description ***-ATN910-324-***-Gi0/2/27-***.***.51.1 mtu 9042 ip address ***.***.4.152 255.255.255.254 pim sm ospf network-type p2p ospf bfd min-tx-interval 500 min-rx-interval 500 detect-multiplier 6 ospf bfd block ospf ldp-sync mpls mpls te mpls rsvp-te mpls ldp mode lacp-static port-queue af1 wfq weight 15 shaping 556 outbound port-queue af2 wfq weight 15 shaping 1042 outbound port-queue af3 wfq weight 15 shaping 70 outbound port-queue af4 pq shaping 3000 outbound port-queue cs6 wfq shaping 70 outbound port-queue cs7 wfq shaping 70 outbound # |

Solution
(1) After getting customer permission, remove the port-queue configuration of CS6 and CS7, issue solved.
[****ATN****-Eth-Trunk1]undo port-queue cs7 outbound [****ATN****-Eth-Trunk1]undo port-queue cs6 outbound |
(2) For the correct shaping value of CS6and CS7, need customer double verify based on the live network service speed.
And the QOS method needs to use PQ for CS6 and CS7.
interface Eth-Trunk1 description TMCA1-PAN-ATN910-324-HUB1-Gi0/2/27-10.4.51.1 mtu 9042 ip address ***.***.4.152 255.255.255.254 pim sm ospf network-type p2p ospf bfd min-tx-interval 500 min-rx-interval 500 detect-multiplier 6 ospf bfd block ospf ldp-sync mpls mpls te mpls rsvp-te mpls ldp mode lacp-static port-queue af1 wfq weight 15 shaping 556 outbound port-queue af2 wfq weight 15 shaping 1042 outbound port-queue af3 wfq weight 15 shaping 70 outbound port-queue af4 pq shaping 3000 outbound port-queue cs6 pq shaping <suitable shapping value> outbound port-queue cs7 pq shaping <suitable shapping value> outbound # |

