На стеке из 2хS5720-28X-SI-AC произошел ребут одно из коммутаторов (Слот 2) без видимых причин, и внешних воздействий. Проблема произошла, суда по логам в 2020-12-31 03:02.
Логи:
Dec 31 2020 08:02:01.990.1+03:00 switch-stack FSP/4/STACK_SWITCHOVER:OID 1.3.6.1.4.1.2011.5.25.183.1.22.4 After switchover, slot 1 is selected as master. Dec 31 2020 08:02:01.990.2+03:00 switch-stack FSP/4/STACKMEMBER_LEAVE:OID 1.3.6.1.4.1.2011.5.25.183.1.22.7 Slot 2 leaves from stack. Dec 31 2020 08:07:41.124.1+03:00 switch-stack %IFNET/4/BOARD_ENABLE(l)[416]:Board 2 has been available.
Разберем диагностику в целях найти причину ребута, и проверим текущее состояние стека
1. Текущее состояние устройств в порядке
===============display device=============== ================================================== S5720-28X-SI-AC's Device status: Slot Sub Type Online Power Register Status Role ------------------------------------------------------------------------------- 1 - S5720-28X-SI Present PowerOn Registered Normal Master PWR1 POWER Present PowerOn Registered Normal NA PWR2 POWER Present PowerOn Registered Normal NA 2 - S5720-28X-SI Present PowerOn Registered Normal Standby PWR1 POWER Present PowerOn Registered Normal NA PWR2 POWER Present PowerOn Registered Normal NA
===============display stack=============== ================================================= Stack mode: Service-port Stack topology type: Ring Stack system MAC: 4xxx-xxxx-xxx0 MAC switch delay time: 10 min Stack reserved VLAN: 4093 Slot of the active management port: 1 Slot Role MAC Address Priority Device Type ------------------------------------------------------------- 1 Master 4xxx-xxxx-xxx0 250 S5720-28X-SI-AC 2 Standby 4xxx-xxxx-xx40 100 S5720-28X-SI-AC
2. Смотрим причину ребута
===============display reboot-info=============== ======================================================= Slot ID Times Reboot Type Reboot Time(DST) =========================================================================== 2 1 FSP 2020/12/31 08:04:47
FSP – Stack management protocol отправил в перезагрузку slot2. Значит проблема софтовая – необходимо проверить работу логики стека. Также конечно же были проверены сами стек интерфейсы – на физике ошибок не было.
===============display stack port=============== Logic Port Phy Port Online Status ---------------------------------------------------------------------------- stack-port1/1 XGigabitEthernet1/0/1 present up stack-port1/2 XGigabitEthernet1/0/2 present up stack-port2/1 XGigabitEthernet2/0/1 present up stack-port2/2 XGigabitEthernet2/0/2 present up ===============display stack channel all=============== ! : Port have received packets with CRC error. Slot L-Port P-Port Speed State || P-Port Speed State L-Port Slot --------------------------------------------------------------------------------------- 1 1/1 XGE1/0/1 10G UP XGE2/0/2 10G UP 2/2 2 1 1/2 XGE1/0/2 10G UP XGE2/0/1 10G UP 2/1 2 2 2/1 XGE2/0/1 10G UP XGE1/0/2 10G UP 1/2 1 2 2/2 XGE2/0/2 10G UP XGE1/0/1 10G UP 1/1 1
3. Изучаем трейсы стека (display stack trace memory | nvram )
Что видел слот1: (некоторые события опустил) 2020-12-31 08:04:38.914:Stack port 2 became Down. 2020-12-31 08:04:38.914:Slot 1 changed to standalone at state "RUNNING". 2020-12-31 08:04:38.914:Only slot 1 is present when designating the standby and master. 2020-12-31 08:04:38.914:A routing table is built. Topo is 0(1: ring, 0: link) state is idle, and strategy is 0xcd (0xcc: ring, 0xcd: link). 2020-12-31 08:04:38.914:Stack event "STAND_ALONE" occurs(flag is 0, timer is 0, state is 3, link status is 0(port 0) and 0(port 1). 2020-12-31 08:04:38.914:The link status on stack-port 2 neighbor(slot 4294967295) is 0(0: down, 1: up). 2020-12-31 08:04:38.914:Stack port 1 became Down. 2020-12-31 08:02:01.990:Notify that the device in slot 2 has been removed. 2020-12-31 08:02:01.990:Slot 2 is not present. The reason is stand by smooth. 2020-12-31 08:02:01.990:Notify that slot 1 changed from standby to master. 2020-12-31 08:02:01.990:The master in slot 2 was lost, and the standby in slot 1 changed to master. 2020-12-31 08:02:01.990:Stack event "MASTER_DOWN" occurred. 2020-12-31 08:02:01.990:The standby did not receive any SPDU packet from the master. 2020-12-31 08:02:01.0:Stack port 2 does not receive any hello packet for 15 second(s). 2020-12-31 08:02:01.0:Stack port 1 does not receive any hello packet for 15 second(s). 2020-12-31 08:01:55.990:Stack port 2 does not receive any hello packet for 10 second(s). 2020-12-31 08:01:55.990:Stack port 1 does not receive any hello packet for 10 second(s). Что видел слот2: (некоторые события опустил) 2020-12-31 08:04:47.938:Slot 2 is restarted. The reason is "Reset for stack combine". 2020-12-31 08:04:47.928:Reset for merge. 2020-12-31 08:04:39.928:Stack event "MASTER_CHANGE" occurred. 2020-12-31 08:02:12.568:Stack event "STAND_ALONE" occurs(flag is 0, timer is 0, state is 3, link status is 1(port 0) and 1(port 1). 2020-12-31 08:02:12.558:Slot 1 is not present. The reason is removed alone. 2020-12-31 08:02:12.558:Stack port 2 does not receive any hello packet for 25 second(s). 2020-12-31 08:02:12.538:The link status on stack-port 1 neighbor(slot 1) is 1(0: down, 1: up). 2020-12-31 08:02:12.538:Set the status of stack port 1 to 2 (1: forwarding, 2: block-all, 3+: block-some). The result is 0(0: ok). 2020-12-31 08:02:12.538:A routing table is built. Topo is 0(1: ring, 0: link) state is idle, and strategy is 0xcc (0xcc: ring, 0xcd: link). 2020-12-31 08:02:12.538:Topology changed from 1 to 0 (0: link, 1: ring). 2020-12-31 08:02:12.538:Stack port 1 does not receive any hello packet for 25 second(s). 2020-12-31 08:02:03.428:Stack-port 2 did not receive a hello packet. 2020-12-31 08:02:03.428:Stack-port 1 did not receive a hello packet. 2020-12-31 08:02:02.588:Stack port 2 does not receive any hello packet for 15 second(s). 2020-12-31 08:02:02.568:Stack port 1 does not receive any hello packet for 15 second(s). 2020-12-31 08:01:57.548:Stack port 2 does not receive any hello packet for 10 second(s). 2020-12-31 08:01:57.548:Stack port 1 does not receive any hello packet for 10 second(s).
Таким образом произошел сбой в логике обмена корректными stack hello пакетами внутренненго протокола FSP.
4. На свитчах работает софт V200R011C10SPC600 + патч V200R011SPH010
Патч V200R011SPH010 слишком старый, актуальный патч на момент обнаружения проблемы - V200R011SPH020. Новые патчи включают изменения всех предыдущих патчей.
В сопроводительных документах patch release notes просматриваем список вылеченных багов, и видим что в патчеV200R011SPH011 была вылечена подходящая по писанию проблема на устройствах нашей линейки
Для предотвращения возникновения проблемы в будущем необходимо установить последний патч на стек.
5. Использование рекомендованных версий ПО и последних патчей к ним – важная рекомендация производителя направленная на повышение стабильности устройств и сервисов в сети.
Для доступа к обнволениям ПО устройство должно быть зарегистрировано в личном кабинете по серийному номеру:
https://support.huawei.com/enterprisemysupport/mysupport#click=productreg
Для отслеживания выпуска обновлений ПО можно подписаться на эти рассылки и своевременно получать обвновления о выпуске патчей или новых версий ПО, а также новосных рассылок и бюллютений по продуктам.
https://support.huawei.com/enterprisemysupport/mysupport#click=techsupportsubscription