Hello, everyone!
This post will tell you one issue for 1+1 Linear MS on an OptiX OSN 3500 on the Live Network Is in Starting State.
Problem description:
On a site, 32 linear MSP protection groups were configured on OptiX OSN 3500 NEs. The single-ended non-revertive mode was configured for the 23rd to 32nd 1+1 linear MSP protection groups. However, the protocols were in starting state. After the protocols were restarted, the problem was resolved.
#0x90cbe:cfg-get-lmsstate:23;
LMS-SWITCH-STATE
PG-ID PU-ID SWITCH-REQUEST SWITCH-STATE
23 0 LPS_NR Starting .
Handling procedure:
NE software version: 5.21.18.50P01
Cross-connect board version: SSN1UXCSA 8.13
Step 1 Analyzed the black box of the board in slot 9, and found that the board in slot 9 was switched to the standby board, and the board in slot 10 was the active board since September 16. When the board in slot 10 was switched to the active board, the timer failed to be started. As a result, the protocols of some linear MSP groups were in starting state.
3904 2010-09-15 20:14 0x77 0C 40 01 //The active and standby cross-connect boards were manually switched.
3905 2010-09-15 20:14 0x95 0C E2 09 01 //The board in slot 9 was switched to the standby board.
Step 2 Analyzed the black box of the board in slot 10, and found that the board was warm reset for the last time at 4:11:42.
4 2010-09-15 20:14 10 0xF0000010 0x3
The timer failed to be started because delivered a failure message was delivered whose error code was 524299, indicating that the number of message queues reached a maximum value.
243 2010-09-15 20:14 0xAC Level:2, Apsadpt.cpp, Line:8230, dwRc[524299], SendMsg Err
244 2010-09-15 20:14 0xAC Level:2, LpsAdpt.cpp, Line:5736, dwRe[1], Timer ERR
245 2010-09-15 20:14 0xAC Level:2, Apsadpt.cpp, Line:8230, dwRc[524299], SendMsg Err
When the protocol was restarted, four timers were disabled and one timer was disabled on each linear MS. However, a queue of the timer could only contain 128 bytes. 28 bytes were configured in the MS. Therefore, the message queue of the timer module overflows, and some timers in the linear MS could not be started. As a result, the linear MS was in starting state.
10 1 EXTCMDTIMER_STOP 0x0000 2010-09-15 20:14 0x042f8faa
10 2 EXTCMDTIMER_STOP 0x0001 2010-09-15 20:14 0x042f90ff
10 3 T1_STOP 0x0000 2010-09-15 20:14 0x042f9249
10 4 T2_STOP 0x0000 2010-09-15 20:14 0x042f9394
10 5 TK12_STOP 0x0000 2010-09-15 20:14 0x042f94df
10 6 K_ON_OFF 0x0000 2010-09-15 20:14 0x042f9694
10 7 K_ON_OFF 0x0001 2010-09-15 20:14 0x042f9775
10 8 T1_START 0x0000 2010-09-15 20:14 0x042f9dc5
Step 3 Analyzed the ocplog, and found that the NE was upgraded at 03:45:29 on September 16, 2010. Before the upgrade, the board in slot 10 was the active board, and the board in slot 9 was the standby board. At 04:14:45 on September 16, 2010, the standby board was upgraded first and then the active board. However, the versions of the active and standby boards were different. As a result, the protocol run on the board in slot 10 was restarted after active/standby switching.
Init OCP Log OK 2010-09-15 20:14 18 1
NESOFT_VER: 5.21.18.50P01 Feb 25 2010 11:49:45
The 32rd MSP group was in stop state. Why?
Step 4 Analyzed the bb4.log, and found that the command for starting the protocol of the 32rd MSP group was not received.
4030 2010-09-15 20:14 0x46 25 00 1A 00 00 00 01 //The 26th MSP group was created.
4186 2010-09-15 20:14 0xAC 25 08 19 //The 25th MSP group was enabled.
4223 2010-09-15 20:14 0x77 0C 40 01 //The board with a larger slot number was switched to the active board.
4224 2010-09-15 20:14 0x77 0C 40 01 //The board with a larger slot number was switched to the active board.
4225 2010-09-15 20:14 0x95 0C E2 0A 00 //The board in slot 10 was switched to the active board.
Step 5 Recovered the NE database on the live network, and reset the standby board in the lab. After the standby board was on line, the problem was reproduced by switching the active and standby cross-connect boards. However, the commands were lost after the configurations of the 25th MSP group were received.
bb4.log 2010-09-15 20:14 25 a0 19 00
bb4.log 2010-09-15 20:14 25 04 19 02 58
The length of the receive queue for the board software was 128. When the protocol of each MSP group was started, six commands were delivered. If multiple commands were delivered at the same time, the commands delivered later would be discarded. As a result, the MSP protocol was in stop state due to no command for starting the protocol was received.
Root cause:
When the number of linear MSP groups was more than 22, the lengths of the message queue of the linear MS timer and command receiving module were insufficient. As a result, the state was abnormal.
Solution:
Workaround: Manually restart the linear MSP group protocol.
Solution: Upgrade the software of the cross-connect board.
That's all, I welcome everyone to leave a message and exchange in the comment area!
Thank you!