Hello,
Today, I would like to share with you when some optical ports flash LSE-WILL-DIE alarm due to the software problem of the data board.
Problem Description
The OSN3500 equipment deployed on the customer’s live network, the px1 board and the peg16 board flashed the lsr_will_die alarm several times for about 20 seconds. After the paranoid current threshold was modified, some optical ports on the live network still reported this alarm, causing the customer Attach great importance to it and require root cause analysis and thorough solutions.
Host version: v200r011c00spc200
Data veneer: ssn1peg16, ssn1pex1
Warning message
lsr_will_die
Process
Temporary solution: shield the lsr_will_die alarms of the ge and 10ge ports on the network management, set the alarm threshold to 900, turn on the network management lsr_bcm_alm alarm monitoring, and judge whether the optical module is abnormal by observing the lsr_bcm_alm alarm.
Complete solution: A patch or subsequent version needs to be developed to solve the problem of floating-point arithmetic exceptions
Root cause
The ler_will_die alarm is an alarm reported when the paranoid current exceeds the set threshold. This alarm indicates that the life of the laser is about to expire. The occurrence of the alarm does not mean that the optical module will fail immediately, but it can continue to be used for a period of time. Replacement should be done during this time. Preparation of the optical module.
Combining the phenomenon of the live network and the analysis of the data collected by the front line, multiple sites report the alarm at the same time, and the use time of the optical module is not more than 2 years (the life of the optical module is generally 3 to 5 years), and this alarm is flashing Therefore, it is preliminarily inferred that the possibility of multiple optical modules being broken at the same time is not large. Therefore, the problem should be located from both the software and hardware of the single board.
Hardware aspect:
1. Analyze the manufacturing information of the single board, and found that some optical modules are not supported by the data single board, and the same single board uses many optical modules from different manufacturers. This has certain hidden dangers, but it is not the key to the problem, because the problem is The optical module also works normally at other sites.
2. Set up a mirroring environment in the laboratory to reproduce the alarm.
3. The laboratory completes the thermostat test to observe the relationship between temperature and alarm.
Software aspect:
1. Software development and troubleshooting code. The lsr_will_die alarm and lsr_bcm_alm alarm will theoretically appear as a pair. Although lsr_bcm_alm is blocked on the live network, this alarm can be viewed through the navigator.
2. Coordinate the first line, set the bias current alarm high threshold to 1 on the live network, and report the alarm after about 25 seconds. It is found that the bias current high alarm and the laser end of life alarm are both appearing in pairs, but on the live network In the log of previous alarms, it was not found that the two alarms appeared in pairs. The only difference between the two alarms in the judgment is that there are more floating-point operations, so floating-point operations are very suspect.
3. Through laboratory simulation, the findings are consistent with the analysis conclusions. Therefore, it can be judged that the lsr_will_die alarm is a false alarm caused by an abnormal floating-point operation. The power_abnormal alarm abnormal report caused by an abnormal floating-point operation has also occurred in other sites in the past, and the temporary version is printed in the laboratory to print the floating-point operation result. Abnormal results of floating-point arithmetic also occurred.
So far, it can be determined that the lsr_will_die alarm is a false alarm caused by floating-point arithmetic exceptions.
Suggestions and conclusions
The lsr_will_die alarm of the optical port of the data veneer is designed with reference to the ptn product. The traditional mstp product data veneer does not have this alarm, so we should pay attention to the applicability and sensitivity of the alarm when designing the alarm. In particular, some sensitive overseas customers took this warning very seriously and asked for the root cause, which brought us unnecessary trouble. It is recommended to modify the name of the alarm or directly shield the alarm.
In addition, there must be a unified and effective standard for the setting of the alarm threshold. It is best not to modify the threshold easily, otherwise it is easy to cause disgust and doubt from customers.
You are welcome to leave a message and exchange in the comment area. Thank you!



