Hello, guys!
Have a nice day!
I encountered a problem before and it is resolved now. I want to share it with you!
symptom
Massive cameras are offline and cannot log in IVS client for about half an hour.
Analysis
Analyze the SMU log and find the service restarts at 12:44:41, 12:56:43, and 13:52:02.
Check the server detection script log and find the server time is hopped at 12:36:57, system time is changed from 12:32:06 to 12:36:57, which triggers the service restart protection mechanism and restarts all services modules.
Analyze the server NTP logs and confirm the time is hopped at 12:36:57.
Due to the master server is a two-node VMU, to judge the service running status, the two-node software needs to grasp the status of all services, and as the start-stop time of each service is different, it takes about 20 minutes to finish two-node VMU restart.
Because other MPUs have configured VMU float IP as the NTP clock source, after NTP time sync to each MPU, the time of the MPU server would also be changed, thus services restarted and led to cameras offline.
Confirm that the customer site configures a windows NTP clock source server, and the customer adjusts NTP server time around 12:30, which causes VCN server time rollover.
After the services of two-node VMU recover, massive CU/eSDK users log in. And because MPU services just recover at that time, and then begin to connect with SMU, after reconnection, SMU would report all the online user information to each MPU’s SCU module.
The report process costs a long time due to plenty of online users and device groups, exceeding the maximum thread detection time: 700s.
Root Cause
The time hopping of the NTP server causes VCN server time change, which makes VCN service modules to restart automatically, IPCs go offline, and IVS client login failed.
During the process of services recovering, because incident report consumed a long time, triggering SMU thread restart for the second time. After all servers have finished time synchronization, services recover.
Solution
Modify the NTP parameter of two-node VMU to change the NTP time sync method to micro-synchronization, that is, only sync 5 milliseconds each time.
Extend the SMU thread time-out limit of the two-node VMU to 1000s by modifying file: /home/ivs_smu/config/service.xml.
That is all, thanks for reading!