Check whether the alarm indicating that files fail to be replicated in an OMMHA two-node cluster is in the eSight alarm list. If yes, rectify the fault described in the alarm information.
Check whether the network between the active and standby nodes is reachable. If the network is unreachable, rectify the network fault.
Log in to the active eSight server as the ossuser user.
Ensure that the standby server is running properly and the network is reachable. You can run the ping command to connect to the system IP address and heartbeat IP address of the standby server.
ossuser@eSightServer:~> ping -c 4 10.137.63.225 PING 10.137.63.225 (10.137.63.225) 56(84) bytes of data. 64 bytes from 10.137.63.225: icmp_seq=1 ttl=64 time=0.477 ms 64 bytes from 10.137.63.225: icmp_seq=2 ttl=64 time=0.439 ms 64 bytes from 10.137.63.225: icmp_seq=3 ttl=64 time=0.437 ms 64 bytes from 10.137.63.225: icmp_seq=4 ttl=64 time=0.384 ms --- 10.137.63.225 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 2999ms rtt min/avg/max/mdev = 0.384/0.434/0.477/0.036 ms ossuser@eSightServer:~> ping -c 4 192.168.122.1 PING 192.168.122.1 (192.168.122.1) 56(84) bytes of data. 64 bytes from 192.168.122.1: icmp_seq=1 ttl=64 time=0.018 ms 64 bytes from 192.168.122.1: icmp_seq=2 ttl=64 time=0.016 ms 64 bytes from 192.168.122.1: icmp_seq=3 ttl=64 time=0.015 ms 64 bytes from 192.168.122.1: icmp_seq=4 ttl=64 time=0.014 ms --- 192.168.122.1 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 2998ms rtt min/avg/max/mdev = 0.014/0.015/0.018/0.005 ms
To avoid the impact of intermittent network disconnection, observe the network for 10 minutes after the network connection is restored to check whether the alarm is cleared.
If yes, the environment has been recovered and no manual rectification is required.
If no, the environment is abnormal. You need to perform the following steps to manually rectify the fault.
Determine the correct host.
If the data replication failure alarm remains unhandled for a long time and you log in to eSight to perform O&M operations during this period, for example, adding a new device, data differences between the active and standby nodes cannot be ignored. If an active/standby switchover occurs, the original standby server functions as the new active server. In this case, the preceding O&M operation results are lost, for example, the added device is lost.
To prevent this problem, query service data based on run logs and select the eSight server with the latest service data as the active server.
Log in to the eSight web page.
Check whether key service information, such as NE configurations, performance data, and alarms, is normal.
If the service data is lost, for example, the added device is lost, contact Huawei technical support.
Select a suitable rectification procedure based on the following scenarios:
No requirement on the service interruption time
Log in to the active server using the management port or VNC as the root user.
cd /opt/ommha/config
sh config.sh
No service interruption
Method 1: Log in to the active server and synchronize data.
Log in to the active server using the management port or VNC as the root user.
cd /opt/ommha/config
sh config_online.sh
Method 2: Log in to the standby server and synchronize data.
Log in to the standby server using the management port or VNC as the ossuser user.
Cancel the bash timeout setting.
Run the following command to query the OMMHA status and ensure that the local server is the standby server:
Run the following command to stop OMMHA:
Run the following commands to rebuild the database:
sed -i "s#ENABLE_SYSDBA_LOGIN[[:space:]]=.*#ENABLE_SYSDBA_LOGIN = TRUE#g" data/cfg/zengine.ini
python app/bin/zctl.py -t kill
rm -f data/archive_log/arch*.arc
python app/bin/zctl.py -t build
If the following information is displayed, the database is successfully rebuilt:
Successfully build database
Start OMMHA.
Log out of the system.