Symptom:
After an IB switch is powered off, I/O access in the storage pool is abnormal, the database cluster restarts, and services are abnormal.
Diagnosis:
Check whether the network is faulty or multiple nodes are faulty. If no, this section is not applicable.
Use PuTTY and run the ssh User name@IP address command to log in to a storage node. In the command, User name indicates the user name for logging in to the node and IP address indicates the management IP address. To run this command, you need to enter the password of the user name.
Run the following command to query the MDC node to which the faulty storage pool belongs:
mdc_cmd.sh 165 -1
If the command output contains the mapping between the storage pool ID and the storage IP address of the owning MDC node, log in to the MDC node.
Run the following command to switch to the log directory of the MDC node:
cd /var/log/dsware/plog/mdc/bak
Run the following command to check whether there are storage pool status logs generated around the time when the fault occurs fail to be reported:
zcat * | grep -a "connect zk less 60s, cann't return incorrect pool status"
If yes, rectify the fault by referring to operations described in Solution.
Cause:
After the network fault is recovered, the MDC node cannot immediately obtain the accurate status of the storage pool. Therefore, the MDC node reports the storage pool status one minute later. Upper-layer services depend on the storage pool status so that they cannot complete the startup process in a timely manner.
Solution:
Run the following command to check whether the storage pool status is normal (pool_id indicates the storage pool ID):
mdc_cmd.sh 120 pool_id
If pool_status in the command output does not contain STATUS = POOL HAS FAILURE PT, the storage pool status is normal.
Restart the affected services and check whether the services are restored.
Check After Recovery:
Check whether the upper-layer services are restored.