Symptom:
In a three-network deployment scenario (storage networks are classified into front- and back-end storage networks) with I/O suspension (matched with HCS or NFV) disabled, the redundancy of a storage pool exceeds the threshold due to a back-end storage network fault. As a result, the storage pool is in downgrade state for a long time and I/Os are suspended without I/O errors returned.
Diagnosis:
Check whether the system is configured with three networks (storage networks are classified into front- and back-end storage networks), whether I/O suspension (matched with HCS or NFV) is disabled, and whether a back-end storage network fault alarm is generated. If none of the preceding situations occurs, this section is not applicable.
Run the following command to query the MDC node to which the faulty storage pool belongs:
mdc_cmd.sh 165 -1
If the command output contains the mapping between the storage pool ID and the storage IP address of the owning MDC node, log in to the MDC node.
Run the following command to switch to the log directory of the MDC node:
cd /var/log/dsware/plog/mdc/bak
Check whether record down will over redundancy, cannot down is printed for the mdc_handle_debug_be_check_notify_event function around the time when the fault occurs. If yes, the problem occurs.
Cause:
If the redundancy of a storage pool exceeds the threshold due to a back-end storage network fault, the system of the current version will ignore the fault. Therefore, if I/O suspension is disabled, later storage pool faults cannot be reported and I/Os keep retrying. As a result, I/O errors will not be returned.
Solution:
Log in to the active FSM node using its floating IP address as user dsware.
Run the following command to go to the specified directory:
cd /opt/dsware/client/bin
Run the following command to set the abort timeout interval:
./dswareTool.sh --op globalParametersOperation -opType modify -parameter abort_timeout:90
Restore the back-end network of the faulty node and check whether services in the storage pool are restored.
Check After Recovery:
Check whether the storage pool status is restored to normal.