Hi team, here's a new case about the Bypass function of the NFVI upper-layer cloud core that doesn't take effect.
Problem Symptom
In NFVIs scenario, after all, storage nodes are powered off,
all upper-layer cloud core VMs are hung, and the Bypass function does not take effect.
Problem Diagnosis
1. Confirm that all storage nodes are powered off based on the logs. I/Os on a compute node keep returning error codes to the operating system.
2. The message log of the compute node's operating system does not contain I/O error records, indicating that I/Os are retried in the operating system and are not returned to the upper layer.
3. The Bypass function can be triggered only when the accessed block device reports an I/O error. However, the I/Os keep retrying in the operating system and no error is returned. Therefore, the Bypass function does not take effect, and upper-layer VMs are hung.
Causes
A storage system of the 6.3 version supports I/O suspension being disabled when it is interconnected with NFVI/HCS. After I/O suspension is disabled, when a storage fault occurs, I/O errors can be returned within 120s, so that upper-layer applications can detect the fault and enable the Bypass function. However, I/O suspension cannot be disabled in 8.0 versions.
Solution
1. Log in to the active FSM node and run the su - dsware command to switch to user dsware.
2. Run the /opt/dsware/client/bin/dswareTool.sh --op queryGlobalParameters -paraName g_dsware_io_hanging_switch command to check whether I/O suspension is disabled. If enabled, contact the owner of the upper-layer application to disable I/O suspension, so that the Bypass function can take effect. If disabled, go to the next step.
3. Check the current storage system version. If the version is earlier than FusionStorage 8.0.1.SPH8, the I/O suspension function cannot be disabled. You need to upgrade the system to FusionStorage 8.0.1.SPH8 or later.
Check After Recovery
After the system is upgraded to a version that supports disabling the I/O suspension function, simulate a scenario where all storage nodes are powered off. The Bypass function of the cloud core is normal.
Applicable Versions
Versions earlier than FusionStorage 8.0.1.SPH8.
This is my solution, how about yours? Go ahead and share it with us!