Hello, everyone!
The post summarizes some FAQs about the UltraPath link downgrade scenarios and recovery mechanisms.
Question:
Why are multipathing links set to degraded? How can I automatically recover?
Answer:
When the host multipathing software detects that the path is abnormal, it sets the path to the degraded state and sends an alarm indicating that the link is unstable to the storage system. After a period of time, if the host multipathing software detects that the link is recovered, the host multipathing software pushes a link recovery alarm to the storage system.
The following paragraph lists the common link downgrade and restoration conditions:
1. Busy downgrade
Busy is displayed as Degraded in the logical path.
Search for the keywords “to busy” and “XMP_PATH_IO_BUSY” in the xmp_log0.txt file of Huawei UltraPath, and check the time when the error was generated and the alarm was generated. If the preceding information is displayed, the logical path of a VLUN is in the Busy state.
Alarm principles:
When the storage device returns an I/O BUSY error code, the error handling process is started. When an I/O returns a BUSY error for the first time, the storage device retries. If the I/O is still returned for more than 120 seconds, the storage device sets the logic path to degraded.
For a busy downgraded trail, if the trail is normal within a period of time (120 minutes), the trail is restored to the normal state.
2. I/O timeout degrade
I/O timeout degradation is displayed as Degraded in the logical path.
Search for the keywords “to timeout” and “XMP_PATH_IO_TIMEOUT” in the xmp_log0.txt file of Huawei UltraPath, and check the time when the error was generated and the alarm was generated. If the preceding information is displayed, the logical path of a VLUN is degraded due to I/O timeout.
Alarm principles:
When an I/O error is returned to UltraPath and the I/O processing time exceeds 60s, timeout processing is triggered and the logical path is set to degraded.
I/O detection is periodically sent within 10 minutes. If I/O timeout does not occur within 10 minutes, the path status is restored to Normal and no alarm is cleared.
3. Link unstable degrade
Instable link degradation is displayed as Degraded in the logical path.
Search the xmp_log0.txt file for the keywords “to not steady” and “XMP_PATH_IO_NOT_STEADY”, and check the time when the error occurred and the alarm was generated. If the preceding information is displayed, the logical path of a VLUN is unstable and degraded.
Alarm principles:
If the link is disconnected for three times within 30 minutes, the link is unstable and degraded.
If the link is not disconnected within one hour, the link status is restored to Normal, and the link alarm is not cleared.
4. Degrade due to unknown I/O errors
Degraded is displayed in the logical path due to unknown I/O errors.
Search for keywords “to unknown error” and “XMP_PHYPATH_IO_UNKNOWN” in the xmp_log0.txt file of Huawei UltraPath, and check the time when the error was generated and the alarm was generated. If the preceding information is displayed, the logical path of a VLUN is degraded due to unknown I/O errors.
Alarm principles:
If more than 20% of the 5000 I/O errors occur, the logical path is degraded.
If no unknown I/O error is generated within 10 minutes, the path status is restored to Normal and no alarm is cleared.
That's all, thanks!