Hi, guys!
This is a case about instable fibre channel link between host and disk array.
Problem Description
An alarm indicating that the link between the host and the disk array is unstable is generated on DeviceManager. The alarm may be automatically cleared irregularly and then reoccur or the alarm persists.
Alarm information
An alarm is generated on the disk array.
Alarm title: Link
Between A Host And Storage Array Is Unstable,
Alarm content: The link between the initiator
(type FC, identifier 0x10000000c995b100) of the host (name hostname) and the
host port (Engine 0, interface module A2, port number P0) is unstable.
Process
1. Check the possible causes one by one to determine the location where the bit error occurs.
2. Replace the optical fiber, optical module, and daughter card in the bit error range in sequence and check whether the bit error rate continues to increase.
Step 1: Check the bit error increase on the switch. Log in to the CLI of Brocade and SNS series switches and run the statsclear and porterrshow commands to clear and view bit errors.
Run the porterrshow command again to check the port statistics after running for a period of time (for example, one hour). If the enc in and enc out increase rapidly, replace the cable or module corresponding to the port, or even use another port to eliminate bit errors.
Step 2: Check the bit errors of the corresponding port on the storage device. Check method: Determine the storage port based on the alarm information. For example, the alarm information is The.
link between the initiator (type FC, identifier 0x10000000c995b100) of the host (name hostname) and the host port (Engine 0, interface module A2, port number P0 is unstable. The P0 FC port on controller card A2 of engine 0 is unstable to the host.
Clear the bit error statistics of all Fibre Channel ports by following the instructions provided in the online help of the? icon in the upper corner of the DeviceManager home page. Choose Manage Hardware Devices > Monitor Controllers > Manage FC Interface Modules.
After the system runs for a period of time (for example, one hour), check the increase of the line bit error statistics of the port in the alarm information. If the bit error rate increases rapidly, replace the cable or module corresponding to the port, or use another port to check whether the bit error is cleared.
Step 3: Observe for 1 hour after bit errors are checked on the switch and storage device. If the alarm persists or the alarm is cleared, the alarm recurs. Determine the controller that reports the link instability based on the alarm information in step 2. On the navigation bar of DeviceManager, choose Settings > Export Data > System Log to decompress the system log package. Open the corresponding control log (..\log_controller_x\Messages\ messages_YYYYMMDDHHMMSS_mem, YYYYMMDDHHMMSS indicates the latest run log) and search for ABTS.
If the preceding information contains the keyword ABTS, I/Os on the slave host still time out.
Based on the analysis experience of this type of problem on the live network, the I/O timeout is caused by bit errors on the host and switch. The storage device cannot detect the bit error type and can only detect the host I/O timeout.
It is suspected that bit errors occur on the host HBA. Currently, there is no good method for checking bit errors of host HBAs. The method varies according to OSs. Generally, replace the cables, modules, or ports on the host side to rectify the bit errors. Then, collect logs again to check whether the bit errors are cleared.
Step 4: Observe for one hour after the bit errors of the switches, storage systems, and hosts are checked. If the alarm persists or is cleared, contact Huawei technical support.
This is my solution, how about yours? Go ahead and share it with us!