Got it

Storage Replication Link Intermittent Disconnection Caused by SAN Network Exceptions and High Host Latency

116 0 0 0 0

[Problem Description]

The storage replication link is intermittently disconnected and the host latency is high.


[Fault Symptom]

HyperMetro replication links on Huawei storage devices are disconnected and automatically recover quickly.

2019-09-04 10:04 0xF0E10001 Fault Major Recovered 2019-09-04 10:04 Replication link (link ID 768, local controller 0D, local port CTE0.L4.IOM1.P1, remote controller 0D, remote port CTE0.L4.IOM1.P1, remote device name STO392-8901-OS6800-01, serial number 2102350UHP10HA000009) is disconnected. Therefore, the remote device cannot be accessed.

2019-09-04 10:04 0xF0E10001 Fault Major Rovered 2019-09-04 10:04 Replication link (link ID 512, local controller 0C, local port CTE0.R4.IOM1.P1, remote controller 0C, remote port CTE0.R4.IOM1.P1, remote device name STO392-8901-OS6800-01, serial number 2102350UHP10HA000009) is disconnected. Therefore, the remote device cannot be accessed.


[Root Cause]

The SAN network is abnormal. As a result, services time out.


[Location Method]

1. Analyzed storage logs and found that the HyperMetro heartbeat detection I/O times out (five consecutive times out), causing the HyperMetro replication link to be disconnected.

1

2. A large number of ABTSs are found on both storage devices. Therefore, the timeout is related to the network.

[2019-09-04 10:04][43123679.282249] [][15000000c3879][INFO][Receive ABTS. LPort(0x110101_0x21100)<---RPort(0x19341). OX_ID(0x120a), RX_ID(0xffff).][FC_UNF][UNF_RcvBlsReq,15644][110101_ElsTx]

3. Sort out the SAN network topology based on the storage ports involved in the replication link that has been disconnected. (If a SAN network problem occurs, you are advised to sort out a general network connection diagram and quickly demarcate the problem based on the abnormality.) We found that almost all link disconnections occurred on the same Fabric plane (SAN391-02 and SAN392-02).

3

4. Check whether the transmit and receive power of the optical modules of the storage ports and the corresponding optical switch port (sfpshow) is normal, whether the corresponding ports have bit errors (porterrshow) on the switch side, and whether the switch generates an error alarm (errdump). The interface is intermittently disconnected (fabriclog). No abnormalities were found.

For details, see section http://3ms.huawei.com/hi/group/2027453/wiki_5239085.html.

In this case, we can preliminarily determine that the high network latency is caused by the cascading link of the switch, which is difficult to locate. The causes are as follows: For example, cascading links (including DWDM links) are unstable or slow devices on the network cause back pressure.

5. Run the statsclear command to clear the bit error records of the two optical switches. Observe the logs for a period of time. Collect the logs again and observe the cascading ports. Several cascading ports (271, 287, 303, and 319) do have disc c3 frame loss and crc. Port 271 is slightly more. All cascading ports are abnormal. Therefore, it is suspected that the problem may be caused by slow device backpressure on the network.

3

There are many types of slow devices, such as full port bandwidth, low host or storage performance, HBA driver or firmware exception, and aging optical modules and cables. Some of these devices do not display bit errors. The live network is full of top-end SAN switches. Each SAN switch has hundreds of ports. It is very complicated to check slow devices one by one. Therefore, the following operations are performed simultaneously:

a. Rectify the fault of the optical fiber or optical module on the two optical switches. (Refer to the bit errors displayed in the porterrshow command and replace or isolate the port.)

b. Enable the monitoring functions (MAPS and FPI) of Brocade switches. This function requires the support of the Fabric Vision license. If the switch on the live network does not have a license, you can import a temporary license for fault locating. This does not affect services.

3

Introduction to MAPS: http://3ms.huawei.com/hi/group/2027453/wiki_5180523.html

MAPS enable: mapspolicy --enable dflt_moderate_policy (medium recommended)

mapsconfig --actions RASLOG,SW_CRITICAL,SW_MARGINAL

FPI introduction: http://3ms.huawei.com/hi/group/2027453/wiki_5285059.html

FPI enabled: This function is enabled by default in versions later than 8.0. Before 8.0, run the mapsconfig --enableFPImon command.

7. After the monitoring function is enabled, a large number of performance alarms are generated on the switch. The frame timeout of cascading ports 1 and 47 reaches 100 ms. Because the intermediate optical cable or DWDM WDM device is faulty, and the portperfshow command is executed to find that the service load of the four cascading ports is not heavy, the port is disabled and isolated temporarily.

For details about how to disable an in-use cascading port without causing data frame loss, see:

http://3ms.huawei.com/hi/group/2027453/wiki_5506219.html

3

The check result shows that all the cascading ports have a large number of this count, and there is room for optimization to increase the number of buffer credits. In addition, these cascading ports are configured in long-distance mode. The buffer credit is allocated based on the configured distance during initial port configuration. Therefore, it is suspected that the configured distance is too short.

According to the customer, the distance between the two data centers is about 10 km. The portshow command output shows that the configured distance is 35 km, which meets the requirements (it is recommended that the configured distance be 1.5 to 2 times the actual distance). Therefore, the buffer credit should not be bottlenecked.

3

However, this problem occurs in practice. It is suspected that the intermediate optical cable is wound too much or the optical attenuation of the optical cable is large. Therefore, it is recommended that the D-Port function of the switch be used to check the distance and delay between the two sites. Because D-Port diagnosis is offline, you need to run the portperfshow command to check the service load of the three ports in advance to ensure that the bandwidth is not a bottleneck after a cascade port is disabled. The method of disabling the port is described earlier. For details about how to check the D-Port configuration, see:

http://3ms.huawei.com/hi/group/2027453/wiki_5684636.html

The delay detected by the D-Port is 0.3 ms (normal), and the distance is 32.7 km, which is almost equal to the configured 35 km. This does not comply with the best practice of long-distance configuration. The best practice should be 1.5 to 2 times. Because the device has a large number of buffers, we directly reconfigure the distance to 80 km.

3

In addition, the other ports were tested and modified in the same way.

After the port configuration is optimized, the client-side service delay recovers.


[Solution]

1. Replace the faulty optical module and optical fiber on the network.

2. Load the fabric vison license for switch performance monitoring, monitor high-latency ports, and isolate them.

3. Check and optimize the switch long-distance configuration through the D-Port.


[Applicability]

Brocade SAN switches and OceanStor V3 series.


Comment

You need to log in to comment to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.