Hello all,
Today I want to share with you a case about how to optimize the reliability in Dorado 6000 V6 iSCSI Scenarios.
Environment configuration
Dorado 6000 V6 (with 50 km remote replication), two Linux hosts, Veritas DMP multipathing, two dual-port 10GE HBAs, and four physical paths between each host and storage. Hosts and storage devices are connected through 10GE switches.
Symptom
In the Linux + Veritas DMP and iSCSI scenario, the default configuration is used. After a controller is removed, I/Os are restored to zero for 10 seconds. When multiple physical links are faulty, I/Os are restored to zero for more than 7 seconds, which does not meet user requirements. With optimization, I/Os are not zeroed when a single controller is faulty, and I/Os are zeroed for 1 second when multiple physical links are faulty.
Optimization Solution
1. Modify the following parameters in the etc/iscsi/iscsid.cof file:
node.session.timeo.replacement_timeout = 1
#How long does the upper layer be notified of network problems?
node.conn[0].timeo.noop_out_interval = 1
#Interval for sending ping packets
node.conn[0].timeo.noop_out_timeout = 1
# Timeout interval for receiving heartbeat packets.
Note: After modifying the file, restart the host, re-establish the iSCSI connection, and make the configuration take effect permanently.

After the preceding parameters are modified, the I/O of one controller is removed and stops for 6 seconds.

Shut down one port of the storage device and then the three ports of the storage device on the switch. I/Os are stopped for 7 seconds.

2. Optimized the iSCSI network mounting solution to fully interconnect host HBAs and storage SmartIO cards.
Solution before optimization: One-to-one connection is used, and each host is connected to each SmartIO card of the storage device. However, each HBA card of each host cannot be connected to each SmartIO card of the storage device.

iSCSI mounting script:


After one controller is removed, only one HBA on each host has traffic and more traffic is forwarded.

After the four ports on the storage side are shut down, each host is still connected to only half of the SmartIO cards on the storage side, and more data is forwarded.

Optimized solution: Each HBA on the host is fully interconnected with each SmartIO on the storage device.

iSCSI mounting script:


After one controller is removed, each HBA on the host is still connected to each SmartIO card on the storage device, and I/O forwarding is reduced.

After the four ports on the storage side are shut down, each host is still connected to each SmartIO card on the storage side, reducing I/O forwarding.

After the parameters in the iSCSI configuration file are modified and the host HBA and storage SmartIO cards are fully interconnected, the fault of a single controller does not return to zero.

Shut down one port of the storage device and then the three ports of the storage device on the switch. I/Os are stopped for 1 second.

Summary
When the controller fault pretest in the iSCSI scenario lasts for 6 seconds, it takes 949 ms for the storage BSP to detect the controller removal, the BSP reports the interrupt, the system controller receives the interrupt, and the system control switches from the dual-host mode to the single mode. Other time is spent on the host. The multipathing software on the host does not take much time from receiving the error code returned by the I/O to complete the path switchover. Therefore, the main time consumption is mainly caused by fault detection and I/O forwarding related to the host HBA and iSCSI driver. The solution is to modify iSCSI-related timeout parameters and use full interconnection to reduce I/O forwarding in case of faults.
Thank you.


