Symptom:
The append latency is long and busy messages (error code -994) are returned at the persistence layer.
Diagnosis:
Based on the information about the node with high append latency provided by the service layer, check whether the increase of append latency of the node is caused by Plog_Client or Plog_Server processes.
Log in to the node, go to the /var/FTDS/stat directory, and run the ls -lrt|grep {plog_client_pid} command to find the dotting log printed when the latency increased. The log file name is similar to DFV_FTDS_client_27650_20190325161742_show.ios, which ends with _show.ios.
Search for keyword append_total_latency_d_all in the obtained dotting file, as shown in the following figure:
The second, third, fifth, and sixth parameters of append_total_latency_d_all indicate that the average latency is 996 ms and the maximum latency is 16s.Meanings of parameters of append_total_latency_p99_all in the second line are similar to those in the first line, that is, in I/Os collected during the statistical time period (30s), there are 184 append requests with high latency. The average latency is 7.74s and the maximum latency is 16.4s.Statistics in the preceding two lines indicate that the append latency at the persistence layer is really high.
Search for keyword ec_append_max_msg_total_d in the obtained dotting file, as shown in the following figure:
The ec_append_max_msg_total_d dotting field indicates the latency for Plog_Server processes to process each append operation. According to the preceding figure, the main cause for long latency at the persistence layer is the long latency on Plog_Server processes.Search for keyword C_append_total_tcp in the obtained dotting file, as shown in the following figure:
The C_append_total_tcp dotting field indicates the append latency of a single Plog_Server process.According to the preceding figure, the append latency of a single Plog_Server process is not long, but values of the P99 field are large, which are close to the dotting statistics of the ec_append_max_msg_total_d field. This indicates that, the pressure of the entire cluster is not high, but that of some disks is high. As a result, the latency of the entire append request increases.
According to the conclusion of the first step, the Plog_Server processes that increase the total append latency need to be found out.
Go to the /var/log/dsware/stat/{client_nid} directory and run the grep "\-994" errocodeinfo.stat |grep "append"|grep "{Time when the latency increased}" command, as shown in the following figure:
Records in green boxes are Plog_Server processes that return error code -994 most frequently and 655xx indicates the NID of the Plog_Server processes. Locate the corresponding disks according to the NIDs.Values in the fourth column indicate the total times that a Plog_Server process returns error code -994. Values in the fifth column indicate the times that a Plog_Server process returns error code -994 within the last minute. It can be seen that most read and write operations that return error code -994 are performed on disk 65545 and 65570.Select a Plog_Server process that returns error code -994 frequently, obtain the value of plog_server_nid (for example, 655XX), and run the insight command to obtain the IP address and ESN of the corresponding node.Run the ip_str=`grep mdc_url /opt/dfv/persistence_layer/*/conf/*.c*f*|grep @|awk -F=' {print $2} '|head -n1|awk -F,' {print $1} '`;id_ip=`echo $ip_str|awk -F@' {print $1} '|awk -F:' {if($3){print $2,$3}else {print $1,$2}} '`;port=`echo $ip_str|awk -F@' {print $2} '|awk -F:' {print $2} '`;dsware_insight 0 $id_ip $port 8 101 {pool_id} | grep {plog_server_nid} command to obtain the IP address of the node and the ESN of the disk where the Plog_Server process is located, as shown in the following figure:

Based on the node IP address and disk ESN obtained in 2, log in to the node to find the reason why the disk returns error code -994 a large number of times.
View the disk logs that are stored in the /var/log/dsware directory. Log file keyword: ESN.Run the grep "io request is full" {plog_server_log} command, the disk queue is found to be full. A large number of requests, including append and read requests, are discarded and error codes -994 are returned, as shown in the following figure:

Go to the /var/log/dsware directory and search for the log with keyword isotat. View the log to check the disk pressure, as shown in the following figure:
In the sdk lines, the disk is found to have a low pressure, indicating that a large number of I/Os have been cached. If I/Os can be cached, the reason for the Plog_Server process returning error code -994 a large number of times is that a large number of read requests are sent to the disk where the Plog_Server process is located, causing full disk queue.Locate the node that delivers a large number of read I/Os.
Go to the /var/log/dsware/stat/{plog_server_nid}/ directory and view the errocodeinfo.stat file, as shown in the following figure:

Field description:
ERRFUNCNAME: indicates the operation type. For example, Send_read_plog_ec_ack indicates to return the response of the ec read operation.ERRORCODE: indicates the error code and -994 indicates busy.TOTALTIMES: indicates the times the error code is generated.CAPSTIMES: indicates the times the error code is generated within the current statistical period.PeerNID = 131075: indicates the NID of the peer node. In this example, it indicates the node that delivers the read requests.
Location method using the file:
(1) Find the contents printed at the specified time, For example, find the error statistics collected at 10:22:37 on March, 25, 2019.
(2) Check whether a large number of read requests return error code -994 during the period.
(3) If yes, the read hotspot is most possibly caused by the current node (NID: 131075).
(4) If no, it indicates that the increase of append latency is not caused by a read hotspot and further problem location is required.
Use the NID of the Plog_Client that delivers a large number of read requests obtained in 3 to locate the corresponding node.Run the insight command ip_str=`grep mdc_url /opt/dfv/persistence_layer/*/conf/*.c*f*|grep @|awk -F=' {print $2} '|head -n1|awk -F,' {print $1} '`;id_ip=`echo $ip_str|awk -F@' {print $1} '|awk -F:' {if($3){print $2,$3}else {print $1,$2}} '`;port=`echo $ip_str|awk -F@' {print $2} '|awk -F:' {print $2} '`;dsware_insight 0 $id_ip $port 8 110 | grep {plog_client_nid} and the command output is shown in the following figure:

After obtaining the IP address of the node that delivers a large number of read requests, the service layer provides related flow control measures to reduce the read requests, preventing the read hotspot urgently.
Cause:
Causes for the read hotspot:
When a large number of read requests are delivered to a disk and the target of each request is a small data block, the disk is under high I/O workload. At this time, if there are append requests delivered to the disk, busy messages are possibly returned. After the busy messages are returned, the append operations are performed periodically at an interval of 500 ms to 5s. As a result, the append latency is increased and the append operation may even fail.
When busy messages are returned, read requests delivered to a disk are reconstructed and distributedly delivered to other disks. However, write requests are not reconstructed and distributedly delivered. Only when the requests are processed successfully on all disks, the append operation is successful. Therefore, the symptom of the read hotspot is the increase of append latency.
Solution:
Emergency measures: Locate the node that delivers a large number of read requests at the persistence layer and then inform the service layer to enable flow control for the read requests.
Check After Recovery:
The append latency returns to the normal level.
Records in green boxes are Plog_Server processes that return error code -994 most frequently and 655xx indicates the NID of the Plog_Server processes. Locate the corresponding disks according to the NIDs.Values in the fourth column indicate the total times that a Plog_Server process returns error code -994. Values in the fifth column indicate the times that a Plog_Server process returns error code -994 within the last minute. It can be seen that most read and write operations that return error code -994 are performed on disk 65545 and 65570.
In the sdk lines, the disk is found to have a low pressure, indicating that a large number of I/Os have been cached. If I/Os can be cached, the reason for the Plog_Server process returning error code -994 a large number of times is that a large number of read requests are sent to the disk where the Plog_Server process is located, causing full disk queue.