Hello, everyone!
I will share with you how to deal with the problem of consuming data stacking when using Kafka.
Cause Analysis
The CPU or I/O usage of the server is high.
The settings of num.io.threads and num.network.threads on the server are improper.
Too many requests are sent from the server, causing high latency.
The network on the server is poor.
The network on the client is poor.
The client and server versions do not match or are incompatible.
Solution
Run the iostat -d -x 1 command to check the CPU parameter idle and I/O parameter util. The higher the idle value, the idler the CPU. The higher the util value, the higher the I/O usage. If the idle value is less than 20%, run the top -Hp kafkaPid command to check the threads with high CPU usage, open the jstack log file to analyze the cause. If the value of util is greater than 80%, check the values of read/write speed, await, and svctm corresponding to the disk to determine the cause that greatly affects the I/O. If the value of read/write speed is large, check the Kafka read and write requests and latency (see 3), if the value of await is much greater than that of svctm, the I/O queue is too long and the application response is slow.
The default values of num.io.threads and num.network.threads are 8 and 3 respectively. In the production environment, the values of num.io.threads and num.network.threads are generally set to multiples of the number of disks.
Check the traffic, request information, and latency information of the corresponding topic. If the number of requests on the node is large and the latency is high, check whether the number and distribution of partitions are proper and whether the distribution of partition leaders is balanced.
Choose Cluster > Kafka > KafkaTopic Monitor. On the page that is displayed, view the byte traffic and number of requests of the real-time and historical topics.

Choose Cluster > Kafka > Dashboard and click Customize in the upper right corner to select the request, disk, traffic, and latency information.

Go to the Kafka Broker instance page. On the instance status page, click the inverted triangle icon in the upper right corner to view the request, disk, traffic, latency, and process memory of the node.

If the request, traffic, delay, disk, or memory information of a node is obviously higher than that of other nodes, check whether the number of leader nodes is obviously greater than that of other nodes on the instance status page. If the leader data of each node is similar, the leader distribution is balanced. If the leader data of each node is greatly different, check whether the value of auto.leader.rebalance.enable of the broker instance is true. If the value is false, manually run the kafka-preferred-replica-election.sh command on the Kafka client. (For details, see the client tool part.)
On the FusionInsight page, check whether the read/write speed of the disk is always high. You can also use the NMON tool to check the real-time I/O to determine whether the I/O reaches the bottleneck, that is, whether the I/O rate is close to the value of the NIC rate divided by 8.
For details about how to install and use nmon, see http://nmon.sourceforge.net/.
On the client, you can use NMON or other tools to check the I/O rate and determine whether the I/O reaches the bottleneck.
Check whether the client version is consistent with the server version. If the C++ client is used, check whether the version is compatible. For example, the rdkafka version compatible with 6.5. X is 3.1.0, which is incompatible with rdkafka 4.0.3.
You can use kafka-consumer-perf-test.sh to test the performance on the server and client and compare the performance with the consumption performance of the client to further determine whether the problem is caused by the client or consumer. For details about how to use kafka-consumer-perf-test.sh, see the client tool part.
We warmly welcome you to enjoy our community!


