Hello, everyone!
I will share with you how to deal with the problem of poor production performance of producers when using Kafka.
Context and Symptom
The production performance of the Kafka cluster deteriorates, and the amount of written data decreases.
Cause Analysis
The data volume increases, the number of partitions is insufficient, or the number of partitions created after topics are deleted decreases. As a result, the performance bottleneck occurs.
The CPU processing performance is poor, and the network is severely congested.
The network bandwidth is insufficient, causing poor production performance.
Procedure
Possible cause 1: The data volume increases, the number of partitions is insufficient, or the number of partitions created after topics are deleted decreases. As a result, the performance bottleneck occurs.
Run the describe command to check the number of partitions of the current topic. Assume that num = Number of nodes x Number of disks on a single node.
If the number of partitions is far less than num and num is less than 200, increase the number of partitions of the topic to num. (For details about the command, see section "alter.")
If the number of partitions is far less than num and num is greater than 200, increase the number of partitions of the topic to 200. (For details about the command, see section "alter.")
If the port is not found, run the netstat -anp | grep port command (port indicates the port number used by the producer) to check whether network write stacking exists on each node.
Figure1 Network write stacking

If there are many values in the red box and the values are large, network stacking exists. Locate the node with network stacking.
If no such a node is found, check the data volume on each node. The data indicators are displayed in the data graph of each node. Locate the node with a large difference.
Figure1 Topic data on a single node

If no such a node is found, run the describe command to return the results of all topics and query the number of partitions on each leader node.
./kafka-topics.sh --describe --zookeeper zookeeper_IP:24002/kafka > topics
For example, run the following command to check the number of leaders on node 1:
cat topics | grep -v "Leader: 1". Find the node with the largest number of leaders in this way.
If the node is not found, check the client log. The client log specifies the node that responds slowly. Find the node.
After finding the node with heavy load, go to the its data directory.
cd /srv/BigData/kafka/data*/kafka-logs
Sort topics by folder size and find topics with large files and few partitions. Expand the partitions of the topics.
Possible cause 2: The CPU processing performance is poor, and the network is severely congested.
Check whether the host CPU usage is low on the FusionInsight GUI. Data on each node is severely stacked.
Increase the values of Kafka cluster performance parameters
num.io.threads and num.network.threads to two to three times the number of disks mounted to the node. Restart the Kafka cluster.
Possible cause 3: The network bandwidth is insufficient, causing poor production performance.
Install the FusionInsight client on the client. For details about the installation precautions, see section "Installing the Client" in the product documentation.
Use the pressure test tool to send data to the test cluster and check whether the production performance has bottlenecks.
We warmly welcome you to enjoy our community!

