Hello, everyone!
I will share with you how to deal with the problem producer fails to send data when using Kafka.
Context and Symptom
Error description on the client:
The error that Kafka failed to send data frequently occurred and could not be recovered for a long time. After Kafka was restarted, services were recovered. The following exception was displayed on the client:
can not send messages!
Error description on the server:
The engineer logged in to FusionInsight Manager, chose Services > Kafka, and checked the values of Total Number of Partitions That Are Not Fully Synchronized and Total Request Rate.
If the two parameters were not displayed on the right, the engineer could click Customize to select the two items and click OK.
As shown in the above figure, the number of suddenly-increased Partitions was not synchronized, which decreases the request rate. Therefore, the server was abnormal. In this case, the following steps for troubleshooting needed to be performed.
Information Collection
Finding the jstack of the Kafka fault process by running required commands
Run the following command to query the ID of the Broker process.
ps -ef|grep kafka

Run the following command to log in to all background nodes of Kafka:
su - omm
jstack Kafka process ID >> /tmp/kafka.jstack
It is recommended that the jstack information be recorded every 10s, and three jstack records are required, then collected kafka.jstack files.

Changing the log level of Kafka to DEBUG and keeping the configuration for five minutes
Log in to FusionInsight Manager, choose Services > Kafka > Service Configuration, and set Type to All.
Enter the name of the parameter to be modified in the search box on the right.
Modify the log level of the following logs to DEBUG, namely, kafka.log.level, kafka.network.requestchannel.log.level, kafka.request.log.level, and root.log.level.
Click Save in the upper left corner, and click OK on the displayed dialog box.
Wait for 5 minutes, change the value to INFO, and save the configuration.
On FusionInsight Manager, choose System > Log Download.
Set Services to kafka:broker as shown in the preceding figure.
You did not need to configure the Hosts parameter, all hosts where Kafka was deployed are selected by default.
Set Time to the time period from an hour before the exception occurs in the current time.
Obtaining detailed information about all partitions
Run the following command to go to the client installation directory:
cd Installation directory of the Kafka client
Run the following command to configure environment variables:
source bigdata_env
If the cluster uses the security mode, run the following command to authenticate the user. In normal mode, user authentication is not required.
kinit Component service user
Run the following command on the Kafka client to collect partition-describe.log:
kafka-topics.sh --describe --zookeeper 26.3.X.X:24002/kafka > /tmp/partition-describe.log
To sum up, you need to collect kafka.jstack, partition-describe.log and the Broker service logs.
Possible Causes
The network is abnormal.
The Broker service is faulty.
Cause Analysis
After checking the network, and there was no exception found.
No error was reported in Broker logs.
When the jstack file had been checked, the following error information was displayed:
The error information indicated that dead locks occurred on two request threads.


This was an open source problem, and you could find the details by visiting https://issues.apache.org/jira/browse/KAFKA-6042There.
Solution
Temporarily change the number of threads on the server to 1 to prevent deadlock between multiple threads. However, this operation affects the performance of the server.
Log in to FusionInsight Manager, choose Services > Kafka > Service Configuration > Type > All, enter num.io.threads in the search box in the upper right corner, and change the value to 1.
After the modification, click Save Configuration, and click OK. (Note: Do not select the service or instance that affects the cluster restart. It is recommended that you manually restart the Kafka cluster after negotiation with the customer. The modification takes effect only after the restart.)
To manually restart the service, log in to FusionInsight Manager and choose Services > Kafka > More Actions > Restart Service.
This problem has been resolved using the FusionInsight Tool V100R002C80SPC002 patch. You are advised to start production services after installing the formal patch.
We warmly welcome you to enjoy our community!


