Got it

Occasional Long E2E Delay for 30000 Consumers

104 0 0 0 0

Hello, everyone! 

This post describes Occasional Long E2E Delay for 30000 Consumers.



[Applicable Version]

6.5.x

[Context and Symptom]

The Kafka environment was a live network of 2000 nodes. Multiple consumers initiated consumptions on each node. The Kafka environment had three bare metal servers each configured with 64 cores and 512 GB memory. 15 topics were available. Each topic had nine partitions and three copies. 15 producers were available. The message distribution rate for each topic was 1msg/s. The message body contained 2000 bytes. 2000 consumer groups were available. Each group had 15 consumers and subscribed to 15 topics. An independent consumer for monitoring data was initiated. The monitoring data was stored on the producer host. The delay is calculated by the difference between the message sending time and received time. The producer and the consumer were located on the same host to ensure time consistency.

According to the test result, if the producer sent a message with the size of 2 KB every second for 15 topics, the delay was long even to several seconds. When the producer sent a message for one topic, the problem did not occur.

1_en-us_image_0228243567.png

In this scenario, traffic of the message producer was not large. The producer generated 10 messages per second and the traffic rate was 20 KB/s. However, the Broker sent 30,000 messages per second and the traffic rate was 60 MB/s.

The packet whose Seq is 1359049145 was sent as a production request at 17:10.33, and was retransmitted at 17.10.35 (after 1.7 seconds).

The Server returned an ACK packet at an interval of 2.3s. Checked the packet on the Server and found:

1_en-us_image_0228243568.png

The preceding figure shows the relative Seq. The following figure shows the absolute Seq.

1_en-us_image_0228243569.png

It took 239 ms from the time when the producer message was received to the time when the server returned the Kafka response, which indicated that two seconds were spent on transmission. It can be preliminarily concluded that the underlying network of the test environment was faulty, causing a large number of retransmissions.

The network engineers suspected that the OVS network was blocked. However, the environment was an OpenStack system running on VMs, which was different from the live network environment. After the environment was switched to 20 physical nodes (without using the OVS bridge), the test result improved.

100 consumer and 15 topics were created on each node. The total number of connections was still 30000 (20 x 15 x 100).

1_en-us_image_0228243570.png

The average delay decreased from about 100 ms to 70+ ms. The maximum delay was found during the connection setup. In other time, the delay did not exceed 1s.

[Solution]

Adjust the fetch.purgatory.purge.interval.requests and producer.purgatory.purge.interval.requests parameters to reduce the maximum delay. Even after the network recovers, adjustment of the two parameters can still greatly shorten the delay. (A test will be performed again later.) Theoretically, adjusting the two parameters introduces the latency mechanism to prevent over-judgment of completed requests each time the I/O threads process a request with a delay.

In addition, even a Kafka system with three nodes can support 2000 consumer groups in a healthy network environment. Each consumer group has 15 consumers, and there are 30,000 consumers in total.


Hope you can learn from it, thank you!


Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.