Got it

Spark:Spark Streaming Optimization

Latest reply: Oct 11, 2021 13:50:15 823 6 7 0 0

1.1.1 Spark Streaming Optimization

Scenario

The Spark Streaming is a mini-batch flow processing frame, which features second delay and high throughput. As a result, the aim of Spark Streaming tuning is to enhance the throughput capacity of the Spark Streaming to process as much data as possible in a unit period in the scenario of second delay.

note

This chapter applies to the scenario in which the data source is the Kafka.

Procedure

A simple flow processing system consists of the data source, receiver and processor. The data source is the Kafka, the receiver is the Kafka data source receiver in the Spark Streaming, and the processor is the Spark Streaming.

For Spark Streaming tuning, performance of these three components must be optimal.

l   Data Source Tuning

In actual application scenario, the data source stores the data in the local disks to ensure the error tolerance of the data. However, the calculation results of the Spark Streaming are stored in the memory, and the data source may become the largest bottleneck of the flow system.

For Kafka performance tuning, pay attention to the followings:

           Use the versions later than the Kafka-0.8.2, for which the new Producer interface in the asynchronous mode can be used.

           Configure multiple directories for the Broker, multiple IO lines and appropriate numbers of Partitions for the Topic.

For details, see the performance optimization in the Kafka opening source described in the following file: http://kafka.apache.org/documentation.html

l   Receiver Tuning

There are receivers with various data sources in the Spark Streaming, such as the Kafka, Flume, MQTT, and ZeroMQ. The Kafka has most types and is the most mature receiver.

The Kafka includes receiver API in three modes:

           The KafkaReceiver receives Kafka data directly. When the process becomes abnormal, data loss may occur.

           The ReliableKafkaReceiver receives the data removal using the ZooKeeper records.

           The DirectKafka reads the data in each Partition in the Kafka using the RDD to ensure high reliability.

In implementing, the performance of the DirectKarfa is the best. In actual testing, the performance of the DirectKarfa is also much better than the other two APIs. As a result, the DirectKafka API is recommended.

The data receiver is a consumer of Kafka. For details about its tuning, see the following Kafka opening source: http://kafka.apache.org/documentation.html

l   Processor Tuning

The bottom of the Spark Streaming is running in the Spark. Consequently, most tuning measures for the Spark can apply to the Spark Streaming. Take the following as examples:

           Data Serialization

           Optimizing Memory Configuration

           Setting the DOP

           Using the External Shuffle Service to Improve Performance

note

When performing Spark Streaming Tuning, notice that one thing: pursuing for performance optimization may cause the poor reliability of the whole Spark Streaming. For example:

If the spark.streaming.receiver.writeAheadLog.enable is set to false, the disk operation is reduced to improve the performance. However, due to the lack of the WAL mechanism, data loss occurs when the restoration is abnormal.

Therefore, during the Spark Streaming tuning, these configuration variables to ensure the data reliability cannot be closed in the production environment.

l   Log archive optimization

The spark.eventLog.group.size parameter is used to group JobHistory logs of an application based on the specified number of jobs. Each group creates a file recording log to prevent JobHistory reading failures caused by an oversized log generated during the long-term running of the application. If this parameter is set to 0, logs are not grouped.

Most Spark Streaming jobs are small jobs and are generated at a high speed. As a result, frequent grouping is performed and a large number of small log files are generated, consuming disk I/O resources. You are advised to increase the parameter value to, for example, 1000 or greater.

 


This article contains more resources

You need to log in to download or view. No account? Register

x

welcome
View more
  • x
  • convention:

Thanks for sharing
View more
  • x
  • convention:

The content is beneficial, thanks.
View more
  • x
  • convention:

Thanks for sharing
View more
  • x
  • convention:

Great thanks
View more
  • x
  • convention:

zaheernew
MVE Author Created Oct 11, 2021 13:50:15

Thanks for sharing.
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.