1.1.1 Spark Streaming Optimization
Scenario
The Spark Streaming is a mini-batch flow processing frame, which features second delay and high throughput. As a result, the aim of Spark Streaming tuning is to enhance the throughput capacity of the Spark Streaming to process as much data as possible in a unit period in the scenario of second delay.
![]()
This chapter applies to the scenario in which the data source is the Kafka.
Procedure
A simple flow processing system consists of the data source, receiver and processor. The data source is the Kafka, the receiver is the Kafka data source receiver in the Spark Streaming, and the processor is the Spark Streaming.
For Spark Streaming tuning, performance of these three components must be optimal.
l Data Source Tuning
In actual application scenario, the data source stores the data in the local disks to ensure the error tolerance of the data. However, the calculation results of the Spark Streaming are stored in the memory, and the data source may become the largest bottleneck of the flow system.
For Kafka performance tuning, pay attention to the followings:
− Use the versions later than the Kafka-0.8.2, for which the new Producer interface in the asynchronous mode can be used.
− Configure multiple directories for the Broker, multiple IO lines and appropriate numbers of Partitions for the Topic.
For details, see the performance optimization in the Kafka opening source described in the following file: http://kafka.apache.org/documentation.html
l Receiver Tuning
There are receivers with various data sources in the Spark Streaming, such as the Kafka, Flume, MQTT, and ZeroMQ. The Kafka has most types and is the most mature receiver.
The Kafka includes receiver API in three modes:
− The KafkaReceiver receives Kafka data directly. When the process becomes abnormal, data loss may occur.
− The ReliableKafkaReceiver receives the data removal using the ZooKeeper records.
− The DirectKafka reads the data in each Partition in the Kafka using the RDD to ensure high reliability.
In implementing, the performance of the DirectKarfa is the best. In actual testing, the performance of the DirectKarfa is also much better than the other two APIs. As a result, the DirectKafka API is recommended.
The data receiver is a consumer of Kafka. For details about its tuning, see the following Kafka opening source: http://kafka.apache.org/documentation.html
l Processor Tuning
The bottom of the Spark Streaming is running in the Spark. Consequently, most tuning measures for the Spark can apply to the Spark Streaming. Take the following as examples:
− Data Serialization
− Optimizing Memory Configuration
− Setting the DOP
− Using the External Shuffle Service to Improve Performance
![]()
When performing Spark Streaming Tuning, notice that one thing: pursuing for performance optimization may cause the poor reliability of the whole Spark Streaming. For example:
If the spark.streaming.receiver.writeAheadLog.enable is set to false, the disk operation is reduced to improve the performance. However, due to the lack of the WAL mechanism, data loss occurs when the restoration is abnormal.
Therefore, during the Spark Streaming tuning, these configuration variables to ensure the data reliability cannot be closed in the production environment.
l Log archive optimization
The spark.eventLog.group.size parameter is used to group JobHistory logs of an application based on the specified number of jobs. Each group creates a file recording log to prevent JobHistory reading failures caused by an oversized log generated during the long-term running of the application. If this parameter is set to 0, logs are not grouped.
Most Spark Streaming jobs are small jobs and are generated at a high speed. As a result, frequent grouping is performed and a large number of small log files are generated, consuming disk I/O resources. You are advised to increase the parameter value to, for example, 1000 or greater.


