Hello, everyone!
This post I will share with you the Spark principles and Architecture.
Spark System Architecture
As shown in the following figure, Spark Core is the core of the Spark ecosystem. Data is read data from the components at the persistence layer, such as HDFS and HBase.
Mesos, Yarn, and Spark's own standalone Cluster Manager are used to schedule jobs for computing Spark application programs that can come from different components, such as Spark Shell/Spark Submit for batch processing, Spark Streaming for real -time processing, Spark SQL for ad hoc query, MLlib for machine learning, GraphX for graph processing, and SparkR for mathematical computing.

Typical Case - WordCount

WordCount

Spark SQL Overview
Spark SQL is the module used in Spark for structured data processing. In Spark applications, you can seamlessly use SQL statements or DataFrame APIs to query structured data.

Spark SQL allows developers to directly process RDD and query external data stored in Hive and HBase. An important feature of Spark SQL is the ability to process relational tables and RDD in a unified manner, so that developers can easily use SQL commands to perform external queries and perform more complex data analysis.
Spark SQL uses a method similar to that used by relational databases to process SQL statements. Spark SQL parses SQL statements to form a tree, and then uses rules to bind and optimize the tree.
Lexical and syntax parsing: Parses the lexical and syntax of the read SQL statements to identify keywords (such as SELECT, FROM, and WHERE), expressions, projections, and data sources in the SQL statements, to determine whether the SQL statements are standard, and to form a logical plan.
Bind: Binds an SQL statement to a data dictionary (column, table, or view) of a database. If the related projection and data source exist, the SQL statement can be executed.
Optimize: Spark SQL provides several execution plans and returns datasets queried from the database.
Execute: executes the optimal execution plans obtained in the previous step and returns the dataset queried from the database.
The differences between Spark SQL and Hive SQL, see What's the difference between Spark SQL and Hive?
Spark SQL and Hive
Differences:
The execution engine of Spark SQL is Spark Core, and the default execution engine of Hive is MapReduce.
The execution speed of Spark SQL is 10 to 100 times faster than Hive.
Spark SQL does not support buckets, but Hive supports.
Relationships:
Spark SQL depends on the metadata of Hive.
Spark SQL is compatible with most syntax and functions of Hive.
Spark SQL can use the custom functions of Hive.
Spark Structured Streaming Overview
Structured Streaming is a streaming data–processing engine built on the Spark SQL engine. You can compile a streaming computing process like using static RDD data. When streaming data is continuously generated, Spark SQL will process the data incrementally and continuously, and update the results to the
result set.

Example Programming Model of Structured Streaming

Spark Streaming Overview
The basic principle of Spark Streaming is to split real-time input data streams by time slice (in seconds), and then use the Spark engine to process data of each time slice in a way, which is similar to batch processing.

Window Interval and Sliding Interval

The window slides on the Dstream. The RDDs that fall within the window are merged and operated to
generate a window-based RDD.
Window length: indicates the duration of a window.
Sliding window interval: indicates the interval for performing window operations.
Spark Streaming is a stream processing system that performs high-throughput and fault-tolerant processing on real-time data streams. It can perform complex operations such as map, reduce, and join on multiple data sources (such as Kafka, Flume, Twitter, Zero, and TCP sockets) and save the results to external file systems. The data is stored in the external file system and database, or applied to the real-time
dashboard.
The core idea of Spark Streaming is to split stream computing into a series of short batch jobs. The batch processing engine is Spark Core. That is, the input data of Spark Streaming is divided into data segments based on the specified time slice (for example, 1 second). Each data segment is converted into RDDs in Spark, then the DStream conversion in Spark Streaming is transformed to the RDD conversion in Spark. As a result, the intermediate result of the RDD conversion is saved in the memory.
Storm is a well-known framework in the real-time computing field. Compared with Storm, Spark Streaming provides better throughput. They have better performance than each other in different scenarios.
Application scenario of Storm:
1. Storm is recommended in scenarios where even one second delay is unacceptable. For example, a financial system requires real -time financial transaction and analysis.
2. If a reliable transaction mechanism and reliability mechanism are required for real-time computing, that is, the processing of data must be accurate, Storm can be used.
3. If dynamic adjustment of real-time computing program parallelism is required based on peak and off-peak hours, Storm can be used to maximize the use of cluster resources (usually for small companies with cluster resource constraints).
4. If SQL interactive query operations and complex transformation operators do not need to be executed on a big data application system that requires real-time computing, Storm is ideal.
Spark Streaming: If real-time computing, strong transaction mechanisms, and dynamic parallelism adjustment are not required, Spark Streaming can be used. Located in the Spark ecological technology stack, Spark Streaming can seamlessly integrate with Spark Core and Spark SQL. That is, delayed batch processing, interactive query, and other operations can be performed immediately and seamlessly on immediate data that is processed in real time. This feature greatly enhances the advantages and
functions of Spark Streaming.
Comparison Between Spark Streaming and Storm
For more details, see What's the Difference between Spark Streaming and Storm?
Summary of HBase-related posts
Title |
A Spark Task Fails to Be Started Because Local Random Ports Are Used Up |
Big Data Practice Analysis: Spark Read and Write Process Analysis |
That's all, thanks!


