Got it

HCIA-Big Data | Spark Principles and Architecture

Latest reply: May 17, 2022 08:31:19 827 21 8 0 0

Hello, everyone!

This post I will share with you the Spark principles and Architecture.

Spark System Architecture

As shown in the following figure, Spark Core is the core of the Spark ecosystem. Data is read data from the components at the persistence layer, such as HDFS and HBase. 

Mesos, Yarn, and Spark's own standalone Cluster Manager are used to schedule jobs for computing Spark application programs that can come from different components, such as Spark Shell/Spark Submit for batch processing, Spark Streaming for real -time processing, Spark SQL for ad hoc query, MLlib for machine learning, GraphX for graph processing, and SparkR for mathematical computing. 

Spark System Architecture

Typical Case - WordCount

Typical Case - WordCount

WordCount

WordCount

Spark SQL Overview

Spark SQL is the module used in Spark for structured data processing. In Spark applications, you can seamlessly use SQL statements or DataFrame APIs to query structured data.

Spark SQL Overview

Spark SQL allows developers to directly process RDD and query external data stored in Hive and HBase. An important feature of Spark SQL is the ability to process relational tables and RDD in a unified manner, so that developers can easily use SQL commands to perform external queries and perform more complex data analysis. 

Spark SQL uses a method similar to that used by relational databases to process SQL statements. Spark SQL parses SQL statements to form a tree, and then uses rules to bind and optimize the tree. 

Lexical and syntax parsing: Parses the lexical and syntax of the read SQL statements to identify keywords (such as SELECT, FROM, and WHERE), expressions, projections, and data sources in the SQL statements, to determine whether the SQL statements are standard, and to form a logical plan. 

Bind: Binds an SQL statement to a data dictionary (column, table, or view) of a database. If the related projection and data source exist, the SQL statement can be executed. 

Optimize: Spark SQL provides several execution plans and returns datasets queried from the database. 

Execute: executes the optimal execution plans obtained in the previous step and returns the dataset queried from the database. 

The differences between Spark SQL and Hive SQL, see What's the difference between Spark SQL and Hive?

Spark SQL and Hive

Differences:

  • The execution engine of Spark SQL is Spark Core, and the default execution engine of Hive is MapReduce.

  • The execution speed of Spark SQL is 10 to 100 times faster than Hive.

  • Spark SQL does not support buckets, but Hive supports.

Relationships:

  • Spark SQL depends on the metadata of Hive.

  • Spark SQL is compatible with most syntax and functions of Hive.

  • Spark SQL can use the custom functions of Hive.

Spark Structured Streaming Overview

Structured Streaming is a streaming data–processing engine built on the Spark SQL engine. You can compile a streaming computing process like using static RDD data. When streaming data is continuously generated, Spark SQL will process the data incrementally and continuously, and update the results to the

result set.

Structured Streaming

Example Programming Model of Structured Streaming

Example Programming Model of Structured Streaming

Spark Streaming Overview

The basic principle of Spark Streaming is to split real-time input data streams by time slice (in seconds), and then use the Spark engine to process data of each time slice in a way, which is similar to batch processing.

Spark Streaming

Window Interval and Sliding Interval

Window Interval and Sliding Interval

The window slides on the Dstream. The RDDs that fall within the window are merged and operated to

generate a window-based RDD.

  • Window length: indicates the duration of a window.

  • Sliding window interval: indicates the interval for performing window operations.

Spark Streaming is a stream processing system that performs high-throughput and fault-tolerant processing on real-time data streams. It can perform complex operations such as map, reduce, and join on multiple data sources (such as Kafka, Flume, Twitter, Zero, and TCP sockets) and save the results to external file systems. The data is stored in the external file system and database, or applied to the real-time 

dashboard. 

The core idea of Spark Streaming is to split stream computing into a series of short batch jobs. The batch processing engine is Spark Core. That is, the input data of Spark Streaming is divided into data segments based on the specified time slice (for example, 1 second). Each data segment is converted into RDDs in Spark, then the DStream conversion in Spark Streaming is transformed to the RDD conversion in Spark. As a result, the intermediate result of the RDD conversion is saved in the memory. 

Storm is a well-known framework in the real-time computing field. Compared with Storm, Spark Streaming provides better throughput. They have better performance than each other in different scenarios. 

Application scenario of Storm: 

1. Storm is recommended in scenarios where even one second delay is unacceptable. For example, a financial system requires real -time financial transaction and analysis. 

2. If a reliable transaction mechanism and reliability mechanism are required for real-time computing, that is, the processing of data must be accurate, Storm can be used. 

3. If dynamic adjustment of real-time computing program parallelism is required based on peak and off-peak hours, Storm can be used to maximize the use of cluster resources (usually for small companies with cluster resource constraints). 

4. If SQL interactive query operations and complex transformation operators do not need to be executed on a big data application system that requires real-time computing, Storm is ideal. 

Spark Streaming: If real-time computing, strong transaction mechanisms, and dynamic parallelism adjustment are not required, Spark Streaming can be used. Located in the Spark ecological technology stack, Spark Streaming can seamlessly integrate with Spark Core and Spark SQL. That is, delayed batch processing, interactive query, and other operations can be performed immediately and seamlessly on immediate data that is processed in real time. This feature greatly enhances the advantages and 

functions of Spark Streaming.

Comparison Between Spark Streaming and Storm

For more details, see What's the Difference between Spark Streaming and Storm?

Summary of HBase-related posts

Title

[FI Components]  Relationship between Spark and HDFS

[FI Components] Basic Principle of Spark and Architecture

[FI Components] Working Principle of Spark

[FI Components] Spark Streaming Principle

[FI components Log] Spark Component Log Introduction

Insufficient Resources Causing Spark Running Failure

Disabling Driver in Advance Causing Spark Task Failures

Insufficient Resources Causing Spark Running Failure

A Spark Task Fails to Be Started Because Local Random Ports Are Used Up

What Is Spark SQL?

Spark Streaming consumption

Spark Basic Principles (Part I): RDD

Spark Application Scenarios

Three modes for SparkSQL to read MySQL

Spark Reads Hive and Writes HBase Samples

Big Data Practice Analysis: Spark Read  and Write Process Analysis

Spark Learning Summary

That's all, thanks!

The post is synchronized to: HCIA-Big Data

  • x
  • convention:

NgTrang
Created Apr 9, 2022 10:22:34

Thanks for sharing
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 11, 2022 00:52:05 (0) (0)
 
NTan33
Created Apr 10, 2022 06:18:00

Another interesting post to go through.
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 11, 2022 00:52:37 (0) (0)
We can learn HCIA-Big data together!  
TriNguyen
Created Apr 10, 2022 09:30:51

Good share
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 11, 2022 00:52:44 (0) (0)
Thanks!  
Saqibaz
Created Apr 11, 2022 03:43:25

Good share
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 12, 2022 00:39:39 (0) (0)
Thanks!  
MahMush
Moderator Author Created Apr 17, 2022 05:29:51

great post to understand Spark principles and Architecture.
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 18, 2022 00:48:58 (0) (0)
Thanks!  
Saqibaz
Created Apr 17, 2022 09:31:48

Thanks for sharing
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 18, 2022 00:49:07 (0) (0)
 
VinceD
Moderator Created Apr 21, 2022 01:54:30

thanks for sharing.
View more
  • x
  • convention:

VinceD
VinceD Created Apr 21, 2022 01:54:46 (0) (0)
 
olive.zhao
olive.zhao Created Apr 21, 2022 01:54:55 (0) (0)
 
pupu.F
Created Apr 22, 2022 02:59:01

Good post about the Spark.
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 22, 2022 03:12:31 (0) (0)
Thanks!  
bobi
Created Apr 22, 2022 07:46:40

Great share
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 25, 2022 01:12:14 (0) (0)
 
12
Back to list

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.