Got it

HCIA-Big Data | Introduction to Spark

Latest reply: Jul 27, 2022 11:28:30 1946 113 43 0 1

Second question

The above question is tied to the following activity: Join the class of Big Data 101!



Hello, everyone!

This post describes the basic concepts of Spark and the similarities and differences between the Resilient Distributed Dataset (RDD), DataSet, and DataFrame data structures in Spark. Additionally,  and the features of Spark SQL, Spark Streaming, and Structured Streaming.

What is Spark?

Apache Spark was developed in the UC Berkeley AMP lab in 2009.

It is a fast, versatile, and scalable memory-based big data computing engine.

As a one-stop solution, Apache Spark integrates batch processing, real-time streaming, interactive query, graph programming, and machine learning.

Application Scenarios of Spark

Batch processing can be used for extracting, transforming, and loading (ETL).

Machine learning can be used to automatically determine whether e-Bay buyers comments are positive or negative.

Interactive analysis can be used to query the Hive warehouse.

Streaming processing can be used for real-time businesses such as page-click stream analysis, recommendation systems, and public opinion analysis.

Highlights of Spark

Highlights of Spark

Spark and MapReduce

Spark and MapReduce

Spark Data Structure

RDD - Core Concept of Spark

Resilient Distributed Datasets (RDDs) are elastic, read-only, and partitioned distributed datasets.

RDDs are stored in the memory by default and are written to disks when the memory is insufficient.

RDD data is stored in clusters as partitions.

RDDs have a lineage mechanism, which allows for rapid data recovery when data loss occurs.

RDD

RDD Dependencies

RDD

Differences Between Wide and Narrow Dependencies

For more details, see What's the difference between wide and narrow dependencies?

Stage Division of RDD

Stage Division of RDD

RDD Operation Type

Spark operations can be classified into creation, conversion, control, and behavior operations.

  • Creation operation: used to create an RDD. An RDD is created through a memory collection and an external storage system or by a transformation operation.

  • Transformation operation: An RDD is transformed into a new RDD through certain operations. The transformation operation of the RDD is a lazy operation, which only defines a new RDD but does not execute it immediately.

  • Control operation: RDD persistence is performed. An RDD can be stored in the disk or memory based on different storage policies. For example, the cache API caches the RDD in the memory by default.

  • Action operation: An operation that can trigger Spark running. Action operations in Spark are classified into two types. One is to output the calculation result, and the other is to save the RDD to an external file system or database.

RDD Creation Operation

Currently, there are two types of basic RDDs:

  • Parallel collection: An existing collection is collected and computed in parallel.

  • External storage: A function is executed on each record in a file. The file system must be HDFS or any storage system supported by Hadoop.

The two types of RDD can be operated in the same way to obtain a series of extensions such as sub-RDD and form a lineage diagram.

Control Operation

Spark can store RDD in the memory or disk file system persistently. The RDD in memory can greatly improve iterative computing and data sharing between computing models. Generally, 60% of the memory of the execution node is used to cache data, and the remaining 40% is used to run tasks. In Spark, persist and cache operations are used for data persistency. Cache is a special case of persist().

Transformation Operation - Transformation Operator

Transformation Operation - Transformation Operator

Action Operation - Action Operator

Action Operation - Action Operator

DataFrame

Similar to RDD, DataFrame is also an invariable, elastic, and distributed dataset. In addition to data, it records the data structure information, which is known as schema. The schema is similar to a two-dimensional table.

The query plan of DataFrame can be optimized by Spark Catalyst Optimizer. Spark is easy to use, which improves high-quality execution efficiency. Programs written by DataFrame can be converted into efficient

forms for execution.

DataSet

DataFrame is a special case of DataSet (DataFrame=Dataset[Row]). Therefore, you can use the "as" method to convert DataFrame to DataSet. Row is a common type. All table structure information is represented by row.

DataSet is a strongly typed dataset, for example, Dataset[Car] and Dataset[Person].

Differences Between DataFrame, DataSet, and RDD

Assume that there are two lines of data in an RDD, which are displayed as follows:

RDD

The data in DataFrame is displayed as follows:

DataFrame

The data in DataSet is displayed as follows:

DataSet

For more details, see The difference between RDD, DataFrame and DataSet in Spark.

Summary of HBase-related posts

Title

[FI Components]  Relationship between Spark and HDFS

[FI Components] Basic Principle of Spark and Architecture

[FI Components] Working Principle of Spark

[FI Components] Spark Streaming Principle

[FI components Log] Spark Component Log Introduction

Insufficient Resources Causing Spark Running Failure

Disabling Driver in Advance Causing Spark Task Failures

Insufficient Resources Causing Spark Running Failure

A Spark Task Fails to Be Started Because Local Random Ports Are Used Up

What Is Spark SQL?

Spark Streaming consumption

Spark Basic Principles (Part I): RDD

Spark Application Scenarios

Three modes for SparkSQL to read MySQL

Spark Reads Hive and Writes HBase Samples

Big Data Practice Analysis: Spark Read  and Write Process Analysis

Spark Learning Summary

That's all, thanks!

The post is synchronized to: HCIA-Big Data

  • x
  • convention:

NTan33
Created Apr 9, 2022 03:34:14

An interesting topic for those inclined towards it.
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 11, 2022 00:51:36 (0) (0)
Thanks, friend!  
RNT
Created Apr 9, 2022 05:12:54

Useful post

View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 11, 2022 00:51:41 (0) (0)
 
Saqibaz
Created Apr 11, 2022 03:43:35

Good share
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 12, 2022 00:38:05 (0) (0)
Thanks, friend!  
zaheernew
MVE Author Created Apr 11, 2022 08:17:46

Useful Info
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 12, 2022 00:38:25 (0) (0)
Thanks, friend!  
TriNguyen
Created Apr 11, 2022 09:54:17

Great share
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 12, 2022 00:38:52 (0) (0)
Thanks, friend!  
phuta
Created Apr 11, 2022 13:42:52

Thanks for sharing
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 12, 2022 00:39:02 (0) (0)
 
hanhcao
Created Apr 12, 2022 07:09:21

Good one
View more
  • x
  • convention:

MahMush
Moderator Author Created Apr 29, 2022 15:31:48

Important information shared
View more
  • x
  • convention:

SaraZahid
Created Apr 29, 2022 16:46:30

Thanks for sharing
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.