The above question is tied to the following activity: Join the class of Big Data 101!
Hello, everyone!
This post describes the basic concepts of Spark and the similarities and differences between the Resilient Distributed Dataset (RDD), DataSet, and DataFrame data structures in Spark. Additionally, and the features of Spark SQL, Spark Streaming, and Structured Streaming.
What is Spark?
Apache Spark was developed in the UC Berkeley AMP lab in 2009.
It is a fast, versatile, and scalable memory-based big data computing engine.
As a one-stop solution, Apache Spark integrates batch processing, real-time streaming, interactive query, graph programming, and machine learning.
Application Scenarios of Spark
Batch processing can be used for extracting, transforming, and loading (ETL).
Machine learning can be used to automatically determine whether e-Bay buyers comments are positive or negative.
Interactive analysis can be used to query the Hive warehouse.
Streaming processing can be used for real-time businesses such as page-click stream analysis, recommendation systems, and public opinion analysis.
Highlights of Spark
Spark and MapReduce
Spark Data Structure
RDD - Core Concept of Spark
Resilient Distributed Datasets (RDDs) are elastic, read-only, and partitioned distributed datasets.
RDDs are stored in the memory by default and are written to disks when the memory is insufficient.
RDD data is stored in clusters as partitions.
RDDs have a lineage mechanism, which allows for rapid data recovery when data loss occurs.
RDD Dependencies
Differences Between Wide and Narrow Dependencies
For more details, see What's the difference between wide and narrow dependencies?
Stage Division of RDD
RDD Operation Type
Spark operations can be classified into creation, conversion, control, and behavior operations.
Creation operation: used to create an RDD. An RDD is created through a memory collection and an external storage system or by a transformation operation.
Transformation operation: An RDD is transformed into a new RDD through certain operations. The transformation operation of the RDD is a lazy operation, which only defines a new RDD but does not execute it immediately.
Control operation: RDD persistence is performed. An RDD can be stored in the disk or memory based on different storage policies. For example, the cache API caches the RDD in the memory by default.
Action operation: An operation that can trigger Spark running. Action operations in Spark are classified into two types. One is to output the calculation result, and the other is to save the RDD to an external file system or database.
RDD Creation Operation
Currently, there are two types of basic RDDs:
Parallel collection: An existing collection is collected and computed in parallel.
External storage: A function is executed on each record in a file. The file system must be HDFS or any storage system supported by Hadoop.
The two types of RDD can be operated in the same way to obtain a series of extensions such as sub-RDD and form a lineage diagram.
Control Operation
Spark can store RDD in the memory or disk file system persistently. The RDD in memory can greatly improve iterative computing and data sharing between computing models. Generally, 60% of the memory of the execution node is used to cache data, and the remaining 40% is used to run tasks. In Spark, persist and cache operations are used for data persistency. Cache is a special case of persist().
Transformation Operation - Transformation Operator
Action Operation - Action Operator
DataFrame
Similar to RDD, DataFrame is also an invariable, elastic, and distributed dataset. In addition to data, it records the data structure information, which is known as schema. The schema is similar to a two-dimensional table.
The query plan of DataFrame can be optimized by Spark Catalyst Optimizer. Spark is easy to use, which improves high-quality execution efficiency. Programs written by DataFrame can be converted into efficient
forms for execution.
DataSet
DataFrame is a special case of DataSet (DataFrame=Dataset[Row]). Therefore, you can use the "as" method to convert DataFrame to DataSet. Row is a common type. All table structure information is represented by row.
DataSet is a strongly typed dataset, for example, Dataset[Car] and Dataset[Person].
Differences Between DataFrame, DataSet, and RDD
Assume that there are two lines of data in an RDD, which are displayed as follows:
The data in DataFrame is displayed as follows:
The data in DataSet is displayed as follows:
For more details, see The difference between RDD, DataFrame and DataSet in Spark.
Summary of HBase-related posts
Title |
A Spark Task Fails to Be Started Because Local Random Ports Are Used Up |
Big Data Practice Analysis: Spark Read and Write Process Analysis |
That's all, thanks!