Got it

Spark Basic Principles (Part I): RDD

Latest reply: Dec 17, 2021 02:00:32 400 1 3 0 0

Hello, everyone!

This post will share with you the RDD.

1. What is RDD?

RDD is called Resilient Distributed Datasets. Spark processes data based on RDDs. Formally, an RDD is a collection of read-only records of a partition and a collection of fault-tolerant elements that can be operated in parallel.

How is RDD generated? There are three main generation modes:

  • Generated by using the parallelize function: val distData = sc.parallelize(data)

  • Persistent data generation: val lines = sc.textFile("data.txt")

  • Generated by another RDD calculation: val lineLengths = lines.map(s => s.length)

RDDs are fault-tolerant elements that can be operated in parallel and are the basis of Spark parallelism and fault tolerance.

2. Composition of RDD

2.1 partition

@volatile @transient private var partitions_: Array[Partition] = _

When an RDD is generated by using an RDD operation, a piece of data is divided into n pieces. Each piece of data corresponds to a partition in the RDD, and the RDD stores an array of partitions. The number of partitions can be specified during RDD initialization. If the number is not specified, read the spark.default.parallelism configuration item to set the number.

When a task is generated, the number of partitions determines the number of tasks and affects the program parallelism. When the number of partitions is too small, resources cannot be fully used. For example, in local mode, 16 cores are available but only eight partitions are available, half of the cores are not used. If there are too many partitions, there are too many tasks, and the time overhead for serialization and transmission of tasks increases.

The recommended number of partitions is Typically you want 2-4 partitions for each CPU in your cluster..

The number of partitions can be adjusted by using the repartition and coalesce functions. Repartition is equivalent to coalesce(numPartitions, shuffle = true). Repartition not only adjusts the number of partitions, but also changes partitioners to hashpartitioners to generate shuffle operations. The coalesce function can control whether to shuffle. If shuffle is set to false, only the number of partitions can be decreased.

2.2 compute()

def compute(split: Partition, context: TaskContext): Iterator[T]

The compute function is used to calculate each partition and is implemented by a subclass of RDD. Each RDD has a compute function, which is responsible for receiving the partition transferred by the previous RDD.

Let's look at the operations RDD uses for computing, which are equivalent to APIs for computing RDDs.

RDD operations for calculation are classified into transformation operations and action operations. The difference between RDD and RDD is that RDD does not calculate values when a transformation operation is performed, and only calculates values when an action operation is performed. RDD calculation is lazy. This mechanism enables RDD computing to form one or more calculation chains, and finally forms a logical execution graph, a directed acyclic graph (DAG). Spark's parallel computing capability and fault tolerance are based on this mechanism. The logic execution diagram will be further explained in the next chapter.

The following are some typical transformation and action operations. The implementation of each method overrides the compute() function to implement a specific calculation.

Typical Transformation Operations Typical Action Operations

map(func) count()

flatMap(func) join(otherDataset, [numTasks])

filter(func) groupByKey([numTasks])

mapPartitions(func) reduceByKey(func, [numTasks])

union(otherRDD) sortByKey([ascending], [numTasks])

2.3 dependencies

@volatile private var dependencies_: Seq[Dependency[_]] = _

dependencies indicates the dependency between an RDD and other RDDs. Dependencies are generated by RDD operations, for example, val lineLengths = lines.map(s => s.length). This indicates that RDD lineLengths depends on RDD lines. RDD lines are also called the parent RDD of RDD lineLengths.

In Spark, dependencies are classified into two types: NarrowDependency (full dependency) and ShuffleDependency (partial dependency). What's the difference between the two? In RDD, each partition can depend on one or more partitions. If each partition depends on only one partition, it is called NarrowDependency. If multiple values are required, the value of this parameter is ShuffleDependency.

1

Among the foregoing four dependency cases, the first three are NarrowDependency. Each partition in RDD x is completely related to only one or more partitions in RDD a. Specially, the first is referred to as OneToOneDependency. The fourth type is ShuffleDependency. A partition in RDD x is related to only a part of data of a partition in the parent RDD, and another part of data is related to other partitions in RDD x. Why do two dependencies need to be divided? This is related to the physical execution diagram of Spark, which will be further described in the next section.

Different RDD operations generate different dependencies. The transformation operation generates NarrowDependency, and the action operation generates ShuffleDependency. In the preceding example, lineLengths is generated through the map operation. The dependency between and lines is NarrowDependency.

For the first three types of NarrowDependency, Spark can directly perform parallel calculation. For the fourth type of ShuffleDependency, the shuffle operation needs to be performed. For details about the shuffle operation, see later chapters.

That's all, thanks!

  • x
  • convention:

olive.zhao
Admin Created Dec 17, 2021 02:00:32

Thanks for your sharing!
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.