Got it

HCIA-Big Data | Introduction to MapReduce

Latest reply: Jul 28, 2022 17:54:34 825 17 13 0 0

Hello, everyone!

This post describes MapReduce. MapReduce is the most famous computing framework for batch processing and offline processing in the big data field. In this post, you will learn its principles, processes, and application scenarios. 

What is MapReduce?

MapReduce is designed and developed based on the MapReduce thesis released by Google. Based on the "divide-and-conquer" algorithm, MapReduce is used for parallel computing and offline computing of large-scale datasets (larger than 1 TB). It has the following features:

  • Highly abstract programming ideas: Programmers only need to describe what to do, and the execution framework will process the work accordingly.

  • Outstanding scalability: Cluster capabilities can be improved by adding nodes.

  • High fault tolerance: Cluster availability and fault tolerance are improved using policies, such as computing or data migration.

Functions and Architectures of MapReduce

MapReduce Process

As the name implies, the MapReduce calculation process can be divided into two phases: Map and Reduce. The output in the Map phase is the input in the Reduce phase.

MapReduce can be understood as a process of summarizing a large amount of disordered data based on certain features and processing the data to obtain the final result.

  • In the Map phase, keys and values (features) are extracted from each piece of uncorrelated data.

  • In the Reduce phase, data is organized by keys followed by several values. These values are correlated. On this basis, further processing may be performed to obtain a result.

MapReduce Workflow

MapReduce Workflow

In the preceding figure, different Map tasks do not communicate with each other, and different Reduce tasks do not exchange information. Users cannot explicitly send messages from one machine to another. All data exchange is implemented by the MapReduce framework. 

Map Phase

Before jobs are submitted, the files to be processed are split. By default, the MapReduce framework regards a block as a split. Client applications can redefine the mapping between blocks and splits.

In the Map phase, data is stored in a ring memory buffer. When the data in the buffer reaches about 80%,

a spill occurs. In this case, the data in the buffer needs to be written to local disks.

Map Phase

Partition: By default, the hash algorithm is used for partitioning. The MapReduce framework determines the number of partitions based on that of Reduce tasks. Records with the same key value are sent to the same Reduce tasks for processing. 

Sort: The outputs of Map are sorted, for example, ('Hi','1'), ('Hello','1') is reordered as ('Hello','1'),('Hi','1').

Combine: By default, combination is optical to the MapReduce framework. For example, you can combine ('Hi', '1'), ('Hi', '1'), ('Hello', '1'), and (Hello', '1') into ('Hi', '2'), ('Hello', '2'). 

Merge: After map tasks are processed, many spill files are generated. Multiple spill files must be merged to create a partitioned and sorted spill file MapOutFile (MOF).  To reduce the amount of data written, MOF files can be compressed before being written. 

Reduce Phase

The MOF files are sorted. If the amount of data received by Reduce tasks are small, the data is directly stored in the buffer. As the number of files in the buffer increases, the MapReduce background thread merges the files into a large ordered file. Many intermediate files are generated during the merge operation. The last merge result is directly outputted to the Reduce function defined by the user.

When the data volume is small, the data does not need to be spilled onto the disk. Instead, the data is merged in the cache and then output to Reduce.

Reduce Phase

Generally, when the MOF output progress of a Map task reaches 3%, a Reduce task is started to obtain MOF files from each Map task. The number of Reduce tasks is determined by clients and determines the number of MOF partitions. For this reason, the MOF files output by Map tasks has corresponding Reduce tasks. The following figure shows the heavy load simulation process. 

Shuffle Process

Shuffle is the intermediate data transfer process between the Map phase and Reduce phase. The Shuffle process includes obtaining MOF files from the Map tasks of Reduce tasks and sorting and merging MOF files.

Shuffle Process

Before all Map tasks are completed, the system merges the files into a large file which is saved on the local disk. During file merging, if the number of spill files is greater than the preset value (3 by default), the combiner can be started again. JobTracker keeps monitoring the execution of Map tasks and notifies Reduce tasks to obtain data. 

The Reduce task queries whether the Map task is completed from the JobTracker through a remote procedure call (RPC). If the Map task is completed, the Reduce task obtains data. The obtained data is stored to the cache, and the data is merged from different Map machines before being written to disks. Multiple spill files are merged into one or more large files, and key-value pairs in the files are sorted. 

WordCount Examples for Typical Programs

WordCount Examples for Typical Programs

If a user needs to analyze the frequency of each word repeated in text file A, MapReduce can help the user quickly implement the statistical analysis. The preceding figure shows the analysis process. 

1. File A is stored on HDFS and divided into blocks A.1, A.2, and A.3 that are stored on DataNodes #1, #2, and #3. 

2. The WordCount analysis and processing program provides user-defined Map and Reduce functions. WordCount submits analysis applications to ResourceManager. Then ResourceManager creates jobs based on the request and creates three Map tasks as well as three Reduce tasks that are running in a container. 

3. Map tasks 1, 2, and 3 output a MOF file that is partitioned and sorted but not combined. 

4. Reduce tasks to obtain the MOF file from Map tasks. After combination, sorting, and user-defined Reduce logic processing, statistics displayed in the table on the right are output.

Functions of WordCount

Functions of WordCount

Map Process of WordCount

Map Process of WordCount

Reduce Process of WordCount

Reduce Process of WordCount

That's all, thanks!

Next: HCIA-Big Data | Introduction to YARN

The post is synchronized to: HCIA-Big Data

  • x
  • convention:

user_4000619
Moderator Created Apr 7, 2022 04:10:26

thanks
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 7, 2022 05:54:11 (0) (0)
 
Saqibaz
Created Apr 7, 2022 04:12:45

thanks for sharing
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 7, 2022 05:54:16 (0) (0)
 
wissal
MVE Created May 5, 2022 16:29:03

Learning together, never stop!
View more
  • x
  • convention:

olive.zhao
olive.zhao Created May 6, 2022 01:01:36 (0) (0)
Learning together!  
NTan33
Created May 6, 2022 01:33:09

A great introduction to this topic.
View more
  • x
  • convention:

MahMush
Moderator Author Created May 6, 2022 10:40:23

Hadoop's MapReduce architecture is used to create applications that can process enormous amounts of data on large clusters.
View more
  • x
  • convention:

bobi
Created May 6, 2022 14:32:30

Thanks for sharing
View more
  • x
  • convention:

VinceD
Moderator Created May 7, 2022 15:33:08

interesting content.
View more
  • x
  • convention:

VinceD
VinceD Created May 7, 2022 15:33:32 (0) (0)
 
KasimAbubakr
Created May 10, 2022 04:18:00

Thank you
View more
  • x
  • convention:

Sun_MRU
Moderator Created May 10, 2022 07:13:38

thanks for sharing
View more
  • x
  • convention:

12
Back to list

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.