Got it

How to control the containers (map and reducer) in the YARN level?

Created: Jun 24, 2019 03:22:48Latest reply: Jun 24, 2019 05:43:11 685 1 0 0 0
  Rewarded HiCoins: 0 (problem resolved)

Hi all,


Could anyone please assist me with the below issues?


1. I could see in my cluster that most of the jobs are running with 3000 more mappers and 999 reducers (by default).


2. How do you suggest to the developer to mention the values and on what basis?


SET mapreduce.job.reduces=XX

    

3. Max number of reducers will be used. Can we reduce the default value instead of launching 999 reducers?


hive.exec.reducers.max

Default Value: 999

Pelase assist me with the above points. 

Thanks,

Ramu A. 

Huawei India


Featured Answers

Best answer

Recommended answer

songminwang
Created Jun 24, 2019 05:43:11

Hello, sir!

What is the number of Mapper?

1. Number of input files
 
2. Input file size 

3. Configuration parameters

The number of files in the input directory determines the number of maps to be run.
 
The application runs a map for each segment. Generally, each input file has an map split. 

If the size of the input file exceeds the size of the HDFS block (128M), more than two map files are generated for the same input file.

The formula for calculating the number of maps is as follows:

splitsize=max(minimumsize, min(maximumsize, blocksize)).

If minimumsize and maximumsize are not set, the default value of splitsize is blocksize.

You need to know the value you set and set the value according to the actual situation.

The hive.exec.reducers.bytes.per.reducer parameter specifies the number of reducer to be processed by a job.
 
The value is the total size of input files. The default value is 1 GB.

The hive.exec.reducers.max parameter controls the maximum number of reducer. If the value of hive.exec.reducers.max is input / bytes per reduce > max, the number of reduce specified by this parameter is started. This does not affect the setting of the mapre.reduce.tasks parameter. The default value of max is 999.

If mapred.reduce.tasks is specified, Hive does not use its estimation function to automatically calculate the number of reduce. 

Instead, Hive uses this parameter to start reducer. The default value is -1.

The setting of reduce has a great impact on the execution efficiency.

1. If the value of reduce is too small: If the data volume is large, the reduce may be abnormal. As a result, the task cannot be ended, and the reduce may be OOM.

2. If the value of reduce is too large: Too many small files are generated, resulting in high cost. The memory usage of NameNode increases.

If mapred.reduce.tasks is not specified, Hive automatically calculates the number of reducers.
View more
  • x
  • convention:

All Answers
Hello, sir!

What is the number of Mapper?

1. Number of input files
 
2. Input file size 

3. Configuration parameters

The number of files in the input directory determines the number of maps to be run.
 
The application runs a map for each segment. Generally, each input file has an map split. 

If the size of the input file exceeds the size of the HDFS block (128M), more than two map files are generated for the same input file.

The formula for calculating the number of maps is as follows:

splitsize=max(minimumsize, min(maximumsize, blocksize)).

If minimumsize and maximumsize are not set, the default value of splitsize is blocksize.

You need to know the value you set and set the value according to the actual situation.

The hive.exec.reducers.bytes.per.reducer parameter specifies the number of reducer to be processed by a job.
 
The value is the total size of input files. The default value is 1 GB.

The hive.exec.reducers.max parameter controls the maximum number of reducer. If the value of hive.exec.reducers.max is input / bytes per reduce > max, the number of reduce specified by this parameter is started. This does not affect the setting of the mapre.reduce.tasks parameter. The default value of max is 999.

If mapred.reduce.tasks is specified, Hive does not use its estimation function to automatically calculate the number of reduce. 

Instead, Hive uses this parameter to start reducer. The default value is -1.

The setting of reduce has a great impact on the execution efficiency.

1. If the value of reduce is too small: If the data volume is large, the reduce may be abnormal. As a result, the task cannot be ended, and the reduce may be OOM.

2. If the value of reduce is too large: Too many small files are generated, resulting in high cost. The memory usage of NameNode increases.

If mapred.reduce.tasks is not specified, Hive automatically calculates the number of reducers.
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.