1. Number of input files
2. Input file size
3. Configuration parameters
The number of files in the input directory determines the number of maps to be run.
The application runs a map for each segment. Generally, each input file has an map split.
If the size of the input file exceeds the size of the HDFS block (128M), more than two map files are generated for the same input file.
The formula for calculating the number of maps is as follows:
splitsize=max(minimumsize, min(maximumsize, blocksize)).
If minimumsize and maximumsize are not set, the default value of splitsize is blocksize.
You need to know the value you set and set the value according to the actual situation.
The hive.exec.reducers.bytes.per.reducer parameter specifies the number of reducer to be processed by a job.
The value is the total size of input files. The default value is 1 GB.
The hive.exec.reducers.max parameter controls the maximum number of reducer. If the value of hive.exec.reducers.max is input / bytes per reduce > max, the number of reduce specified by this parameter is started. This does not affect the setting of the mapre.reduce.tasks parameter. The default value of max is 999.
If mapred.reduce.tasks is specified, Hive does not use its estimation function to automatically calculate the number of reduce.
Instead, Hive uses this parameter to start reducer. The default value is -1.
The setting of reduce has a great impact on the execution efficiency.
1. If the value of reduce is too small: If the data volume is large, the reduce may be abnormal. As a result, the task cannot be ended, and the reduce may be OOM.
2. If the value of reduce is too large: Too many small files are generated, resulting in high cost. The memory usage of NameNode increases.
If mapred.reduce.tasks is not specified, Hive automatically calculates the number of reducers.