1.1.1 Case 1: MapReduce Statistics Sample Program
1.1.1.1 Scenario
Applicable Versions
FusionInsight HD V100R002C70, FusionInsight HD V100R002C80
Scenario
Develop a MapReduce application to perform the following operations on logs about dwell durations of netizens for shopping online:
l Collect statistics on female netizens who dwell on online shopping for over 2 hours on the weekend.
l The first column in the log file records names, the second column records gender, and the third column records the dwell duration in the unit of minute. Three columns are separated by comma (,).
log1.txt: logs collected on Saturday
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60
log2.txt: logs collected on Sunday
LiuYang,female,20
YuanJing,male,10
CaiXuyu,female,50
FangBo,female,50
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
CaiXuyu,female,50
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
FangBo,female,50
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60
Data Planning
Save the original log files in the HDFS.
1. Create two text files input_data1.txt and input_data2.txt on the local computer, and copy log1.txt to input_data1.txt and log2.txt to input_data2.txt.
2. Create the /tmp/input folder in the HDFS, and run the following commands to upload input_data1.txt and input_data2.txt to the /tmp/input directory:
a. On the HDFS client of the Linux OS, run the hdfs dfs -mkdir/tmp/input command.
b. On the HDFS client of the Linux OS, run the hdfs dfs -putlocal_filepath/tmp/input command.
1.1.1.2 Development Guidelines
Collect statistics on female netizens who dwell on online shopping for over 2 hours on the weekend.
To achieve the objective, the process is as follows:
l Read the source file data.
l Filter data information of the time that female netizens spend online.
l Summarize the total time that each female netizen spends online.
l Filter the information about female netizens who spends online over 2 hours.
1.1.1.3 Sample Code Description
Function
Collect statistics on female netizens who dwell on online shopping for over 2 hours on the weekend.
The operation is performed in three steps:
l Filter the dwell duration of female netizens in original files using the CollectionMapper class inherited from the Mapper abstract class.
l Summarize the dwell duration of each female netizen, and output information about female netizens who dwell online for over 2 hours using the CollectionReducer class inherited from the Reducer abstract class.
l Use the main method to create a MapReduce job and then submit the MapReduce job to the Hadoop cluster.
Sample Code
The following code snippets are used as an example. For complete code, see the com.huawei.bigdata.mapreduce.examples.FemaleInfoCollector class.
Example 1: The CollectionMapper class defines the map() and setup() methods of the Mapper abstract class.
public static class CollectionMapper extends
Mapper<Object, Text, Text, IntWritable> {
// separator.
// Gender screening.
String sexFilter;
// Name.
private Text nameInfo = new Text();
//The key and value of the output must be serialized.
private IntWritable timeInfo = new IntWritable(1);
* Distributed computing
* @param key Object: Offset of the original file.
* @param value Text: A line of character data in the original file.
Output parameter of * @param context Context.
* @throws IOException , InterruptedException
*/
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException
{
String line = value.toString();
if (line.contains(sexFilter))
//A line of character string data that is read.
String name = line.substring(0, line.indexOf(delim));
nameInfo.set(name);
…………….//Obtain the online duration.
String time = line.substring(line.lastIndexOf(delim) + 1,
line.length());
timeInfo.set(Integer.parseInt(time));
………..// The map function obtains the key and value key-value pair.
context.write(nameInfo, timeInfo);
}
}
/**
* This command is used to invoke the map function to perform some initial operations.
*
* @param context Context
*/
public void setup(Context context) throws IOException,
InterruptedException
{
The configuration information can be obtained using the context.
delim = context.getConfiguration().get("log.delimiter", ",");
sexFilter = delim
+ context.getConfiguration()
.get("log.sex.filter", "female") + delim;
}
}
Example 2: The CollectionReducer class defines the reduce() method of the Reducer abstract class.
public static class CollectionReducer extends
Reducer<Text, IntWritable, Text, IntWritable>
{
// Statistics result.
private IntWritable result = new IntWritable();
// Total time threshold.
private int timeThreshold;
/** <
* @param key Text: Key after the Mapper function.
* @param values Iterable: all statistical results of the same key item.
* @param context Context
* @throws IOException , InterruptedException
*/
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
// If the time is earlier than the threshold, no result is displayed.
if (sum < timeThreshold)
{
return;
}
result.set(sum);
The output of reduce is key: information about a netizen. Value: Total time for Internet users to access the Internet.
context.write(key, result);
}
/**
* The setup () method is invoked once only before the map () method of the map task or the reduce () method of the reduce task.
*
* @param context Context
* @throws IOException , InterruptedException
*/
public void setup(Context context) throws IOException,
InterruptedException
{
………// The Context command can be used to obtain configuration information.
timeThreshold = context.getConfiguration().getInt(
"log.time.threshold", 120);
}
}
Example 3: Use the main() method to create a job, set parameters, and submit the job to the Hadoop cluster.
public static void main(String[] args) throws Exception {
// Initialize environment variables.
Configuration conf = new Configuration();
// Secure login.
LoginUtil.login(PRINCIPAL, KEYTAB, KRB, conf);
// Obtain the input parameter.
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: collect female info <in> <out>");
System.exit(2);
}
// Initialize the job task object.
@SuppressWarnings("deprecation")
Job job = new Job(conf, "Collect Female Info");
job.setJarByClass(FemaleInfoCollector.class);
…….// The command is used to set the class for executing the map or reduce task during the running of the system. The class can also be specified in the configuration file.
job.setMapperClass(CollectionMapper.class);
job.setReducerClass(CollectionReducer.class);
// Set the combiner class. By default, the class is not used. Generally, the class similar to Reduce is used.
…//The Combiner class must be used with caution or specified in the configuration file.
job.setCombinerClass(CollectionReducer.class);
// Set the output type of a job.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
// Submit the task to the remote environment for execution.
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Example 4: The CollectionCombiner class combines the data from the Map function to reduce the amount of data transmitted from Map to Reduce.
/**
* Combiner class
*/
public static class CollectionCombiner extends
Reducer<Text, IntWritable, Text, IntWritable> {
// Intermediate statistical results
private IntWritable intermediateResult = new IntWritable();
/**
* @param key Text : key after Mapper
* @param values Iterable : all results with the same key in this map task
* @param context Context
* @throws IOException , InterruptedException
*/
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
intermediateResult.set(sum);
// In the output information, key indicates netizen information,
// and value indicates the total online time of the netizen in this map task.
context.write(key, intermediateResult);
}
}
1.1.1.4 Obtaining Sample Code
Using the FusionInsight Client
Obtain the sample project mapreduce-example-security in the HDFS directory in the FusionInsight_Services_ClientConfig file extracted from the client.
Using the Maven Project
Log in to Huawei DevClod (https://codehub-cn-south-1.devcloud.huaweicloud.com/codehub/7076065/home) to download code udner components/mapreduce to the local PC.
1.1.1.5 Application Commissioning
Compiling and Running Commissioning Results
1.
In the Eclipse development environment, select
the LocalRunner.java project and click
to run the corresponding
application project.
Alternatively, right-click the project and choose Run as > Java Application from the shortcut menu to run the application project.
![]()
Do not restart the HDFS service during the running of MapReduce jobs. Otherwise, the jobs may fail.
Viewing Commissioning Results
After the MapReduce application is run, you can view the running status of the MapReduce application in the following ways:
l Check the running status of the application in the Eclipse.
l Use MapReduce logs to obtain the running status of applications.
l Log in to the MapReduce WebUI to check the running status of the application.
l Log in to the Yarn WebUI to check the running status of the application.
![]()
Contact the administrator to obtain a service account that has the right to access the web UI and its password.
1.1.1.5.1 Commissioning Applications on Windows
Compiling and Running Programs
Scenario
You can run applications in the Windows environment after the application code development is complete.
![]()
If the IBM JDK is used on Windows, applications cannot be directly run on Windows.
Procedure
1.
In the Eclipse development environment, select
the LocalRunner.java project and click
to run the corresponding
application project.
Alternatively, right-click the project and choose Run as > Java Application from the shortcut menu to run the application project.
![]()
Do not restart the HDFS service during the running of MapReduce jobs. Otherwise, the jobs may fail.
2. Viewing the Commissioning Result
Scenario
After the MapReduce application is run, you can view the running status of the MapReduce application in the following ways:
l Check the running status of the application in the Eclipse.
l Use MapReduce logs to obtain the running status of applications.
l Log in to the MapReduce WebUI to check the running status of the application.
l Log in to the Yarn WebUI to check the running status of the application.
![]()
Contact the administrator to obtain a service account that has the right to access the web UI and its password.
Procedure
l Viewing the running result to learn application running status
View the output result on the console to learn the application running status as follows:
1848 [main] INFO org.apache.hadoop.security.UserGroupInformation - Login successful for user admin@HADOOP.COM using keytab file
Login success!!!!!!!!!!!!!!
7093 [main] INFO org.apache.hadoop.hdfs.PeerCache - SocketCache disabled.
9614 [main] INFO org.apache.hadoop.hdfs.DFSClient - Created HDFS_DELEGATION_TOKEN token 45 for admin on ha-hdfs:hacluster
9709 [main] INFO org.apache.hadoop.mapreduce.security.TokenCache - Got dt for hdfs://hacluster; Kind: HDFS_DELEGATION_TOKEN,
Service: ha-hdfs:hacluster, Ident:
(HDFS_DELEGATION_TOKEN token 45 for admin)
10914 [main] INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to 53
12136 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 2
12731 [main] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:2
13405 [main] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1456738266914_0006
13405 [main] INFO org.apache.hadoop.mapreduce.JobSubmitter - Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:hacluster,
Ident: (HDFS_DELEGATION_TOKEN token 45 for admin)
16019 [main] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Application submission is not finished,
submitted application application_1456738266914_0006 is still in NEW
16975 [main] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1456738266914_0006
17069 [main] INFO org.apache.hadoop.mapreduce.Job - The url to track the job:
https://linux2:26001/proxy/application_1456738266914_0006/
17086 [main] INFO org.apache.hadoop.mapreduce.Job - Running job: job_1456738266914_0006
29811 [main] INFO org.apache.hadoop.mapreduce.Job - Job job_1456738266914_0006 running in uber mode : false
29811 [main] INFO org.apache.hadoop.mapreduce.Job - map 0% reduce 0%
41492 [main] INFO org.apache.hadoop.mapreduce.Job - map 100% reduce 0%
53161 [main] INFO org.apache.hadoop.mapreduce.Job - map 100% reduce 100%
53265 [main] INFO org.apache.hadoop.mapreduce.Job - Job job_1456738266914_0006 completed successfully
53393 [main] INFO org.apache.hadoop.mapreduce.Job - Counters: 50
![]()
The following exception may occur when the sample code is running in the Windows OS, but it will not affect services.
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
l Viewing the task execution status by using the MapReduce WebUI
Log in to FusionInsight Manager, choose Service Management > MapReduce > JobHistoryServer, and check the task execution status on the web page.
Figure 1-1 JobHistory WebUI
![]()
l Viewing the task execution status by using the Yarn WebUI
Log in to FusionInsight Manager, choose Service Management > Yarn > ResourceManager (Master), and check the task execution status on the web page.
Figure 1-2 ResourceManager WebUI
![]()
l Viewing MapReduce logs to obtain the application running status
View MapReduce logs to learn application running status, and adjust applications based on log information.
1.1.1.5.2 Running Applications on Linux
Compiling and Running Programs
Scenario
After the program code is developed, you can run the application in the Linux environment.
Prerequisites
The Yarn client has been installed.
Procedure
Step 1 Export the executable MapReduce application package.
l For the MapReduce statistics sample program, select the FemaleInfoCollector.java, LoginUtil.java, krb5.conf, and user.keytab files, and choose Export from the shortcut menu.
l For the MapReduce accessing multicomponent sample program, select the LoginUtil.java, MultiComponentExample.java, and JarFinderUtil.java files, and choose Export from the shortcut menu.
Step 2 Select JAR file, as shown in Figure 1-3. Click Next.
Figure 1-3 Selecting JAR file
![]()
Step 3 Select a path for exporting the package, as shown in Figure 1-4. Click Finish.
Figure 1-4 Selecting a path for exporting the JAR file
![]()
Step 4 Upload the generated application package mapreduce-example.jar to a Linux client, for example, /srv/client/conf, in the same directory as the configuration file.
Step 5 Execute the sample project on Linux.
l For the MapReduce statistics sample project, run the following command.
yarn jar mapreduce-example.jar com.huawei.bigdata.mapreduce.examples.FemaleInfoCollector <inputPath> <outputPath>
This command is used to set parameters and submit jobs. In the command, <inputPath> indicates the input path of the HDFS file system, and <outputPath> indicates the output path of the HDFS file system.
![]()
l Before running the yarn jar mapreduce-example.jar com.huawei.bigdata.mapreduce.examples.FemaleInfoCollector<inputPath> <outputPath> command, upload the log1.txt and log2.txt files to the <inputPath> directory of the HDFS. For details, see the description of typical scenarios.
l Before running the yarn jar mapreduce-example.jar com.huawei.bigdata.mapreduce.examples.FemaleInfoCollector<inputPath> <outputPath> command, ensure that the <outputPath> directory does not exist. Otherwise, an error is reported.
l Do not restart the HDFS service during the running of MapReduce jobs. Otherwise, the jobs may fail.
l For the sample application about multi-component access from MapReduce, perform the following steps.
a. Obtain the user.keytab, krb5.conf, hbase-site.xml, hiveclient.properties, and hive-site.xml files, and create a folder in the Linux environment to save the configuration files, for example, /srv/client/conf.
![]()
Contact the administrator to obtain the user.keytab and krb5.conf files corresponding to the account and permission. Obtain hbase-site.xml from the HBase client, and hiveclient.properties and hive-site.xml from the Hive client.
b. Create the jaas_mr.conf file in the new folder. The file content is as follows:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="user.keytab"
principal="test@HADOOP.COM"
useTicketCache=false
storeKey=true
debug=true;
};
![]()
In the preceding file content, test@HADOOP.COM is an example. Change it based on the site requirements.
c. In the Linux environment, add the classpath required for running the sample project, for example,
export YARN_USER_CLASSPATH=/srv/client/conf/:/srv/client/HBase/hbase/lib/*:/srv/client/Hive/Beeline/lib/*
d. Submit the MapReduce job and run the following command to run the sample project.
yarn jar mapreduce-example.jar com.huawei.bigdata.mapreduce.examples.MultiComponentExample
----End
2. Viewing the Commissioning Result
Scenario
After the MapReduce application is run, you can view the running status of the MapReduce application in the following ways:
l Check the running status of the program based on the running result.
l Log in to the MapReduce WebUI to check the running status of the application.
l Log in to the Yarn WebUI to check the running status of the application.
l Use MapReduce logs to obtain the running status of applications.
![]()
Contact the administrator to obtain a service account that has the right to access the web UI and its password.
Procedure
l Viewing the task execution status by using the MapReduce WebUI
Log in to FusionInsight Manager, choose Service Management > MapReduce > JobHistoryServer, and check the task execution status on the web page.
Figure 1-5 JobHistory WebUI
![]()
l Viewing the task execution status by using the Yarn WebUI
Log in to FusionInsight Manager, choose Service Management > Yarn > ResourceManager (Master), and check the task execution status on the web page.
Figure 1-6 ResourceManager WebUI
![]()
l Viewing the running result of the MapReduce application
− After running the yarn jar mapreduce-example.jar command in the Linux OS, you can view the running status of the running applications. For example:
linux1:/opt # yarn jar
mapreduce-example.jar /user/mapred/example/input/ /output6
16/02/24 15:45:40 INFO security.UserGroupInformation: Login successful for user
admin@HADOOP.COM using keytab file user.keytab
Login success!!!!!!!!!!!!!!
16/02/24 15:45:40 INFO hdfs.PeerCache: SocketCache disabled.
16/02/24 15:45:41 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 28
for admin on ha-hdfs:hacluster
16/02/24 15:45:41 INFO security.TokenCache: Got dt for hdfs://hacluster; Kind:
HDFS_DELEGATION_TOKEN, Service: ha-hdfs:hacluster, Ident:
(HDFS_DELEGATION_TOKEN token 28 for admin)
16/02/24 15:45:41 INFO input.FileInputFormat: Total input files to process :
2
16/02/24 15:45:41 INFO mapreduce.JobSubmitter: number of splits:2
16/02/24 15:45:42 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1455853029114_0027
16/02/24 15:45:42 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN,
Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 28 for
admin)
16/02/24 15:45:42 INFO impl.YarnClientImpl: Submitted application
application_1455853029114_0027
16/02/24 15:45:42 INFO mapreduce.Job: The url to track the job:
https://linux1:26001/proxy/application_1455853029114_0027/
16/02/24 15:45:42 INFO mapreduce.Job: Running job:
job_1455853029114_0027
16/02/24 15:45:50 INFO mapreduce.Job: Job job_1455853029114_0027 running in
uber mode : false
16/02/24 15:45:50 INFO mapreduce.Job: map 0% reduce 0%
16/02/24 15:45:56 INFO mapreduce.Job: map 100% reduce 0%
16/02/24 15:46:03 INFO mapreduce.Job: map 100% reduce 100%
16/02/24 15:46:03 INFO mapreduce.Job: Job job_1455853029114_0027 completed
successfully
16/02/24 15:46:03 INFO mapreduce.Job: Counters: 49
− Run the yarn application -status <ApplicationID> command in the Linux OS. The execution result shows the running status of the running applications. For example:
linux1:/opt # yarn application
-status application_1455853029114_0027
Application Report :
Application-Id :
application_1455853029114_0027
Application-Name : Collect Female
Info
Application-Type :
MAPREDUCE
User : admin
Queue : default
Start-Time : 1456299942302
Finish-Time :
1456299962343
Progress : 100%
State : FINISHED
Final-State : SUCCEEDED
Tracking-URL :
https://linux1:26014/jobhistory/job/job_1455853029114_0027
RPC Port : 27100
AM Host : SZV1000044726
Aggregate Resource Allocation :
114106 MB-seconds, 42 vcore-seconds
Log Aggregation Status :
SUCCEEDED
Diagnostics : Application finished
execution.
Application Node Label Expression :
<Not set>
AM container Node Label Expression :
<DEFAULT_PARTITION>
l View MapReduce logs to learn application running status.
View MapReduce logs to learn application running status, and adjust applications based on log information.

