Got it

MapReduce:Case 1: MapReduce Statistics Sample Program

Latest reply: Nov 26, 2018 09:57:46 818 2 1 0 0

1.1.1 Case 1: MapReduce Statistics Sample Program

1.1.1.1 Scenario

Applicable Versions

FusionInsight HD V100R002C70, FusionInsight HD V100R002C80

Scenario

Develop a MapReduce application to perform the following operations on logs about dwell durations of netizens for shopping online:

l   Collect statistics on female netizens who dwell on online shopping for over 2 hours on the weekend.

l   The first column in the log file records names, the second column records gender, and the third column records the dwell duration in the unit of minute. Three columns are separated by comma (,).

log1.txt: logs collected on Saturday

LiuYang,female,20

YuanJing,male,10

GuoYijun,male,5

CaiXuyu,female,50

Liyuan,male,20

FangBo,female,50

LiuYang,female,20

YuanJing,male,10

GuoYijun,male,50

CaiXuyu,female,50

FangBo,female,60

log2.txt: logs collected on Sunday

LiuYang,female,20

YuanJing,male,10

CaiXuyu,female,50

FangBo,female,50

GuoYijun,male,5

CaiXuyu,female,50

Liyuan,male,20

CaiXuyu,female,50

FangBo,female,50

LiuYang,female,20

YuanJing,male,10

FangBo,female,50

GuoYijun,male,50

CaiXuyu,female,50

FangBo,female,60

Data Planning

Save the original log files in the HDFS.

1.         Create two text files input_data1.txt and input_data2.txt on the local computer, and copy log1.txt to input_data1.txt and log2.txt to input_data2.txt.

2.         Create the /tmp/input folder in the HDFS, and run the following commands to upload input_data1.txt and input_data2.txt to the /tmp/input directory:

a.         On the HDFS client of the Linux OS, run the hdfs dfs -mkdir/tmp/input command.

b.         On the HDFS client of the Linux OS, run the hdfs dfs -putlocal_filepath/tmp/input command.

1.1.1.2 Development Guidelines

Collect statistics on female netizens who dwell on online shopping for over 2 hours on the weekend.

To achieve the objective, the process is as follows:

l   Read the source file data.

l   Filter data information of the time that female netizens spend online.

l   Summarize the total time that each female netizen spends online.

l   Filter the information about female netizens who spends online over 2 hours.

1.1.1.3 Sample Code Description

Function

Collect statistics on female netizens who dwell on online shopping for over 2 hours on the weekend.

The operation is performed in three steps:

l   Filter the dwell duration of female netizens in original files using the CollectionMapper class inherited from the Mapper abstract class.

l   Summarize the dwell duration of each female netizen, and output information about female netizens who dwell online for over 2 hours using the CollectionReducer class inherited from the Reducer abstract class.

l   Use the main method to create a MapReduce job and then submit the MapReduce job to the Hadoop cluster.

Sample Code

The following code snippets are used as an example. For complete code, see the com.huawei.bigdata.mapreduce.examples.FemaleInfoCollector class.

Example 1: The CollectionMapper class defines the map() and setup() methods of the Mapper abstract class.

    public static class CollectionMapper extends

            Mapper<Object, Text, Text, IntWritable> {

// separator.

// Gender screening.

    String sexFilter;

// Name.

    private Text nameInfo = new Text();

//The key and value of the output must be serialized.

    private IntWritable timeInfo = new IntWritable(1);

* Distributed computing

* @param key Object: Offset of the original file.

* @param value Text: A line of character data in the original file.

Output parameter of * @param context Context.

     * @throws IOException , InterruptedException

     */

        public void map(Object key, Text value, Context context)

                throws IOException, InterruptedException

        {

            String line = value.toString();

            if (line.contains(sexFilter))

//A line of character string data that is read.

                       String name = line.substring(0, line.indexOf(delim));

                       nameInfo.set(name);

…………….//Obtain the online duration.

                       String time = line.substring(line.lastIndexOf(delim) + 1,

                                     line.length());

                        timeInfo.set(Integer.parseInt(time));

………..// The map function obtains the key and value key-value pair.

                context.write(nameInfo, timeInfo);

          }

        }

        /**

    * This command is used to invoke the map function to perform some initial operations.

      * 

      * @param context Context

      */

        public void setup(Context context) throws IOException,

                InterruptedException 

    {

The configuration information can be obtained using the context.

      delim = context.getConfiguration().get("log.delimiter", ",");

            sexFilter = delim

                                 + context.getConfiguration()

                                         .get("log.sex.filter", "female") + delim;

        }

    }

Example 2: The CollectionReducer class defines the reduce() method of the Reducer abstract class.

    public static class CollectionReducer extends

            Reducer<Text, IntWritable, Text, IntWritable> 

{

// Statistics result.

       private IntWritable result = new IntWritable();

// Total time threshold.

        private int timeThreshold;

     /** <

* @param key Text: Key after the Mapper function.

* @param values Iterable: all statistical results of the same key item.

      * @param context Context

      * @throws IOException , InterruptedException

      */

        public void reduce(Text key, Iterable<IntWritable> values,

                Context context) throws IOException, InterruptedException

 {

            int sum = 0;

            for (IntWritable val : values) {

                sum += val.get();

            }

// If the time is earlier than the threshold, no result is displayed.

            if (sum < timeThreshold) 

       {

                return;

            }

            result.set(sum);

The output of reduce is key: information about a netizen. Value: Total time for Internet users to access the Internet.

            context.write(key, result);

        }

        /**

*  The setup () method is invoked once only before the map () method of the map task or the reduce () method of the reduce task.

     * 

     * @param context Context

     * @throws IOException , InterruptedException

     */

        public void setup(Context context) throws IOException,

                InterruptedException 

    {

………// The Context command can be used to obtain configuration information.

            timeThreshold = context.getConfiguration().getInt(

                    "log.time.threshold", 120);

        }

    }

Example 3: Use the main() method to create a job, set parameters, and submit the job to the Hadoop cluster.

  public static void main(String[] args) throws Exception {

// Initialize environment variables.

 Configuration conf = new Configuration();

// Secure login.

    LoginUtil.login(PRINCIPAL, KEYTAB, KRB, conf);

// Obtain the input parameter.

 String[] otherArgs = new GenericOptionsParser(conf, args)

        .getRemainingArgs();

    if (otherArgs.length != 2) {

      System.err.println("Usage: collect female info <in> <out>");

      System.exit(2);

    }

// Initialize the job task object.

    @SuppressWarnings("deprecation")

    Job job = new Job(conf, "Collect Female Info");

    job.setJarByClass(FemaleInfoCollector.class);

…….// The command is used to set the class for executing the map or reduce task during the running of the system. The class can also be specified in the configuration file.

    job.setMapperClass(CollectionMapper.class);

    job.setReducerClass(CollectionReducer.class);

// Set the combiner class. By default, the class is not used. Generally, the class similar to Reduce is used.

…//The Combiner class must be used with caution or specified in the configuration file.

    job.setCombinerClass(CollectionReducer.class);

// Set the output type of a job.

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

  FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

// Submit the task to the remote environment for execution.

    System.exit(job.waitForCompletion(true) ? 0 : 1);

  }

Example 4: The CollectionCombiner class combines the data from the Map function to reduce the amount of data transmitted from Map to Reduce.

  /**

 * Combiner class

 */

  public static class CollectionCombiner extends

  Reducer<Text, IntWritable, Text, IntWritable> {

  // Intermediate statistical results

  private IntWritable intermediateResult = new IntWritable();

  /**

 * @param key     Text : key after Mapper

 * @param values  Iterable : all results with the same key in this map task

 * @param context Context

 * @throws IOException , InterruptedException

 */

  public void reduce(Text key, Iterable<IntWritable> values,

  Context context) throws IOException, InterruptedException {

  int sum = 0;

  for (IntWritable val : values) {

  sum += val.get();

  }

  intermediateResult.set(sum);

  // In the output information, key indicates netizen information, 

  // and value indicates the total online time of the netizen in this map task.

  context.write(key, intermediateResult);

  }

  }

1.1.1.4 Obtaining Sample Code

Using the FusionInsight Client

Obtain the sample project mapreduce-example-security in the HDFS directory in the FusionInsight_Services_ClientConfig file extracted from the client.

Using the Maven Project

Log in to Huawei DevClod (https://codehub-cn-south-1.devcloud.huaweicloud.com/codehub/7076065/home) to download code udner components/mapreduce to the local PC.

1.1.1.5 Application Commissioning

Compiling and Running Commissioning Results

1.         In the Eclipse development environment, select the LocalRunner.java project and click101618dquqqix5gqdq2g86.jpg to run the corresponding application project.

Alternatively, right-click the project and choose Run as > Java Application from the shortcut menu to run the application project.

note

Do not restart the HDFS service during the running of MapReduce jobs. Otherwise, the jobs may fail.

Viewing Commissioning Results

After the MapReduce application is run, you can view the running status of the MapReduce application in the following ways:

l   Check the running status of the application in the Eclipse.

l   Use MapReduce logs to obtain the running status of applications.

l   Log in to the MapReduce WebUI to check the running status of the application.

l   Log in to the Yarn WebUI to check the running status of the application.

note

Contact the administrator to obtain a service account that has the right to access the web UI and its password.

1.1.1.5.1 Commissioning Applications on Windows

Compiling and Running Programs

Scenario

You can run applications in the Windows environment after the application code development is complete.

note

If the IBM JDK is used on Windows, applications cannot be directly run on Windows.

Procedure

1.         In the Eclipse development environment, select the LocalRunner.java project and click101618dquqqix5gqdq2g86.jpg to run the corresponding application project.

Alternatively, right-click the project and choose Run as > Java Application from the shortcut menu to run the application project.

note

Do not restart the HDFS service during the running of MapReduce jobs. Otherwise, the jobs may fail.

2. Viewing the Commissioning Result

Scenario

After the MapReduce application is run, you can view the running status of the MapReduce application in the following ways:

l   Check the running status of the application in the Eclipse.

l   Use MapReduce logs to obtain the running status of applications.

l   Log in to the MapReduce WebUI to check the running status of the application.

l   Log in to the Yarn WebUI to check the running status of the application.

note

Contact the administrator to obtain a service account that has the right to access the web UI and its password.

Procedure

l   Viewing the running result to learn application running status

View the output result on the console to learn the application running status as follows:

1848 [main] INFO  org.apache.hadoop.security.UserGroupInformation  - Login successful for user admin@HADOOP.COM using keytab file 

Login success!!!!!!!!!!!!!!

7093 [main] INFO  org.apache.hadoop.hdfs.PeerCache  - SocketCache disabled.

9614 [main] INFO  org.apache.hadoop.hdfs.DFSClient  - Created HDFS_DELEGATION_TOKEN token 45 for admin on ha-hdfs:hacluster

9709 [main] INFO  org.apache.hadoop.mapreduce.security.TokenCache  - Got dt for hdfs://hacluster; Kind: HDFS_DELEGATION_TOKEN,

Service: ha-hdfs:hacluster, Ident: 

(HDFS_DELEGATION_TOKEN token 45 for admin)

10914 [main] INFO  org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider  - Failing over to 53

12136 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat  - Total input files to process : 2

12731 [main] INFO  org.apache.hadoop.mapreduce.JobSubmitter  - number of splits:2

13405 [main] INFO  org.apache.hadoop.mapreduce.JobSubmitter  - Submitting tokens for job: job_1456738266914_0006

13405 [main] INFO  org.apache.hadoop.mapreduce.JobSubmitter  - Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:hacluster, 

Ident: (HDFS_DELEGATION_TOKEN token 45 for admin)

16019 [main] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl  - Application submission is not finished, 

submitted application application_1456738266914_0006 is still in NEW

16975 [main] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl  - Submitted application application_1456738266914_0006

17069 [main] INFO  org.apache.hadoop.mapreduce.Job  - The url to track the job: 

https://linux2:26001/proxy/application_1456738266914_0006/

17086 [main] INFO  org.apache.hadoop.mapreduce.Job  - Running job: job_1456738266914_0006

29811 [main] INFO  org.apache.hadoop.mapreduce.Job  - Job job_1456738266914_0006 running in uber mode : false

29811 [main] INFO  org.apache.hadoop.mapreduce.Job  -  map 0% reduce 0%

41492 [main] INFO  org.apache.hadoop.mapreduce.Job  -  map 100% reduce 0%

53161 [main] INFO  org.apache.hadoop.mapreduce.Job  -  map 100% reduce 100%

53265 [main] INFO  org.apache.hadoop.mapreduce.Job  - Job job_1456738266914_0006 completed successfully

53393 [main] INFO  org.apache.hadoop.mapreduce.Job  - Counters: 50

note

The following exception may occur when the sample code is running in the Windows OS, but it will not affect services.

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

l   Viewing the task execution status by using the MapReduce WebUI

Log in to FusionInsight Manager, choose Service Management > MapReduce > JobHistoryServer, and check the task execution status on the web page.

Figure 1-1 JobHistory WebUI

101622kwk7j8r2f8ad7yf3.jpg

 

l   Viewing the task execution status by using the Yarn WebUI

Log in to FusionInsight Manager, choose Service Management > Yarn > ResourceManager (Master), and check the task execution status on the web page.

Figure 1-2 ResourceManager WebUI

101623zmffq8fu7gwfhlwf.jpg

 

l   Viewing MapReduce logs to obtain the application running status

View MapReduce logs to learn application running status, and adjust applications based on log information.

1.1.1.5.2 Running Applications on Linux

Compiling and Running Programs

Scenario

After the program code is developed, you can run the application in the Linux environment.

Prerequisites

The Yarn client has been installed.

Procedure

                               Step 1      Export the executable MapReduce application package.

l   For the MapReduce statistics sample program, select the FemaleInfoCollector.java, LoginUtil.java, krb5.conf, and user.keytab files, and choose Export from the shortcut menu.

l   For the MapReduce accessing multicomponent sample program, select the LoginUtil.java, MultiComponentExample.java, and JarFinderUtil.java files, and choose Export from the shortcut menu.

                               Step 2      Select JAR file, as shown in Figure 1-3. Click Next.

Figure 1-3 Selecting JAR file

101624cuo787k286ohuzo4.png

 

                               Step 3      Select a path for exporting the package, as shown in Figure 1-4. Click Finish.

Figure 1-4 Selecting a path for exporting the JAR file

101625pwzwc3t7btwl2w5w.png

 

                               Step 4      Upload the generated application package mapreduce-example.jar to a Linux client, for example, /srv/client/conf, in the same directory as the configuration file.

                               Step 5      Execute the sample project on Linux.

l   For the MapReduce statistics sample project, run the following command.

yarn jar mapreduce-example.jar com.huawei.bigdata.mapreduce.examples.FemaleInfoCollector <inputPath> <outputPath>

This command is used to set parameters and submit jobs. In the command, <inputPath> indicates the input path of the HDFS file system, and <outputPath> indicates the output path of the HDFS file system.

note

l  Before running the yarn jar mapreduce-example.jar com.huawei.bigdata.mapreduce.examples.FemaleInfoCollector<inputPath> <outputPath> command, upload the log1.txt and log2.txt files to the <inputPath> directory of the HDFS. For details, see the description of typical scenarios.

l  Before running the yarn jar mapreduce-example.jar com.huawei.bigdata.mapreduce.examples.FemaleInfoCollector<inputPath> <outputPath> command, ensure that the <outputPath> directory does not exist. Otherwise, an error is reported.

l  Do not restart the HDFS service during the running of MapReduce jobs. Otherwise, the jobs may fail.

l   For the sample application about multi-component access from MapReduce, perform the following steps.

a.         Obtain the user.keytab, krb5.conf, hbase-site.xml, hiveclient.properties, and hive-site.xml files, and create a folder in the Linux environment to save the configuration files, for example, /srv/client/conf.

note

Contact the administrator to obtain the user.keytab and krb5.conf files corresponding to the account and permission. Obtain hbase-site.xml from the HBase client, and hiveclient.properties and hive-site.xml from the Hive client.

b.         Create the jaas_mr.conf file in the new folder. The file content is as follows:

Client {  
com.sun.security.auth.module.Krb5LoginModule required  
useKeyTab=true  
keyTab="user.keytab"  
principal="test@HADOOP.COM"  
useTicketCache=false  
storeKey=true  
debug=true;  
};

note

In the preceding file content, test@HADOOP.COM is an example. Change it based on the site requirements.

c.         In the Linux environment, add the classpath required for running the sample project, for example,

export YARN_USER_CLASSPATH=/srv/client/conf/:/srv/client/HBase/hbase/lib/*:/srv/client/Hive/Beeline/lib/*

d.         Submit the MapReduce job and run the following command to run the sample project.

yarn jar mapreduce-example.jar com.huawei.bigdata.mapreduce.examples.MultiComponentExample

----End

2. Viewing the Commissioning Result

Scenario

After the MapReduce application is run, you can view the running status of the MapReduce application in the following ways:

l   Check the running status of the program based on the running result.

l   Log in to the MapReduce WebUI to check the running status of the application.

l   Log in to the Yarn WebUI to check the running status of the application.

l   Use MapReduce logs to obtain the running status of applications.

note

Contact the administrator to obtain a service account that has the right to access the web UI and its password.

Procedure

l   Viewing the task execution status by using the MapReduce WebUI

Log in to FusionInsight Manager, choose Service Management > MapReduce > JobHistoryServer, and check the task execution status on the web page.

Figure 1-5 JobHistory WebUI

101627zmncxl6zc97p9d6i.jpg

 

l   Viewing the task execution status by using the Yarn WebUI

Log in to FusionInsight Manager, choose Service Management > Yarn > ResourceManager (Master), and check the task execution status on the web page.

Figure 1-6 ResourceManager WebUI

101628rf6d1nn9bffcblb9.jpg

 

l   Viewing the running result of the MapReduce application

           After running the yarn jar mapreduce-example.jar command in the Linux OS, you can view the running status of the running applications. For example:

linux1:/opt # yarn jar mapreduce-example.jar /user/mapred/example/input/ /output6  
16/02/24 15:45:40 INFO security.UserGroupInformation: Login successful for user admin@HADOOP.COM using keytab file user.keytab  
Login success!!!!!!!!!!!!!!  
16/02/24 15:45:40 INFO hdfs.PeerCache: SocketCache disabled.  
16/02/24 15:45:41 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 28 for admin on ha-hdfs:hacluster  
16/02/24 15:45:41 INFO security.TokenCache: Got dt for hdfs://hacluster; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 28 for admin)  
16/02/24 15:45:41 INFO input.FileInputFormat: Total input files to process : 2  
16/02/24 15:45:41 INFO mapreduce.JobSubmitter: number of splits:2  
16/02/24 15:45:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1455853029114_0027  
16/02/24 15:45:42 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:hacluster, Ident: (HDFS_DELEGATION_TOKEN token 28 for admin)  
16/02/24 15:45:42 INFO impl.YarnClientImpl: Submitted application application_1455853029114_0027  
16/02/24 15:45:42 INFO mapreduce.Job: The url to track the job: https://linux1:26001/proxy/application_1455853029114_0027/  
16/02/24 15:45:42 INFO mapreduce.Job: Running job: job_1455853029114_0027  
16/02/24 15:45:50 INFO mapreduce.Job: Job job_1455853029114_0027 running in uber mode : false  
16/02/24 15:45:50 INFO mapreduce.Job:  map 0% reduce 0%  
16/02/24 15:45:56 INFO mapreduce.Job:  map 100% reduce 0%  
16/02/24 15:46:03 INFO mapreduce.Job:  map 100% reduce 100%  
16/02/24 15:46:03 INFO mapreduce.Job: Job job_1455853029114_0027 completed successfully  
16/02/24 15:46:03 INFO mapreduce.Job: Counters: 49

           Run the yarn application -status <ApplicationID> command in the Linux OS. The execution result shows the running status of the running applications. For example:

linux1:/opt # yarn application -status application_1455853029114_0027  
Application Report : 
        Application-Id : application_1455853029114_0027  
        Application-Name : Collect Female Info  
        Application-Type : MAPREDUCE  
        User : admin  
        Queue : default  
        Start-Time : 1456299942302   
        Finish-Time : 1456299962343  
        Progress : 100%  
        State : FINISHED  
        Final-State : SUCCEEDED  
        Tracking-URL : https://linux1:26014/jobhistory/job/job_1455853029114_0027  
        RPC Port : 27100  
        AM Host : SZV1000044726  
        Aggregate Resource Allocation : 114106 MB-seconds, 42 vcore-seconds  
        Log Aggregation Status : SUCCEEDED  
        Diagnostics : Application finished execution. 
        Application Node Label Expression : <Not set>  
        AM container Node Label Expression : <DEFAULT_PARTITION>

l   View MapReduce logs to learn application running status.

View MapReduce logs to learn application running status, and adjust applications based on log information.

 


This article contains more resources

You need to log in to download or view. No account? Register

x

welcome
View more
  • x
  • convention:

it's a good article
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.