Got it

Reading Data from Hive and Writing Data to HBase (Spark: Case 7)

Latest reply: Jan 19, 2022 15:49:29 4524 4 4 0 0

Hi there, fellow Community members!


This post features the process of reading data from Hive and writing data to HBase. Please see below for more details regarding the topic.


1.1.1 Case 7: Reading Data from Hive and Writing Data to HBase

1.1.1.1 Scenario

Applicable Versions

FusionInsight HD V100R002C70, FusionInsight HD V100R002C80

Scenario

Assume that the person table of Hive stores a user's consumption of the current day and HBase table2 stores the user's history consumption data.

In the person table, the name=1,account=100record indicates that user1's consumption amount on the current day is 100 CNY.

In table2, the key=1,cf:cid=1000record indicates that user1's history consumption data is 1000 CNY.

Based on some service requirements, a Spark application must be developed to implement the following functions:

Calculate a user's history consumption amount based on the user name, that is, the user's total consumption amount =100 (consumption amount of the current day) + 1000 (history consumption amount).

In the preceding example, the application run result is that in table2, the total consumption amount of user1(key=1) is cf:cid=1100 CNY.

Data Planning

Before developing the application, create a Hive table named person and insert data to the table. At the same time, create HBase table2 so that you can write the data analysis result to it.

                              Step 1     Save original log files to HDFS.

1.        Create a blank log1.txt file on the local PC and write the following content to the file.

1,100

2.        Create the /tmp/input directory in HDFS and upload the log1.txt file to the directory.

a.        On the HDFS client, run the following commands for authentication:

cd/opt/hadoopclient

kinit<service user for authentication >

b.        On the HDFS client of the Linux OS, run the hadoop fs -mkdir/tmp/input command (the hdfs dfs command has the same function) to create a directory.

c.        On the HDFS client of the Linux OS, run the hadoop fs -putlog1.txt/tmp/input command to upload data files.

                              Step 2     Store the imported data to the Hive table.

Ensure that the JDBCServer is started. Use the Beeline tool to create a Hive table and insert data to the table.

1.        Run the following command to create a Hivetable named person:

create table person

(

name STRING,

account INT

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' STORED AS TEXTFILE;

2.        Run the following command to insert data to the persontable:

                              Step 3     Create an HBase table.

Ensure that the JDBCServer is started. Use the Beeline tool to create the HBase table and insert data.

1.        Run the following command to create an HBase table named table2:

create table table2

(

key string,

cid string

)

stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

with serdeproperties(

"hbase.columns.mapping" = ":key,

cf:cid

")

tblproperties("hbase.table.name" = "table2");

2.        Run the following command on HBase to insert data to HBase table2:

put 'table2', '1', 'cf:cid', '1000'

----End

1.1.1.2 Development Guidelines

1.        Query data in the Hive person table.

2.        Query data in table2 based on the key value in the person table.

3.        Sum the data records obtained in the previous two steps.

4.        Write the result of the previous step to table2.

1.1.1.3 Sample Code Description

1.1.1.3.1 Java Code Example

Function

In a Spark application, you can use Spark to call a Hive API to perform operations on a Hive table, and write the data analysis result of the Hive table to an HBase table.

Sample Code

The following code snippets are used as an example. For complete code, see com.huawei.bigdata.spark.examples.SparkHivetoHbase.

/**

* Read data from the Hive table, obtain corresponding records from the HBase table based on the key value, perform operations on the two data, and update the data in the HBase table.

 */

public class SparkHivetoHbase {

  public static void main(String[] args) throws Exception {

// Use the Spark interface to obtain data from the table.

    SparkConf conf = new SparkConf().setAppName("SparkHivetoHbase");

    JavaSparkContext jsc = new JavaSparkContext(conf);

    HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(jsc);

    DataFrame dataFrame = sqlContext.sql("select name, account from person");

// Traverse each partition in the Hive table and update the partition to the HBase table.

// If the number of data records is small, you can use the foreach () method.

    dataFrame.toJavaRDD().foreachPartition(

      new VoidFunction<Iterator<Row>>() {

        public void call(Iterator<Row> iterator) throws Exception {

          hBaseWriter(iterator);

        }

      }

    );

    jsc.stop();

  }

 /**

* Update the HBase table record at the exetutor end.

 *

Partition data in the * @param iterator hive table

 */

  private static void hBaseWriter(Iterator<Row> iterator) throws IOException {

// Reads the HBase table.

    String tableName = "table2";

    String columnFamily = "cf";

    Configuration conf = HBaseConfiguration.create();

    Connection connection = null;

    Table table = null;

    try {

      connection = ConnectionFactory.createConnection(conf);

      table = connection.getTable(TableName.valueOf(tableName));

      List<Row> table1List = new ArrayList<Row>();

      List<Get> rowList = new ArrayList<Get>();

      while (iterator.hasNext()) {

        Row item = iterator.next();

        Get get = new Get(item.getString(0).getBytes());

        table1List.add(item);

        rowList.add(get);

      }

// Obtain the HBase table record.

// Modify a record in the HBase table.

      List<Put> putList = new ArrayList<Put>();

      for (int i = 0; i < resultDataBuffer.length; i++) {

 // Value of the hive table

        Result resultData = resultDataBuffer[i];

        if (!resultData.isEmpty()) {

          // get hiveValue

          int hiveValue = table1List.get(i).getInt(1);

// Obtain the HBase value based on the column cluster and column.

          String hbaseValue = Bytes.toString(resultData.getValue(columnFamily.getBytes(), "cid".getBytes()));

          Put put = new Put(table1List.get(i).getString(0).getBytes());

// Calculation result

          int resultValue = hiveValue + Integer.valueOf(hbaseValue);

// Set the result to the put object.

          put.addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes("cid"), Bytes.toBytes(String.valueOf(resultValue)));

          putList.add(put);

        }

      }

      if (putList.size() > 0) {

        table.put(putList);

      }

    } catch (IOException e) {

      e.printStackTrace();

  } finally {

      if (table != null) {

        try {

          table.close();

        } catch (IOException e) {

          e.printStackTrace();

        }

      }

      if (connection != null) {

        try {

// Close the HBase connection.

          connection.close();

        } catch (IOException e) {

          e.printStackTrace();

        }

      }

    }

  }

}

1.1.1.3.2 Scala Code Example

Function

In a Spark application, you can use Spark to call a Hive API to perform operations on a Hive table, and write the data analysis result of the Hive table to an HBase table.

Sample Code

The following code snippets are used as an example. For complete code, see com.huawei.bigdata.spark.examples.SparkHivetoHbase.

/**

* Read data from the Hive table, obtain corresponding records from the HBase table based on the key value, perform operations on the two data, and update the data in the HBase table.

  */

object SparkHivetoHbase {

  case class FemaleInfo(name: String, gender: String, stayTime: Int)

  def main(args: Array[String]) {

 // Use the Spark interface to obtain data from the table.

    val sparkConf = new SparkConf().setAppName("SparkHivetoHbase")

    val sc = new SparkContext(sparkConf)

    val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

    import sqlContext.implicits._

    val dataFrame = sqlContext.sql("select name, account from person")

// Traverse each partition in the Hive table and update the partition to the HBase table.

// If the number of data records is small, you can use the foreach () method.

    dataFrame.rdd.foreachPartition(x => hBaseWriter(x))

    sc.stop()

  }

  /**

* Update the HBase table record at the exetutor end.

 *

Partition data in the * @param iterator hive table.

 */

  def hBaseWriter(iterator: Iterator[Row]): Unit = {

// Read the HBase table.

    val tableName = "table2"

    val columnFamily = "cf"

    val conf = HBaseConfiguration.create()

    var table: Table = null

    var connection: Connection = null

    try {

      connection = ConnectionFactory.createConnection(conf)

      table = connection.getTable(TableName.valueOf(tableName))

      val iteratorArray = iterator.toArray

      val rowList = new util.ArrayList[Get]()

      for (row <- iteratorArray) {

        val get = new Get(row.getString(0).getBytes)

        rowList.add(get)

      }

// Obtain the HBase table record.

      val resultDataBuffer = table.get(rowList)

// Modify a record in the HBase table.

      val putList = new util.ArrayList[Put]()

      for (i <- 0 until iteratorArray.size) {

        // hbase row

        val resultData = resultDataBuffer(i)

        if (!resultData.isEmpty) {

 //Value of the hive table

          var hiveValue = iteratorArray(i).getInt(1)

// Obtain the HBase value based on the column cluster and column.

        val hbaseValue = Bytes.toString(resultData.getValue(columnFamily.getBytes, "cid".getBytes))

          val put = new Put(iteratorArray(i).getString(0).getBytes)

// Calculation result

          val resultValue = hiveValue + hbaseValue.toInt

// Set the result to the put object.

          put.addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes("cid"), Bytes.toBytes(resultValue.toString))

          putList.add(put)

        }

      }

      if (putList.size() > 0) {

        table.put(putList)

      }

    } catch {

      case e: IOException =>

        e.printStackTrace();

    } finally {

      if (table != null) {

        try {

          table.close()

        } catch {

          case e: IOException =>

            e.printStackTrace();

        }

      }

      if (connection != null) {

      try {

// Close the HBase connection.

          connection.close()

        } catch {

          case e: IOException =>

            e.printStackTrace()

        }

      }

    }

  }

}

1.1.1.4 Obtaining Sample Code

Using the FusionInsight Client

Obtain the sample project in the sampleCodedirectory in the Spark directory in the FusionInsight_Services_ClientConfigfile extracted from the client.

Security mode: SparkHivetoHbaseJavaExampleand SparkHivetoHbaseScalaExample in the spark-examples-securitydirectory

Non-security mode: SparkHivetoHbaseJavaExampleand SparkHivetoHbaseScalaExample in the spark-examples-normaldirectory

Using the Maven Project

log in to Huawei DevClod (https://codehub-cn-south-1.devcloud.huaweicloud.com/codehub/7076065/home) to download code udner to local PC.

Security mode:

components/spark/spark-examples-security/SparkJavaExample

components/spark/spark-examples-security/SparkScalaExample

Non-security mode:

components/spark/spark-examples-normal/SparkJavaExample

components/spark/spark-examples-normal/SparkScalaExample

1.1.1.5 Application Commissioning

1.1.1.5.1 Compiling and Running the Application

Scenario

After the program code is developed, you can upload the code to the Linux client for running. The running procedures of applications developed in Scala or Java are the same.

note

l The Spark application can run only in the Linux environment but not in the Windows environment.

l The Spark application developed in Python does not need to build Artifacts as a jar. You just need to copy the sample projects to the compiler.

l It is needed to ensure that the version of Python installed on the worker and driver is consistent, otherwise the following error will be reported: "Python in worker has different version %s than that in driver %s."

Procedure

                              Step 1     In the IntelliJ IDEA, configure the Artifacts information about the project before the jar is created.

1.        On the main page of the IDEA, choose File> Project Structures... to enter the Project Structure page.

2.        On the Project Structure page, select Artifacts, click + and choose Jar > From modules with dependencies....

Figure 1-1 Add the Artifacts

153617ce7zxd998229agux.png

 

3.        Select the corresponding module. The module corresponding to the Java sample projects is CollectFemaleInfo. Click OK.

Figure 1-2 Create Jar from Modules

153618kq25qzr5aphean7a.png

 

4.        Configure the name, type and output directory of the Jar based on the actual condition.

Figure 1-3 Configuring the basic information

153619ap46x6owr66wxk7q.png

 

5.        Right-click CollectFemaleInfo, choose Put into Output Root, and click Apply.

Figure 1-4 Put into Output Root

153620bhuiyh51lu1hkgry.png

 

6.        Click OK.

                              Step 2     Create the jar.

1.        On the main page of the IDEA, choose Build> Build Artifacts....

Figure 1-5 Build Artifacts

153621t6iir08d3d0nsr08.png

 

2.        On the displayed menu, choose CollectFemaleInfo> Build to create a jar.

Figure 1-6 Build

153622ccef2j3nkfy7nakz.png

 

3.        If the following information is displayed in the event log, the jar is created successfully. You can obtain the jar from the directory configured in Step 1.4.

21:25:43 Compilation completed successfully in 36 sec

                              Step 3     Copy the jar created in Step 2 to the Spark running environment (Spark client), such as /opt/hadoopclient/Spark to run the Spark application.

 

Notice

When a Spark task is running, it is prohibited to restart the HDFS service or restart all DataNode instances. Otherwise, the Spark task may fail, resulting in JobHistory data loss.

l  Run the sample projects of Spark Core(including Scala and Java).

Access the Spark client directory and implement the bin/spark-submit script to run the codes.

<inputPath> indicates the input directory in the HDFS.

bin/spark-submit --classcom.huawei.bigdata.spark.examples.FemaleInfoCollection --master yarn-client/opt/female/FemaleInfoCollection.jar <inputPath>

l  Run the sample projects of Spark SQL (Java and Scala).

Access the Spark client directory and implement the bin/spark-submit script to run the codes.

<inputPath> indicates the input directory in the HDFS.

bin/spark-submit --classcom.huawei.bigdata.spark.examples.FemaleInfoCollection --master yarn-client/opt/female/FemaleInfoCollection.jar <inputPath>

l  Run the sample projects of Spark Streaming (Java and Scala).

Access the Spark client directory and implement the bin/spark-submit script to run the codes.

note

The location of Spark Streaming Kafka dependency package on the client is different from the location of other dependency packages. For example, the path to the Spark Streaming Kafka dependency package is $SPARK_HOME/lib/streamingClient, whereas the path to other dependency packages is $SPARK_HOME/lib. When running an application, you must add the configuration option to the spark-submit command to specify the path of Spark Streaming Kafka dependency package. The following is an example path:

--jars $SPARK_HOME/lib/streamingClient/kafka-clients-0.8.2.1.jar,$SPARK_HOME/lib/streamingClient/kafka_2.10-0.8.2.1.jar,$SPARK_HOME/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar

Example codes of the Spark Streaming Write To Print is as follows:

bin/spark-submit --master yarn-client--jars $SPARK_HOME/lib/streamingClient/kafka-clients-0.8.2.1.jar,$SPARK_HOME/lib/streamingClient/kafka_2.10-0.8.2.1.jar,$SPARK_HOME/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar--classcom.huawei.bigdata.spark.examples.FemaleInfoCollectionPrint /opt/female/FemaleInfoCollectionPrint.jar <checkPointDir> <batchTime> <topics> <brokers>

Example codes of the Spark Streaming Write To Kafka is as follows:

bin/spark-submit --master yarn-client--jars $SPARK_HOME/lib/streamingClient/kafka-clients-0.8.2.1.jar,$SPARK_HOME/lib/streamingClient/kafka_2.10-0.8.2.1.jar,$SPARK_HOME/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar--classcom.huawei.bigdata.spark.examples.FemaleInfoCollectionKafka /opt/female/FemaleInfoCollectionKafka.jar <checkPointDir> <batchTime> <windowTime> <topics> <brokers>

l  Run the sample projects of Accessing the Spark SQL Through JDBC (Java and Scala).

Access the Spark client directory and implement the java -cp command to run the codes.

java -cp$SPARK_HOME/lib/*:$SPARK_HOME/conf:/opt/female/ThriftServerQueriesTest.jar com.huawei.bigdata.spark.examples.ThriftServerQueriesTest $SPARK_HOME/conf/hive-site.xml $SPARK_HOME/conf/spark-defaults.conf

note:

In the preceding command line, you can choose the minimal runtime dependency package based on the sample projects. For details of the runtime dependency packages, see References.

l  Run the Spark on HBase sample application(Java and Scala).

a.        Verify that the configuration options in the Spark client configuration file spark-defaults.conf are correctly configured.

When running the Spark on HBase sample application, set the configuration option spark.hbase.obtainToken.enabledin the Spark client configuration file spark-defaults.conf to true(The default value is false. Changing the value to true does not affect existing services. If you want to uninstall the HBase service, change the value back to false first. Set the configuration option spark.inputFormat.cache.enabledto false.

Table 1-1Parameters

Parameter

Description

Default    Value

spark.hbase.obtainToken.enabled

Indicates whether to enable the function of obtaining the HBase token.

false

spark.inputFormat.cache.enabled

Indicates whether to cache the InputFormat that maps to HadoopRDD. If the parameter is set to true, the tasks of the same Executor use the same InputFormat object. In this case, the InputFormat must be thread-safe. If caching the InputFormat is not required, set the parameter to false.

true

 

b.        Access the Spark client directory and implement the bin/spark-submit script to run the code.

Run sample applications in the sequence: TableCreation > TableInputData > TableOutputData.

When the TableInputData sample application is running, <inputPath> needs to be specified. <inputPath>indicates the input path in the HDFS.

bin/spark-submit --classcom.huawei.bigdata.spark.examples.TableInputData --master yarn-client/opt/female/TableInputData.jar <inputPath>

l  Run the Spark Hbase to HBase sample application(Scala and Java).

Access the Spark client directory and implement the bin/spark-submit script to run the code.

bin/spark-submit --classcom.huawei.bigdata.spark.examples.SparkHbasetoHbase --master yarn-client/opt/female/FemaleInfoCollection.jar

l  Run the Spark Hive to HBase sample application(Scala and Java).

Access the Spark client directory and implement the bin/spark-submit script to run the code.

bin/spark-submit --classcom.huawei.bigdata.spark.examples.SparkHivetoHbase --master yarn-client/opt/female/FemaleInfoCollection.jar

l  Run the Spark Streaming Kafka to HBasesample application(Scala and Java).

Access the Spark client directory and implement the bin/spark-submit script to run the code.

When the sample application is running, specify the <checkPointDir><topic><brokerList>. <checkPointDir>indicates the directory where the application result is backed up, <topic>indicates the topic that is read from Kafka, <brokerList>indicates the IP address of the Kafka server.

note:

On the client, the directory of Spark Streaming Kafka dependency package is different from the directory of other dependency packages. For example, the directory of another dependency package is $SPARK_HOME/lib and the directory of a Spark Streaming Kafka dependency package is $SPARK_HOME/lib/streamingClient. Therefore, when running the application, add the configuration option in the spark-submitcommand to specify the directory for the Spark Streaming Kafka dependency package, for example, --jars $SPARK_HOME/lib/streamingClient/kafka-clients-0.8.2.1.jar,$SPARK_HOME/lib/streamingClient/kafka_2.10-0.8.2.1.jar,$SPARK_HOME/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar.

Example code of Spark Streaming To HBase

bin/spark-submit --master yarn-client --jars$SPARK_HOME/lib/streamingClient/kafka-clients-0.8.2.1.jar,$SPARK_HOME/lib/streamingClient/kafka_2.10-0.8.2.1.jar,$SPARK_HOME/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar --classcom.huawei.bigdata.spark.examples.streaming.SparkOnStreamingToHbase /opt/female/FemaleInfoCollectionPrint.jar <checkPointDir> <topic> <brokerList>

l  Submit the application developed in Python.

Access the Spark client directory and implement the bin/spark-submit script to run the codes.

<inputPath> indicates the input directory in the HDFS.

note:

Because the sample code does not contain the authentication information, specify the authentication information by configuring the spark.yarn.keytab and spark.yarn.principle when the application is run.

bin/spark-submit --master yarn-client --conf spark.yarn.keytab=/opt/FIclient/user.keytab --conf spark.yarn.principal=sparkuser/opt/female/SparkPythonExample/collectFemaleInfo.py <inputPath>

----End

References

The runtime dependency packages for the sample projects of Accessing the Spark SQL Through JDBC (Java and Scala) are as follows:

l  The sample projects of Accessing the Spark SQL Through JDBC (Scala):

          avro-1.7.7.jar

          commons-collections-3.2.2.jar

          commons-configuration-1.6.jar

          commons-io-2.4.jar

          commons-lang-2.6.jar

          commons-logging-1.1.3.jar

          guava-12.0.1.jar

          hadoop-auth-2.7.2.jar

          hadoop-common-2.7.2.jar

          hadoop-mapreduce-client-core-2.7.2.jar

          hive-exec-1.2.1.spark.jar

          hive-jdbc-1.2.1.spark.jar

          hive-metastore-1.2.1.spark.jar

          hive-service-1.2.1.spark.jar

          httpclient-4.5.2.jar

          httpcore-4.4.4.jar

          libthrift-0.9.3.jar

          log4j-1.2.17.jar

          slf4j-api-1.7.10.jar

          zookeeper-3.5.1.jar

          scala-library-2.10.4.jar

l  The sample projects of Accessing the Spark SQL Through JDBC (Java):

          commons-collections-3.2.2.jar

          commons-configuration-1.6.jar

          commons-io-2.4.jar

          commons-lang-2.6.jar

          commons-logging-1.1.3.jar

          guava-2.0.1.jar

          hadoop-auth-2.7.2.jar

          hadoop-common-2.7.2.jar

          hadoop-mapreduce-client-core-2.7.2.jar

          hive-exec-1.2.1.spark.jar

          hive-jdbc-1.2.1.spark.jar

          hive-metastore-1.2.1.spark.jar

          hive-service-1.2.1.spark.jar

          httpclient-4.5.2.jar

          httpcore-4.4.4.jar

          libthrift-0.9.3.jar

          log4j-1.2.17.jar

          slf4j-api-1.7.10.jar

          zookeeper-3.5.1.jar

1.1.1.5.2 Checking the Commissioning Result

Scenario

After a Spark application is run, you can check the running result through one of the following methods:

l  Viewing the command output.

l  Logging in to the Spark WebUI.

l  Viewing Spark logs.

Procedure

l  Check the operating result data of the Spark application.

The data storage directory and format are specified by users in the Spark application. You can obtain the data in the specified file.

l  Check the status of the Spark application.

The Spark contains the following two Web UIs:

          The Spark UI displays the status of applications being executed.

The Spark UI contains the Spark Jobs, Spark Stages, Storage, Environment, and Executorsparts. Besides these parts, Spark Streaming is displayed for the Streaming application.

Access to the interface: On the Web UI of the YARN, find the corresponding Spark application, and click the final column ApplicationMaster of the application information to access the Spark UI.

          The History Server UI displays the status of all Spark applications.

The History Server UI displays information such as the application ID, application name, start time, end time, execution time, and user to whom the application belongs. After the application ID is clicked, the Spark UI of the application is displayed.

l  View Spark logs to learn application running conditions.

The logs of Spark offers immediate visibility into application running conditions. You can adjust application programs based on the logs. Log related information can be referenced to Spark in the Log Description in the Administrator Guide.

 

This article contains more resources

You need to log in to download or view. No account? Register

x
  • x
  • convention:

chz
Created Sep 29, 2018 07:37:30

welcomeReading Data from Hive and Writing Data to HBase (Spark: Case 7)-2765301-1
View more
  • x
  • convention:

user_4237671
Created Jan 4, 2022 03:22:31

Reading Data from Hive and Writing Data to HBase (Spark: Case 7)-4578123-1
View more
  • x
  • convention:

JohnTr
Created Jan 4, 2022 05:27:02

Thanks for sharing
View more
  • x
  • convention:

user_4358465
Created Jan 19, 2022 15:49:29

A well defined post! Thank you
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.