Got it

HCIA-Big Data | Introduction to Hive

Latest reply: Apr 23, 2022 13:56:24 1013 21 14 0 0

Hello, friend!

The Apache Hive data warehouse software helps read, write, and manage large data sets that reside in distributed storage by using SQL. Structures can be projected onto stored data. The command-line tool and JDBC driver are provided to connect users to Hive.

Hive Overview

Hive is a data warehouse tool running on Hadoop and supports PB-level distributed data query and management.

Hive features:

  • Supporting flexible extraction, transformation, and load (ETL).

  • Supporting multiple computing engines, such as Tez and Spark.

  • Supporting direct access to HDFS files and HBase.

  • Easy-to-use and easy-to-program.

Application Scenarios of Hive

Application Scenarios of Hive

Comparison Between Hive and Traditional Data Warehouses


Hive

Conventional Data Warehouse

Storage

HDFS is used to store data. Theoretically, 

infinite expansion is possible.

Clusters are used to store data, which have a capacity upper limit. With the increase of capacity, the computing speed decreases sharply. Therefore, data warehouses are applicable only to commercial applications with small data volumes.

Execution engine

Tez (default)

You can select more efficient algorithms to perform queries or take more optimization measures to speed up the queries.

Usage Method

HQL (SQL-like)

SQL

Flexibility

Metadata storage is independent of data storage, decoupling metadata and data.

Low flexibility. Data can be used for limited purposes.

Analysis speed

Computing depends on the cluster scale and the cluster is easy to expand. In the case of a large amount of data, computing is much faster than that of a common data warehouse.

When the data volume is small, the data processing speed is high. When the data volume is large, the speed decreases sharply.

Index

Low efficiency

High efficiency

Ease of use

Self-developed application models are 

needed, featuring high flexibility but 

delivering low usability.

A set of mature report solutions is integrated to facilitate data analysis.

Reliability

Data is stored in HDFS, implementing 

high data reliability and fault tolerance.

The reliability is low. If a query fails, you start the task again. The data fault tolerance depends on hardware RAID.

Environment dependency

Low dependency on hardware, applicable to common machines

Low dependency on hardware, applicable to common machine

Price

Open-source product, free of charge

Expensive in commercial use

Advantages of Hive

Advantages of Hive

Hive Architecture

Hive Architecture

Hive Running Process

The client submits the HQL command.

Tez executes the query.

YARN allocates resources to applications in the cluster and enables authorization for Hive jobs in the YARN queue.

Hive updates data in HDFS or Hive warehouse based on the table type.

Hive returns the query result through the JDBC connection.

Hive Running Process

Data Storage Model of Hive

Data Storage Model of Hive

Partition and Bucket

Partition: Data tables can be partitioned based on the value of a certain field.

  • Each partition is a directory.

  • The number of partitions is not fixed.

  • Partitions or buckets can be created in a partition.

Data can be stored in different buckets.

  • Each bucket is a file.

  • The number of buckets is specified when creating a table. The buckets can be sorted.

  • Data is hashed based on the value of a field and then stored in a bucket.

Managed Table and External Table

Hive can create managed tables and external tables.

  • By default, a managed table is created, and Hive moves data to the data warehouse directory.

  • When an external table is created, Hive accesses data outside the warehouse directory.

  • If all processing is performed by Hive, you are advised to use managed tables.

  • If you want to use Hive and other tools to process the same data set, you are advised to use external tables.


Managed Table

External Table

CREATE/LOAD

Data is moved to the repository directory.

The data location is not moved.

DROP

The metadata and data are deleted together.

Only the metadata is deleted.

Functions Supported by Hive

Built-in Hive Functions

  • Mathematical functions, such as round(), floor(), abs(), and rand().

  • Date functions, such as to_date(), month(), and day().

  • String functions, such as trim(), length(), and substr().

User-Defined Function (UDF)

Basic Hive Operations

Hive Usage

Running HiveServer2 and Beeline:

$ $HIVE_HOME/bin/hiveserver2

$ $HIVE_HOME/bin/beeline -u jdbc:hive2://$HS2_HOST:$HS2_PORT

Running Hcatalog:

$ $HIVE_HOME/hcatalog/sbin/hcat_server.sh

Running WebHCat (Templeton):

$ $HIVE_HOME/hcatalog/sbin/webhcat_server.sh

Hive SQL Overview

DDL-Data Definition Language:

  • Creates tables, modifies tables, deletes tables, partitions, and data types.

DML-Data Management Language:

  • Imports and exports data.

DQL-Data Query Language:

  • Performs simple queries.

  • Performs complex queries such as Group by, Order by and Join.

DDL Operations

Create a table:

hive> CREATE TABLE pokes (foo INT, bar STRING);

hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Browse the table:

hive> SHOW TABLES;

Describe a table:

hive> DESCRIBE invites;

Modify a table:

hive> ALTER TABLE events RENAME TO 3koobecaf; 

hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);

DML Operations

Load data to a table:

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

Export data to HDFS:

EXPORT TABLE invites TO '/department';

DQL Operations 

SELECTS and FILTERS:

hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

GROUP BY:

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;

hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

MULTITABLE INSERT:

FROM src

INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100

INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200;

JOIN:

hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

STREAMING:

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING  '/bin/cat' WHERE a.ds > '2008-08-09';

That's all, thanks!

  • x
  • convention:

Saqibaz
Created Mar 12, 2022 11:19:03

Thanks for sharing
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Mar 15, 2022 02:02:23 (0) (0)
 
SamB
Created Mar 12, 2022 11:41:07

Very detailed and Very useful post, thanks
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Mar 15, 2022 02:02:33 (0) (0)
 
Aimen
Created Mar 12, 2022 11:52:52

Good share
View more
  • x
  • convention:

huyvan
Created Mar 12, 2022 14:31:47

Good share
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Mar 15, 2022 02:02:43 (0) (0)
Thanks!  
user_3915171
Created Mar 12, 2022 22:59:03

thanks
View more
  • x
  • convention:

NTan33
Created Mar 13, 2022 04:01:57

A good introduction to this topic.
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Mar 15, 2022 02:02:53 (0) (0)
Thanks!  
user_4358465
Created Mar 13, 2022 15:51:04

Great introduction to Hive
View more
  • x
  • convention:

MahMush
Author Created Mar 20, 2022 04:30:02

completely defined
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 2, 2022 01:55:06 (0) (0)
Thanks!  
zaheernew
MVE Author Created Apr 11, 2022 08:16:49

Useful Info
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 12, 2022 00:36:53 (0) (0)
 
12
Back to list

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.