Got it

Query engines Spark and Hive

Latest reply: Nov 24, 2021 04:50:38 1468 34 20 0 2

Hello all,

With the development of the big data ecosystem, the tools for big data processing started to become more sophisticated and easy to use. One of the major requirements for analytics tools is the ability to execute ad hoc queries on the

data stored in the system. SQL, being one of the main languages used for the purposes of data querying9 became the standard most frameworks try to integrate, creating a number of SQL-like languages. In this post Hive and Spark big data processing frameworks, their architecture, and usage are presented.


Hive

Apache Hive is a data warehouse system that supports read and write operations over datasets in a distributed storage. The queries support SQL-like syntax by introducing a separate query language called HiveQL. HiveServer2

(HS2) is a service enabling clients to run Hive queries.

Hive System Architecture

hive

This figure shows the major components of Hive and how it interacts with Hadoop. There is a User Interface (UI)x that is used for system interaction, initially being only a command line interface (CLI). The Driver is the component that receives the queries, provides the session handles, and an API modeled on JDBC/ODBC interfaces. The Compiler parses the query, does semantic analysis of the query blocks and expressions, and is responsible for creating an execution plan with respect to the table and partition metadata, which is stored in the Metastore.

The Metastore stores all of the structural information about the different tables and partitions in the warehouse, including the information about columns and their types, the serializers and deserializers used to read and write data to HDFS, and the location of data. 


There are three main abstractions in Hive Data Model: tables, partitions, and buckets. Hive tables can be viewed as similar to tables in relational databases.

Rows in the Hive tables are separated into columns with types. Tables support filter, union, and join operations. All the data in a table is stored in an HDFS directory. External tables, which can be created on pre-existing files or directories in HDFS, are also supported by Hive. Each table can have one or more partition keys that determine how the data is stored. The partitions allow the system to prune data based on query predicates, in order not to go through all of the stored data for every query. The data stored in a partition can be divided into buckets, based on the hash of the column. Each bucket is stored as a file in a directory. The buckets allow more efficient evaluation that depends on a sample of data.


Spark 

Apache Spark is a unified engine for distributed data processing that was created in 2009 at the University of California, Berkeley. The first release of it was in 2010 and since then Apache Spark has become one of the most active open-source projects in the big data processing. It is part of the Apache Foundation and has been over 1000 contributors to the project. Spark offers several frameworks, including Spark SQL for analytics, Spark Streaming, MLlib for machine learning, and GraphX for graph-specific processing.

spark

Each Spark program has a driver running the execution of the different concurrent operations on the nodes in the cluster. The main abstraction is a resilient distributed dataset (RDD) that represents a collection of objects partitioned over the Spark nodes in the cluster. Most commonly it can be created by reading in a file from HDFS. Another important abstraction in Spark is the concept of shared variables. With the default configuration during each concurrent execution of a function, a copy of each variable is sent to each task. However, in some cases, variables need to be shared across tasks, or with the driver. In these cases, Spark supports two types of shared variables: broadcast variables that are used as a method to cache a value in memory, and accumulators that are mutated only by adding to them, for instance, sums and counts.


The key programming abstraction in Spark is Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of objects partitioned over Spark nodes in the cluster and can be processed concurrently. Spark offers APIs in Java, Scala, Python, and R, through which users can pass functions to be executed on the Spark cluster. Commonly, RDDs are first to read from an external source, such as HDFS, and then transformed using operations such as map, filter, or groupBy. RDDs are lazily evaluated, so an efficient execution plan can be created for the transformations specified by the user.

Thank you.

1


The post is synchronized to: FusionInsight Components

  • x
  • convention:

S_Noch
Created Apr 1, 2021 04:03:56

Very good
View more
  • x
  • convention:

little_fish
little_fish Created Apr 1, 2021 05:44:19 (0) (0)
follow me for more information  
chantha
Created Apr 1, 2021 05:10:55

Well note
View more
  • x
  • convention:

little_fish
little_fish Created Apr 1, 2021 05:44:28 (0) (0)
follow me for more information  
wissal
MVE Created Apr 1, 2021 07:43:21

Learned, well done
View more
  • x
  • convention:

little_fish
little_fish Created Apr 1, 2021 08:10:34 (0) (0)
hi wissal  
Unicef
MVE Created Apr 1, 2021 08:14:05

Good share, thanks for posting
View more
  • x
  • convention:

little_fish
little_fish Created Apr 1, 2021 09:22:39 (0) (0)
 
user_4001805
Created Apr 1, 2021 14:34:16

Good sharing
View more
  • x
  • convention:

little_fish
little_fish Created Apr 2, 2021 00:35:05 (0) (0)
 
phuta
Created Apr 1, 2021 14:49:55

Thanks for sharing
View more
  • x
  • convention:

little_fish
little_fish Created Apr 2, 2021 00:35:17 (0) (0)
 
Herediano
Created Apr 1, 2021 16:10:08

Thank you
View more
  • x
  • convention:

little_fish
little_fish Created Apr 2, 2021 00:35:29 (0) (0)
 
user_4143437
Created Apr 12, 2021 12:27:29

Nice one
View more
  • x
  • convention:

little_fish
little_fish Created Apr 13, 2021 01:46:34 (0) (0)
thanks dear  
user_4143437
Created Apr 15, 2021 18:44:58

Nice one
View more
  • x
  • convention:

little_fish
little_fish Created Apr 16, 2021 00:33:25 (0) (0)
Thanks  
12
Back to list

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.