Got it

Apache Hive vs. Apache HBase

Latest reply: Jul 26, 2022 13:47:46 444 5 4 0 0

Hello, everyone!

In this post, we are going to talk about Apache Hive vs. Apache HBase.

People are always asking me at meetups whether they should use Apache Hive, Apache HBase, Apache SparkSQL, or some buzzword data engine.

My answer is yes: use them all for the appropriate use case and data.

Ask yourself some questions first:

  • What does your data look like?

  • How many rows will you have?

  • What is more important: reads, writes, appends, updates, or deletes?

  • Do you need SQL?

  • Do you need deep, rich full ANSI SQL?

  • What interfaces will you have to the data? JDBC? APIs? Apache Spark?

  • How many concurrent users will access this data?

  • How often is it inserted? Updated? Deleted? Read? Joined? Exported?

  • Is this structured? Unstructured? Semistructured? AVRO? JSON?

  • Do you want to integrate with OLAP? Druid?

  • Is this for temporary use?

  • Is this part of a real-time streaming ingest flow?

  • Is it columnar? If so, how many columns? Are they natural grouped?

  • Do you have sparse data?

  • What BI or query tool are you using?

  • Do you need to do scans?

  • Is your data key-value?

My next question is: How are you ingesting it? For most cases, it makes sense to use Apache NiFi for either Apache Hive or Apache HBase destinations. Sometimes, Apache SQOOP makes sense, as well. What is the source format? Do you need to store it in the original format? Is it already JSON or CSV?

Apache HBase has some very interesting updates coming in version 2.0 that makes it great for a lot of use cases.

Apache Hive is great for its full SQL, in-memory caching, sorting, joining data, ACID, and integration with BI tools, Druid, and Spark SQL integration.

With Apache Phoenix, HBase has a good set of SQL to start with — but it's nowhere near as mature or rich as Apache Hive's SQL.

Apache HBase pros:

  • Huge sparse datasets are killer

  • NoSQL store

  • Medium object storage

  • Key-value usage

  • Co-processors

  • UDF

  • Apache Phoenix for SQL

  • Upserts in Phoenix

  • Apache Spark Connector

  • Scans

Apache HBase Cons:

  • Needs richer SQL

  • Requires architecting access methods

  • Not for tiny data

Apache Hive Pros:

  • Real SQL database

  • Massive datasets

  • ACID tables

  • BI tool integration

  • EDW use cases

  • HBase Integration

  • Apache HiveMall for machine learning

  • Druid interactivity and integration

  • Strong Apache Spark Support

  • Strong security integration

  • UDF

  • Various file storage on HDFS including Apache ORC, Apache Parquet, CSV, and JSON

  • ACID merge

  • Hybrid procedural SQL on Hadoop (HPL/SQL)

Apache Hive Cons:

  • Not for key-value data

  • Not for tiny data, use an RDBMS

  • Not for people in Ivory Towers that just complain about people's software but don't have any enterprise applications.

  • You need to run the latest LLAP version

  • Configuration if not using Apache Ambari is tricky

So, who wins? There was a time I tried to use Apache Phoenix for everything since its JDBC driver is really solid, made it easy to put lots of data in quickly, and makes for fast queries. It's also great for use cases that I used to use something like MongoDB for, with varying JSON data.

Apache Hive has the Apache Spark SQL integration and rich SQL that makes it great for tabular data, and its Apache ORC format is amazing.

In most use cases, Apache Hive wins. For NoSQL, sparse data, really high-end requirements, Apache HBase wins. The good news is that they both work well together on the same Hadoop cluster and utilize your massive HDFS store. I rarely see places where they don't use both. Use them both — if one doesn't work, use the other. The two together have solved every query and storage requirement that I have had for 100 different use cases in dozens of different enterprises.

How do you get the benefits of Apache HBase and still run Apache Hive queries?


Check out creating external Apache Hive tables that point to HBase.  You can now use HBase tables in all of your Apache Hive applications.

Original link:

  • x
  • convention:

MVE Created Mar 29, 2021 07:08:32

View more
  • x
  • convention:

Created Aug 24, 2021 02:31:26

View more
  • x
  • convention:

Created Jul 26, 2022 13:26:57

Good share
View more
  • x
  • convention:

Created Jul 26, 2022 13:27:27

Thanks for sharing
View more
  • x
  • convention:

Created Jul 26, 2022 13:47:46

View more
  • x
  • convention:


You need to log in to comment to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits


Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.