Got it

ORC vs parquet

Latest reply: Jul 26, 2022 11:43:59 436 18 8 0 0

ORC vs parquet

Both ORC and Parquet are widely used open-source columnar file storage formats in the Hadoop ecosystem. They are comparable in terms of efficiency and speed, and are primarily intended to accelerate big data analytical applications. Working with ORC files is just as straightforward as working with Parquet files, as they offer superior read and write performance over their row-based counterparts. Both have their fair number of perks and cons, making it difficult to determine which is superior.

 

ORC

ORC files store data efficiently. It overcomes other file formats' limitations. ORC file format stores data compactly and skips unimportant bits without huge, complex, or manually maintained indices. ORC addresses all these difficulties.

The ORC file format maintains groups of rows in a single file, with each row's contents written in columnar format. An ORC file comprises stripes, which are groups of row data, and file footer information. A postscript contains compression parameters and the size of the compressed footer at the end of the file. The stripe size is 250 MB by default. Big stripe sizes enable large, efficient reads from HDFS.

orc

Parquet

The open-source file format Parquet for Hadoop holds nested data structures in a columnar, flat manner. In terms of capacity and performance, the Parquet file format is superior to the traditional approach, which stores data in a row-oriented fashion. It is particularly advantageous for queries that read certain columns from a "broad" (many-column) database, as just the necessary columns are read and IO is minimized.

cacca

 

Comparison of ORC and Parquet:

ORC was inspired by Facebook's row columnar format for columnar reads, predictive pushdown, and lazy reads. It replaces the traditional Record Columnar File (RCFile) format and reduces relational data size by up to 75%. Parquet was inspired by Google's Dremel paper and developed by Cloudera and Twitter. Apache has incubated Parquet.

ORC and Parquet are prominent column-oriented big data file formats that share data in columns. ORC only supports Hive and Pig, while Parquet supports most Hadoop projects. ORC is better for Hive, but Parquet works well with Apache Spark. Apache Spark writes and reads data in Parquet by default.

Indexing ORC files is as easy as Parquet. Both are good for reading-intensive jobs. ORC files are organized as distinct data stripes. Index, row data, and footer are in each stripe. The footer caches each column's count, min, max, and sum. Parquet stores data on pages with headers, definition levels, repetition levels, and data.

PARQUET is better equipped to store nested data.ORC is superior at Predicate Pushdown.ORC supports ACID characteristics.ORC is more compression efficient.


  • x
  • convention:

alopez
MVE Created Jul 15, 2022 14:03:56

Good job, dear friend! ORC vs parquet-5072449-1
View more
  • x
  • convention:

MahMush
MahMush Created Jul 30, 2022 17:36:53 (0) (0)
Thanks for appreciating, hope you find it useful.  
TuanNg
Created Jul 15, 2022 14:10:03

Good share
View more
  • x
  • convention:

MahMush
MahMush Created Jul 18, 2022 05:29:14 (0) (0)
I hope you understand the concept of parquet vs orc showing which big data systems are more compatible  
DienLg
Created Jul 15, 2022 14:18:53

Thanks for sharing
View more
  • x
  • convention:

MahMush
MahMush Created Jul 30, 2022 17:37:46 (0) (0)
Its always a pleasure to share knowledge.  
Saqibaz
Created Jul 15, 2022 15:29:23

Thanks for sharing
View more
  • x
  • convention:

MahMush
MahMush Created Jul 18, 2022 05:31:39 (0) (0)
Nice to have your feedback  
wissal
MVE Created Jul 15, 2022 16:21:59

Important content
View more
  • x
  • convention:

MahMush
MahMush Created Jul 30, 2022 17:39:13 (0) (0)
Glad to see your response.  
Saqib123
Moderator Created Jul 15, 2022 20:00:44

Thanks for Sharing
View more
  • x
  • convention:

MahMush
MahMush Created Jul 18, 2022 05:31:07 (0) (0)
It's good to see that you liked it  
andersoncf1
MVE Author Created Jul 15, 2022 21:11:49

Well done
View more
  • x
  • convention:

MahMush
MahMush Created Jul 30, 2022 17:38:32 (0) (0)
Thank you dear friend.  
Unicef
MVE Created Jul 16, 2022 02:40:27

THANKS Q
View more
  • x
  • convention:

MahMush
MahMush Created Jul 18, 2022 05:30:38 (0) (0)
Glad to see the response  
MahMush
Moderator Author Created Jul 26, 2022 11:42:37

PARQUET is better at storing nested data. ORC is better at Predicate Pushdown. ORC has ACID properties. ORC is more efficient at compression.
View more
  • x
  • convention:

12
Back to list

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.