Got it

ORC: some of its exceptional features

Latest reply: Jul 26, 2022 16:29:33 124 3 2 0 0

What is ORC:

ORC is a type-aware columnar Hadoop file format. It's optimized for massive streaming readings but has rapid row locating. Columnar data storage enables the reader access, decompress, and process only the needed values. Because ORC files are type-aware, the writer chooses the best encoding and generates an internal index.


Predicate pushdown uses indexes to select which file stripes to read for a query, and row indexes can restrict the search to 10,000 rows. ORC supports all Hive types, including structs, lists, maps, and unions.

ORC is popular among Hadoop users. Facebook employs ORC to save petabytes in their data warehouse and showed it's faster than RC File or Parquet. Yahoo employs ORC and has given benchmark results. Standard ORC file stripes are 64MB. Independent file stripes are the natural unit of distributed work. Each stripe's columns are divided so the reader can read only the needed ones. ORC files store Hive data efficiently. It was developed to overcome Hive file constraints. Hive reads, writes, and processes data faster using ORC files.


Some exceptional features of ORC:

ORC provides numerous advantages over RCFile, including:


·         Each process produces a single file, reducing NameNode load.

·         Datetime, decimal, and complex Hive types supported (struct, list, map, and union)

·         file-based indexes

·         Data type-based skip row groups that fail predicate filtering

·         Run-length encoding for integer columns Dictionary encoding for text columns Concurrent reads of the same file using distinct RecordReaders Split files without scanning for markers Bound memory needed for reading or writing Protocol Buffers metadata, which allows adding and removing fields

·         Filestructure

·         An ORC file contains stripes of row data and footer information. A postscript at the file's end contains compression parameters and footer size.

·         It's 250 MB by default. Large stripe sizes facilitate HDFS reads.

·         File footer lists stripes, rows per stripe, and column data types. Count, min, max, and total are column-level aggregates.






  • x
  • convention:

Moderator Author Created Jul 26, 2022 11:51:57

The ORC file format has the advantage of efficient compression, as it is stored as columns and compressed, resulting in smaller disc reads. In Tez, the columnar format is also ideal for vectorization optimizations.
View more
  • x
  • convention:

Moderator Created Jul 26, 2022 16:29:33

Its a nice post
View more
  • x
  • convention:

MahMush Created Jul 30, 2022 15:12:30 (0) (0)
happy to see your response.  


You need to log in to comment to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits


Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.