What is ORC:
ORC is a type-aware columnar Hadoop file format. It's optimized for massive streaming readings but has rapid row locating. Columnar data storage enables the reader access, decompress, and process only the needed values. Because ORC files are type-aware, the writer chooses the best encoding and generates an internal index.
Predicate pushdown uses indexes to select which file stripes to read for a query, and row indexes can restrict the search to 10,000 rows. ORC supports all Hive types, including structs, lists, maps, and unions.
ORC is popular among Hadoop users. Facebook employs ORC to save petabytes in their data warehouse and showed it's faster than RC File or Parquet. Yahoo employs ORC and has given benchmark results. Standard ORC file stripes are 64MB. Independent file stripes are the natural unit of distributed work. Each stripe's columns are divided so the reader can read only the needed ones. ORC files store Hive data efficiently. It was developed to overcome Hive file constraints. Hive reads, writes, and processes data faster using ORC files.
Some exceptional features of ORC:
ORC provides numerous advantages over RCFile, including:
· Each process produces a single file, reducing NameNode load.
· Datetime, decimal, and complex Hive types supported (struct, list, map, and union)
· file-based indexes
· Data type-based skip row groups that fail predicate filtering
· Run-length encoding for integer columns Dictionary encoding for text columns Concurrent reads of the same file using distinct RecordReaders Split files without scanning for markers Bound memory needed for reading or writing Protocol Buffers metadata, which allows adding and removing fields
· Filestructure
· An ORC file contains stripes of row data and footer information. A postscript at the file's end contains compression parameters and footer size.
· It's 250 MB by default. Large stripe sizes facilitate HDFS reads.
· File footer lists stripes, rows per stripe, and column data types. Count, min, max, and total are column-level aggregates.
References:
https://orc.apache.org/docs/
https://cwiki.apache.org/confluence/display/hive/languagemanual+orc
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-orc