Apache ORC

ORC (Optimized Row Columnar) is an open-source column-oriented data storage of the Apache Hadoop ecosystem. It is comparable to Parquet and RCFile, and was created one month prior to Parquet by Hortonworks in collaboration with Facebook. It is highly optimized for reading, writing and processing data in Hive.

ORC files structure includes a Stripe and a Footer.

Stripes: Groups data by chunks.

  • Index data: Stored as columns. Keeps min and max values for each column and the row position within each column. It helps locate the stripes and row groups based on the data required.
  • Row data: The actual data of the file. Also stored as columns
  • Stripe Footer: contains a directory of stream (serialized data) location.

Footer: Collects general file information.

  • Metadata: various statistical information related to the columns at stripe level. This enables input split elimination based on predictive push-down which are evaluated for each stripe.
  • File footer: contains information of the list of stripes, number of rows per stripe, the data type for each column, and aggregates min, max, and sum at column level.
  • Postscript: contains the length of file footer and metadata, the version of the file, the general compression used (none, zlib, snappy, etc), and the size of the compressed folder.

The default stripe size is 250 MB. Large stripe sizes enable large efficient reads from HDFS.

This format includes support for ACID transactions, built-in Indexes, and support for all Hive's types: structs, lists, maps, and unions. It is efficient for Business Intelligence workload and improves performances on read, write and processing in Hive.

Projects using ORC includ Hadoop, Spark, Arrow, Flink, Iceberg, Druid, Gobblin and sdasNifi.

