Support Ukrain
Adaltas logoAdaltasAdaltas logoAdaltas

Apache ORC

ORC (Optimized Row Columnar) is an open-source column-oriented data storage of the Apache Hadoop ecosystem. It is comparable to Parquet and RCFile, and was created one month prior to Parquet by Hortonworks in collaboration with Facebook. It is highly optimized for reading, writing and processing data in Hive.

ORC files structure includes a Stripe and a Footer.

Stripes: Groups data by chunks.

  • Index data: Stored as columns. Keeps min and max values for each column and the row position within each column. It helps locate the stripes and row groups based on the data required.
  • Row data: The actual data of the file. Also stored as columns
  • Stripe Footer: contains a directory of stream (serialized data) location.

Footer: Collects general file information.

  • Metadata: various statistical information related to the columns at stripe level. This enables input split elimination based on predictive push-down which are evaluated for each stripe.
  • File footer: contains information of the list of stripes, number of rows per stripe, the data type for each column, and aggregates min, max, and sum at column level.
  • Postscript: contains the length of file footer and metadata, the version of the file, the general compression used (none, zlib, snappy, etc), and the size of the compressed folder.

The default stripe size is 250 MB. Large stripe sizes enable large efficient reads from HDFS.

This format includes support for ACID transactions, built-in Indexes, and support for all Hive's types: structs, lists, maps, and unions. It is efficient for Business Intelligence workload and improves performances on read, write and processing in Hive.

Projects using ORC includ Hadoop, Spark, Arrow, Flink, Iceberg, Druid, Gobblin and sdasNifi.

Related articles

Comparison of database architectures: data warehouse, data lake and data lakehouse

Comparison of database architectures: data warehouse, data lake and data lakehouse

Categories: Big Data, Data Engineering | Tags: Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Lakehouse, Data Warehouse, File Format

Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparing…

Gonzalo ETSE

By Gonzalo ETSE

May 17, 2022

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.