Apache Avro

Avro is a row based data serialization format hosted by the Apache Foundation. An Avro file consists of a header format serialized in JSON followed by the data. Data is serialized in JSON or binary. The majority of applications store data in the binary format for performance reasons. It is smaller and faster. Thus, the schema is interpretable by machines while remaining readable by humans and the data is highly optimized. Another key feature is that avro binary files are compressible and divisible.

Avro is particularly suited for use cases requiring schema migration. Indeed, it supports dynamic typing of the data, as the schema can be modified. Different versions of the schema are saved, allowing schema conflict resolution. This is useful to manage data quality in data stream processing applications like Kafka. The consumers can adapt to the current available schema. In addition, consumers and Hadoop MapReduce tasks can take advantage of the divisibility of the binary files for parallel processing.

The supported data types are:

Primitive: null, boolean, int, long, float, double, bytes, and string.
Complex: arrays, enums, fixed, maps, records, and unions.

Avro can also be used to exchange data (RPC) by sharing the schema during the connection. The compressibility of the files increases the efficiency of data exchanges and storage.

Learn more: Official website
Related tags: Apache ORC; Apache Parquet; File Format

Options to connect and integrate Hadoop with Oracle

Categories: Data Engineering | Tags: Database, Java, Oracle, R, RDBMS, Avro, HDFS, Hive, MapReduce, Sqoop, NoSQL, SQL

I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article…

By David WORMS

May 15, 2013

Data Lake ingestion best practices

Categories: Big Data, Data Engineering | Tags: Data Governance, HDF, Operation, Avro, Hive, NiFi, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema

Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…

By David WORMS

Jun 18, 2018

Comparison of different file formats in Big Data

Categories: Big Data, Data Engineering | Tags: Business intelligence, Data structures, Avro, HDFS, ORC, Parquet, Batch processing, Big Data, CSV, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers

In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…

By Aida NGOM

Jul 23, 2020

Faster model development with H2O AutoML and Flow

Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python

Building Machine Learning (ML) models is a time-consuming process. It requires expertise in statistics, ML algorithms, and programming. On top of that, it also requires the ability to translate a…

By Petra KAFERLE DEVISSCHERE

Dec 10, 2020

Storage size and generation time in popular file formats

Categories: Data Engineering, Data Science | Tags: Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)

Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…

By Barthelemy NGOM

Mar 22, 2021

H2O in practice: a Data Scientist feedback

Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python

Automated machine learning (AutoML) platforms are gaining popularity and becoming a new important tool in the data scientists’ toolbox. A few months ago, I introduced H2O, an open-source platform for…

By Petra KAFERLE DEVISSCHERE

Sep 29, 2021

H2O in practice: a protocol combining AutoML with traditional modeling approaches

Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python, XGBoost

H20 comes with a lot of functionalities. The second part of the series H2O in practice proposes a protocol to combine AutoML modeling with traditional modeling and optimization approach. The objective…

By Petra KAFERLE DEVISSCHERE

Nov 12, 2021

Apache Avro

Related articles

Options to connect and integrate Hadoop with Oracle

Data Lake ingestion best practices

Comparison of different file formats in Big Data

Faster model development with H2O AutoML and Flow

Storage size and generation time in popular file formats

H2O in practice: a Data Scientist feedback

H2O in practice: a protocol combining AutoML with traditional modeling approaches