Avro is a row based data serialization format hosted by the Apache Foundation. An Avro file consists of a header format serialized in JSON followed by the data. Data is serialized in JSON or binary. The majority of applications store data in the binary format for performance reasons. It is smaller and faster. Thus, the schema is interpretable by machines while remaining readable by humans and the data is highly optimized. Another key feature is that avro binary files are compressible and divisible.
Avro is particularly suited for use cases requiring schema migration. Indeed, it supports dynamic typing of the data, as the schema can be modified. Different versions of the schema are saved, allowing schema conflict resolution. This is useful to manage data quality in data stream processing applications like Kafka. The consumers can adapt to the current available schema. In addition, consumers and Hadoop MapReduce tasks can take advantage of the divisibility of the binary files for parallel processing.
The supported data types are:
- Primitive: null, boolean, int, long, float, double, bytes, and string.
- Complex: arrays, enums, fixed, maps, records, and unions.
Avro can also be used to exchange data (RPC) by sharing the schema during the connection. The compressibility of the files increases the efficiency of data exchanges and storage.
- Learn more
- Official website
H20 comes with a lot of functionalities. The second part of the series H2O in practice proposes a protocol to combine AutoML modeling with traditional modeling and optimization approach. The objective…
Nov 12, 2021
Automated machine learning (AutoML) platforms are gaining popularity and becoming a new important tool in the data scientists’ toolbox. A few months ago, I introduced H2O, an open-source platform for…
Sep 29, 2021
Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…
Mar 22, 2021
Building Machine Learning (ML) models is a time-consuming process. It requires expertise in statistics, ML algorithms, and programming. On top of that, it also requires the ability to translate a…
Dec 10, 2020
In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…
By Aida NGOM
Jul 23, 2020
Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…
By David WORMS
Jun 18, 2018
I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article…
By David WORMS
May 15, 2013