File Format

A file format is a representation type of data that organizes the data inside a file. The type of format is often associated and identified with a file extension (e.g. .csv). This representation allows software adapted to the formats in question to be able to decode information contained in the file, as well as interoperability between software or other computer programs.

Choose an appropriate file format has a paramount importance in the context of data processing. Because depending of the use cases, some formats are more suitable than others due to their specificities. For example, CSV is a very understandable and widely used format despite its lack of formalism.

Learn more
Wikipedia

Related articles

Comparison of database architectures: data warehouse, data lake and data lakehouse

Comparison of database architectures: data warehouse, data lake and data lakehouse

Categories: Big Data, Data Engineering | Tags: Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Lakehouse, Data Warehouse, File Format

Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparing…

Gonzalo ETSE

By Gonzalo ETSE

May 17, 2022

CSV package for Node.js version 6

CSV package for Node.js version 6

Categories: Node.js | Tags: Data Engineering, Refactoring, CSV, File Format, Release and features

Version 6 of the package for Node.js is released along its sub projects. Here are the latest versions: version , latest version was NPM version , latest version was NPM version , latest version…

David WORMS

By David WORMS

Nov 15, 2021

Storage size and generation time in popular file formats

Storage size and generation time in popular file formats

Categories: Data Engineering, Data Science | Tags: Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)

Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…

Barthelemy NGOM

By Barthelemy NGOM

Mar 22, 2021

Introduction to Ludwig and how to deploy a Deep Learning model via Flask

Introduction to Ludwig and how to deploy a Deep Learning model via Flask

Categories: Data Science, Tech Radar | Tags: Learning and tutorial, Deep Learning, Ludwig Deep Learning Toolbox, Machine Learning, Python

Over the past decade, Machine Learning and deep learning models have proven to be very effective in performing a wide variety of tasks such as fraud detection, product recommendation, autonomous…

Robert Walid SOARES

By Robert Walid SOARES

Mar 2, 2020

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Categories: Data Engineering, Learning | Tags: Apache Spark Streaming, Spark, Python, Streaming

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…

Oskar RYNKIEWICZ

By Oskar RYNKIEWICZ

May 28, 2019

Data Lake ingestion best practices

Data Lake ingestion best practices

Categories: Big Data, Data Engineering | Tags: NiFi, Data Governance, HDF, Operation, Avro, Hive, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema

Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…

David WORMS

By David WORMS

Jun 18, 2018

State of the Hadoop open-source ecosystem in early 2013

State of the Hadoop open-source ecosystem in early 2013

Categories: Big Data | Tags: Flume, Mesos, Phoenix, Pig, Hadoop, Kafka, Mahout, Data Science

Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume…

David WORMS

By David WORMS

Jul 8, 2013

Apache Hive Essentials How-to by Darren Lee

Apache Hive Essentials How-to by Darren Lee

Categories: Business Intelligence, Learning | Tags: UDF, Hadoop, Hive, File Format, SQL

Recently, I’ve been ask to review a new book on Apache Hive called “Apache Hive Essentials How-to” (edit: the second edition is now available) written by Darren Lee and published by Packt Publishing…

David WORMS

By David WORMS

Apr 23, 2013

Convert .flac music files to .mp3 on osx

Convert .flac music files to .mp3 on osx

Categories: Hack | Tags: OS X, File Format

As an osx user for years now, one should know by then that iTunes doesn’t support the flac format. We are now in 2012, I’ve been waiting for this to happen since years know. Loosing patience, dark…

David WORMS

By David WORMS

Jul 20, 2012

HDFS and Hive storage - comparing file formats and compression methods

HDFS and Hive storage - comparing file formats and compression methods

Categories: Big Data | Tags: Business intelligence, Hive, ORC, Parquet, File Format

A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The…

David WORMS

By David WORMS

Mar 13, 2012

Two Hive UDAF to convert an aggregation to a map

Two Hive UDAF to convert an aggregation to a map

Categories: Data Engineering | Tags: Java, HBase, Hive, File Format

I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The…

David WORMS

By David WORMS

Mar 6, 2012

Timeseries storage in Hadoop and Hive

Timeseries storage in Hadoop and Hive

Categories: Data Engineering | Tags: CRM, timeseries, Tuning, Hadoop, HDFS, Hive, File Format

In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows…

David WORMS

By David WORMS

Jan 10, 2012

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain