Data Lake
A Data Lake is a central repository from various data sources where the emphasis is put on storing data rapidly and for a low cost, at the expense of a well defined structure.
A wide variety of data can be stored in data lakes such as structured data (like columns and rows in classical RDBMS), semi-structured data (CSV, XML and JSON files), and unstructured data (images, videos, emails, web pages…).
In a Data Lake, the data is stored in a raw format, untouched, making it flexible for later usage. Data Lakes are, in general, a solid basis for data preparation, reports, visualization, in-depth analysis, data science and "machine learning".
- Learn more
- Wikipedia
Related articles

Architecture of object-based storage and S3 standard specifications
Categories: Big Data, Data Governance | Tags: Database, API, Amazon S3, Big Data, Data Lake, Storage
Object storage has been growing in popularity among data storage architectures. Compared to file systems and block storage, object storage faces no limitations when handling petabytes of data. By…
By Luka BIGOT
Jun 20, 2022

Comparison of database architectures: data warehouse, data lake and data lakehouse
Categories: Big Data, Data Engineering | Tags: Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Data Warehouse, File Format
Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparing…
By Gonzalo ETSE
May 17, 2022

An overview of Cloudera Data Platform (CDP)
Categories: Big Data, Cloud Computing, Data Engineering | Tags: SDX, Data Analytics, Big Data, Cloud, Cloudera, CDP, CDH, Data Hub, Data Lake, Data Warehouse
Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multifunctional self-service tools in order to analyze and centralize data. It brings security and…
Jul 19, 2021

Self-Paced training from Databricks: a guide to self-enablement on Big Data & AI
Categories: Data Engineering, Learning | Tags: Cloud, Data Lake, Databricks, Delta Lake, MLflow
Self-paced trainings are proposed by Databricks inside their Academy program. The price is $ 2000 USD for unlimited access to the training courses for a period of 1 year, but also free for customers…
May 26, 2021

Storage size and generation time in popular file formats
Categories: Data Engineering, Data Science | Tags: Hive, ORC, Avro, HDFS, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)
Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…
Mar 22, 2021

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)
Categories: Big Data, Cloud Computing, Data Engineering | Tags: NiFi, Hadoop, HDFS, Authentication, Authorization, Azure, Azure Data Lake Storage (ADLS), OAuth2
As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…
Nov 5, 2020

Download datasets into HDFS and Hive
Categories: Big Data, Data Engineering | Tags: Hive, Business intelligence, Data Analytics, Data Engineering, Data structures, Database, Hadoop, HDFS, Big Data, Data Lake, Data Warehouse
Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the…
By Aida NGOM
Jul 31, 2020

Snowflake, the Data Warehouse for the Cloud, introduction and tutorial
Categories: Business Intelligence, Cloud Computing | Tags: Cloud, Data Lake, Data Science, Data Warehouse, Snowflake
Snowflake is a SaaS-based data-warehousing platform that centralizes, in the cloud, the storage and processing of structured and semi-structured data. The increasing generation of data produced over…
Apr 7, 2020

Cloudera CDP and Cloud migration of your Data Warehouse
Categories: Big Data, Cloud Computing | Tags: Azure, Cloudera, Data Hub, Data Lake, Data Warehouse
While one of our customer is anticipating a move to the Cloud and with the recent announcement of Cloudera CDP availability mi-september during the Strata conference, it seems like the appropriate…
By David WORMS
Dec 16, 2019

Innovation, project vs product culture in Data Science
Categories: Data Science, Data Governance | Tags: DevOps, Agile, Scrum
Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes are…
By David WORMS
Oct 8, 2019

Data Lake ingestion best practices
Categories: Big Data, Data Engineering | Tags: Hive, NiFi, ORC, Data Governance, HDF, Operation, Protocol Buffers, Avro, Spark, Data Lake, File Format, Registry, Schema
Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…
By David WORMS
Jun 18, 2018

Oracle and Hive, how data are published?
Categories: Big Data | Tags: Hive, Sqoop, Oracle, Data Lake
In the past few days, I’ve published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with…
By David WORMS
Jul 6, 2013