Data Engineering
La donnée est l’énergie qui alimente la transformation digitale. Les développeurs la consomme dans leurs applicatifs. Les Data Analysts la fouille, la requête et la partage. Les Data Scientists alimentent leurs algorithmes avec. Les Data Engineers ont la responsabilité de mettre en place la chaîne de valeur qui inclue la collecte, le nettoyage, l’enrichissement et la mise à disposition des données.
Gérer le passage à l’échelle, garantir la sécurité et l’intégrité des données, être tolérant aux pannes, manipuler des données par lots ou en flux continu, valider les schémas, publier les APIs, sélectionner les formats, modèles et bases de données appropriés à leurs expositions sont autant de prérogatives à la charge du Data Engineer. De son travail découle la confiance et les succès de ceux qui consomme et exploitent la donnée.
Articles associés au Data Engineering

Apache Hop 101, quick tutorial to get started
Catégories : Data Engineering | Tags : Data Engineering, DevOps, Learning and tutorial, Pipeline, Airflow, Hop, Iceberg, NiFi, Argo Workflows, Docker, Git
This hands-on tutorial walks through the creation of a project, pipeline, and workflow in Apache Hop. Building on the core concepts introduced in the previous article and using a Docker-based…
Par Mori HUANG
26 mai 2026

Apache Hop 101, introduction and installation
Catégories : Data Engineering | Tags : Data Engineering, DevOps, Learning and tutorial, Pipeline, Airflow, Hop, Iceberg, NiFi, Argo Workflows, Docker, Git
Apache Hop is an ETL (Extract Transform and Load) tool designed to make pipeline development intuitive, maintainable, and scalable. This article is part of a serie of 2 articles: Apache Hop 10…
Par Mori HUANG
10 mai 2026

CDP part 6: end-to-end data lakehouse ingestion pipeline with CDP
Catégories : Big Data, Data Engineering, Learning | Tags : Business intelligence, Data Engineering, Iceberg, NiFi, Spark, Big Data, Cloudera, CDP, Data Analytics, Data Lake, Data Warehouse
In this hands-on lab session we demonstrate how to build an end-to-end big data solution with Cloudera Data Platform (CDP) Public Cloud, using the infrastructure we have deployed and configured over…
Par Tobias CHAVARRIA
24 juil. 2023

CDP part 1: introduction to end-to-end data lakehouse architecture with CDP
Catégories : Cloud Computing, Data Engineering, Infrastructure | Tags : Data Engineering, Hortonworks, Iceberg, AWS, Azure, Big Data, Cloud, Cloudera, CDP, Cloudera Manager, Data Warehouse
Cloudera Data Platform (CDP) is a hybrid data platform for big data transformation, machine learning and data analytics. In this series we describe how to build and use an end-to-end big data…
Par Stephan BAUM
8 juin 2023

Keycloak deployment in EC2
Catégories : Cloud Computing, Data Engineering, Infrastructure | Tags : Security, EC2, Authentication, AWS, Docker, Keycloak, SSL/TLS, SSO
Why use Keycloak Keycloak is an open-source identity provider (IdP) using single sign-on (SSO). An IdP is a tool to create, maintain, and manage identity information for principals and to provide…
Par Stephan BAUM
14 mars 2023

Big data infrastructure internship
Catégories : Big Data, Data Engineering, DevOps & SRE, Infrastructure | Tags : Infrastructure, Hadoop, Big Data, Cluster, Internship, Kubernetes, TDP
Job description Big Data and distributed computing are at the core of Adaltas. We accompagny our partners in the deployment, maintenance, and optimization of some of the largest clusters in France…
Par Stephan BAUM
2 déc. 2022

Comparison of database architectures: data warehouse, data lake and data lakehouse
Catégories : Big Data, Data Engineering | Tags : Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Data lakehouse, Data Warehouse, File Format
Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparing…
Par Gonzalo ETSE
17 mai 2022

Databricks logs collection with Azure Monitor at a Workspace Scale
Catégories : Cloud Computing, Data Engineering, Adaltas Summit 2021 | Tags : Metrics, Monitoring, Spark, Azure, Databricks, Log4j
Databricks is an optimized data analytics platform based on Apache Spark. Monitoring Databricks plateform is crucial to ensure data quality, job performance, and security issues by limiting access to…
Par Claire PLAYE
10 mai 2022

An overview of Cloudera Data Platform (CDP)
Catégories : Big Data, Cloud Computing, Data Engineering | Tags : SDX, Big Data, Cloud, Cloudera, CDP, CDH, Data Analytics, Data Hub, Data Lake, Data lakehouse, Data Warehouse
Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multifunctional self-service tools in order to analyze and centralize data. It brings security and…
19 juil. 2021

Self-Paced training from Databricks: a guide to self-enablement on Big Data & AI
Catégories : Data Engineering, Learning | Tags : Cloud, Data Lake, Databricks, Delta Lake, MLflow
Self-paced trainings are proposed by Databricks inside their Academy program. The price is $ 2000 USD for unlimited access to the training courses for a period of 1 year, but also free for customers…
Par Anna KNYAZEVA
26 mai 2021

Find your way into data related Microsoft Azure certifications
Catégories : Cloud Computing, Data Engineering | Tags : Data Governance, Azure, Data Science
Microsoft Azure has certification paths for many technical job roles such as developer, Data Engineer, Data Scientist and solution architect among others. Each of these certifications consists of…
Par Barthelemy NGOM
14 avr. 2021

Apache Liminal: when MLOps meets GitOps
Catégories : Big Data, Containers Orchestration, Data Engineering, Data Science, Tech Radar | Tags : Data Engineering, CI/CD, Data Science, Deep Learning, Deployment, Docker, GitOps, Kubernetes, Machine Learning, MLOps, Open source, Python, TensorFlow
Apache Liminal is an open-source software which proposes a solution to deploy end-to-end Machine Learning pipelines. Indeed it permits to centralize all the steps needed to construct Machine Learning…
Par Aargan COINTEPAS
31 mars 2021

Storage size and generation time in popular file formats
Catégories : Data Engineering, Data Science | Tags : Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)
Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…
Par Barthelemy NGOM
22 mars 2021

TensorFlow Extended (TFX): the components and their functionalities
Catégories : Big Data, Data Engineering, Data Science, Learning | Tags : Beam, Data Engineering, Pipeline, CI/CD, Data Science, Deep Learning, Deployment, Machine Learning, MLOps, Open source, Python, TensorFlow
Putting Machine Learning (ML) and Deep Learning (DL) models in production certainly is a difficult task. It has been recognized as more failure-prone and time consuming than the modeling itself, yet…
5 mars 2021

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)
Catégories : Big Data, Cloud Computing, Data Engineering | Tags : Hadoop, HDFS, NiFi, Authentication, Authorization, Azure, Azure Data Lake Storage (ADLS), OAuth2
As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…
Par Gauthier LEONARD
5 nov. 2020

Experiment tracking with MLflow on Databricks Community Edition
Catégories : Data Engineering, Data Science, Learning | Tags : Spark, Databricks, Deep Learning, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn
Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and the…
10 sept. 2020

Download datasets into HDFS and Hive
Catégories : Big Data, Data Engineering | Tags : Business intelligence, Data Engineering, Data structures, Database, Hadoop, HDFS, Hive, Big Data, Data Analytics, Data Lake, Data lakehouse, Data Warehouse
Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the…
Par Aida NGOM
31 juil. 2020

Comparison of different file formats in Big Data
Catégories : Big Data, Data Engineering | Tags : Business intelligence, Data structures, Avro, HDFS, ORC, Parquet, Batch processing, Big Data, CSV, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers
In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…
Par Aida NGOM
23 juil. 2020

Importing data to Databricks: external tables and Delta Lake
Catégories : Data Engineering, Data Science, Learning | Tags : Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python
During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…
21 mai 2020

Optimization of Spark applications in Hadoop YARN
Catégories : Data Engineering, Learning | Tags : Tuning, Hadoop, Spark, Python
Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This article…
30 mars 2020

MLflow tutorial: an open source Machine Learning (ML) platform
Catégories : Data Engineering, Data Science, Learning | Tags : AWS, Azure, Databricks, Deep Learning, Deployment, Machine Learning, MLflow, MLOps, Python, Scikit-learn
Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…
23 mars 2020

Logstash pipelines remote configuration and self-indexing
Catégories : Data Engineering, Infrastructure | Tags : Docker, Elasticsearch, Kibana, Logstash, Log4j
Logstash is a powerful data collection engine that integrates in the Elastic Stack (Elasticsearch - Logstash - Kibana). The goal of this article is to show you how to deploy a fully managed Logstash…
13 déc. 2019

Internship Data Science & Data Engineer - ML in production and streaming data ingestion
Catégories : Data Engineering, Data Science | Tags : DevOps, Flink, Hadoop, HBase, Kafka, Spark, Internship, Kubernetes, Python
Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…
Par David WORMS
26 nov. 2019

Insert rows in BigQuery tables with complex columns
Catégories : Cloud Computing, Data Engineering | Tags : GCP, BigQuery, Schema, SQL
Google’s BigQuery is a cloud data warehousing system designed to process enormous volumes of data with several features available. Out of all those features, let’s talk about the support of Struct…
Par César BEREZOWSKI
22 nov. 2019

Machine Learning model deployment
Catégories : Big Data, Data Engineering, Data Science, DevOps & SRE | Tags : DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema
“Enterprise Machine Learning requires looking at the big picture […] from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…
Par Oskar RYNKIEWICZ
30 sept. 2019

Spark Streaming part 4: clustering with Spark MLlib
Catégories : Data Engineering, Data Science, Learning | Tags : Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming
Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…
Par Oskar RYNKIEWICZ
27 juin 2019

Spark Streaming part 3: DevOps, tools and tests for Spark applications
Catégories : Big Data, Data Engineering, DevOps & SRE | Tags : DevOps, Learning and tutorial, Spark, Apache Spark Streaming
Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data…
Par Oskar RYNKIEWICZ
31 mai 2019
