Data Engineering
Data is the energy that feeds digital transformation. The developers consume it in their applications. Data Analysts search, query and share it. Data Scientists feed their algorithms with it. Data Engineers are responsible for setting up the value chain that includes the collection, cleaning, enrichment and provision of data.
Manage scalability, ensure data security and integrity, be fault-tolerant, manipulate batch or streaming data, validate schemas, publish APIs, select formats, models and databases appropriate for their exhibitions are the prerogatives of the Data Engineer. From his work derives the trust and success of those who consume and exploit the data.

Articles related to Data Engineering
Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)
Categories: Big Data, Cloud Computing, Data Engineering | Tags: HDFS, NiFi, Authentication, Authorization, Hadoop, Azure Data Lake Storage (ADLS), Azure, OAuth2
As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…
Nov 5, 2020
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Deep Learning, Databricks, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn
Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and the…
Sep 10, 2020
Download datasets into HDFS and Hive
Categories: Big Data, Data Engineering | Tags: Analytics, HDFS, Hive, Big Data, Data Analytics, Data Engineering, Data structures, Database, Hadoop, Data Lake, Data Warehouse
Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the…
By Aida NGOM
Jul 31, 2020
Comparaison of different file formats in Big Data
Categories: Big Data, Data Engineering | Tags: Analytics, Avro, HDFS, Hive, Kafka, MapReduce, ORC, Spark, Batch processing, Big Data, CSV, Data Analytics, Data structures, Database, JSON, Protocol Buffers, Hadoop, Parquet, Kubernetes, XML
In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…
By Aida NGOM
Jul 23, 2020
Categories: Data Engineering, Data Science, Learning | Tags: Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python
During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…
May 21, 2020
Categories: Data Engineering, Learning | Tags: Spark, Tuning, Hadoop, Python
Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This article…
Mar 30, 2020
Categories: Data Engineering, Data Science, Learning | Tags: Deep Learning, AWS, Databricks, Deployment, Machine Learning, Azure, MLflow, MLOps, Python, Scikit-learn
Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…
Mar 23, 2020
Logstash pipelines remote configuration and self-indexing
Categories: Data Engineering, Infrastructure | Tags: Docker, Elasticsearch, Kibana, Logstash, Log4j
Logstash is a powerful data collection engine that integrates in the Elastic Stack (Elasticsearch - Logstash - Kibana). The goal of this article is to show you how to deploy a fully managed Logstash…
Dec 13, 2019
Internship Data Science & Data Engineer - ML in production and streaming data ingestion
Categories: Data Engineering, Data Science | Tags: Flink, Kafka, Spark, DevOps, Hadoop, HBase, Kubernetes, Python
Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…
By David WORMS
Nov 26, 2019
Insert rows in BigQuery tables with complex columns
Categories: Cloud Computing, Data Engineering | Tags: GCP, BigQuery, Schema, SQL
Google’s BigQuery is a cloud data warehousing system designed to process enormous volumes of data with several features available. Out of all those features, let’s talk about the support of Struct…
Nov 22, 2019
Machine Learning model deployment
Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema
“Enterprise Machine Learning requires looking at the big picture … from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…
Sep 30, 2019
Spark Streaming part 4: clustering with Spark MLlib
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Apache Spark Streaming, Big Data, Streaming, Clustering, Machine Learning, Scala
Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…
Jul 11, 2019
Spark Streaming part 3: DevOps, tools and tests for Spark applications
Categories: Big Data, Data Engineering, DevOps & SRE | Tags: Spark, Apache Spark Streaming, DevOps, Learning and tutorial
Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data…
Jun 19, 2019
Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop
Categories: Data Engineering, Learning | Tags: Spark, Apache Spark Streaming, Streaming, Python
Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…
May 28, 2019
Spark Streaming part 1: build data pipelines with Spark Structured Streaming
Categories: Data Engineering, Learning | Tags: Kafka, Spark, Apache Spark Streaming, Big Data, Streaming
Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The…
Apr 18, 2019
Publish Spark SQL DataFrame and RDD with Spark Thrift Server
Categories: Data Engineering | Tags: Hive, Spark, Thrift, JDBC, Hadoop, SQL
The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports…
Mar 25, 2019
Apache Flink: past, present and future
Categories: Data Engineering | Tags: Flink, Pipeline, Streaming, Kubernetes, Machine Learning, SQL
Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink…
Nov 5, 2018
Data Lake ingestion best practices
Categories: Big Data, Data Engineering | Tags: Avro, Hive, NiFi, ORC, Spark, Data Governance, HDF, Operation, Protocol Buffers, Data Lake, File Format, Registry, Schema
Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…
By David WORMS
Jun 18, 2018
Apache Beam: a unified programming model for data processing pipelines
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Apex, Beam, Flink, Spark, Pipeline
In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 in…
May 24, 2018
What's new in Apache Spark 2.3?
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, ORC, Spark, PySpark, Docker, Streaming, Tuning, Spark MLlib, Kubernetes, pandas
Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…
May 23, 2018
Execute Python in an Oozie workflow
Categories: Data Engineering | Tags: Oozie, Elasticsearch, REST, Python
Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that. I’ve recently designed a workflow that would interact…
Mar 6, 2018
Oracle DB synchrnozation to Hadoop with CDC
Categories: Data Engineering | Tags: Hive, Sqoop, CDC, GoldenGate, Oracle, Data Warehouse
This note is the result of a discussion about the synchronization of data written in a database to a warehouse stored in Hadoop. Thanks to Claude Daub from GFI who wrote it and who authorizes us to…
By David WORMS
Jul 31, 2017
EclairJS - Putting a Spark in Web Apps
Categories: Data Engineering, Front End | Tags: Spark, JavaScript, Jupyter
Presentation by David Fallside from IBM, images extracted from the presentation. Introduction Web Apps development has moved from Java to NodeJS and Javascript. It provides a simple and rich…
By David WORMS
Jul 17, 2016
Splitting HDFS files into multiple hive tables
Categories: Data Engineering | Tags: Flume, HDFS, Hive, Oozie, Pig, SQL
I am going to show how to split a CSV file stored inside HDFS as multiple Hive tables based on the content of each record. The context is simple. We are using Flume to collect logs from all over our…
By David WORMS
Sep 15, 2013
Testing the Oracle SQL Connector for Hadoop HDFS
Categories: Data Engineering | Tags: HDFS, Database, File system, Oracle, CDH, SQL
Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other…
By David WORMS
Jul 15, 2013
Options to connect and integrate Hadoop with Oracle
Categories: Data Engineering | Tags: Avro, HDFS, Hive, MapReduce, Sqoop, Database, Java, NoSQL, Oracle, R, RDBMS, SQL
I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article…
By David WORMS
May 15, 2013
Two Hive UDAF to convert an aggregation to a map
Categories: Data Engineering | Tags: Hive, Java, HBase, File Format
I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The…
By David WORMS
Mar 6, 2012
Timeseries storage in Hadoop and Hive
Categories: Data Engineering | Tags: HDFS, Hive, CRM, timeseries, Tuning, Hadoop, File Format
In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows…
By David WORMS
Jan 10, 2012