Apache Hadoop
Hadoop is massively scalable platform commonly used to process big data workloads. At its core, it is composed of a distributed file system (HDFS) and a resource manager (YARN).
Hadoop provides a high level of durability and availability while still being able to process computational analytical workloads in parallel. The combination of availability, durability, and scalability of processing makes Hadoop a natural fit for Big Data workloads.
- Learn more
- Official website
Related articles

TDP workshop: Become a TDP power user from your terminal
Categories: Events, Learning | Tags: Ansible, DevOps, Hadoop, Open source, TDP
The TDP CLI is used to deploy and operate your TDP services. It relies on tdp-lib to provide control and flexibility at your fingertips. Some time ago, we announced the public release of TDP - Trunk…
By Paul FARAULT
Jun 17, 2022

Introducing Trunk Data Platform: the Open-Source Big Data Distribution Curated by TOSIT
Categories: Big Data, DevOps & SRE, Infrastructure | Tags: Ansible, Ranger, DevOps, Hortonworks, Hadoop, HBase, Knox, Spark, Cloudera, CDP, CDH, Open source, TDP
Ever since Cloudera and Hortonworks merged, the choice of commercial Hadoop distributions for on-prem workloads essentially boils down to CDP Private Cloud. CDP can be seen as the “best of both worlds…
Apr 14, 2022

Apache HBase: RegionServers co-location
Categories: Big Data, Adaltas Summit 2021, Infrastructure | Tags: Ambari, Database, HDP, Infrastructure, Tuning, Hadoop, HBase, Big Data, Storage
RegionServers are the processes that manage the storage and retrieval of data in Apache HBase, the non-relational column-oriented database in Apache Hadoop. It is through their daemons that any CRUD…
Feb 22, 2022

H2O in practice: a protocol combining AutoML with traditional modeling approaches
Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python, XGBoost
H20 comes with a lot of functionalities. The second part of the series H2O in practice proposes a protocol to combine AutoML modeling with traditional modeling and optimization approach. The objective…
Nov 12, 2021

Internship in Data Engineering
Categories: Front End, Learning | Tags: Hive, Metrics, Monitoring, Kafka, Delta Lake, Elasticsearch, IaC, Internship, Kubernetes, Streaming
Job Description Data is a valuable business asset. Some call it the new oil. The data engineer collects, transform and refine raw data into information that can be used by business analysts and data…
By David WORMS
Oct 25, 2021

Internship in Big Data infrastructure with TDP
Categories: Infrastructure, Learning | Tags: Cyber Security, DevOps, Java, Hadoop, IaC, Internship, TDP
Job Description Big Data and distributed computing is at Adaltas’ core. We support our partners in the deployment, maintenance and optimization of some of France’s largest clusters. Adaltas is also an…
By Daniel HARTY
Oct 25, 2021

H2O in practice: a Data Scientist feedback
Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python
Automated machine learning (AutoML) platforms are gaining popularity and becoming a new important tool in the data scientists’ toolbox. A few months ago, I introduced H2O, an open-source platform for…
Sep 29, 2021

Adaltas Summit 2021, 2nd edition in corsica
Categories: Adaltas Summit 2021, Learning | Tags: Ansible, Hadoop, Spark, Azure, Blockchain, Deep Learning, Docker, Terraform, Kubernetes, Node.js
For its second edition, the whole Adaltas crew is gathering in Corsica for a whole week with 2 days dedicated to technology the 23rd and the 24th of september 2021. After a year and a half of sanitary…
By David WORMS
Sep 21, 2021

Storage size and generation time in popular file formats
Categories: Data Engineering, Data Science | Tags: Hive, ORC, Avro, HDFS, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)
Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…
Mar 22, 2021

Build your open source Big Data distribution with Hadoop, HBase, Spark, Hive & Zeppelin
Categories: Big Data, Infrastructure | Tags: Hive, Maven, Unit tests, Hadoop, HBase, Spark, Git, Release and features, TDP
The Hadoop ecosystem gave birth to many popular projects including HBase, Spark and Hive. While technologies like Kubernetes and S3 compatible object storages are growing in popularity, HDFS and YARN…
Dec 18, 2020

Faster model development with H2O AutoML and Flow
Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python
Building Machine Learning (ML) models is a time-consuming process. It requires expertise in statistics, ML algorithms, and programming. On top of that, it also requires the ability to translate a…
Dec 10, 2020

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)
Categories: Big Data, Cloud Computing, Data Engineering | Tags: NiFi, Hadoop, HDFS, Authentication, Authorization, Azure, Azure Data Lake Storage (ADLS), OAuth2
As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…
Nov 5, 2020

Rebuilding HDP Hive: patch, test and build
Categories: Big Data, Infrastructure | Tags: Hive, Maven, GitHub, Java, Unit tests, Git, Release and features, TDP
The Hortonworks HDP distribution will soon be deprecated in favor of Cloudera’s CDP. One of our clients wanted a new Apache Hive feature backported into HDP 2.6.0. We thought it was a good opportunity…
Oct 6, 2020

Installing Hadoop from source: build, patch and run
Categories: Big Data, Infrastructure | Tags: Maven, Java, LXD, Unit tests, Hadoop, HDFS, Docker, TDP
Commercial Apache Hadoop distributions have come and gone. The two leaders, Cloudera and Hortonworks, have merged: HDP is no more and CDH is now CDP. MapR has been acquired by HP and IBM BigInsights…
Aug 4, 2020

Download datasets into HDFS and Hive
Categories: Big Data, Data Engineering | Tags: Hive, Business intelligence, Data Analytics, Data Engineering, Data structures, Database, Hadoop, HDFS, Big Data, Data Lake, Data Warehouse
Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the…
By Aida NGOM
Jul 31, 2020

Comparaison of different file formats in Big Data
Categories: Big Data, Data Engineering | Tags: ORC, Batch processing, Business intelligence, Data structures, Protocol Buffers, Avro, HDFS, Parquet, Big Data, CSV, JavaScript Object Notation (JSON), Kubernetes
In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…
By Aida NGOM
Jul 23, 2020

Optimization of Spark applications in Hadoop YARN
Categories: Data Engineering, Learning | Tags: Tuning, Hadoop, Spark, Python
Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This article…
Mar 30, 2020

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop
Categories: Data Engineering, Learning | Tags: Apache Spark Streaming, Spark, Python, Streaming
Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…
May 28, 2019

Should you move your Big Data and Data Lake to the Cloud
Categories: Big Data, Cloud Computing | Tags: DevOps, AWS, Azure, Cloud, CDP, Databricks, GCP
Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customer’s Big…
Dec 9, 2019

Internship Data Science & Data Engineer - ML in production and streaming data ingestion
Categories: Data Engineering, Data Science | Tags: Flink, DevOps, Hadoop, HBase, Kafka, Spark, Internship, Kubernetes, Python
Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…
By David WORMS
Nov 26, 2019

InfraOps & DevOps Internship - build a Big Data & Kubernetes PaaS
Categories: Big Data, Containers Orchestration | Tags: DevOps, LXD, NoSQL, Hadoop, Kafka, Spark, Ceph, Internship, Kubernetes
Context The acquisition of a high-capacity cluster is in line with Adaltas’ desire to build a PAAS-type offering to use and to provide Big Data and container orchestration platforms. The platforms are…
By David WORMS
Nov 26, 2019

Machine Learning model deployment
Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema
“Enterprise Machine Learning requires looking at the big picture […] from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…
Sep 30, 2019

Running Apache Hive 3, new features and tips and tricks
Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: Druid, Hive, JDBC, LLAP, Hadoop, Kafka, Release and features
Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since…
Jul 25, 2019

Publish Spark SQL DataFrame and RDD with Spark Thrift Server
Categories: Data Engineering | Tags: Hive, Thrift, JDBC, Hadoop, Spark, SQL
The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports…
Mar 25, 2019

Multihoming on Hadoop
Categories: Infrastructure | Tags: Kerberos, Hadoop, HDFS, Network
Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an…
Mar 5, 2019

Apache Knox made easy!
Categories: Big Data, Cyber Security, Adaltas Summit 2018 | Tags: Ranger, Kerberos, LDAP, Active Directory, REST, Knox
Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? Apache Knox overview Apache Knox is an application gateway for interacting in a…
Feb 4, 2019

Monitoring a production Hadoop cluster with Kubernetes
Categories: DevOps & SRE | Tags: Thrift, Grafana, Shinken, Hadoop, Knox, Cluster, Docker, Elasticsearch, Kubernetes, Node, Node.js, Prometheus, Python
Monitoring a production grade Hadoop cluster is a real challenge and needs to be constantly evolving. The software we use today is based on Nagios. Very efficient when it comes to the simplest…
Dec 21, 2018

One week to discuss technology in a Moroccan riad
Categories: Adaltas Summit 2018, Learning | Tags: Flink, CDSW, Gatsby, React.js, Hadoop, Knox, Data Science, Deep Learning, Kubernetes, Node.js
Adaltas organise the year its first conference between the 22 and 26 of October. On the agenda of these 5 days of conference: discuss technology in one of the most beautiful riad of Marrakech. Mix the…
By David WORMS
Oct 11, 2018

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster
Categories: Data Science | Tags: YARN, GPU, Hadoop, MXNet, Spark, Spark MLlib, Deep Learning, PyTorch, TensorFlow, XGBoost
With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging…
Jul 24, 2018

Clusters and workloads migration from Hadoop 2 to Hadoop 3
Categories: Big Data, Infrastructure | Tags: Slider, YARN, Erasure Coding, Rolling Upgrade, HDFS, Spark, Docker
Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current…
Jul 25, 2018

Apache Hadoop YARN 3.0 – State of the union
Categories: Big Data, DataWorks Summit 2018 | Tags: YARN, GPU, Hortonworks, Hadoop, HDFS, MapReduce, Cloudera, Data Science, Docker, Release and features
This article covers the ”Apache Hadoop YARN: state of the union” talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the two…
May 31, 2018

Running Enterprise Workloads in the Cloud with Cloudbreak
Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: Cloudbreak, HDP, Operation, Hadoop, AWS, Azure, GCP, OpenStack
This article is based on Peter Darvasi and Richard Doktorics’ talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworks’ automated deployment tool…
May 28, 2018

Present and future of Hadoop workflow scheduling: Oozie 5.x
Categories: Big Data, DataWorks Summit 2018 | Tags: Hive, Sqoop, HDP, REST, Hadoop, Oozie, CDH
During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of…
May 23, 2018

Apache Apex with Apache SAMOA
Categories: Data Science, Events, Tech Radar | Tags: Apex, Flink, Samoa, Storm, Tools, Hadoop, Machine Learning
Traditional Machine Learning Batch Oriented Supervised - most common Training and Scoring One time model building Data set Training: Model building Holdout: Paremeter tuning Test: Accuracy Online…
Jul 17, 2016

Apache Apex: next gen Big Data analytics
Categories: Data Science, Events, Tech Radar | Tags: Apex, Flink, Storm, Tools, Hadoop, Kafka, Data Science, Machine Learning
Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallel…
Jul 17, 2016

Hive, Calcite and Druid
Categories: Big Data | Tags: Druid, Hive, Business intelligence, Database, Hadoop
BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnal…
By David WORMS
Jul 14, 2016

MariaDB integration with Hadoop
Categories: Infrastructure | Tags: Hive, Database, HA, MariaDB, Hadoop
During a workshop with one of our customers, Adaltas has identified a potential risk to use MariaDB’s High Availability (HA) strategy. Since the customer selected Cloudera’s CDH 5 distribution, the…
By David WORMS
Jul 31, 2017

Red Hat Storage Gluster and its integration with Hadoop
Categories: Big Data | Tags: GlusterFS, Red Hat, Hadoop, HDFS, Storage
I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I will…
By David WORMS
Jul 3, 2015

Composants for CDH and HDP
Categories: Big Data | Tags: Flume, Hive, Sqoop, Zookeeper, Hortonworks, HDP, Hadoop, Oozie, Cloudera, CDH
I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting…
By David WORMS
Sep 22, 2013

State of the Hadoop open-source ecosystem in early 2013
Categories: Big Data | Tags: Flume, Mesos, Phoenix, Pig, Hadoop, Kafka, Mahout, Data Science
Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume…
By David WORMS
Jul 8, 2013

The state of Hadoop distributions
Categories: Big Data | Tags: Hortonworks, Intel, Oracle, Hadoop, Cloudera
Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is a…
By David WORMS
May 11, 2013

Apache Hive Essentials How-to by Darren Lee
Categories: Business Intelligence, Learning | Tags: Hive, UDF, Hadoop, File Format, SQL
Recently, I’ve been ask to review a new book on Apache Hive called “Apache Hive Essentials How-to” (edit: the second edition is now available) written by Darren Lee and published by Packt Publishing…
By David WORMS
Apr 23, 2013

Definitions of machine learning algorithms present in Apache Mahout
Categories: Data Science | Tags: Algorithm, Сlassification, Hadoop, Mahout, Clustering, Machine Learning
Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop…
By David WORMS
Mar 8, 2013

Merging multiple files in Hadoop
Categories: Hack | Tags: File system, Hadoop, HDFS
This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the “getmerge” utility of but contrary to “getmerge”, the final…
By David WORMS
Jan 12, 2013

Hadoop and R with RHadoop
Categories: Business Intelligence, Data Science | Tags: Thrift, Data Analytics, Learning and tutorial, R, Hadoop, HBase, HDFS, MapReduce
RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of…
By David WORMS
Jul 19, 2012

Timeseries storage in Hadoop and Hive
Categories: Data Engineering | Tags: Hive, CRM, timeseries, Tuning, Hadoop, HDFS, File Format
In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows…
By David WORMS
Jan 10, 2012

Hadoop and HBase installation on OSX in pseudo-distributed mode
Categories: Big Data, Learning | Tags: Hue, Infrastructure, Hadoop, HBase, Big Data, Deployment
The operating system chosen is OSX but the procedure is not so different for any Unix environment because most of the software is downloaded from the Internet, uncompressed and set manually. Only a…
By David WORMS
Dec 1, 2010

Storage and massive processing with Hadoop
Categories: Big Data | Tags: Hadoop, HDFS, Storage
Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projects…
By David WORMS
Nov 26, 2010