Apache Hadoop

Hadoop is massively scalable platform commonly used to process big data workloads. At its core, it is composed of a distributed file system (HDFS) and a resource manager (YARN).

Hadoop provides a high level of durability and availability while still being able to process computational analytical workloads in parallel. The combination of availability, durability, and scalability of processing makes Hadoop a natural fit for Big Data workloads.

Related articles

Running Apache Hive 3, new features and tips and tricks

Running Apache Hive 3, new features and tips and tricks

Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: Druid, Hive, Kafka, Cloudera, Data Warehouse, JDBC, LLAP, Active Directory, Release and features, Hadoop

Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since…

By Gauthier LEONARD

Jul 25, 2019

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Categories: Data Engineering, Learning | Tags: Spark, Apache Spark Streaming, Big Data, File Format, Data Governance, Python, Streaming, Hadoop

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…

By Oskar RYNKIEWICZ

May 28, 2019

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

Categories: Data Engineering | Tags: Hive, Spark, Thrift, JDBC, Hadoop, SQL

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports…

By Oskar RYNKIEWICZ

Mar 25, 2019

Multihoming on Hadoop

Multihoming on Hadoop

Categories: Infrastructure | Tags: HDFS, Kerberos, Network, Hadoop

Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an…

By Joris RUMMENS

Mar 5, 2019

Apache Knox made easy!

Apache Knox made easy!

Categories: Big Data, Cyber Security, Adaltas Summit 2018 | Tags: Ambari, Hive, Knox, Ranger, Shiro, Solr, JDBC, Kerberos, LDAP, Active Directory, REST, SSL/TLS, Hadoop, SSO

Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? Apache Knox overview Apache Knox is an application gateway for interacting in a…

By Michael HATOUM

Feb 4, 2019

Monitoring a production Hadoop cluster with Kubernetes

Monitoring a production Hadoop cluster with Kubernetes

Categories: DevOps & SRE | Tags: Knox, Thrift, Docker, Elasticsearch, Graphana, Kubernetes, Node.js, Prometheus, Python, Shinken, Hadoop

Monitoring a production grade Hadoop cluster is a real challenge and needs to be constantly evolving. The software we use today is based on Nagios. Very efficient when it comes to the simplest…

By Paul-Adrien CORDONNIER

Dec 21, 2018

One week to discuss technology in a Moroccan riad

One week to discuss technology in a Moroccan riad

Categories: Adaltas Summit 2018, Learning | Tags: Flink, Knox, CDSW, Deep Learning, Gatsby, Kubernetes, Node.js, React.js, Hadoop

Adaltas organise the year its first conference between the 22 and 26 of October. On the agenda of these 5 days of conference: discuss technology in one of the most beautiful riad of Marrakech. Mix the…

By David WORMS

Oct 11, 2018

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Categories: Big Data, Infrastructure | Tags: HBase, HDFS, Oozie, Slider, Spark, YARN, Docker, Erasure Coding, Operation, Rolling Upgrade, SLA, Hadoop

Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current…

By Lucas BAKALIAN

Jul 25, 2018

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

Categories: Data Science | Tags: Spark, Spark MLlib, YARN, Deep Learning, GPU, PyTorch, TensorFlow, XGBoost, Hadoop

With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging…

By Louis BIANCHERIN

Jul 24, 2018

Apache Hadoop YARN 3.0 – State of the union

Apache Hadoop YARN 3.0 – State of the union

Categories: Big Data, DataWorks Summit 2018 | Tags: HDFS, MapReduce, YARN, Cloudera, Docker, GPU, Hortonworks, Release and features, Hadoop

This article covers the ”Apache Hadoop YARN: state of the union” talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the two…

By Lucas BAKALIAN

May 31, 2018

Running Enterprise Workloads in the Cloud with Cloudbreak

Running Enterprise Workloads in the Cloud with Cloudbreak

Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: AWS, Cloudbreak, GCP, HDP, Azure, OpenStack, Operation, Hadoop

This article is based on Peter Darvasi and Richard Doktorics’ talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworks’ automated deployment tool…

By Joris RUMMENS

May 28, 2018

Present and future of Hadoop workflow scheduling: Oozie 5.x

Present and future of Hadoop workflow scheduling: Oozie 5.x

Categories: Big Data, DataWorks Summit 2018 | Tags: Hive, Oozie, Sqoop, CDH, HDP, REST, Hadoop

During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of…

By Schoukroun LEO

May 23, 2018

MariaDB integration with Hadoop

MariaDB integration with Hadoop

Categories: Infrastructure | Tags: Hive, Database, HA, MariaDB, Hadoop

During a workshop with one of our customers, Adaltas has identified a potential risk to use MariaDB’s High Availability (HA) strategy. Since the customer selected Cloudera’s CDH 5 distribution, the…

By David WORMS

Jul 31, 2017

Apache Apex with Apache SAMOA

Apache Apex with Apache SAMOA

Categories: Data Science, Events, Tech Radar | Tags: Apex, Flink, Samoa, Storm, Machine Learning, Tools, Hadoop

Traditional Machine Learning Batch Oriented Supervised - most common Training and Scoring One time model building Data set Training: Model building Holdout: Paremeter tuning Test: Accuracy Online…

By Pierre SAUVAGE

Jul 17, 2016

Apache Apex : next gen Big Data analytics

Apache Apex : next gen Big Data analytics

Categories: Data Science, Events, Tech Radar | Tags: Apex, Flink, Kafka, Storm, Data Science, Machine Learning, Tools, Hadoop

Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallel…

By César BEREZOWSKI

Jul 17, 2016

Hive, Calcite and Druid

Hive, Calcite and Druid

Categories: Big Data | Tags: Analytics, Druid, Hive, Database, Hadoop

BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnal…

By David WORMS

Jul 14, 2016

Red Hat Storage Gluster and its integration with Hadoop

Red Hat Storage Gluster and its integration with Hadoop

Categories: Big Data | Tags: HDFS, GlusterFS, Red Hat, Storage, Hadoop

I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I will…

By David WORMS

Jul 3, 2015

Composants for CDH and HDP

Composants for CDH and HDP

Categories: Big Data | Tags: Flume, Hive, Oozie, Sqoop, Zookeeper, Cloudera, CDH, Hortonworks, HDP, Hadoop

I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting…

By David WORMS

Sep 22, 2013

State of the Hadoop open-source ecosystem in early 2013

State of the Hadoop open-source ecosystem in early 2013

Categories: Big Data | Tags: Flume, Kafka, Mahout, Mesos, Phoenix, Pig, File Format, Hadoop

Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume…

By David WORMS

Jul 8, 2013

The state of Hadoop distributions

The state of Hadoop distributions

Categories: Big Data | Tags: Cloudera, Hortonworks, Intel, Oracle, Hadoop

Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is a…

By David WORMS

May 11, 2013

Apache Hive Essentials How-to by Darren Lee

Apache Hive Essentials How-to by Darren Lee

Categories: Business Intelligence, Learning | Tags: Hive, File Format, UDF, Hadoop, SQL

Recently, I’ve been ask to review a new book on Apache Hive called “Apache Hive Essentials How-to” written by Darren Lee and published by Packt Publishing. To say it short, I sincerely recommend it. I…

By David WORMS

Apr 23, 2013

Definitions of machine learning algorithms present in Apache Mahout

Definitions of machine learning algorithms present in Apache Mahout

Categories: Data Science | Tags: Algorithms, Mahout, Сlassification, Clustering, Machine Learning, Hadoop

Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop…

By David WORMS

Mar 8, 2013

Merging multiple files in Hadoop

Merging multiple files in Hadoop

Categories: Hack | Tags: HDFS, File system, Hadoop

This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the “getmerge” utility of but contrary to “getmerge”, the final…

By David WORMS

Jan 12, 2013

Hadoop and R with RHadoop

Hadoop and R with RHadoop

Categories: Business Intelligence, Data Science | Tags: HBase, HDFS, MapReduce, Thrift, Data Analytics, Learning and tutorial, R, Hadoop

RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of…

By David WORMS

Jul 19, 2012

Timeseries storage in Hadoop and Hive

Timeseries storage in Hadoop and Hive

Categories: Data Engineering | Tags: HDFS, Hive, CRM, File Format, timeseries, Tuning, Hadoop

In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows…

By David WORMS

Jan 10, 2012

Hadoop and HBase installation on OSX in pseudo-distributed mode

Hadoop and HBase installation on OSX in pseudo-distributed mode

Categories: Big Data, Learning | Tags: HBase, Big Data, Hue, Deployment, Infrastructure, Hadoop

The operating system chosen is OSX but the procedure is not so different for any Unix environment because most of the software is downloaded from the Internet, uncompressed and set manually. Only a…

By David WORMS

Dec 1, 2010

Storage and massive processing with Hadoop

Storage and massive processing with Hadoop

Categories: Big Data | Tags: HDFS, Nutch, Cloudera, Google, Hadoop

Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projects…

By David WORMS

Nov 26, 2010

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.