Apache Hadoop HDFS

Related articles

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

Categories: Big Data, Cloud Computing, Data Engineering | Tags: HDFS, NiFi, Authentication, Authorization, Hadoop, Azure Data Lake Storage (ADLS), Azure, OAuth2

As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…

Gauthier LEONARD

By Gauthier LEONARD

Nov 5, 2020

Installing Hadoop from source: build, patch and run

Categories: Big Data, Infrastructure | Tags: HDFS, Maven, Docker, Java, LXD, Unit tests, Hadoop

Commercial Apache Hadoop distributions have come and gone. The two leaders, Cloudera and Hortonworks, have merged: HDP is no more and CDH is now CDP. MapR has been acquired by HP and IBM BigInsights…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Aug 4, 2020

Download datasets into HDFS and Hive

Categories: Big Data, Data Engineering | Tags: Analytics, HDFS, Hive, Big Data, Data Analytics, Data Engineering, Data structures, Database, Hadoop, Data Lake, Data Warehouse

Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the…

Aida NGOM

By Aida NGOM

Jul 31, 2020

Comparaison of different file formats in Big Data

Categories: Big Data, Data Engineering | Tags: Analytics, Avro, HDFS, Hive, Kafka, MapReduce, ORC, Spark, Batch processing, Big Data, CSV, Data Analytics, Data structures, Database, JSON, Protocol Buffers, Hadoop, Parquet, Kubernetes, XML

In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…

Aida NGOM

By Aida NGOM

Jul 23, 2020

Hadoop Ozone part 3: advanced replication strategy with Copyset

Categories: Infrastructure | Tags: HDFS, Ozone, Cluster, Kubernetes, Node

Hadoop Ozone provide a way of setting a ReplicationType for every write you make on the cluster. Right now is supported HDFS and Ratis but more advanced replication strategies can be achieved. In this…

Hadoop Ozone part 2: tutorial and getting started of its features

Categories: Infrastructure | Tags: HDFS, CLI, Learning and tutorial, REST, Ozone, Amazon S3, Cluster

The releases of Hadoop Ozone come with a handy docker-compose file to try out Ozone. The below instructions provide details on how to use it. You can also use the Katacoda training sandbox which…

Hadoop Ozone part 1: an introduction of the new filesystem

Categories: Infrastructure | Tags: HDFS, Ozone, Cluster, Kubernetes

Hadoop Ozone is an object store for Hadoop. It is designed to scale to billions of objects of varying sizes. It is currently in development. The roadmap is available on the project wiki. This article…

Multihoming on Hadoop

Categories: Infrastructure | Tags: HDFS, Kerberos, Network, Hadoop

Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an…

Joris RUMMENS

By Joris RUMMENS

Mar 5, 2019

Deploying a secured Flink cluster on Kubernetes

Categories: Big Data | Tags: Flink, HDFS, Kafka, Elasticsearch, Encryption, Kerberos, SSL/TLS

When deploying secured Flink applications inside Kubernetes, you are faced with two choices. Assuming your Kubernetes is secure, you may rely on the underlying platform or rely on Flink native…

David WORMS

By David WORMS

Oct 8, 2018

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Categories: Big Data, Infrastructure | Tags: HDFS, Slider, Spark, YARN, Docker, Erasure Coding, Rolling Upgrade

Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current…

Lucas BAKALIAN

By Lucas BAKALIAN

Jul 25, 2018

Apache Hadoop YARN 3.0 – State of the union

Categories: Big Data, DataWorks Summit 2018 | Tags: HDFS, MapReduce, YARN, Cloudera, Docker, GPU, Hortonworks, Release and features, Hadoop

This article covers the ”Apache Hadoop YARN: state of the union” talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the two…

Lucas BAKALIAN

By Lucas BAKALIAN

May 31, 2018

Apache Metron in the Real World

Categories: Cyber Security, DataWorks Summit 2018 | Tags: Algorithm, HDFS, Kafka, NiFi, Solr, Spark, Storm, Elasticsearch, pcap, RDBMS, Metron, SQL

Apache Metron is a storage and analytic platform specialized in cyber security. This talk was about demonstrating the usages and capabilities of Apache Metron in the real world. The presentation was…

Michael HATOUM

By Michael HATOUM

May 29, 2018

Red Hat Storage Gluster and its integration with Hadoop

Categories: Big Data | Tags: HDFS, GlusterFS, Red Hat, Hadoop, Storage

I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I will…

David WORMS

By David WORMS

Jul 3, 2015

Splitting HDFS files into multiple hive tables

Categories: Data Engineering | Tags: Flume, HDFS, Hive, Oozie, Pig, SQL

I am going to show how to split a CSV file stored inside HDFS as multiple Hive tables based on the content of each record. The context is simple. We are using Flume to collect logs from all over our…

David WORMS

By David WORMS

Sep 15, 2013

Kerberos and delegation tokens security with WebHDFS

Categories: Cyber Security | Tags: HDFS, Big Data, HTTP, Kerberos

WebHDFS is an HTTP Rest server bundle with the latest version of Hadoop. What interests me on this article is to dig into security with the Kerberos and delegation tokens functionalities. I will cover…

David WORMS

By David WORMS

Jul 25, 2013

Testing the Oracle SQL Connector for Hadoop HDFS

Categories: Data Engineering | Tags: HDFS, Database, File system, Oracle, CDH, SQL

Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other…

David WORMS

By David WORMS

Jul 15, 2013

Oracle to Apache Hive with the Oracle SQL Connector

Categories: Business Intelligence | Tags: HDFS, Hive, Network, Oracle

In a previous article published last week, I introduced the choices available to connect Oracle and Hadoop. In a follow up article, I covered the Oracle SQL Connector, its installation and integration…

David WORMS

By David WORMS

May 27, 2013

Options to connect and integrate Hadoop with Oracle

Categories: Data Engineering | Tags: Avro, HDFS, Hive, MapReduce, Sqoop, Database, Java, NoSQL, Oracle, R, RDBMS, SQL

I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article…

David WORMS

By David WORMS

May 15, 2013

Merging multiple files in Hadoop

Categories: Hack | Tags: HDFS, File system, Hadoop

This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the “getmerge” utility of but contrary to “getmerge”, the final…

David WORMS

By David WORMS

Jan 12, 2013

Hadoop and R with RHadoop

Categories: Business Intelligence, Data Science | Tags: HDFS, MapReduce, Thrift, Data Analytics, Learning and tutorial, R, Hadoop, HBase

RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of…

David WORMS

By David WORMS

Jul 19, 2012

HDFS and Hive storage - comparing file formats and compression methods

Categories: Big Data | Tags: Analytics, Hive, ORC, Parquet, File Format

A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The…

David WORMS

By David WORMS

Mar 13, 2012

Two Hive UDAF to convert an aggregation to a map

Categories: Data Engineering | Tags: Hive, Java, HBase, File Format

I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The…

David WORMS

By David WORMS

Mar 6, 2012

Timeseries storage in Hadoop and Hive

Categories: Data Engineering | Tags: HDFS, Hive, CRM, timeseries, Tuning, Hadoop, File Format

In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows…

David WORMS

By David WORMS

Jan 10, 2012

Storage and massive processing with Hadoop

Categories: Big Data | Tags: HDFS, Hadoop, Storage

Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projects…

David WORMS

By David WORMS

Nov 26, 2010

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.