HDFS

Multihoming on Hadoop

Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an introduction to the concept and its applications for real-world businesses. […]

By |2019-03-05T18:48:18+00:00March 5th, 2019|Categories: Adaltas Summit 2018, Big Data, Data Engineering|Tags: , , |0 Comments

Managing User Identities on Big Data Clusters

Securing a Big Data Cluster involves integrating or deploying specific services to store users. Some users are cluster-specific when others are available across all clusters. It is not always easy to understand how these different services fit together and whether they should be shared across multiple clusters. Also, which strategy to choose and what are [...]

By |2018-11-08T11:15:29+00:00November 8th, 2018|Categories: Big Data, Cyber Security|Tags: , , , , , |0 Comments

Deploying a secured Flink cluster on Kubernetes

When deploying secured Flink applications inside Kubernetes, you are faced with two choices. Assuming your Kubernetes is secure, you may rely on the underlying platform or rely on Flink native solutions to secure your application from the inside. Note, those two solutions are not mutually exclusive. […]

By |2018-10-09T11:25:29+00:00October 8th, 2018|Categories: Big Data, Cyber Security|Tags: , , , , , |0 Comments

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current processes impacted, which migration strategy is the most appropriate to your organization? […]

By |2018-08-17T09:36:26+00:00July 25th, 2018|Categories: Big Data|Tags: , , , |0 Comments

Data Lake ingestion best practices

Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers. […]

By |2018-06-18T09:29:50+00:00June 18th, 2018|Categories: Data Engineering, DevOps|Tags: , , , , , , , |1 Comment

Red Hat Storage Gluster and its integration with Hadoop

[crayon-5d33b68840521318675305/] I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I will conclude with the integration between Red Hat Storage and Hadoop, especially what we can expect before conducting an [...]

By |2019-06-21T20:40:15+00:00July 3rd, 2016|Categories: Big Data|Tags: , , , , |0 Comments

HDFS and Hive storage – comparing file formats and compression methods

A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The test suite is composed of similar Hive queries which create a table, eventually set a compression type and load [...]

By |2019-06-25T10:32:24+00:00March 13th, 2012|Categories: Data Engineering|Tags: , , , , , |0 Comments

Two Hive UDAF to convert an aggregation to a map

I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The first function converts an aggregation into a map and is internally using a Java HashMap. The second function extends [...]

By |2019-06-25T10:25:53+00:00March 6th, 2012|Categories: Data Engineering|Tags: , , , |0 Comments

Storage and massive processing with Hadoop

Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projects for a growing number of web players (Yahoo!, EBay, Facebook, LinkedIn, Twitter) and their size continues to increase. Yahoo! has 45,000 machines with the [...]

By |2019-06-23T21:31:57+00:00November 26th, 2010|Categories: Big Data|Tags: , , , , |0 Comments