Big Data

Merging multiple files in hadoop

This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It use the "getmerge" utility of "hadoop fs" but contrary to "getmerge", the final merged file isn't put into the local filesystem but inside HDFS. Here's how it look like echo '' > [...]

By | 2017-11-21T20:13:25+00:00 July 12th, 2013|Categories: Big Data|0 Comments

The state of Hadoop distributions

Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is a daunting task. Below is a list of the main distributions including Hadoop. This follows an article published a few days ago about [...]

By | 2017-11-21T20:16:10+00:00 July 11th, 2013|Categories: Big Data|0 Comments

Definitions of machine learning algorithms present in Apache Mahout

Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. It contains various algorithms which we are defining below. Each of them may define multiple implementations. A mojority but not all of the [...]

By | 2017-11-21T20:16:25+00:00 July 8th, 2013|Categories: Big Data|0 Comments

Oracle and Hive, how data are published?

In the past few days, I've published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with HDFS and a third one explaining how to install and use the Oracle SQL Connector with Hive. Those last two [...]

By | 2017-11-21T20:16:51+00:00 July 6th, 2013|Categories: Big Data|0 Comments

Composants for CDH and HDP

I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting, April 2013, I am comparing the Cloudera distribution 4.2.0 and the Hortonwork Data Plaftorm 2.0.0. CDH 4.2.0 bigtop-jsvc 1.0.10-cdh4.2.0 bigtop-tomcat 6.0.35-cdh4.2.0 datafu [...]

By | 2017-11-21T20:17:08+00:00 July 2nd, 2013|Categories: Big Data|0 Comments

Stockage HDFS et Hive – comparaison entre les formats de fichiers et les méthodes de compression

Il y a quelques jours, nous avons conduit un test dans le but de comparer différent format de fichiers et méthodes de compression disponible dans Hive. Parmi ces formats, certains sont natifs à HDFS et s’appliquent à tous les utilisateurs d’Hadoop. La suite de tests est composée de requête Hive toutes similaires qui créent une [...]

By | 2017-11-21T20:19:09+00:00 July 15th, 2012|Categories: Big Data|0 Comments

Two Hive UDAF to convert an aggregation to a map

I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The first function converts an aggregation into a map and is internally using a Java HashMap. The second function extends [...]

By | 2017-11-21T20:23:13+00:00 March 6th, 2012|Categories: Big Data|0 Comments

Timeseries storage in Hadoop and Hive

In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows. Before importing the dataset into Hive, we will be exploring different optimization options expected to impact speed and storage size. [...]

By | 2017-11-21T20:22:06+00:00 January 10th, 2012|Categories: Big Data|0 Comments