Big Data

Definitions of machine learning algorithms present in Apache Mahout

Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. It contains various algorithms which we are defining below. Each of them may define multiple implementations. A mojority but not all of the [...]

By |2018-06-05T22:37:16+00:00July 8th, 2013|Categories: Big Data|0 Comments

Oracle and Hive, how data are published?

In the past few days, I've published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with HDFS and a third one explaining how to install and use the Oracle SQL Connector with Hive. Those last two [...]

By |2018-06-05T22:37:17+00:00July 6th, 2013|Categories: Big Data|0 Comments

Composants for CDH and HDP

I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting, April 2013, I am comparing the Cloudera distribution 4.2.0 and the Hortonwork Data Plaftorm 2.0.0. CDH 4.2.0 bigtop-jsvc 1.0.10-cdh4.2.0 bigtop-tomcat 6.0.35-cdh4.2.0 datafu [...]

By |2017-11-21T20:17:08+00:00July 2nd, 2013|Categories: Big Data|0 Comments

Stockage HDFS et Hive – comparaison entre les formats de fichiers et les méthodes de compression

Il y a quelques jours, nous avons conduit un test dans le but de comparer différent format de fichiers et méthodes de compression disponible dans Hive. Parmi ces formats, certains sont natifs à HDFS et s’appliquent à tous les utilisateurs d’Hadoop. La suite de tests est composée de requête Hive toutes similaires qui créent une [...]

By |2018-06-05T22:37:21+00:00July 15th, 2012|Categories: Big Data|0 Comments

Two Hive UDAF to convert an aggregation to a map

I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The first function converts an aggregation into a map and is internally using a Java HashMap. The second function extends [...]

By |2018-06-05T22:37:23+00:00March 6th, 2012|Categories: Big Data|0 Comments

Timeseries storage in Hadoop and Hive

In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows. Before importing the dataset into Hive, we will be exploring different optimization options expected to impact speed and storage size. [...]

By |2018-06-05T22:37:29+00:00January 10th, 2012|Categories: Big Data|0 Comments

Présentation de MapReduce

Les systèmes d’information ont de plus en plus de données à stocker et à traiter. Des entreprises comme Google, Facebook, Twitter mais encore bien d’autre stockent des quantités d’information astronomiques en provenance de leurs clients et doivent être en mesure de les servir par les meilleurs recommandation tout en assurant la pérennité de leurs systèmes. [...]

By |2018-06-05T22:37:36+00:00June 26th, 2010|Categories: Big Data|0 Comments